Abstract: The recent vision-language pre-training model ImageBind has shown significant success in a wide range of visual tasks, demonstrating excellent ability of joint embedding space across ...