Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

Shaeke Salman, Florida State University
Seminar
MCS Seminar Graphic

Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities are misaligned. We develop an effective gradient-based procedure that minimally modifies image to match the embedding of any given text. We show that we can align the embeddings of distinguishable texts to one image with unnoticeable changes in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and, at the same time, visually indistinguishable images can be matched to the embeddings of very different texts. We further demonstrate the significance of this by using visual language navigation, text-image retrieval, and content moderation applications; results show that the associations between image contents and labels can be modified arbitrarily. Without overcoming the vulnerability, images and other inputs can be used to trigger malicious behaviors, therefore rendering multimodal models are inherently vulnerable.