Language-Image Alignment with Fixed Text Encoders
Abstract
Learning Language-Image alignment with a Fixed Text encoder (LIFT) using pre-trained large language models effectively guides visual representation learning, outperforming joint training methods like CLIP in compositional understanding and long captions.
Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.
Community
Some qualitative comparisons between LIFT and CLIP! The first line shows the caption or option selected by LIFT, and the second line shows the one selected by CLIP. In every case, LIFT selects the correct one, while CLIP does not. We observe that LIFT compensates for CLIP’s shortcomings in tasks involving compositional information (e.g., spatial locations, object-attribute associations, object-object relations).
The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder to pre-compute the embedding for each text sample. During training, we solely update the image encoder and the projection head to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.
good work, but none of the prior important related work have been discussed, including:
- Zhang, Le, Qian Yang, and Aishwarya Agrawal. "Assessing and Learning Alignment of Unimodal Vision and Language Models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
- Ruthardt, Jona, et al. "Do better language models have crisper vision?." arXiv preprint arXiv:2410.07173 (2024).
- Zhai, Xiaohua, et al. "Lit: Zero-shot transfer with locked-image text tuning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper