Papers
arxiv:2506.04209

Language-Image Alignment with Fixed Text Encoders

Published on Jun 4
· Submitted by JingfengY on Jun 6
Authors:
,

Abstract

Learning Language-Image alignment with a Fixed Text encoder (LIFT) using pre-trained large language models effectively guides visual representation learning, outperforming joint training methods like CLIP in compositional understanding and long captions.

AI-generated summary

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

Community

Paper author Paper submitter
edited 1 day ago

Some qualitative comparisons between LIFT and CLIP! The first line shows the caption or option selected by LIFT, and the second line shows the one selected by CLIP. In every case, LIFT selects the correct one, while CLIP does not. We observe that LIFT compensates for CLIP’s shortcomings in tasks involving compositional information (e.g., spatial locations, object-attribute associations, object-object relations).

Screenshot 2025-06-05 at 16.07.22.png

Screenshot 2025-06-05 at 16.08.39.png

Paper author Paper submitter
edited 1 day ago

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder to pre-compute the embedding for each text sample. During training, we solely update the image encoder and the projection head to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

pipeline.png

Paper author Paper submitter

good work, but none of the prior important related work have been discussed, including:

  1. Zhang, Le, Qian Yang, and Aishwarya Agrawal. "Assessing and Learning Alignment of Unimodal Vision and Language Models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
  2. Ruthardt, Jona, et al. "Do better language models have crisper vision?." arXiv preprint arXiv:2410.07173 (2024).
  3. Zhai, Xiaohua, et al. "Lit: Zero-shot transfer with locked-image text tuning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.04209 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.04209 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.04209 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.