File size: 5,645 Bytes
1c88305
 
 
 
 
 
 
 
 
864d61b
1c88305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c31de7
1c88305
 
2c31de7
1c88305
 
2c31de7
 
1c88305
2c31de7
 
 
 
1c88305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb9401c
1c88305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: mit
datasets:
- TIGER-Lab/MMEB-train
language:
- en
metrics:
- recall
base_model:
- llava-hf/llava-onevision-qwen2-7b-ov-hf
pipeline_tag: image-text-to-text
library_name: transformers
---

# Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
<a href="https://github.com/GaryGuTC">Tiancheng Gu*</a>,</span>
<a href="https://kaicheng-yang0828.github.io">Kaicheng Yang*</a>,</span>
Ziyong Feng,</span>
Xingjun Wang,</span>
Yanzhao Zhang,</span>
Dingkun Long,</span>
Yingda Chen,</span>
<a href="https://weidong-tom-cai.github.io/">Weidong Cai</a>,</span>
<a href="https://jiankangdeng.github.io">Jiankang Deng</a></span>

[🏡 Project Page](https://garygutc.github.io/UniME) |  [📄 Paper](https://arxiv.org/pdf/2504.17432) | [💻 Github](https://github.com/deepglint/UniME)

UniME achieves the top ranking on the MMEB leaderboard training using a 336×336 image resolution.(The screenshot is captured at 08:00 UTC+8 on May 6, 2025.)

<p align="center">
    <img src="figures/MMEB.png">
</p>



## 💡 Highlights
<p align="center">
    <img src="figures/fig1.png">
</p>

To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. **Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen**. 

<p align="center">
    <img src="figures/fig2.png">
</p>

After that, we propose hard negative enhanced instruction tuning enhances multimodal systems by improving visual sensitivity, strengthening cross-modal alignment, and boosting instruction-following capabilities. At its core are two key innovations: a false negative filtering mechanism using a similarity threshold to eliminate misleading samples, and an automatic hard negative sampling strategy that selects top-k similar but non-matching examples to increase training difficulty. 
<p align="center">
    <img src="figures/fig3.png">
</p>


## 🧭 Quick Start
```bash
git clone https://github.com/deepglint/UniME.git
cd UniME
conda create -n uniME python=3.10 -y
conda activate uniME
pip install -r requirements.txt
pip install transformers==4.49.0
```

```python
import torch
from PIL import Image
from torch.nn import functional as F
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

def appply_chat_template(image=None, text=None):
    if image != None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": "Summary above image in one word:\n"},
                    ],
            }]
    elif text!= None:
        conversation_image = [{
                "role": "user",
                "content": [
                    {"type": "text", "text": f"{text}\nSummary above sentence in one word:\n"},
                    ],
            }]
    return conversation_image

base_model_path="DeepGlint-AI/UniME-LLaVA-OneVision-7B"

text = "A man is crossing the street with a red car parked nearby."
image_path = "figures/demo.png"
input_image = [Image.open(image_path)]

transform = AutoProcessor.from_pretrained(base_model_path, trust_remote_code=True)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(base_model_path,device_map="cuda", trust_remote_code=True, torch_dtype=torch.float16)
transform.tokenizer.padding_side = "left"
transform.tokenizer.padding = True

inputs_text = transform.apply_chat_template([appply_chat_template(text = text)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")
inputs_image = transform.apply_chat_template([appply_chat_template(image = input_image)],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_dict=True,
                                        return_tensors="pt",
                                        padding=True).to("cuda")

with torch.no_grad():
  emb_text = model(**inputs_text, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_image = model(**inputs_image, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
  emb_text = F.normalize(emb_text, dim=-1)
  emb_image = F.normalize(emb_image, dim=-1)
  Score = emb_image @ emb_text.T
print("Score: ", Score.item())
```

## 🔢 Results
### Diverse Retrieval
<p align="center">
    <img src="figures/res1.png">
</p>

### MMEB
<p align="center">
    <img src="figures/res2.png">
</p>

## 📖 Citation
If you find this repository useful, please use the following BibTeX entry for citation.
```latex
@misc{gu2025breakingmodalitybarrieruniversal,
      title={Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs}, 
      author={Tiancheng Gu and Kaicheng Yang and Ziyong Feng and Xingjun Wang and Yanzhao Zhang and Dingkun Long and Yingda Chen and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2504.17432},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.17432}, 
}
```