File size: 7,669 Bytes
8c0da90
 
 
 
 
 
 
 
 
 
 
 
 
 
561b62d
48b9a2c
561b62d
 
48b9a2c
561b62d
48b9a2c
561b62d
48b9a2c
561b62d
 
 
 
 
 
d629306
48b9a2c
561b62d
48b9a2c
d629306
 
561b62d
48b9a2c
3344c45
561b62d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07f38c0
561b62d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: mit
datasets:
- Tevatron/bge-ir
- Tevatron/wiki-ss-nq-new
- Tevatron/pixmo-docs
- Tevatron/colpali
- Tevatron/msrvtt
- Tevatron/audiocaps
base_model:
- Qwen/Qwen2.5-Omni-7B
- Tevatron/Qwen2.5-Omni-7B-Thinker
pipeline_tag: visual-document-retrieval
---
# Tevatron/OmniEmbed-v0

**OmniEmbed** is a powerful multi-modal embedding model built on [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) using our [Tevatron](https://github.com/texttron/tevatron/) toolkit—a unified toolkit across scale, language, and modality for document retrieval. 
OmniEmbed generates unified embeddings across multilingual text, images, audio, and video, enabling effective cross-modal retrieval for diverse applications.

✅ Text ✅ Image ✅ Audio ✅ Video ✅ Multilingual

## Evaluation Results:

| Benchmark | Task                     | Metric  | OmniEmbed | Baseline (Score)  |
|-----------|--------------------------|---------|-----------|-------------------|
| BEIR-13   | Text Retrieval           | nDCG@10 | 58.2      | MistralE5 (59.0)  |
| MIRACL    | Multilingual Retrieval   | nDCG@10 | 69.1      | BGE‑M3 (69.2)     |
| VIDORE    | Image Document Retrieval | nDCG@5  | 85.8      | DSE‑QWen2 (85.8)  |
| MSRVTT    | Video Retrieval          | R@1     | 51.3      | CLIP (31.2)       |
| AudioCaps | Audio Retrieval          | R@1     | 34.0      | [*](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps)       |

OmniEmbed achieves strong performance, comparable to models specifically optimized for individual tasks.

> Although multiple baselines exist on AudioCaps, we could not identify a consistent query/corpus setting in the literature for direct comparison.

---

### Usage
```python
# Import Library, Load Model and Processor
import torch
from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration
from qwen_omni_utils import process_mm_info

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    'ArvinZhuang/OmniEmbed-test',
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
).to(device).eval()

processor.tokenizer.padding_side = "left"
model.padding_side = "left"

# Function to Encode Message
def encode_message(message):
    texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>"
    audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True)

    inputs = processor(
        text=texts,
        audio=audio_inputs,
        images=image_inputs,
        videos=video_inputs,
        return_tensors="pt",
        padding="longest",
    )
    for k in inputs:
        inputs[k] = inputs[k].to(device)

    cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device)
    inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position)
    model_outputs = model(**inputs, return_dict=True, output_hidden_states=True)

    last_hidden_state = model_outputs.hidden_states[-1]
    reps = last_hidden_state[:, -1]
    reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
    return reps
```

### 🎬 Video Retrieval
```python
example_query = 'Query: How to cook Mapo Tofu?'
example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4"
example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}]
video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2))

print("Similarities:", sim1.item(), sim2.item())
```

### 🎵 Audio Retrieval
```python
example_query = 'Query: A light piano piece'
example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3"
example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}]
audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2))

print("Similarities:", sim1.item(), sim2.item())
```

### 📈 Image Document Retrieval (Image, Chart, PDF)
```python
example_query = 'Query: How many input modality does Qwen2.5-Omni support?'
example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png"
example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}]
image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2))

print("Similarities:", sim1.item(), sim2.item())
```

### 🌍 Multilingual Text Retrieval
```python
example_query = 'Query: 氧气在空气中占比多少?'
example_text_1 = "空气是指大气层中由不同气体和各类飘浮在其中的固体与液体颗粒(大气颗粒与气溶胶)所组成的气态混合物。地球大气层的空气主要由78.1%的氮气、20.9%氧气、0.9%的氩气和1~4%的水蒸气组成,其成分并不是固定的,随着高度、气压、温度的改变和对流情况不同,局部空气的组成比例也会改变。空气在大气层(特别是对流层)中的流动形成了风和曳流、气旋、龙卷等自然现象,而空气中飘浮的颗粒则形成了云、雾、霾和沙尘暴等短期天气情况。空气在海洋和陆地之间跨区域流动所承载的湿度和热能传导也是水循环和气候变率与变化的关键一环。"
example_text_2 = "水(化学式:H2O)是一种无机化合物,在常温且无杂质中是无色[1]无味不导电的透明液体,也会通过蒸发产生气态的水蒸气(这种蒸发可以发生在任何温度下,同时取决于与空气接触的表面积和湿度差)。在标准大气压下,水的凝固点是0 °C(32 °F;273 K),沸点是100 °C(212 °F;373 K)。"
query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}]
text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}]
text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}]

sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1))
sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2))

print("Similarities:", sim1.item(), sim2.item())
```

## Data & Training
We fully open-soured the Training Data and Training Code in [Tevatron](https://github.com/texttron/tevatron/tree/qwenomni)


## Contact
This model is developed by:

Shengyao Zhuang, Xueguang Ma, Samantha Zhan, Crystina Zhang

Feel free to reach out to us with any questions or for further discussion.