DeepGlint-AI
/

UniME-LLaVA-OneVision-7B

Image-Text-to-Text

llava_onevision

Model card Files Files and versions Community

Kaichengalex commited on May 6

Commit

2c31de7

·

verified ·

1 Parent(s): d5e636c

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -25,12 +25,19 @@ Yingda Chen,</span>
 [🏡 Project Page](https://garygutc.github.io/UniME) |  [📄 Paper](https://arxiv.org/pdf/2504.17432) | [💻 Github](https://github.com/deepglint/UniME)
 <p align="center">
-    <img src="figures/fig1.png">
 </p>
 ## 💡 Highlights
 To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. **Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen**.
 <p align="center">

 [🏡 Project Page](https://garygutc.github.io/UniME) |  [📄 Paper](https://arxiv.org/pdf/2504.17432) | [💻 Github](https://github.com/deepglint/UniME)
+UniME achieves the top ranking on the MMEB leaderboard training using a 336×336 image resolution.（The screenshot is captured at 08:00 UTC+8 on May 6, 2025.）
 <p align="center">
+    <img src="figures/MMEB.png">
 </p>
 ## 💡 Highlights
+<p align="center">
+    <img src="figures/fig1.png">
+</p>
 To enhance the MLLM's embedding capability, we propose textual discriminative knowledge distillation. The training process involves decoupling the MLLM's LLM component and processing text with the prompt "Summarize the above sentences in one word.", followed by aligning the student (MLLM) and teacher (NV-Embed V2) embeddings via KL divergence on batch-wise similarity distributions. **Notably, only the LLM component is fine-tuned during this process, while all other parameters remain frozen**.
 <p align="center">