Qwen
/

Qwen2.5-Omni-7B

@@ -43,18 +43,16 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
 ### Performance
 <details>
 <summary>Multimodality  -> Text</summary>
-<style type="text/css">
-.tg  {border-collapse:collapse;border-spacing:0;}
-.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg .tg-0lax{text-align:left;vertical-align:top}
-</style>
-<table class=""><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
@@ -76,7 +74,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">video-SALMONN</td>
-    <td class="tg-0lax">34.11%|31.70%|<span style="font-weight:bold">56.60%</span>|35.64%</td>
   </tr>
   <tr>
     <td class="tg-0lax">UnifiedIO2-xlarge</td>
@@ -84,23 +82,19 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">UnifiedIO2-xxlarge</td>
-    <td class="tg-0lax">34.24%|36.98%|29.25%|38.00%</td>
-  </tr>
-  <tr>
-    <td class="tg-0lax">MiniCPM-o</td>
     <td class="tg-0lax">34.24%|36.98%|24.53%|33.98%</td>
   </tr>
   <tr>
-    <td class="tg-0lax">Baichuan-Omni-1.5</td>
     <td class="tg-0lax">-|-|-|40.50%</td>
   </tr>
   <tr>
-    <td class="tg-0lax">Qwen2-Audio</td>
     <td class="tg-0lax">-|-|-|42.90%</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">55.25%</span>|<span style="font-weight:bold">60.00%</span>|52.83%|<span style="font-weight:bold">56.13%</span></td>
   </tr>
 </tbody></table>
 </details>
@@ -109,16 +103,8 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
 <details>
 <summary>Audio -> Text</summary>
-<style type="text/css">
-.tg  {border-collapse:collapse;border-spacing:0;}
-.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg .tg-9j4x{font-style:italic;font-weight:bold;text-align:center;text-decoration:underline;vertical-align:top}
-.tg .tg-0lax{text-align:left;vertical-align:top}
-</style>
-<table class=""><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
@@ -151,7 +137,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Seed-ASR-Multilingual</td>
-    <td class="tg-0lax">-|-|<span style="font-weight:bold">1.6</span>|<span style="font-weight:bold">2.8</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
@@ -167,7 +153,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
-    <td class="tg-0lax"><span style="font-weight:bold">1.3</span>|<span style="font-weight:bold">3.4</span>|<span style="font-weight:bold">1.6</span>|3.6</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
@@ -176,7 +162,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   <tr>
     <td class="tg-0lax" rowspan="4">Common Voice 15<br>en | zh | yue | fr</td>
     <td class="tg-0lax">Whisper-large-v3</td>
-    <td class="tg-0lax">9.8|12.8|10.9|10.8</td>
   </tr>
   <tr>
     <td class="tg-0lax">MinMo</td>
@@ -184,11 +170,11 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
-    <td class="tg-0lax">8.6|6.9|<span style="font-weight:bold">5.9</span>|9.6</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">7.6</span>|<span style="font-weight:bold">5.2</span>|7.3|<span style="font-weight:bold">7.5</span></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="7">Fleurs<br>zh | en</td>
@@ -197,7 +183,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Seed-ASR-Multilingual</td>
-    <td class="tg-0lax">-|<span style="font-weight:bold">3.4</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">Megrez-3B-Omni</td>
@@ -217,12 +203,12 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">3.0</span>|4.1</td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="5">Wenetspeech<br>test-net | test-meeting</td>
     <td class="tg-0lax">Seed-ASR-Chinese</td>
-    <td class="tg-0lax"><span style="font-weight:bold">4.7|5.7</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">Megrez-3B-Omni</td>
@@ -247,7 +233,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Llama-3-70B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">5.7</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
@@ -271,11 +257,11 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
-    <td class="tg-0lax">-|-|<span style="font-weight:bold">48.2</span>|27.2</td>
   </tr>
   <tr>
     <td class="tg-0lax">MinMo</td>
-    <td class="tg-0lax">-|<span style="font-weight:bold">39.9</span>|46.7|26.0</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen-Audio</td>
@@ -287,7 +273,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">30.2</span>|37.7|41.4|<span style="font-weight:bold">29.4</span></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">SER</td>
@@ -311,7 +297,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.570</span></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">VSC</td>
@@ -331,11 +317,11 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.939</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.939</span></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Music</td>
@@ -347,16 +333,16 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.88</span></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="2">MusicCaps</td>
     <td class="tg-0lax">LP-MusicCaps</td>
-    <td class="tg-0lax">0.291|0.149|0.089|<span style="font-weight:bold">0.061</span>|<span style="font-weight:bold">0.129</span>|0.130</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.328</span>|<span style="font-weight:bold">0.162</span>|<span style="font-weight:bold">0.090</span>|0.055|0.127|<span style="font-weight:bold">0.225</span></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Audio Reasoning</td>
@@ -368,11 +354,11 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
-    <td class="tg-0lax">54.95|50.98|42.04|49.20.5</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">67.87|69.16|59.76|65.60</span></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Voice Chatting</td>
@@ -380,7 +366,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   <tr>
     <td class="tg-0lax" rowspan="8">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
     <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">4.55</span>|3.90|53.35|47.17</td>
   </tr>
   <tr>
     <td class="tg-0lax">MERaLiON</td>
@@ -396,7 +382,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
-    <td class="tg-0lax">4.42|<span style="font-weight:bold">4.15</span>|50.72|54.78</td>
   </tr>
   <tr>
     <td class="tg-0lax">Baichuan-Omni-1.5</td>
@@ -408,12 +394,12 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax">4.49|3.93|<span style="font-weight:bold">55.71</span>|<span style="font-weight:bold">61.32</span></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="8">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
     <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
-    <td class="tg-0lax">65.27|<span style="font-weight:bold">66.88</span>|98.46|71.45</td>
   </tr>
   <tr>
     <td class="tg-0lax">MERaLiON</td>
@@ -441,7 +427,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
-    <td class="tg-0lax"><span style="font-weight:bold">81.10</span>|52.87|<span style="font-weight:bold">99.42</span>|<span style="font-weight:bold">74.12</span></td>
   </tr>
 </tbody></table>
 </details>
@@ -473,16 +459,16 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
 | Dataset                  | Qwen2.5-Omni-7B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
 |--------------------------|--------------|---------------|----------------|----------------|
-| Refcoco<sub>val</sub>    | **90.6**     | 90.0          | **90.6**       | 73.2           |
-| Refcoco<sub>textA</sub>  | **93.4**     | 92.5          | 93.2           | 72.9           |
-| Refcoco<sub>textB</sub>  | 86.8         | 85.4          | **88.2**       | 74.6           |
-| Refcoco+<sub>val</sub>   | 85.3         | 84.2          | **88.2**       | 62.5           |
 | Refcoco+<sub>textA</sub> | **91.0**     | 89.1          | 89.0           | 63.9           |
-| Refcoco+<sub>textB</sub> | **79.2**     | 76.9          | 75.9           | 65.0           |
-| Refcocog+<sub>val</sub>  | **87.6**     | 87.2          | 86.1           | 75.2           |
-| Refcocog+<sub>test</sub> | **88.0**     | 87.2          | 87.0           | 76.2           |
 | ODinW                    | 42.4         | 37.3          | **55.0**       | 36.7           |
-| PointGrounding           | 65.3         | **67.3**      | -              | -              |
 </details>
@@ -491,26 +477,17 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
 | Dataset                     | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
 |-----------------------------|--------------|------------|---------------|-------------|
-| Video-MME<sub>w/o sub</sub> | **65.9**     | 63.9       | 65.1          | 64.8        |
-| Video-MME<sub>w sub</sub>   | **72.9**     | 67.9       | 71.6          | -           |
-| MVBench                     | 68.6         | 67.2       | **69.6**      | -           |
-| EgoSchema<sub>test</sub>    | **69.6**     | 63.2        | 65.0          | -           |
 </details>
 <details>
 <summary>Zero-shot Speech Generation</summary>
-<style type="text/css">
-.tg  {border-collapse:collapse;border-spacing:0;}
-.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
-  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
-.tg .tg-9j4x{font-style:italic;font-weight:bold;text-align:center;text-decoration:underline;vertical-align:top}
-.tg .tg-0lax{text-align:left;vertical-align:top}
-</style>
-<table class=""><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
@@ -527,7 +504,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Seed-TTS_RL</td>
-    <td class="tg-0lax"><span style="font-weight:bold">1.00</span> | 1.94 | <span style="font-weight:bold">6.42</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">MaskGCT</td>
@@ -539,7 +516,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">F5-TTS</td>
-    <td class="tg-0lax">1.56 | <span style="font-weight:bold">1.83</span> | 8.67</td>
   </tr>
   <tr>
     <td class="tg-0lax">CosyVoice 2</td>
@@ -567,7 +544,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
   </tr>
   <tr>
     <td class="tg-0lax">Seed-TTS_RL</td>
-    <td class="tg-0lax"><span style="font-weight:bold">0.801</span> | <span style="font-weight:bold">0.766</span> | <span style="font-weight:bold">0.782</span></td>
   </tr>
   <tr>
     <td class="tg-0lax">MaskGCT</td>
@@ -611,10 +588,10 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
 | GPQA                              | 30.8      | **36.4**   | 34.3     | 32.8        | 32.8      |
 | MATH                              | 71.5      | **75.5**   | 52.9     | 51.9        | 44.3      |
 | GSM8K                             | 88.7      | **91.6**   | 85.7     | 84.5        | 76.7      |
-| HumanEval                         | 79.9      | **84.8**   | 79.9     | 72.6        | 68.9      |
-| MBPP                              | 73.7      | **79.2**   | 67.2     | 69.6        | 74.9      |
-| MultiPL-E                         | 67.0      | **70.4**   | 59.1     | 50.7        | 53.4      |
-| LiveCodeBench<sub>2305-2409</sub> | 25.2      | **28.7**   | 23.9     | 8.3         | 18.9      |
 </details>
 ## Quickstart

 ### Performance
+We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).
+<p align="center">
+    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/bar.png" width="80%"/>
+<p>
 <details>
 <summary>Multimodality  -> Text</summary>
+<table class="tg"><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
   </tr>
   <tr>
     <td class="tg-0lax">video-SALMONN</td>
+    <td class="tg-0lax">34.11%|31.70%|<strong>56.60%</strong>|35.64%</td>
   </tr>
   <tr>
     <td class="tg-0lax">UnifiedIO2-xlarge</td>
   </tr>
   <tr>
     <td class="tg-0lax">UnifiedIO2-xxlarge</td>
     <td class="tg-0lax">34.24%|36.98%|24.53%|33.98%</td>
   </tr>
   <tr>
+    <td class="tg-0lax">MiniCPM-o</td>
     <td class="tg-0lax">-|-|-|40.50%</td>
   </tr>
   <tr>
+    <td class="tg-0lax">Baichuan-Omni-1.5</td>
     <td class="tg-0lax">-|-|-|42.90%</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
   </tr>
 </tbody></table>
 </details>
 <details>
 <summary>Audio -> Text</summary>
+<table class="tg"><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
   </tr>
   <tr>
     <td class="tg-0lax">Seed-ASR-Multilingual</td>
+    <td class="tg-0lax">-|-|<strong>1.6</strong>|<strong>2.8</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
+    <td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
   <tr>
     <td class="tg-0lax" rowspan="4">Common Voice 15<br>en | zh | yue | fr</td>
     <td class="tg-0lax">Whisper-large-v3</td>
+    <td class="tg-0lax">9.3|12.8|10.9|10.8</td>
   </tr>
   <tr>
     <td class="tg-0lax">MinMo</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
+    <td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="7">Fleurs<br>zh | en</td>
   </tr>
   <tr>
     <td class="tg-0lax">Seed-ASR-Multilingual</td>
+    <td class="tg-0lax">-|<strong>3.4</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">Megrez-3B-Omni</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>3.0</strong>|4.1</td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="5">Wenetspeech<br>test-net | test-meeting</td>
     <td class="tg-0lax">Seed-ASR-Chinese</td>
+    <td class="tg-0lax"><strong>4.7|5.7</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">Megrez-3B-Omni</td>
   </tr>
   <tr>
     <td class="tg-0lax">Llama-3-70B</td>
+    <td class="tg-0lax"><strong>5.7</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
+    <td class="tg-0lax">-|-|<strong>48.2</strong>|27.2</td>
   </tr>
   <tr>
     <td class="tg-0lax">MinMo</td>
+    <td class="tg-0lax">-|<strong>39.9</strong>|46.7|26.0</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen-Audio</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">SER</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>0.570</strong></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">VSC</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
+    <td class="tg-0lax"><strong>0.939</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>0.939</strong></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Music</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>0.88</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="2">MusicCaps</td>
     <td class="tg-0lax">LP-MusicCaps</td>
+    <td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong>|<strong>0.129</strong>|0.130</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>0.328</strong>|<strong>0.162</strong>|<strong>0.090</strong>|0.055|0.127|<strong>0.225</strong></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Audio Reasoning</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2-Audio</td>
+    <td class="tg-0lax">54.95|50.98|42.04|49.20</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>67.87|69.16|59.76|65.60</strong></td>
   </tr>
   <tr>
     <td class="tg-9j4x" colspan="3">Voice Chatting</td>
   <tr>
     <td class="tg-0lax" rowspan="8">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
     <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
+    <td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
   </tr>
   <tr>
     <td class="tg-0lax">MERaLiON</td>
   </tr>
   <tr>
     <td class="tg-0lax">MiniCPM-o</td>
+    <td class="tg-0lax">4.42|<strong>4.15</strong>|50.72|54.78</td>
   </tr>
   <tr>
     <td class="tg-0lax">Baichuan-Omni-1.5</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax" rowspan="8">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
     <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
+    <td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
   </tr>
   <tr>
     <td class="tg-0lax">MERaLiON</td>
   </tr>
   <tr>
     <td class="tg-0lax">Qwen2.5-Omni-7B</td>
+    <td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
   </tr>
 </tbody></table>
 </details>
 | Dataset                  | Qwen2.5-Omni-7B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
 |--------------------------|--------------|---------------|----------------|----------------|
+| Refcoco<sub>val</sub>    | 90.5         | 90.0          | **90.6**       | 73.2           |
+| Refcoco<sub>textA</sub>  | **93.5**     | 92.5          | 93.2           | 72.9           |
+| Refcoco<sub>textB</sub>  | 86.6         | 85.4          | **88.2**       | 74.6           |
+| Refcoco+<sub>val</sub>   | 85.4         | 84.2          | **88.2**       | 62.5           |
 | Refcoco+<sub>textA</sub> | **91.0**     | 89.1          | 89.0           | 63.9           |
+| Refcoco+<sub>textB</sub> | **79.3**     | 76.9          | 75.9           | 65.0           |
+| Refcocog+<sub>val</sub>  | **87.4**     | 87.2          | 86.1           | 75.2           |
+| Refcocog+<sub>test</sub> | **87.9**     | 87.2          | 87.0           | 76.2           |
 | ODinW                    | 42.4         | 37.3          | **55.0**       | 36.7           |
+| PointGrounding           | 66.5         | **67.3**      | -              | -              |
 </details>
 | Dataset                     | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
 |-----------------------------|--------------|------------|---------------|-------------|
+| Video-MME<sub>w/o sub</sub> | 64.3         | 63.9       | **65.1**      | 64.8        |
+| Video-MME<sub>w sub</sub>   | **72.4**     | 67.9       | 71.6          | -           |
+| MVBench                     | **70.3**     | 67.2       | 69.6          | -           |
+| EgoSchema<sub>test</sub>    | **68.6**     | 63.2       | 65.0          | -           |
 </details>
 <details>
 <summary>Zero-shot Speech Generation</summary>
+<table class="tg"><thead>
   <tr>
     <th class="tg-0lax">Datasets</th>
     <th class="tg-0lax">Model</th>
   </tr>
   <tr>
     <td class="tg-0lax">Seed-TTS_RL</td>
+    <td class="tg-0lax"><strong>1.00</strong> | 1.94 | <strong>6.42</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">MaskGCT</td>
   </tr>
   <tr>
     <td class="tg-0lax">F5-TTS</td>
+    <td class="tg-0lax">1.56 | <strong>1.83</strong> | 8.67</td>
   </tr>
   <tr>
     <td class="tg-0lax">CosyVoice 2</td>
   </tr>
   <tr>
     <td class="tg-0lax">Seed-TTS_RL</td>
+    <td class="tg-0lax"><strong>0.801</strong> | <strong>0.766</strong> | <strong>0.782</strong></td>
   </tr>
   <tr>
     <td class="tg-0lax">MaskGCT</td>
 | GPQA                              | 30.8      | **36.4**   | 34.3     | 32.8        | 32.8      |
 | MATH                              | 71.5      | **75.5**   | 52.9     | 51.9        | 44.3      |
 | GSM8K                             | 88.7      | **91.6**   | 85.7     | 84.5        | 76.7      |
+| HumanEval                         | 78.7      | **84.8**   | 79.9     | 72.6        | 68.9      |
+| MBPP                              | 73.2      | **79.2**   | 67.2     | 69.6        | 74.9      |
+| MultiPL-E                         | 65.8      | **70.4**   | 59.1     | 50.7        | 53.4      |
+| LiveCodeBench<sub>2305-2409</sub> | 24.6      | **28.7**   | 23.9     | 8.3         | 18.9      |
 </details>
 ## Quickstart