xiongwang commited on
Commit
c1169d7
·
verified ·
1 Parent(s): ad71514

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -18
README.md CHANGED
@@ -26,9 +26,9 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
26
 
27
  ### Key Features
28
 
29
- * **Omni and Novel Architecture**: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We prpose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
30
 
31
- * **Real-Time Voice and Video Chat**: Architecture Designed for fully real-time interactions, supporting chunked input and immediate output.
32
 
33
  * **Natural and Robust Speech Generation**: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
34
 
@@ -47,7 +47,7 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
47
  We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).
48
 
49
  <p align="center">
50
- <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/bar.png"/>
51
  <p>
52
 
53
  <details>
@@ -600,7 +600,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
600
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
601
  ```
602
  pip uninstall transformers
603
- pip install git+https://github.com/huggingface/transformers@3a1ead0aabed473eafe527915eea8c197d424356
604
  pip install accelerate
605
  ```
606
  or you might encounter the following error:
@@ -654,14 +654,17 @@ conversation = [
654
  },
655
  ]
656
 
 
 
 
657
  # Preparation for inference
658
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
659
- audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
660
- inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
661
  inputs = inputs.to(model.device).to(model.dtype)
662
 
663
  # Inference: Generation of the output text and audio
664
- text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
665
 
666
  text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
667
  print(text)
@@ -764,15 +767,18 @@ conversation4 = [
764
  # Combine messages for batch processing
765
  conversations = [conversation1, conversation2, conversation3, conversation4]
766
 
 
 
 
767
  # Preparation for batch inference
768
  text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
769
- audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
770
 
771
- inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
772
  inputs = inputs.to(model.device).to(model.dtype)
773
 
774
  # Batch Inference
775
- text_ids = model.generate(**inputs, use_audio_in_video=True, return_audio=False)
776
  text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
777
  print(text)
778
  ```
@@ -795,10 +801,15 @@ In the process of multimodal interaction, the videos provided by users are often
795
  audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
796
  ```
797
  ```python
798
- # second place, in model inference
 
 
 
 
 
799
  text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
800
  ```
801
- It is worth noting that during a multi-round conversation, the `use_audio_in_video` parameter in these two places must be set to the same, otherwise unexpected results will occur.
802
 
803
  #### Use audio output or not
804
 
@@ -833,7 +844,7 @@ Qwen2.5-Omni supports the ability to change the voice of the output audio. The `
833
  | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
834
  | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
835
 
836
- Users can use the `spk` parameter of `generate` function to specify the voice type. By defalut, if `spk` is not specified, the default voice type is `Chelsie`.
837
 
838
  ```python
839
  text_ids, audio = model.generate(**inputs, spk="Chelsie")
@@ -867,7 +878,7 @@ model = Qwen2_5OmniModel.from_pretrained(
867
  ```
868
 
869
 
870
- <!-- ## Citation
871
 
872
  If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
873
 
@@ -877,10 +888,10 @@ If you find our paper and code useful in your research, please consider giving a
877
 
878
  @article{Qwen2.5-Omni,
879
  title={Qwen2.5-Omni Technical Report},
880
- author={},
881
- journal={arXiv preprint arXiv:},
882
  year={2025}
883
  }
884
- ``` -->
885
 
886
- <br>
 
26
 
27
  ### Key Features
28
 
29
+ * **Omni and Novel Architecture**: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio.
30
 
31
+ * **Real-Time Voice and Video Chat**: Architecture designed for fully real-time interactions, supporting chunked input and immediate output.
32
 
33
  * **Natural and Robust Speech Generation**: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation.
34
 
 
47
  We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).
48
 
49
  <p align="center">
50
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/bar.png" width="80%"/>
51
  <p>
52
 
53
  <details>
 
600
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
601
  ```
602
  pip uninstall transformers
603
+ pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
604
  pip install accelerate
605
  ```
606
  or you might encounter the following error:
 
654
  },
655
  ]
656
 
657
+ # set use audio in video
658
+ USE_AUDIO_IN_VIDEO = True
659
+
660
  # Preparation for inference
661
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
662
+ audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
663
+ inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
664
  inputs = inputs.to(model.device).to(model.dtype)
665
 
666
  # Inference: Generation of the output text and audio
667
+ text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
668
 
669
  text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
670
  print(text)
 
767
  # Combine messages for batch processing
768
  conversations = [conversation1, conversation2, conversation3, conversation4]
769
 
770
+ # set use audio in video
771
+ USE_AUDIO_IN_VIDEO = True
772
+
773
  # Preparation for batch inference
774
  text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
775
+ audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
776
 
777
+ inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
778
  inputs = inputs.to(model.device).to(model.dtype)
779
 
780
  # Batch Inference
781
+ text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)
782
  text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
783
  print(text)
784
  ```
 
801
  audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
802
  ```
803
  ```python
804
+ # second place, in model processor
805
+ inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt",
806
+ padding=True, use_audio_in_video=True)
807
+ ```
808
+ ```python
809
+ # third place, in model inference
810
  text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
811
  ```
812
+ It is worth noting that during a multi-round conversation, the `use_audio_in_video` parameter in these places must be set to the same, otherwise unexpected results will occur.
813
 
814
  #### Use audio output or not
815
 
 
844
  | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
845
  | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
846
 
847
+ Users can use the `spk` parameter of `generate` function to specify the voice type. By default, if `spk` is not specified, the default voice type is `Chelsie`.
848
 
849
  ```python
850
  text_ids, audio = model.generate(**inputs, spk="Chelsie")
 
878
  ```
879
 
880
 
881
+ ## Citation
882
 
883
  If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
884
 
 
888
 
889
  @article{Qwen2.5-Omni,
890
  title={Qwen2.5-Omni Technical Report},
891
+ author={Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin},
892
+ journal={arXiv preprint arXiv:2503.20215},
893
  year={2025}
894
  }
895
+ ```
896
 
897
+ <br>