OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
Abstract
OmniResponse, a Multimodal Large Language Model, generates high-quality synchronized verbal and non-verbal listener responses using text as an intermediate modality.
In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.
Community
We introduce a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Words: Multimodal LLM Knows When to Speak (2025)
- VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction (2025)
- DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion (2025)
- DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations (2025)
- Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation (2025)
- AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation (2025)
- REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper