SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
Abstract
A framework combining visual priors and dynamic constraints within a synchronized diffusion process generates HOI video and motion simultaneously, enhancing video-motion consistency and generalization.
Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://github.com/Droliven/SViMo\_project.
Community
- TL;DR: A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for joint generation of video and motion in Hand-Object Interaction (HOI) scenarios.
- Project page at https://github.com/Droliven/SViMo_project.
- Video demonstration: https://www.youtube.com/watch?v=pVkntn-8KHo.
- TL;DR: A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for joint generation of video and motion in Hand-Object Interaction (HOI) scenarios.
- Project page at https://github.com/Droliven/SViMo_project.
- Video demonstration: https://www.youtube.com/watch?v=pVkntn-8KHo.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction (2025)
- MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation (2025)
- UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes (2025)
- TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation (2025)
- Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation (2025)
- LatentMove: Towards Complex Human Movement Video Generation (2025)
- DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper