Frame In-N-Out: Unbounded Controllable Image-to-Video Generation Paper • 2505.21491 • Published 7 days ago • 17
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation Paper • 2503.14350 • Published Mar 18
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Paper • 2504.16060 • Published Apr 22
DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences Paper • 2406.03008 • Published Jun 5, 2024
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models Paper • 2407.07035 • Published Jul 9, 2024
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors Paper • 2502.13311 • Published Feb 18 • 1
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies Paper • 2412.10345 • Published Dec 13, 2024 • 2
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel Paper • 2412.08467 • Published Dec 11, 2024 • 5
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent Paper • 2309.12311 • Published Sep 21, 2023 • 17
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published Sep 11, 2024 • 21
Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions Paper • 2406.09264 • Published Jun 13, 2024 • 2