ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems
Abstract
The ScanBot dataset, focusing on instruction-conditioned high-precision robotic surface scanning, showcases challenges for vision-language action models in achieving precise scanning trajectories under real-world constraints.
We introduce ScanBot, a novel dataset designed for instruction-conditioned, high-precision surface scanning in robotic systems. In contrast to existing robot learning datasets that focus on coarse tasks such as grasping, navigation, or dialogue, ScanBot targets the high-precision demands of industrial laser scanning, where sub-millimeter path continuity and parameter stability are critical. The dataset covers laser scanning trajectories executed by a robot across 12 diverse objects and 6 task types, including full-surface scans, geometry-focused regions, spatially referenced parts, functionally relevant structures, defect inspection, and comparative analysis. Each scan is guided by natural language instructions and paired with synchronized RGB, depth, and laser profiles, as well as robot pose and joint states. Despite recent progress, existing vision-language action (VLA) models still fail to generate stable scanning trajectories under fine-grained instructions and real-world precision demands. To investigate this limitation, we benchmark a range of multimodal large language models (MLLMs) across the full perception-planning-execution loop, revealing persistent challenges in instruction-following under realistic constraints.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction (2025)
- PointArena: Probing Multimodal Grounding Through Language-Guided Pointing (2025)
- Robotic Visual Instruction (2025)
- ORACLE-Grasp: Zero-Shot Task-Oriented Robotic Grasping using Large Multimodal Models (2025)
- From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation (2025)
- 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks (2025)
- Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper