Vision-Language Grasping System

Zero-Shot Robotic Manipulation via Large Vision-Language Models (VLM)

Course: Advanced Electronic Science Experiments II     Oct 2025 – Jan 2026

Overview

Constructed a zero-shot robotic grasping system leveraging large Vision-Language Models, enabling a robot arm to grasp arbitrary objects specified by natural language prompts without task-specific training.

Key Highlights

  • Open-Vocabulary Perception: Built a zero-shot perception loop using YOLO-World and Segment Anything Model (SAM), achieving pixel-level segmentation ($IoU > 0.9$) for arbitrary objects via natural language prompts.
  • 6-DOF Grasp Planning: Integrated GraspNet-1Billion for 6-DOF pose estimation and derived a custom rotation matrix to align the inference space with ROS2 TF standards, ensuring precise end-effector execution.
  • Dynamic Obstacle Modeling: Implemented Alpha Shape ($\alpha = 0.01$) algorithm to reconstruct high-fidelity non-convex meshes of obstacles, enabling collision-free trajectory planning in unstructured environments using MoveIt2.

System Setup

The system in RViz

Grasp pose visualization powered by Open3D

Alpha-Shape Obstacle Modeling