Vision-Language Grasping System

Course: Advanced Electronic Science Experiments II Oct 2025 – Jan 2026

Overview

Constructed a zero-shot robotic grasping system leveraging large Vision-Language Models, enabling a robot arm to grasp arbitrary objects specified by natural language prompts without task-specific training.

Key Highlights

Open-Vocabulary Perception: Built a zero-shot perception loop using YOLO-World and Segment Anything Model (SAM), achieving pixel-level segmentation ($IoU > 0.9$) for arbitrary objects via natural language prompts.
6-DOF Grasp Planning: Integrated GraspNet-1Billion for 6-DOF pose estimation and derived a custom rotation matrix to align the inference space with ROS2 TF standards, ensuring precise end-effector execution.
Dynamic Obstacle Modeling: Implemented Alpha Shape ($\alpha = 0.01$) algorithm to reconstruct high-fidelity non-convex meshes of obstacles, enabling collision-free trajectory planning in unstructured environments using MoveIt2.

Gallery

System Setup

The system in RViz

Grasp pose visualization powered by Open3D

Alpha-Shape Obstacle Modeling