You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Lei Zhou1, Haozhe Wang1, Zhengshen Zhang1, Zhiyang Liu1, Francis EH Tay1, Marcelo H. Ang Jr.1,
1National University of Singapore

Abstract

In the realm of robotic grasping, achieving accurate and reliable interactions with the environment is a pivotal challenge. Traditional methods of grasp planning methods utilizing partial point clouds derived from depth image often suffer from reduced scene understanding due to occlusion, ultimately impeding their grasping accuracy. Furthermore, scene reconstruction methods have primarily relied upon static techniques, which are susceptible to environment change during manipulation process limits their efficacy in real-time grasping tasks. To address these limitations, this paper introduces a novel two-stage pipeline for dynamic scene reconstruction. In the first stage, our approach takes scene scanning as input to register each target object with mesh reconstruction and novel object pose tracking. In the second stage, pose tracking is still performed to provide object poses in real-time, enabling our approach to transform the reconstructed object point clouds back into the scene. Unlike conventional methodologies, which rely on static scene snapshots, our method continuously captures the evolving scene geometry, resulting in a comprehensive and up-to-date point cloud representation. By circumventing the constraints posed by occlusion, our method enhances the overall grasp planning process and empowers state-of-the-art 6-DoF robotic grasping algorithms to exhibit markedly improved accuracy.

Video

Pipeline

Overview of the proposed pipeline. Stage I: Given a monocular RGB-D video, object masks are segmented using a Video-segmentation Module. Subsequently, feature matching is performed in the Object Pose Tracker and Mesh Generator module simultaneously track object pose and reconstruct object mesh. Keyframes with informative historical observations are stored in the memory pool to facilitate pose tracking in both stages. Stage II: In testing, given an RGB-D image, the masks of the objects in the workspace are segmented out and the object pose is estimated by taking the Keyframe Memory Pool as a reference. Subsequently, the reconstructed meshes are transformed into camera coordinates with the estimated object pose. Taking this reconstructed scene point cloud, grasp generation is performed to generate the top k grasp poses for real-world experiments. The dotted lines represent the supplementation of historical information.

Scene Scanning

A wrist camera is used to scan the workspace and capture multi-view RGBD images.

Stage I

RGB

Depth

Segmentation Masks

Poses




Stage II

RGB

Depth

Segmentation Masks

Poses

Partial Point Cloud

Completed Point Cloud

Tok k Grasp Poses

Qualitative Results

Comparison of grasp generation accuracy between different qualities of input point cloud. Partial represents partial point cloud back-projected from a depth image. YOSO (Ours) represents scene point cloud reconstructed by our YOSO pipeline. GT represents scene-level fully visible point cloud, which is regarded as a perfect reconstructed scene.