DexGrasp-Diffusion: Diffusion-based Unified Functional Grasp Synthesis Pipeline for Multi-Dexterous Robotic Hands

Zhengshen Zhang1, Lei Zhou1, Chenchen Liu1, Zhiyang Liu1, Chengran Yuan1, Sheng Guo1, Ruiteng Zhao1, Marcelo H. Ang Jr.1, Francis EH Tay1,
1National University of Singapore

Abstract

The versatility and adaptability of human grasping catalyze advancing dexterous robotic manipulation. While significant strides have been made in dexterous grasp generation, current research endeavors pivot towards optimizing object manipulation while ensuring functional integrity, emphasizing the synthesis of functional grasps following desired affordance instructions. This paper addresses the challenge of synthesizing functional grasps tailored to diverse dexterous robotic hands by proposing DexGrasp-Diffusion, an end-to-end modularized diffusion-based pipeline. DexGrasp-Diffusion integrates MultiHandDiffuser, a novel unified data-driven diffusion model for multi-dexterous hands grasp estimation, with DexDiscriminator, which employs a Physics Discriminator and a Functional Discriminator with open-vocabulary setting to filter physically plausible functional grasps based on object affordances. The experimental evaluation conducted on the MultiDex dataset provides substantiating evidence supporting the superior performance of MultiHandDiffuser over the baseline model in terms of success rate, grasp diversity, and collision depth. Moreover, we demonstrate the capacity of DexGrasp-Diffusion to reliably generate functional grasps for household objects aligned with specific affordance instructions.

Video

Pipeline

Training: Given a ground truth hand pose h0 in the dataset, Gaussian noise ε is gradually added to h0 to obtain a series of intermediate hand poses ht. Conditioned on various conditions, including time step, object point cloud, finger padding mask, hand class, and hand point cloud, the diffusion model predicts noise added to the h0. Testing: Given a noisy hand pose sampled from a standard multivariate Gaussian distribution hT ~ N(0,I) as the initial state, it corrects ht to less noisy pose ht-1 at each time step. Subsequently, two discriminators are applied to filter physically feasible and functional grasps.

Denosing Process

Starting from random initial hand poses, our diffusion model iteratively de-noise hand poses for T steps.

Functional Discriminator

Given an open-vocabulary affordance label, the functional discriminator implements point cloud segmentation. Green points represent region where the robot can grasp the object without impeding its intended functionality.

This webpage template was inspired from Nerfies.