GenHeld: Generating and Editing Handheld Objects
This addresses a specific problem in robotics and computer vision for generating realistic grasps, but it appears incremental as it builds on existing object synthesis and image editing techniques.
The paper tackles the inverse problem of synthesizing handheld objects conditioned on hand inputs, such as 3D models or 2D images, and demonstrates that their method outperforms baselines by generating plausible held objects in both 2D and 3D with high quality.
Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.