Fang Wan

CV
h-index41
36papers
2,324citations
Novelty53%
AI Score62

36 Papers

CVJul 19, 2023Code
Generative Prompt Model for Weakly Supervised Object Localization

Yuzhong Zhao, Qixiang Ye, Weijia Wu et al.

Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp.

LGJun 1Code
Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation

Fang Wan, Jingxiang Qu, Yi Liu

Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at https://github.com/jiuguaiwf/UCD.

CVMay 19, 2022Code
Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Feng Liu, Xiaosong Zhang, Zhiliang Peng et al.

Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED.

CVJul 1, 2024Code
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Mingxiang Liao, Hannan Lu, Xinyu Zhang et al.

Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL.

CVAug 16, 2024
Correspondence-Guided SfM-Free 3D Gaussian Splatting for NVS

Wei Sun, Xiaosong Zhang, Fang Wan et al.

Novel View Synthesis (NVS) without Structure-from-Motion (SfM) pre-processed camera poses--referred to as SfM-free methods--is crucial for promoting rapid response capabilities and enhancing robustness against variable operating conditions. Recent SfM-free methods have integrated pose optimization, designing end-to-end frameworks for joint camera pose estimation and NVS. However, most existing works rely on per-pixel image loss functions, such as L2 loss. In SfM-free methods, inaccurate initial poses lead to misalignment issue, which, under the constraints of per-pixel image loss functions, results in excessive gradients, causing unstable optimization and poor convergence for NVS. In this study, we propose a correspondence-guided SfM-free 3D Gaussian splatting for NVS. We use correspondences between the target and the rendered result to achieve better pixel alignment, facilitating the optimization of relative poses between frames. We then apply the learned poses to optimize the entire scene. Each 2D screen-space pixel is associated with its corresponding 3D Gaussians through approximated surface rendering to facilitate gradient back propagation. Experimental results underline the superior performance and time efficiency of the proposed approach compared to the state-of-the-art baselines.

ROAug 16, 2023
Proprioceptive Learning with Soft Polyhedral Networks

Xiaobo Liu, Xudong Han, Wei Hong et al.

Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of adaptive kinesthesia and viscoelastic proprioception by learning kinetic features. This design enables passive adaptations to omni-directional interactions, visually captured by a miniature high-speed motion tracking system embedded inside for proprioceptive learning. The results show that the soft network can infer real-time 6D forces and torques with accuracies of 0.25/0.24/0.35 N and 0.025/0.034/0.006 Nm in dynamic interactions. We also incorporate viscoelasticity in proprioception during static adaptation by adding a creep and relaxation modifier to refine the predicted results. The proposed soft network combines simplicity in design, omni-adaptation, and proprioceptive sensing with high accuracy, making it a versatile solution for robotics at a low cost with more than 1 million use cycles for tasks such as sensitive and competitive grasping, and touch-based geometry reconstruction. This study offers new insights into vision-based proprioception for soft robots in adaptive grasping, soft manipulation, and human-robot interaction.

CVDec 9, 2025Code
Thinking with Images via Self-Calling Agent

Wenxi Yang, Yuzhong Zhao, Fang Wan et al.

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

ROAug 16, 2023
Autoencoding a Soft Touch to Learn Grasping from On-land to Underwater

Ning Guo, Xudong Han, Xiaobo Liu et al.

Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of grasping knowledge from on-land to underwater via a vision-based soft robotic finger that learns 6D forces and torques (FT) using a Supervised Variational Autoencoder (SVAE). A high-framerate camera captures the whole-body deformations while a soft robotic finger interacts with physical objects on-land and underwater. Results show that the trained SVAE model learned a series of latent representations of the soft mechanics transferrable from land to water, presenting a superior adaptation to the changing environments against commercial FT sensors. Soft, delicate, and reactive grasping enabled by tactile intelligence enhances the gripper's underwater interaction with improved reliability and robustness at a much-reduced cost, paving the path for learning-based intelligent grasping to support fundamental scientific discoveries in environmental and ocean research.

CVJul 17, 2024
Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Kaixin Bai, Lei Zhang, Zhaopeng Chen et al.

Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both RGB and physically realistic depth images, surpassing previous dataset generation tools. We create an RGBD dataset tailored for robotic industrial grasping scenarios and evaluate it across various tasks, including object detection, instance segmentation, and embedding sim2real visual perception in industrial robotic grasping. By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings. Project details are available at https://baikaixinpublic.github.io/structured light 3D synthesizer/.

ROApr 27
asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

Fang Wan, Guangyi Huang, Tianyu Wu et al.

We introduce asRoBallet, to the best of our knowledge, the first successful deployment of reinforcement learning (RL) on a humanoid ballbot hardware. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-sphere-ground interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency & jitter, and safe hardware exploration, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that are previously ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-sphere and sphere-ground interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.

CVFeb 6, 2024Code
Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

Feng Liu, Tengteng Huang, Qianjing Zhang et al.

Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9\% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code will be available at https://github.com/LiewFeng/RayDN.

CLJul 28, 2025Code
Geometric-Mean Policy Optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu et al.

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.

CVJan 31, 2024Code
ControlCap: Controllable Region-level Captioning

Yuzhong Zhao, Yue Liu, Zonghao Guo et al.

Region-level captioning is challenged by the caption degeneration issue, which refers to that pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones. In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. In specific, ControlCap leverages a discriminative module to generate control words within the caption space to partition it to multiple sub-spaces. The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue. Furthermore, interactive control words can be given by either a human or an expert model, which enables captioning beyond the training caption space, enhancing the model's generalization ability. Extensive experiments on Visual Genome and RefCOCOg datasets show that ControlCap respectively improves the CIDEr score by 21.6 and 2.2, outperforming the state-of-the-arts by significant margins. Code is available at https://github.com/callsys/ControlCap.

ROJul 1, 2024
Evolutionary Morphology Towards Overconstrained Locomotion via Large-Scale, Multi-Terrain Deep Reinforcement Learning

Yenan Chen, Chuye Zhang, Pengxi Gu et al.

While the animals' Fin-to-Limb evolution has been well-researched in biology, such morphological transformation remains under-adopted in the modern design of advanced robotic limbs. This paper investigates a novel class of overconstrained locomotion from a design and learning perspective inspired by evolutionary morphology, aiming to integrate the concept of `intelligent design under constraints' - hereafter referred to as constraint-driven design intelligence - in developing modern robotic limbs with superior energy efficiency. We propose a 3D-printable design of robotic limbs parametrically reconfigurable as a classical planar 4-bar linkage, an overconstrained Bennett linkage, and a spherical 4-bar linkage. These limbs adopt a co-axial actuation, identical to the modern legged robot platforms, with the added capability of upgrading into a wheel-legged system. Then, we implemented a large-scale, multi-terrain deep reinforcement learning framework to train these reconfigurable limbs for a comparative analysis of overconstrained locomotion in energy efficiency. Results show that the overconstrained limbs exhibit more efficient locomotion than planar limbs during forward and sideways walking over different terrains, including floors, slopes, and stairs, with or without random noises, by saving at least 22% mechanical energy in completing the traverse task, with the spherical limbs being the least efficient. It also achieves the highest average speed of 0.85 meters per second on flat terrain, which is 20% faster than the planar limbs. This study paves the path for an exciting direction for future research in overconstrained robotics leveraging evolutionary morphology and reconfigurable mechanism intelligence when combined with state-of-the-art methods in deep reinforcement learning.

CVApr 6, 2021Code
Multiple instance active learning for object detection

Tianning Yuan, Fang Wan, Mengying Fu et al.

Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection. In this paper, we propose Multiple Instance Active Object Detection (MI-AOD), to select the most informative images for detector training by observing instance-level uncertainty. MI-AOD defines an instance uncertainty learning module, which leverages the discrepancy of two adversarial instance classifiers trained on the labeled set to predict instance uncertainty of the unlabeled set. MI-AOD treats unlabeled images as instance bags and feature anchors in images as instances, and estimates the image uncertainty by re-weighting instances in a multiple instance learning (MIL) fashion. Iterative instance uncertainty learning and re-weighting facilitate suppressing noisy instances, toward bridging the gap between instance uncertainty and image-level uncertainty. Experiments validate that MI-AOD sets a solid baseline for instance-level active learning. On commonly used object detection datasets, MI-AOD outperforms state-of-the-art methods with significant margins, particularly when the labeled sets are small. Code is available at https://github.com/yuantn/MI-AOD.

CVNov 28, 2024
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Feng Liu, Shiwei Zhang, Xiaofeng Wang et al.

As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

LGOct 3, 2025
Can Data-Driven Dynamics Reveal Hidden Physics? There Is A Need for Interpretable Neural Operators

Wenhan Gao, Jian Luo, Fang Wan et al.

Recently, neural operators have emerged as powerful tools for learning mappings between function spaces, enabling data-driven simulations of complex dynamics. Despite their successes, a deeper understanding of their learning mechanisms remains underexplored. In this work, we classify neural operators into two types: (1) Spatial domain models that learn on grids and (2) Functional domain models that learn with function bases. We present several viewpoints based on this classification and focus on learning data-driven dynamics adhering to physical principles. Specifically, we provide a way to explain the prediction-making process of neural operators and show that neural operator can learn hidden physical patterns from data. However, this explanation method is limited to specific situations, highlighting the urgent need for generalizable explanation methods. Next, we show that a simple dual-space multi-scale model can achieve SOTA performance and we believe that dual-space multi-spatio-scale models hold significant potential to learn complex physics and require further investigation. Lastly, we discuss the critical need for principled frameworks to incorporate known physics into neural operators, enabling better generalization and uncovering more hidden physical phenomena.

CVMar 27, 2021
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Wei Gao, Fang Wan, Xingjia Pan et al.

Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.

ROJan 29, 2021
Learning-based Optoelectronically Innervated Tactile Finger for Rigid-Soft Interactive Grasping

Linhan Yang, Xudong Han, Weijie Guo et al.

This paper presents a novel design of a soft tactile finger with omni-directional adaptation using multi-channel optical fibers for rigid-soft interactive grasping. Machine learning methods are used to train a model for real-time prediction of force, torque, and contact using the tactile data collected. We further integrated such fingers in a reconfigurable gripper design with three fingers so that the finger arrangement can be actively adjusted in real-time based on the tactile data collected during grasping, achieving the process of rigid-soft interactive grasping. Detailed sensor calibration and experimental results are also included to further validate the proposed design for enhanced grasping robustness.

RODec 6, 2020
Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Linhan Yang, Xudong Han, Weijie Guo et al.

Over the past few decades, efforts have been made towards robust robotic grasping, and therefore dexterous manipulation. The soft gripper has shown their potential in robust grasping due to their inherent properties-low, control complexity, and high adaptability. However, the deformation of the soft gripper when interacting with objects bring inaccuracy of grasped objects, which causes instability for robust grasping and further manipulation. In this paper, we present an omni-directional adaptive soft finger that can sense deformation based on embedded optical fibers and the application of machine learning methods to interpret transmitted light intensities. Furthermore, to use tactile information provided by a soft finger, we design a low-cost and multi degrees of freedom gripper to conform to the shape of objects actively and optimize grasping policy, which is called Rigid-Soft Interactive Grasping. Two main advantages of this grasping policy are provided: one is that a more robust grasping could be achieved through an active adaptation; the other is that the tactile information collected could be helpful for further manipulation.

CVJun 26, 2020
Domain Contrast for Domain Adaptive Object Detection

Feng Liu, Xiaoxong Zhang, Fang Wan et al.

We present Domain Contrast (DC), a simple yet effective approach inspired by contrastive learning for training domain adaptive detectors. DC is deduced from the error bound minimization perspective of a transferred model, and is implemented with cross-domain contrast loss which is plug-and-play. By minimizing cross-domain contrast loss, DC guarantees the transferability of detectors while naturally alleviating the class imbalance issue in the target domain. DC can be applied at either image level or region level, consistently improving detectors' transferability and discriminability. Extensive experiments on commonly used benchmarks show that DC improves the baseline and state-of-the-art by significant margins, while demonstrating great potential for large domain divergence.

ROMay 6, 2020
DeepClaw: A Robotic Hardware Benchmarking Platform for Learning Object Manipulation

Fang Wan, Haokun Wang, Xiaobo Liu et al.

We present DeepClaw as a reconfigurable benchmark of robotic hardware and task hierarchy for robot learning. The DeepClaw benchmark aims at a mechatronics perspective of the robot learning problem, which features a minimum design of robot cell that can be easily reconfigured to host robot hardware from various vendors, including manipulators, grippers, cameras, desks, and objects, aiming at a streamlined collection of physical manipulation data and evaluation of the learned skills for hardware benchmarking. We provide a detailed design of the robot cell with readily available parts to build the experiment environment that can host a wide range of robotic hardware commonly adopted for robot learning. We also propose a hierarchical pipeline of software integration, including localization, recognition, grasp planning, and motion planning, to streamline learning-based robot control, data collection, and experiment validation towards shareability and reproducibility. We present benchmarking results of the DeepClaw system for a baseline Tic-Tac-Toe task, a bin-clearing task, and a jigsaw puzzle task using three sets of standard robotic hardware. Our results show that tasks defined in DeepClaw can be easily reproduced on three robot cells. Under the same task setup, the differences in robotic hardware used will present a non-negligible impact on the performance metrics of robot learning. All design layouts and codes are hosted on Github for open access.

CVMar 31, 2020
Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Zhekun Luo, Devin Guillory, Baifeng Shi et al.

Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.

ROMar 7, 2020
Hybrid Actuator Design for a Gait Augmentation Wearable

Fang Wan, Zheng Wang, Brooke Franchuk et al.

We describe a fluidic actuator design that replaces the sealed chamber of a hydraulic cylinder using a soft actuator to provide compliant linear compression with a large force ($\geq$100 N) at a low operation pressure ($\leq$50 kPa) for a lower-limb wearable. The external shells constrain the deformation of the soft actuator under fluidic pressurization. This enables us to use latex party balloons as a quick and cheap alternative for initial design investigation. We found that the forces exerted by the soft material deformation are well-captured by the rigid shells, removing the necessity of explicitly describing the mechanics of the soft material deformation and its interaction with the rigid structure. One can use the classical Force, Pressure and Area formula factored with an efficiency parameter to characterize the actuator performance. Furthermore, we proposed an engineering design of the hybrid actuator using a customized soft actuator placed inside a single shell cavity with an open end for the compression force. Our results show that the proposed design can generate a very high force within a short stroke distance. At a low input pressure of 50 kPa, the exerted block force is approaching only about 3\% less than the classical equation predicted. The actuator is fitted to a new gait augmentation design for correcting knee alignment, which is usually challenging for actuators made from the purely soft material.

ROMar 1, 2020
A Reconfigurable Hybrid Actuator with Rigid and Soft Components

Yaohui Chen, Sing Le, Qiao Chu Tan et al.

Classical rigid-bodied robotic systems are presented with proven success in theoretical development and industrial applications, are recently challenged by the emergence of soft robotics due to a growing need in physical human-robot interactions (pHRI), such as wearable devices, medical robots, personal robots, etc. In this paper, we present the design and fabrication of a robust, hybrid bending actuator build from both rigid and soft components inspired by crustaceans, where its bending radius and axis can be mechanically programmed through the selective activation of the rigid exterior joints, actuated by the soft actuators inside. The hybrid actuator was experimentally measured in terms of bending and force tests to demonstrate the utility of this design. Finally, a case study was presented to demonstrate its capacity to adapt to specific objects geometry, anticipating its potential application in situations where compliance is the priority.

ROMar 1, 2020
A Lobster-inspired Robotic Glove for Hand Rehabilitation

Yaohui Chen, Sing Le, Qiao Chu Tan et al.

This paper presents preliminary results of the design, development, and evaluation of a hand rehabilitation glove fabricated using lobster-inspired hybrid design with rigid and soft components for actuation. Inspired by the bending abdomen of lobsters, hybrid actuators are built with serially jointed rigid shells actuated by pressurized soft chambers inside to generate bending motions. Such bio-inspiration absorbs features from the classical rigid-bodied robotics with precisely-defined motion generation, as well as the emerging soft robotics with light-weight, physically safe, and adaptive actuation. The fabrication procedure is described, followed by experiments to mechanically characterize these actuators. Finally, an open-palm glove design integrated with these hybrid actuators is presented for a qualitative case study. A hand rehabilitation system is developed by learning patterns of the sEMG signals from the user's forearm to train the assistive glove for hand rehabilitation exercises.

ROFeb 29, 2020
Robotic Cane as a Soft SuperLimb for Elderly Sit-to-Stand Assistance

Xia Wu, Haiyuan Liu, Ziqi Liu et al.

Many researchers have identified robotics as a potential solution to the aging population faced by many developed and developing countries. If so, how should we address the cognitive acceptance and ambient control of elderly assistive robots through design? In this paper, we proposed an explorative design of an ambient SuperLimb (Supernumerary Robotic Limb) system that involves a pneumatically-driven robotic cane for at-home motion assistance, an inflatable vest for compliant human-robot interaction, and a depth sensor for ambient intention detection. The proposed system aims at providing active assistance during the sit-to-stand transition for at-home usage by the elderly at the bedside, in the chair, and on the toilet. We proposed a modified biomechanical model with a linear cane robot for closed-loop control implementation. We validated the design feasibility of the proposed ambient SuperLimb system including the biomechanical model, our result showed the advantages in reducing lower limb efforts and elderly fall risks, yet the detection accuracy using depth sensing and adjustments on the model still require further research in the future. Nevertheless, we summarized empirical guidelines to support the ambient design of elderly-assistive SuperLimb systems for lower limb functional augmentation.

ROFeb 29, 2020
Rigid-Soft Interactive Learning for Robust Grasping

Linhan Yang, Fang Wan, Haokun Wang et al.

Inspired by widely used soft fingers on grasping, we propose a method of rigid-soft interactive learning, aiming at reducing the time of data collection. In this paper, we classify the interaction categories into Rigid-Rigid, Rigid-Soft, Soft-Rigid according to the interaction surface between grippers and target objects. We find experimental evidence that the interaction types between grippers and target objects play an essential role in the learning methods. We use soft, stuffed toys for training, instead of everyday objects, to reduce the integration complexity and computational burden and exploit such rigid-soft interaction by changing the gripper fingers to the soft ones when dealing with rigid, daily-life items such as the Yale-CMU-Berkeley (YCB) objects. With a small data collection of 5K picking attempts in total, our results suggest that such Rigid-Soft and Soft-Rigid interactions are transferable. Moreover, the combination of different grasp types shows better performance on the grasping test. We achieve the best grasping performance at 97.5\% for easy YCB objects and 81.3\% for difficult YCB objects while using a precise grasp with a two-soft-finger gripper to collect training data and power grasp with a four-soft-finger gripper to test.

ROFeb 29, 2020
Scalable Tactile Sensing for an Omni-adaptive Soft Robot Finger

Zeyi Yang, Sheng Ge, Fang Wan et al.

Robotic fingers made of soft material and compliant structures usually lead to superior adaptation when interacting with the unstructured physical environment. In this paper, we present an embedded sensing solution using optical fibers for an omni-adaptive soft robotic finger with exceptional adaptation in all directions. In particular, we managed to insert a pair of optical fibers inside the finger's structural cavity without interfering with its adaptive performance. The resultant integration is scalable as a versatile, low-cost, and moisture-proof solution for physically safe human-robot interaction. In addition, we experimented with our finger design for an object sorting task and identified sectional diameters of 94\% objects within the $\pm$6mm error and measured 80\% of the structural strains within $\pm$0.1mm/mm error. The proposed sensor design opens many doors in future applications of soft robotics for scalable and adaptive physical interactions in the unstructured environment.

ROFeb 29, 2020
Reconfigurable Design for Omni-adaptive Grasp Learning

Fang Wan, Haokun Wang, Jiyuan Wu et al.

The engineering design of robotic grippers presents an ample design space for optimization towards robust grasping. In this paper, we adopt the reconfigurable design of the robotic gripper using a novel soft finger structure with omni-directional adaptation, which generates a large number of possible gripper configurations by rearranging these fingers. Such reconfigurable design with these omni-adaptive fingers enables us to systematically investigate the optimal arrangement of the fingers towards robust grasping. Furthermore, we adopt a learning-based method as the baseline to benchmark the effectiveness of each design configuration. As a result, we found that a 3-finger and 4-finger radial configuration is the most effective one achieving an average 96\% grasp success rate on seen and novel objects selected from the YCB dataset. We also discussed the influence of the frictional surface on the finger to improve the grasp robustness.

CVSep 5, 2019
FreeAnchor: Learning to Match Anchors for Visual Object Detection

Xiaosong Zhang, Fang Wan, Chang Liu et al.

Modern CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. Our approach, referred to as FreeAnchor, updates hand-crafted anchor assignment to "free" anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization. FreeAnchor is implemented by optimizing detection customized likelihood and can be fused with CNN-based detectors in a plug-and-play manner. Experiments on COCO demonstrate that FreeAnchor consistently outperforms their counterparts with significant margins.

CVJun 14, 2019
Utilizing the Instability in Weakly Supervised Object Detection

Yan Gao, Boxiao Liu, Nan Guo et al.

Weakly supervised object detection (WSOD) focuses on training object detector with only image-level annotations, and is challenging due to the gap between the supervision and the objective. Most of existing approaches model WSOD as a multiple instance learning (MIL) problem. However, we observe that the result of MIL based detector is unstable, i.e., the most confident bounding boxes change significantly when using different initializations. We quantitatively demonstrate the instability by introducing a metric to measure it, and empirically analyze the reason of instability. Although the instability seems harmful for detection task, we argue that it can be utilized to improve the performance by fusing the results of differently initialized detectors. To implement this idea, we propose an end-to-end framework with multiple detection branches, and introduce a simple fusion strategy. We further propose an orthogonal initialization method to increase the difference between detection branches. By utilizing the instability, we achieve 52.6% and 48.0% mAP on the challenging PASCAL VOC 2007 and 2012 datasets, which are both the new state-of-the-arts.

CVApr 11, 2019
C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection

Fang Wan, Chang Liu, Wei Ke et al.

Weakly supervised object detection (WSOD) is a challenging task when provided with image category supervision but required to simultaneously learn object locations and object detectors. Many WSOD approaches adopt multiple instance learning (MIL) and have non-convex loss functions which are prone to get stuck into local minima (falsely localize object parts) while missing full object extent during training. In this paper, we introduce a continuation optimization method into MIL and thereby creating continuation multiple instance learning (C-MIL), with the intention of alleviating the non-convexity problem in a systematic way. We partition instances into spatially related and class related subsets, and approximate the original loss function with a series of smoothed loss functions defined within the subsets. Optimizing smoothed loss functions prevents the training procedure falling prematurely into local minima and facilitates the discovery of Stable Semantic Extremal Regions (SSERs) which indicate full object extent. On the PASCAL VOC 2007 and 2012 datasets, C-MIL improves the state-of-the-art of weakly supervised object detection and weakly supervised object localization with large margins.

CVFeb 16, 2019
Min-Entropy Latent Model for Weakly Supervised Object Detection

Fang Wan, Pengxu Wei, Zhenjun Han et al.

Weakly supervised object detection is a challenging task when provided with image category supervision but required to learn, at the same time, object locations and object detectors. The inconsistency between the weak supervision and learning objectives introduces significant randomness to object locations and ambiguity to detectors. In this paper, a min-entropy latent model (MELM) is proposed for weakly supervised object detection. Min-entropy serves as a model to learn object locations and a metric to measure the randomness of object localization during learning. It aims to principally reduce the variance of learned instances and alleviate the ambiguity of detectors. MELM is decomposed into three components including proposal clique partition, object clique discovery, and object localization. MELM is optimized with a recurrent learning algorithm, which leverages continuation optimization to solve the challenging non-convexity problem. Experiments demonstrate that MELM significantly improves the performance of weakly supervised object detection, weakly supervised object localization, and image classification, against the state-of-the-art approaches.

CVJan 2, 2019
SIXray : A Large-scale Security Inspection X-ray Benchmark for Prohibited Item Discovery in Overlapping Images

Caijing Miao, Lingxi Xie, Fang Wan et al.

In this paper, we present a large-scale dataset and establish a baseline for prohibited item discovery in Security Inspection X-ray images. Our dataset, named SIXray, consists of 1,059,231 X-ray images, in which 6 classes of 8,929 prohibited items are manually annotated. It raises a brand new challenge of overlapping image data, meanwhile shares the same properties with existing datasets, including complex yet meaningless contexts and class imbalance. We propose an approach named class-balanced hierarchical refinement (CHR) to deal with these difficulties. CHR assumes that each input image is sampled from a mixture distribution, and that deep networks require an iterative process to infer image contents accurately. To accelerate, we insert reversed connections to different network backbones, delivering high-level visual cues to assist mid-level features. In addition, a class-balanced loss function is designed to maximally alleviate the noise introduced by easy negative samples. We evaluate CHR on SIXray with different ratios of positive/negative samples. Compared to the baselines, CHR enjoys a better ability of discriminating objects especially using mid-level features, which offers the possibility of using a weakly-supervised approach towards accurate object localization. In particular, the advantage of CHR is more significant in the scenarios with fewer positive training samples, which demonstrates its potential application in real-world security inspection.

AIMay 23, 2017
Logical Learning Through a Hybrid Neural Network with Auxiliary Inputs

Fang Wan, Chaoyang Song

The human reasoning process is seldom a one-way process from an input leading to an output. Instead, it often involves a systematic deduction by ruling out other possible outcomes as a self-checking mechanism. In this paper, we describe the design of a hybrid neural network for logical learning that is similar to the human reasoning through the introduction of an auxiliary input, namely the indicators, that act as the hints to suggest logical outcomes. We generate these indicators by digging into the hidden information buried underneath the original training data for direct or indirect suggestions. We used the MNIST data to demonstrate the design and use of these indicators in a convolutional neural network. We trained a series of such hybrid neural networks with variations of the indicators. Our results show that these hybrid neural networks are very robust in generating logical outcomes with inherently higher prediction accuracy than the direct use of the original input and output in apparent models. Such improved predictability with reassured logical confidence is obtained through the exhaustion of all possible indicators to rule out all illogical outcomes, which is not available in the apparent models. Our logical learning process can effectively cope with the unknown unknowns using a full exploitation of all existing knowledge available for learning. The design and implementation of the hints, namely the indicators, become an essential part of artificial intelligence for logical learning. We also introduce an ongoing application setup for this hybrid neural network in an autonomous grasping robot, namely as_DeepClaw, aiming at learning an optimized grasping pose through logical learning.