52.1ROJun 3
TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent ElastomersLingyue Yang, Bin Fang
Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of $70.
ROFeb 21, 2023
Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered EnvironmentYuhong Deng, Xiaofeng Guo, Yixuan Wei et al.
In this paper, a novel robotic grasping system is established to automatically pick up objects in cluttered scenes. A composite robotic hand composed of a suction cup and a gripper is designed for grasping the object stably. The suction cup is used for lifting the object from the clutter first and the gripper for grasping the object accordingly. We utilize the affordance map to provide pixel-wise lifting point candidates for the suction cup. To obtain a good affordance map, the active exploration mechanism is introduced to the system. An effective metric is designed to calculate the reward for the current affordance map, and a deep Q-Network (DQN) is employed to guide the robotic hand to actively explore the environment until the generated affordance map is suitable for grasping. Experimental results have demonstrated that the proposed robotic grasping system is able to greatly increase the success rate of the robotic grasping in cluttered scenes.
99.8ROApr 13Code
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated ManipulationShihan Wu, Xuecheng Liu, Shaoxuan Xie et al.
Despite the critical role of bimanual manipulation in endowing robots with human-like dexterity, large-scale and diverse datasets remain scarce due to the significant hardware heterogeneity across bimanual robotic platforms. To bridge this gap, we introduce RoboCOIN, a large-scale multi-embodiment bimanual manipulation dataset comprising over 180,000 demonstrations collected from 15 distinct robotic platforms. Spanning 16 diverse environments-including residential, commercial, and industrial settings-the dataset features 421 bimanual tasks systematically categorized by 39 bimanual collaboration actions and 432 objects. A key innovation of our work is the hierarchical capability pyramid, which provides granular annotations ranging from trajectory-level concepts to segment-level subtasks and frame-level kinematics. Furthermore, we present CoRobot, an efficient data processing pipeline powered by the Robot Trajectory Markup Language (RTML), designed to facilitate quality assessment, automated annotation, and unified multi-embodiment and data management. Extensive experiments demonstrate the effectiveness of RoboCOIN in enhancing the performance of various bimanual manipulation models across a wide spectrum of robotic embodiments. The entire dataset and codebase are fully open-sourced, providing a valuable resource for advancing research in bimanual and multi-embodiment manipulation.
CVOct 4, 2022
ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather ConditionsAnjun Chen, Xiangyu Wang, Kun Shi et al.
3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementary, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for robust all-weather 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and the vulnerability of RGB images. In this paper, we present ImmFusion, the first mmWave-RGB fusion solution to reconstruct 3D human bodies in all weather conditions robustly. Specifically, our ImmFusion consists of image and point backbones for token feature extraction and a Transformer module for token fusion. The image and point backbones refine global and local features from original data, and the Fusion Transformer Module aims for effective information fusion of two modalities by dynamically selecting informative tokens. Extensive experiments on a large-scale dataset, mmBody, captured in various environments demonstrate that ImmFusion can efficiently utilize the information of two modalities to achieve a robust 3D human body reconstruction in all weather conditions. In addition, our method's accuracy is significantly superior to that of state-of-the-art Transformer-based LiDAR-camera fusion methods.
59.4ROMay 17
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven ParadigmsZhixiang Cao, Di Tian, Runwei Guan et al.
Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.
CVOct 15, 2022
Transformer-based dimensionality reductionRuisheng Ran, Tianyu Gao, Bin Fang
Recently, Transformer is much popular and plays an important role in the fields of Machine Learning (ML), Natural Language Processing (NLP), and Computer Vision (CV), etc. In this paper, based on the Vision Transformer (ViT) model, a new dimensionality reduction (DR) model is proposed, named Transformer-DR. From data visualization, image reconstruction and face recognition, the representation ability of Transformer-DR after dimensionality reduction is studied, and it is compared with some representative DR methods to understand the difference between Transformer-DR and existing DR methods. The experimental results show that Transformer-DR is an effective dimensionality reduction method.
ROFeb 10
AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile PerceptionRuoxuan Feng, Yuxuan Zhou, Siyu Mei et al.
Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.
74.9ROMar 10
ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision AssemblyMinchi Ruan, LiangQing Zhou, Hongtong Li et al.
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
CVAug 22, 2025Code
A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect DetectionYong Zhang, Cunjian Chen, Qiang Gao et al.
Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.
CVFeb 29, 2020Code
Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image TranslationRunfa Chen, Wenbing Huang, Binghui Huang et al.
Unsupervised image-to-image translation is a central task in computer vision. Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by reusing it for encoding the images of the target domain. The proposed architecture, termed as NICE-GAN, exhibits two advantageous patterns over previous approaches: First, it is more compact since no independent encoding component is required; Second, this plug-in encoder is directly trained by the adversary loss, making it more informative and trained more effectively if a multi-scale discriminator is applied. The main issue in NICE-GAN is the coupling of translation with discrimination along the encoder, which could incur training inconsistency when we play the min-max game via GAN. To tackle this issue, we develop a decoupled training strategy by which the encoder is only trained when maximizing the adversary loss while keeping frozen otherwise. Extensive experiments on four popular benchmarks demonstrate the superior performance of NICE-GAN over state-of-the-art methods in terms of FID, KID, and also human preference. Comprehensive ablation studies are also carried out to isolate the validity of each proposed component. Our codes are available at https://github.com/alpc91/NICE-GAN-pytorch.
LGFeb 15, 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile SensorsRuoxuan Feng, Jiangyu Hu, Wenke Xia et al.
Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to precisely understand and manipulate objects. Over time, numerous meticulously designed visuo-tactile sensors have been integrated into robotic systems, aiding in completing various tasks. However, the distinct data characteristics of these low-standardized visuo-tactile sensors hinder the establishment of a powerful tactile perception system. We consider that the key to addressing this issue lies in learning unified multi-sensor representations, thereby integrating the sensors and promoting tactile knowledge transfer between them. To achieve unified representation of this nature, we introduce TacQuad, an aligned multi-modal multi-sensor tactile dataset from four different visuo-tactile sensors, which enables the explicit integration of various sensors. Recognizing that humans perceive the physical environment by acquiring diverse tactile information such as texture and pressure changes, we further propose to learn unified multi-sensor representations from both static and dynamic perspectives. By integrating tactile images and videos, we present AnyTouch, a unified static-dynamic multi-sensor representation learning framework with a multi-level structure, aimed at both enhancing comprehensive perceptual abilities and enabling effective cross-sensor transfer. This multi-level architecture captures pixel-level details from tactile data via masked modeling and enhances perception and transferability by learning semantic-level sensor-agnostic features through multi-modal alignment and cross-sensor matching. We provide a comprehensive analysis of multi-sensor transferability, and validate our method on various datasets and in the real-world pouring task. Experimental results show that our method outperforms existing methods, exhibits outstanding static and dynamic perception capabilities across various sensors.
ROFeb 11, 2025
RoboBERT: An End-to-end Multimodal Robotic Manipulation ModelSicheng Wang, Sheng Liu, Weiheng Wang et al.
Embodied intelligence seamlessly integrates vision, language, and action.~However, most multimodal robotic models rely on massive fine-tuning, incurring high time and hardware costs.~To address this, we introduce RoboBERT, an end-to-end multimodal manipulation model built around a novel two-stage training paradigm.~In the first stage, we freeze most of the vision encoder and train with a single "standard" instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy.~In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance.~We further employ systematic data augmentations to enhance robustness against visual perturbations.~Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD-D benchmark and 3.79 on the ABC-D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture.Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data.These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for multimodal robotic systems.
LGMay 21, 2024
Transformer in Touch: A SurveyJing Gao, Ning Cheng, Bin Fang et al.
The Transformer model, initially achieving significant success in the field of natural language processing, has recently shown great potential in the application of tactile perception. This review aims to comprehensively outline the application and development of Transformers in tactile technology. We first introduce the two fundamental concepts behind the success of the Transformer: the self-attention mechanism and large-scale pre-training. Then, we delve into the application of Transformers in various tactile tasks, including but not limited to object recognition, cross-modal generation, and object manipulation, offering a concise summary of the core methodologies, performance benchmarks, and design highlights. Finally, we suggest potential areas for further research and future work, aiming to generate more interest within the community, tackle existing challenges, and encourage the use of Transformer models in the tactile field.
IVDec 20, 2024
PIGUIQA: A Physical Imaging Guided Perceptual Framework for Underwater Image Quality AssessmentWeizhi Xian, Mingliang Zhou, Leong Hou U et al.
In this paper, we propose a Physical Imaging Guided perceptual framework for Underwater Image Quality Assessment (UIQA), termed PIGUIQA. First, we formulate UIQA as a comprehensive problem that considers the combined effects of direct transmission attenuation and backward scattering on image perception. By leveraging underwater radiative transfer theory, we systematically integrate physics-based imaging estimations to establish quantitative metrics for these distortions. Second, recognizing spatial variations in image content significance and human perceptual sensitivity to distortions, we design a module built upon a neighborhood attention mechanism for local perception of images. This module effectively captures subtle features in images, thereby enhancing the adaptive perception of distortions on the basis of local information. Third, by employing a global perceptual aggregator that further integrates holistic image scene with underwater distortion information, the proposed model accurately predicts image quality scores. Extensive experiments across multiple benchmarks demonstrate that PIGUIQA achieves state-of-the-art performance while maintaining robust cross-dataset generalizability. The implementation is publicly available at https://anonymous.4open.science/r/PIGUIQA-A465/
CVFeb 19, 2024
A Lightweight Parallel Framework for Blind Image Quality AssessmentQunyue Huang, Bin Fang
Existing blind image quality assessment (BIQA) methods focus on designing complicated networks based on convolutional neural networks (CNNs) or transformer. In addition, some BIQA methods enhance the performance of the model in a two-stage training manner. Despite the significant advancements, these methods remarkably raise the parameter count of the model, thus requiring more training time and computational resources. To tackle the above issues, we propose a lightweight parallel framework (LPF) for BIQA. First, we extract the visual features using a pre-trained feature extraction network. Furthermore, we construct a simple yet effective feature embedding network (FEN) to transform the visual features, aiming to generate the latent representations that contain salient distortion information. To improve the robustness of the latent representations, we present two novel self-supervised subtasks, including a sample-level category prediction task and a batch-level quality comparison task. The sample-level category prediction task is presented to help the model with coarse-grained distortion perception. The batch-level quality comparison task is formulated to enhance the training data and thus improve the robustness of the latent representations. Finally, the latent representations are fed into a distortion-aware quality regression network (DaQRN), which simulates the human vision system (HVS) and thus generates accurate quality scores. Experimental results on multiple benchmark datasets demonstrate that the proposed method achieves superior performance over state-of-the-art approaches. Moreover, extensive analyses prove that the proposed method has lower computational complexity and faster convergence speed.
ROMar 12, 2025
Tacchi 2.0: A Low Computational Cost and Comprehensive Dynamic Contact Simulator for Vision-based Tactile SensorsYuhao Sun, Shixin Zhang, Wenzhuang Li et al.
With the development of robotics technology, some tactile sensors, such as vision-based sensors, have been applied to contact-rich robotics tasks. However, the durability of vision-based tactile sensors significantly increases the cost of tactile information acquisition. Utilizing simulation to generate tactile data has emerged as a reliable approach to address this issue. While data-driven methods for tactile data generation lack robustness, finite element methods (FEM) based approaches require significant computational costs. To address these issues, we integrated a pinhole camera model into the low computational cost vision-based tactile simulator Tacchi that used the Material Point Method (MPM) as the simulated method, completing the simulation of marker motion images. We upgraded Tacchi and introduced Tacchi 2.0. This simulator can simulate tactile images, marked motion images, and joint images under different motion states like pressing, slipping, and rotating. Experimental results demonstrate the reliability of our method and its robustness across various vision-based tactile sensors.
IVFeb 11, 2025
The establishment of static digital humans and the integration with spinal modelsFujiao Ju, Yuxuan Wang, Shuo Wang et al.
Adolescent idiopathic scoliosis (AIS), a prevalent spinal deformity, significantly affects individuals' health and quality of life. Conventional imaging techniques, such as X - rays, computed tomography (CT), and magnetic resonance imaging (MRI), offer static views of the spine. However, they are restricted in capturing the dynamic changes of the spine and its interactions with overall body motion. Therefore, developing new techniques to address these limitations has become extremely important. Dynamic digital human modeling represents a major breakthrough in digital medicine. It enables a three - dimensional (3D) view of the spine as it changes during daily activities, assisting clinicians in detecting deformities that might be missed in static imaging. Although dynamic modeling holds great potential, constructing an accurate static digital human model is a crucial initial step for high - precision simulations. In this study, our focus is on constructing an accurate static digital human model integrating the spine, which is vital for subsequent dynamic digital human research on AIS. First, we generate human point - cloud data by combining the 3D Gaussian method with the Skinned Multi - Person Linear (SMPL) model from the patient's multi - view images. Then, we fit a standard skeletal model to the generated human model. Next, we align the real spine model reconstructed from CT images with the standard skeletal model. We validated the resulting personalized spine model using X - ray data from six AIS patients, with Cobb angles (used to measure the severity of scoliosis) as evaluation metrics. The results indicate that the model's error was within 1 degree of the actual measurements. This study presents an important method for constructing digital humans.
CVMar 14, 2024
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision DatasetNing Cheng, You Li, Jing Gao et al.
Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
CRMay 18, 2023
Towards Generalizable Data Protection With Transferable Unlearnable ExamplesBin Fang, Bo Li, Shuang Wu et al.
Artificial Intelligence (AI) is making a profound impact in almost every domain. One of the crucial factors contributing to this success has been the access to an abundance of high-quality data for constructing machine learning models. Lately, as the role of data in artificial intelligence has been significantly magnified, concerns have arisen regarding the secure utilization of data, particularly in the context of unauthorized data usage. To mitigate data exploitation, data unlearning have been introduced to render data unexploitable. However, current unlearnable examples lack the generalization required for wide applicability. In this paper, we present a novel, generalizable data protection method by generating transferable unlearnable examples. To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution. Through extensive experimentation, we substantiate the enhanced generalizable protection capabilities of our proposed method.
CRMay 18, 2023
Re-thinking Data Availablity Attacks Against Deep Neural NetworksBin Fang, Bo Li, Shuang Wu et al.
The unauthorized use of personal data for commercial purposes and the clandestine acquisition of private data for training machine learning models continue to raise concerns. In response to these issues, researchers have proposed availability attacks that aim to render data unexploitable. However, many current attack methods are rendered ineffective by adversarial training. In this paper, we re-examine the concept of unlearnable examples and discern that the existing robust error-minimizing noise presents an inaccurate optimization objective. Building on these observations, we introduce a novel optimization paradigm that yields improved protection results with reduced computational time requirements. We have conducted extensive experiments to substantiate the soundness of our approach. Moreover, our method establishes a robust foundation for future research in this area.
ROFeb 25, 2022
Hybrid Robotic Grasping with a Soft Multimodal Gripper and a Deep Multistage Learning SchemeFukang Liu, Fuchun Sun, Bin Fang et al.
Grasping has long been considered an important and practical task in robotic manipulation. Yet achieving robust and efficient grasps of diverse objects is challenging, since it involves gripper design, perception, control and learning, etc. Recent learning-based approaches have shown excellent performance in grasping a variety of novel objects. However, these methods either are typically limited to one single grasping mode, or else more end effectors are needed to grasp various objects. In addition, gripper design and learning methods are commonly developed separately, which may not adequately explore the ability of a multimodal gripper. In this paper, we present a deep reinforcement learning (DRL) framework to achieve multistage hybrid robotic grasping with a new soft multimodal gripper. A soft gripper with three grasping modes (i.e., enveloping, sucking, and enveloping_then_sucking) can both deal with objects of different shapes and grasp more than one object simultaneously. We propose a novel hybrid grasping method integrated with the multimodal gripper to optimize the number of grasping actions. We evaluate the DRL framework under different scenarios (i.e., with different ratios of objects of two grasp types). The proposed algorithm is shown to reduce the number of grasping actions (i.e., enlarge the grasping efficiency, with maximum values of 161% in simulations and 154% in real-world experiments) compared to single grasping modes.
ROAug 11, 2021
Elastic Tactile Simulation Towards Tactile-Visual PerceptionYikai Wang, Wenbing Huang, Bin Fang et al.
Tactile sensing plays an important role in robotic perception and manipulation tasks. To overcome the real-world limitations of data collection, simulating tactile response in a virtual environment comes as a desirable direction of robotic research. In this paper, we propose Elastic Interaction of Particles (EIP) for tactile simulation. Most existing works model the tactile sensor as a rigid multi-body, which is incapable of reflecting the elastic property of the tactile sensor as well as characterizing the fine-grained physical interaction between the two objects. By contrast, EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact. With the tactile simulation by EIP, we further propose a tactile-visual perception network that enables information fusion between tactile data and visual images. The perception network is based on a global-to-local fusion mechanism where multi-scale tactile features are aggregated to the corresponding local region of the visual modality with the guidance of tactile positions and directions. The fusion method exhibits superiority regarding the 3D geometric reconstruction task.
RONov 23, 2020
Elastic Interaction of Particles for Robotic Tactile SimulationYikai Wang, Wenbing Huang, Bin Fang et al.
Tactile sensing plays an important role in robotic perception and manipulation. To overcome the real-world limitations of data collection, simulating tactile response in virtual environment comes as a desire direction of robotic research. Most existing works model the tactile sensor as a rigid multi-body, which is incapable of reflecting the elastic property of the tactile sensor as well as characterizing the fine-grained physical interaction between two objects. In this paper, we propose Elastic Interaction of Particles (EIP), a novel framework for tactile emulation. At its core, EIP models the tactile sensor as a group of coordinated particles, and the elastic theory is applied to regulate the deformation of particles during the contact process. The implementation of EIP is conducted from scratch, without resorting to any existing physics engine. Experiments to verify the effectiveness of our method have been carried out on two applications: robotic perception with tactile data and 3D geometric reconstruction by tactile-visual fusion. It is possible to open up a new vein for robotic tactile simulation, and contribute to various downstream robotic tasks.
ROMar 2, 2020
Fabric Defect Detection Using Vision-Based Tactile SensorBin Fang, Xingming Long, Yifan Zhang et al.
This paper introduces a new type of system for fabric defect detection with the tactile inspection system. Different from existed visual inspection systems, the proposed system implements a vision-based tactile sensor. The tactile sensor, which mainly consists of a camera, four LEDs, and an elastic sensing layer, captures detailed information about fabric surface structure and ignores the color and pattern. Thus, the ambiguity between a defect and image background related to fabric color and pattern is avoided. To utilize the tactile sensor for fabric inspection, we employ intensity adjustment for image preprocessing, Residual Network with ensemble learning for detecting defects, and uniformity measurement for selecting ideal dataset for model training. An experiment is conducted to verify the performance of the proposed tactile system. The experimental results have demonstrated the feasibility of the proposed system, which performs well in detecting structural defects for various types of fabrics. In addition, the system does not require external light sources, which skips the process of setting up and tuning a lighting environment.
LGNov 16, 2019
Reinforcement Learning from Imperfect Demonstrations under Soft Expert GuidanceMingxuan Jing, Xiaojian Ma, Wenbing Huang et al.
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufficient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we first define an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical findings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efficiently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.
SPApr 25, 2019
Attention-based Transfer Learning for Brain-computer InterfaceChuanqi Tan, Fuchun Sun, Tao Kong et al.
Different functional areas of the human brain play different roles in brain activity, which has not been paid sufficient research attention in the brain-computer interface (BCI) field. This paper presents a new approach for electroencephalography (EEG) classification that applies attention-based transfer learning. Our approach considers the importance of different brain functional areas to improve the accuracy of EEG classification, and provides an additional way to automatically identify brain functional areas associated with new activities without the involvement of a medical professional. We demonstrate empirically that our approach out-performs state-of-the-art approaches in the task of EEG classification, and the results of visualization indicate that our approach can detect brain functional areas related to a certain task.
ROSep 17, 2018
PointNetGPD: Detecting Grasp Configurations from Point SetsHongzhuo Liang, Xiaojian Ma, Shuang Li et al.
In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud. Compared to recent grasp evaluation metrics that are based on handcrafted depth features and a convolutional neural network (CNN), our proposed PointNetGPD is lightweight and can directly process the 3D point cloud that locates within the gripper for grasp evaluation. Taking the raw point cloud as input, our proposed grasp evaluation network can capture the complex geometric structure of the contact area between the gripper and the object even if the point cloud is very sparse. To further improve our proposed model, we generate a larger-scale grasp dataset with 350k real point cloud and grasps with the YCB object set for training. The performance of the proposed model is quantitatively measured both in simulation and on robotic hardware. Experiments on object grasping and clutter removal show that our proposed model generalizes well to novel objects and outperforms state-of-the-art methods. Code and video are available at \href{https://lianghongzhuo.github.io/PointNetGPD}{https://lianghongzhuo.github.io/PointNetGPD}