CVAug 27, 2024Code
Learning effective pruning at initialization from iterative pruningShengkai Liu, Yaofeng Cheng, Fusheng Zha et al.
Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the lottery ticket hypothesis, the iterative rewind pruning (IRP) finds subnetworks retroactively by rewinding the parameter to the original initialization in every pruning iteration, which means all the subnetworks are based on the initial state. Here, we hypothesise the surviving subnetworks are more important and bridge the initial feature and their surviving score as the PaI criterion. We employ an end-to-end neural network (\textbf{AutoS}parse) to learn this correlation, input the model's initial features, output their score and then prune the lowest score parameters before training. To validate the accuracy and generalization of our method, we performed PaI across various models. Results show that our approach outperforms existing methods in high-sparsity settings. Notably, as the underlying logic of model pruning is consistent in different models, only one-time IRP on one model is needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to VGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach. These results reveal the learning tendencies of neural networks and provide new insights into our understanding and research of PaI from a practical perspective. Our code is available at: https://github.com/ChengYaofeng/AutoSparse.git.
LGApr 29, 2023
Meta-Reinforcement Learning Based on Self-Supervised Task Representation LearningMingyang Wang, Zhenshan Bing, Xiangtong Yao et al.
Meta-reinforcement learning enables artificial agents to learn from related training tasks and adapt to new tasks efficiently with minimal interaction data. However, most existing research is still limited to narrow task distributions that are parametric and stationary, and does not consider out-of-distribution tasks during the evaluation, thus, restricting its application. In this paper, we propose MoSS, a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning to address this challenge. We extend meta-RL to broad non-parametric task distributions which have never been explored before, and also achieve state-of-the-art results in non-stationary and out-of-distribution tasks. Specifically, MoSS consists of a task inference module and a policy module. We utilize the Gaussian mixture model for task representation to imitate the parametric and non-parametric task variations. Additionally, our online adaptation strategy enables the agent to react at the first sight of a task change, thus being applicable in non-stationary tasks. MoSS also exhibits strong generalization robustness in out-of-distributions tasks which benefits from the reliable and robust task representation. The policy is built on top of an off-policy RL algorithm and the entire network is trained completely off-policy to ensure high sample efficiency. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior works in terms of asymptotic performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization robustness on broad and diverse task distributions.
13.3AIApr 13
DreamKG: A KG-Augmented Conversational System for People Experiencing HomelessnessJavad M Alizadeh, Genhui Zheng, Chiu C Tan et al.
People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.
ROJan 4, 2024
Robot-Assisted Deep Venous Thrombosis Ultrasound Examination using Virtual FixtureDianye Huang, Chenguang Yang, Mingchuan Zhou et al.
Deep Venous Thrombosis (DVT) is a common vascular disease with blood clots inside deep veins, which may block blood flow or even cause a life-threatening pulmonary embolism. A typical exam for DVT using ultrasound (US) imaging is by pressing the target vein until its lumen is fully compressed. However, the compression exam is highly operator-dependent. To alleviate intra- and inter-variations, we present a robotic US system with a novel hybrid force motion control scheme ensuring position and force tracking accuracy, and soft landing of the probe onto the target surface. In addition, a path-based virtual fixture is proposed to realize easy human-robot interaction for repeat compression operation at the lesion location. To ensure the biometric measurements obtained in different examinations are comparable, the 6D scanning path is determined in a coarse-to-fine manner using both an external RGBD camera and US images. The RGBD camera is first used to extract a rough scanning path on the object. Then, the segmented vascular lumen from US images are used to optimize the scanning path to ensure the visibility of the target object. To generate a continuous scan path for developing virtual fixtures, an arc-length based path fitting model considering both position and orientation is proposed. Finally, the whole system is evaluated on a human-like arm phantom with an uneven surface.
ROApr 22, 2025
PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF GraspYaofeng Cheng, Fusheng Zha, Wei Guo et al.
The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown significant potential in enabling robots to grasp target objects. However, most existing methods are based on the point clouds (2.5D points) generated from single-view depth images. These point clouds only have one surface side of the object providing incomplete geometry information, which mislead the grasping algorithm to judge the shape of the target object, resulting in low grasping accuracy. Humans can accurately grasp objects from a single view by leveraging their geometry experience to estimate object shapes. Inspired by humans, we propose a novel 6-DoF grasping framework that converts the point completion results as object shape features to train the 6-DoF grasp network. Here, point completion can generate approximate complete points from the 2.5D points similar to the human geometry experience, and converting it as shape features is the way to utilize it to improve grasp efficiency. Furthermore, due to the gap between the network generation and actual execution, we integrate a score filter into our framework to select more executable grasp proposals for the real robot. This enables our method to maintain a high grasp quality in any camera viewpoint. Extensive experiments demonstrate that utilizing complete point features enables the generation of significantly more accurate grasp proposals and the inclusion of a score filter greatly enhances the credibility of real-world robot grasping. Our method achieves a 17.8\% success rate higher than the state-of-the-art method in real-world experiments.
CVNov 22, 2024
Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian OptimizationYuhang Song, Mario Gianni, Chenguang Yang et al.
This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: https://anonymous.4open.science/r/FGVLN.
ROSep 27, 2025
Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation PlanningYi Wang, Zeyu Xue, Mujie Liu et al.
Teleoperation via natural-language reduces operator workload and enhances safety in high-risk or remote settings. However, in dynamic remote scenes, transmission latency during bidirectional communication creates gaps between remote perceived states and operator intent, leading to command misunderstanding and incorrect execution. To mitigate this, we introduce the Spatio-Temporal Open-Vocabulary Scene Graph (ST-OVSG), a representation that enriches open-vocabulary perception with temporal dynamics and lightweight latency annotations. ST-OVSG leverages LVLMs to construct open-vocabulary 3D object representations, and extends them into the temporal domain via Hungarian assignment with our temporal matching cost, yielding a unified spatio-temporal scene graph. A latency tag is embedded to enable LVLM planners to retrospectively query past scene states, thereby resolving local-remote state mismatches caused by transmission delays. To further reduce redundancy and highlight task-relevant cues, we propose a task-oriented subgraph filtering strategy that produces compact inputs for the planner. ST-OVSG generalizes to novel categories and enhances planning robustness against transmission latency without requiring fine-tuning. Experiments show that our method achieves 74 percent node accuracy on the Replica benchmark, outperforming ConceptGraph. Notably, in the latency-robustness experiment, the LVLM planner assisted by ST-OVSG achieved a planning success rate of 70.5 percent.
CVAug 4, 2025
Rethinking Transparent Object Grasping: Depth Completion with Monocular Depth Estimation and Instance MaskYaofeng Cheng, Xinkai Gao, Sen Zhang et al.
Due to the optical properties, transparent objects often lead depth cameras to generate incomplete or invalid depth data, which in turn reduces the accuracy and reliability of robotic grasping. Existing approaches typically input the RGB-D image directly into the network to output the complete depth, expecting the model to implicitly infer the reliability of depth values. However, while effective in training datasets, such methods often fail to generalize to real-world scenarios, where complex light interactions lead to highly variable distributions of valid and invalid depth data. To address this, we propose ReMake, a novel depth completion framework guided by an instance mask and monocular depth estimation. By explicitly distinguishing transparent regions from non-transparent ones, the mask enables the model to concentrate on learning accurate depth estimation in these areas from RGB-D input during training. This targeted supervision reduces reliance on implicit reasoning and improves generalization to real-world scenarios. Additionally, monocular depth estimation provides depth context between the transparent object and its surroundings, enhancing depth prediction accuracy. Extensive experiments show that our method outperforms existing approaches on both benchmark datasets and real-world scenarios, demonstrating superior accuracy and generalization capability. Code and videos are available at https://chengyaofeng.github.io/ReMake.github.io/.
ROMay 30, 2023
Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured DataHongkuan Zhou, Zhenshan Bing, Xiangtong Yao et al.
The growing interest in language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks, with the objective of enabling robots to interpret language commands and manipulate objects accordingly. While language-conditioned approaches demonstrate impressive capabilities for addressing tasks in familiar environments, they encounter limitations in adapting to unfamiliar environment settings. In this study, we propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data to enhance the algorithm's generalization in adapting to unfamiliar environments. We assess our model's performance in both simulated and real-world environments using a zero-shot setting. In the simulated environment, the proposed approach surpasses previously reported scores for CALVIN benchmark, especially in the challenging Zero-Shot Multi-Environment setting. The average completed task length, indicating the average number of tasks the agent can continuously complete, improves more than 2.5 times compared to the state-of-the-art method HULC. In addition, we conduct a zero-shot evaluation of our policy in a real-world setting, following training exclusively in simulated environments without additional specific adaptations. In this evaluation, we set up ten tasks and achieved an average 30% improvement in our approach compared to the current state-of-the-art approach, demonstrating a high generalization capability in both simulated environments and the real world. For further details, including access to our code and videos, please refer to https://hk-zh.github.io/spil/
ROAug 9, 2021
Unknown Object Segmentation through Domain AdaptationYiting Chen, Chenguang Yang, Miao Li
The ability to segment unknown objects in cluttered scenes has a profound impact on robot grasping. The rise of deep learning has greatly transformed the pipeline of robotic grasping from model-based approach to data-driven stream, which generally requires a large scale of grasping data either collected in simulation or from real-world examples. In this paper, we proposed a sim-to-real framework to transfer the object segmentation model learned in simulation to the real-world. First, data samples are collected in simulation, including RGB, 6D pose, and point cloud. Second, we also present a GAN-based unknown object segmentation method through domain adaptation, which consists of an image translation module and an image segmentation module. The image translation module is used to shorten the reality gap and the segmentation module is responsible for the segmentation mask generation. We used the above method to perform segmentation experiments on unknown objects in a bin-picking scenario. Finally, the experimental result shows that the segmentation model learned in simulation can be used for real-world data segmentation.
ROJul 19, 2021
Learning compliant grasping and manipulation by teleoperation with adaptive force controlChao Zeng, Shuang Li, Yiming Jiang et al.
In this work, we focus on improving the robot's dexterous capability by exploiting visual sensing and adaptive force control. TeachNet, a vision-based teleoperation learning framework, is exploited to map human hand postures to a multi-fingered robot hand. We augment TeachNet, which is originally based on an imprecise kinematic mapping and position-only servoing, with a biomimetic learning-based compliance control algorithm for dexterous manipulation tasks. This compliance controller takes the mapped robotic joint angles from TeachNet as the desired goal, computes the desired joint torques. It is derived from a computational model of the biomimetic control strategy in human motor learning, which allows adapting the control variables (impedance and feedforward force) online during the execution of the reference joint angle trajectories. The simultaneous adaptation of the impedance and feedforward profiles enables the robot to interact with the environment in a compliant manner. Our approach has been verified in multiple tasks in physics simulation, i.e., grasping, opening-a-door, turning-a-cap, and touching-a-mouse, and has shown more reliable performances than the existing position control and the fixed-gain-based force control approaches.
MLMay 20, 2021
Ensemble machine learning approach for screening of coronary heart disease based on echocardiography and risk factorsJingyi Zhang, Huolan Zhu, Yongkai Chen et al.
Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classification methods together by model stacking, and generalize the traditional stacking method to a two-step stacking method to improve the diagnostic performance. Results: By borrowing strengths from multiple classification models through the proposed method, we improve the CHD classification accuracy from around 70% to 87.7% on the testing set. The sensitivity of the proposed method is 0.903 and the specificity is 0.843, with an AUC of 0.904, which is significantly higher than those of the individual classification models. Conclusions: Our work lays a foundation for the deployment of speckle tracking echocardiography-based screening tools for coronary heart disease.
CVSep 23, 2019
Retrieval-based Localization Based on Domain-invariant Feature Learning under Changing EnvironmentsHanjiang Hu, Hesheng Wang, Zhe Liu et al.
Visual localization is a crucial problem in mobile robotics and autonomous driving. One solution is to retrieve images with known pose from a database for the localization of query images. However, in environments with drastically varying conditions (e.g. illumination changes, seasons, occlusion, dynamic objects), retrieval-based localization is severely hampered and becomes a challenging problem. In this paper, a novel domain-invariant feature learning method (DIFL) is proposed based on ComboGAN, a multi-domain image translation network architecture. By introducing a feature consistency loss (FCL) between the encoded features of the original image and translated image in another domain, we are able to train the encoders to generate domain-invariant features in a self-supervised manner. To retrieve a target image from the database, the query image is first encoded using the encoder belonging to the query domain to obtain a domain-invariant feature vector. We then preform retrieval by selecting the database image with the most similar domain-invariant feature vector. We validate the proposed approach on the CMU-Seasons dataset, where we outperform state-of-the-art learning-based descriptors in retrieval-based localization for high and medium precision scenarios.