Tin Lun Lam

CV
h-index34
31papers
704citations
Novelty50%
AI Score53

31 Papers

CVMar 28, 2023Code
Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks

Mingjian Liang, Junjie Hu, Chenyu Bao et al.

Recently, RGB-Thermal based perception has shown significant advances. Thermal information provides useful clues when visual cameras suffer from poor lighting conditions, such as low light and fog. However, how to effectively fuse RGB images and thermal data remains an open challenge. Previous works involve naive fusion strategies such as merging them at the input, concatenating multi-modality features inside models, or applying attention to each data modality. These fusion strategies are straightforward yet insufficient. In this paper, we propose a novel fusion method named Explicit Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of data. Specifically, we consider the following cases: i) both RGB data and thermal data, ii) only one of the types of data, and iii) none of them generate discriminative features. EAEF uses one branch to enhance feature extraction for i) and iii) and the other branch to remedy insufficient representations for ii). The outputs of two branches are fused to form complementary features. As a result, the proposed fusion method outperforms state-of-the-art by 1.6\% in mIoU on semantic segmentation, 3.1\% in MAE on salient object detection, 2.3\% in mAP on object detection, and 8.1\% in MAE on crowd counting. The code is available at https://github.com/FreeformRobotics/EAEFNet.

CVAug 29, 2022Code
Progressive Self-Distillation for Ground-to-Aerial Perception Knowledge Transfer

Junjie Hu, Chenyou Fan, Mete Ozay et al.

We study a practical yet hasn't been explored problem: how a drone can perceive in an environment from different flight heights. Unlike autonomous driving, where the perception is always conducted from a ground viewpoint, a flying drone may flexibly change its flight height due to specific tasks, requiring the capability for viewpoint invariant perception. Tackling the such problem with supervised learning will incur tremendous costs for data annotation of different flying heights. On the other hand, current semi-supervised learning methods are not effective under viewpoint differences. In this paper, we introduce the ground-to-aerial perception knowledge transfer and propose a progressive semi-supervised learning framework that enables drone perception using only labeled data of ground viewpoint and unlabeled data of flying viewpoints. Our framework has four core components: i) a dense viewpoint sampling strategy that splits the range of vertical flight height into a set of small pieces with evenly-distributed intervals, ii) nearest neighbor pseudo-labeling that infers labels of the nearest neighbor viewpoint with a model learned on the preceding viewpoint, iii) MixView that generates augmented images among different viewpoints to alleviate viewpoint differences, and iv) a progressive distillation strategy to gradually learn until reaching the maximum flying height. We collect a synthesized and a real-world dataset, and we perform extensive experimental analyses to show that our method yields 22.2% and 16.9% accuracy improvement for the synthesized dataset and the real world. Code and datasets are available on https://github.com/FreeformRobotics/Progressive-Self-Distillation-for-Ground-to-Aerial-Perception-Knowledge-Transfer.

CVMay 11, 2022
Deep Depth Completion from Extremely Sparse Data: A Survey

Junjie Hu, Chenyu Bao, Mete Ozay et al.

Depth completion aims at predicting dense pixel-wise depth from an extremely sparse map captured from a depth sensor, e.g., LiDARs. It plays an essential role in various applications such as autonomous driving, 3D reconstruction, augmented reality, and robot navigation. Recent successes on the task have been demonstrated and dominated by deep learning based solutions. In this article, for the first time, we provide a comprehensive literature review that helps readers better grasp the research trends and clearly understand the current advances. We investigate the related studies from the design aspects of network architectures, loss functions, benchmark datasets, and learning strategies with a proposal of a novel taxonomy that categorizes existing methods. Besides, we present a quantitative comparison of model performance on three widely used benchmarks, including indoor and outdoor datasets. Finally, we discuss the challenges of prior works and provide readers with some insights for future research directions.

CVDec 31, 2022
Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Liguang Zhou, Yuhongze Zhou, Xiaonan Qi et al.

Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.

CVAug 26, 2022
Dense Depth Distillation with Out-of-Distribution Simulated Images

Junjie Hu, Chenyou Fan, Mete Ozay et al.

We study data-free knowledge distillation (KD) for monocular depth estimation (MDE), which learns a lightweight model for real-world depth perception tasks by compressing it from a trained teacher model while lacking training data in the target domain. Owing to the essential difference between image classification and dense regression, previous methods of data-free KD are not applicable to MDE. To strengthen its applicability in real-world tasks, in this paper, we propose to apply KD with out-of-distribution simulated images. The major challenges to be resolved are i) lacking prior information about scene configurations of real-world training data and ii) domain shift between simulated and real-world images. To cope with these difficulties, we propose a tailored framework for depth distillation. The framework generates new training samples for embracing a multitude of possible object arrangements in the target domain and utilizes a transformation network to efficiently adapt them to the feature statistics preserved in the teacher model. Through extensive experiments on various depth estimation models and two different datasets, we show that our method outperforms the baseline KD by a good margin and even achieves slightly better performance with as few as 1/6 of training images, demonstrating a clear superiority.

CVMar 9, 2023
Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric Depth Estimation

Junjie Hu, Chenyou Fan, Liguang Zhou et al.

With the rapid advancements in autonomous driving and robot navigation, there is a growing demand for lifelong learning models capable of estimating metric (absolute) depth. Lifelong learning approaches potentially offer significant cost savings in terms of model training, data storage, and collection. However, the quality of RGB images and depth maps is sensor-dependent, and depth maps in the real world exhibit domain-specific characteristics, leading to variations in depth ranges. These challenges limit existing methods to lifelong learning scenarios with small domain gaps and relative depth map estimation. To facilitate lifelong metric depth learning, we identify three crucial technical challenges that require attention: i) developing a model capable of addressing the depth scale variation through scale-aware depth learning, ii) devising an effective learning strategy to handle significant domain gaps, and iii) creating an automated solution for domain-aware depth inference in practical applications. Based on the aforementioned considerations, in this paper, we present i) a lightweight multi-head framework that effectively tackles the depth scale imbalance, ii) an uncertainty-aware lifelong learning solution that adeptly handles significant domain gaps, and iii) an online domain-specific predictor selection method for real-time inference. Through extensive numerical studies, we show that the proposed method can achieve good efficiency, stability, and plasticity, leading the benchmarks by 8% to 15%.

ROAug 5, 2022
Learning to Coordinate for a Worker-Station Multi-robot System in Planar Coverage Tasks

Jingtao Tang, Yuan Gao, Tin Lun Lam

For massive large-scale tasks, a multi-robot system (MRS) can effectively improve efficiency by utilizing each robot's different capabilities, mobility, and functionality. In this paper, we focus on the multi-robot coverage path planning (mCPP) problem in large-scale planar areas with random dynamic interferers in the environment, where the robots have limited resources. We introduce a worker-station MRS consisting of multiple workers with limited resources for actual work, and one station with enough resources for resource replenishment. We aim to solve the mCPP problem for the worker-station MRS by formulating it as a fully cooperative multi-agent reinforcement learning problem. Then we propose an end-to-end decentralized online planning method, which simultaneously solves coverage planning for workers and rendezvous planning for station. Our method manages to reduce the influence of random dynamic interferers on planning, while the robots can avoid collisions with them. We conduct simulation and real robot experiments, and the comparison results show that our method has competitive performance in solving the mCPP problem for worker-station MRS in metric of task finish time.

CVDec 31, 2022
Peer Learning for Unbiased Scene Graph Generation

Liguang Zhou, Junjie Hu, Yuhongze Zhou et al.

Unbiased scene graph generation (USGG) is a challenging task that requires predicting diverse and heavily imbalanced predicates between objects in an image. To address this, we propose a novel framework peer learning that uses predicate sampling and consensus voting (PSCV) to encourage multiple peers to learn from each other. Predicate sampling divides the predicate classes into sub-distributions based on frequency, and assigns different peers to handle each sub-distribution or combinations of them. Consensus voting ensembles the peers' complementary predicate knowledge by emphasizing majority opinion and diminishing minority opinion. Experiments on Visual Genome show that PSCV outperforms previous methods and achieves a new state-of-the-art on SGCls task with 31.6 mean.

CVAug 15, 2022
Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Liguang Zhou, Yuhongze Zhou, Tin Lun Lam et al.

Scene graph generation (SGG) has gained tremendous progress in recent years. However, its underlying long-tailed distribution of predicate classes is a challenging problem. For extremely unbalanced predicate distributions, existing approaches usually construct complicated context encoders to extract the intrinsic relevance of scene context to predicates and complex networks to improve the learning ability of network models for highly imbalanced predicate distributions. To address the unbiased SGG problem, we introduce a simple yet effective method dubbed Context-Aware Mixture-of-Experts (CAME) to improve model diversity and mitigate biased SGG without complicated design. Specifically, we propose to integrate the mixture of experts with a divide and ensemble strategy to remedy the severely long-tailed distribution of predicate classes, which is applicable to the majority of unbiased scene graph generators. The biased SGG is thereby reduced, and the model tends to anticipate more evenly distributed predicate predictions. To differentiate between various predicate distribution levels, experts with the same weights are not sufficiently diverse. In order to enable the network dynamically exploit the rich scene context and further boost the diversity of model, we simply use the built-in module to create a context encoder. The importance of each expert to scene context and each predicate to each expert is dynamically associated with expert weighting (EW) and predicate weighting (PW) strategy. We have conducted extensive experiments on three tasks using the Visual Genome dataset, showing that CAME outperforms recent methods and achieves state-of-the-art performance. Our code will be available publicly.

ROMar 19
RhoMorph: Rhombus-shaped Deformable Modular Robots for Stable, Medium-Independent Reconfiguration Motion

Jie Gu, Yirui Sun, Zhihao Xia et al.

In this paper, we present RhoMorph, a novel deformable planar lattice modular self-reconfigurable robot (MSRR) with a rhombus shaped module. Each module consists of a parallelogram skeleton with a single centrally mounted actuator that enables folding and unfolding along its diagonal. The core design philosophy is to achieve essential MSRR functionalities such as morphing, docking, and locomotion with minimal control complexity. This enables a continuous and stable reconfiguration process that is independent of the surrounding medium, allowing the system to reliably form various configurations in diverse environments. To leverage the unique kinematics of RhoMorph, we introduce morphpivoting, a novel motion primitive for reconfiguration that differs from advanced MSRR systems, and propose a strategy for its continuous execution. Finally, a series of physical experiments validate the module's stable reconfiguration ability, as well as its positional and docking accuracy.

CVJul 7, 2025Code
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Yun Wang, Longguang Wang, Chenghao Zhang et al.

Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.

CVSep 21, 2023
Class Relevance Learning For Out-of-distribution Detection

Butian Xiong, Liguang Zhou, Tin Lun Lam et al.

Image classification plays a pivotal role across diverse applications, yet challenges persist when models are deployed in real-world scenarios. Notably, these models falter in detecting unfamiliar classes that were not incorporated during classifier training, a formidable hurdle for safe and effective real-world model deployment, commonly known as out-of-distribution (OOD) detection. While existing techniques, like max logits, aim to leverage logits for OOD identification, they often disregard the intricate interclass relationships that underlie effective detection. This paper presents an innovative class relevance learning method tailored for OOD detection. Our method establishes a comprehensive class relevance learning framework, strategically harnessing interclass relationships within the OOD pipeline. This framework significantly augments OOD detection capabilities. Extensive experimentation on diverse datasets, encompassing generic image classification datasets (Near OOD and Far OOD datasets), demonstrates the superiority of our method over state-of-the-art alternatives for OOD detection.

CVOct 23, 2025Code
PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

Yun Wang, Junjie Hu, Qiaole Dong et al.

Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{https://github.com/cocowy1/PPMStereo}.

CVAug 1, 2021Code
Object-to-Scene: Learning to Transfer Object Knowledge to Indoor Scene Recognition

Bo Miao, Liguang Zhou, Ajmal Mian et al.

Accurate perception of the surrounding scene is helpful for robots to make reasonable judgments and behaviours. Therefore, developing effective scene representation and recognition methods are of significant importance in robotics. Currently, a large body of research focuses on developing novel auxiliary features and networks to improve indoor scene recognition ability. However, few of them focus on directly constructing object features and relations for indoor scene recognition. In this paper, we analyze the weaknesses of current methods and propose an Object-to-Scene (OTS) method, which extracts object features and learns object relations to recognize indoor scenes. The proposed OTS first extracts object features based on the segmentation network and the proposed object feature aggregation module (OFAM). Afterwards, the object relations are calculated and the scene representation is constructed based on the proposed object attention module (OAM) and global relation aggregation module (GRAM). The final results in this work show that OTS successfully extracts object features and learns object relations from the segmentation network. Moreover, OTS outperforms the state-of-the-art methods by more than 2\% on indoor scene recognition without using any additional streams. Code is publicly available at: https://github.com/FreeformRobotics/OTS.

CVAug 1, 2021Code
BORM: Bayesian Object Relation Model for Indoor Scene Recognition

Liguang Zhou, Jun Cen, Xingchao Wang et al.

Scene recognition is a fundamental task in robotic perception. For human beings, scene recognition is reasonable because they have abundant object knowledge of the real world. The idea of transferring prior object knowledge from humans to scene recognition is significant but still less exploited. In this paper, we propose to utilize meaningful object representations for indoor scene representation. First, we utilize an improved object model (IOM) as a baseline that enriches the object knowledge by introducing a scene parsing algorithm pretrained on the ADE20K dataset with rich object categories related to the indoor scene. To analyze the object co-occurrences and pairwise object relations, we formulate the IOM from a Bayesian perspective as the Bayesian object relation model (BORM). Meanwhile, we incorporate the proposed BORM with the PlacesCNN model as the combined Bayesian object relation model (CBORM) for scene recognition and significantly outperforms the state-of-the-art methods on the reduced Places365 dataset, and SUN RGB-D dataset without retraining, showing the excellent generalization ability of the proposed method. Code can be found at https://github.com/hszhoushen/borm.

CVNov 30, 2020Code
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training

Shuai Zhao, Liguang Zhou, Wenxiao Wang et al.

The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. In this case, we argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width. To prove it, one large network is divided into several small ones regarding its parameters and regularization components. Each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to increase their diversity. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs, \ie, achieving better accuracy-efficiency trade-offs. Small networks can also achieve faster inference speed than the large one by concurrent running. All of the above shows that the number of networks is a new dimension of model scaling. We validate our argument with 8 different neural architectures on common benchmarks through extensive experiments. The code is available at \url{https://github.com/FreeformRobotics/Divide-and-Co-training}.

RONov 4, 2024
Real-Time Polygonal Semantic Mapping for Humanoid Robot Stair Climbing

Teng Bin, Jianming Yao, Tin Lun Lam et al.

We present a novel algorithm for real-time planar semantic mapping tailored for humanoid robots navigating complex terrains such as staircases. Our method is adaptable to any odometry input and leverages GPU-accelerated processes for planar extraction, enabling the rapid generation of globally consistent semantic maps. We utilize an anisotropic diffusion filter on depth images to effectively minimize noise from gradient jumps while preserving essential edge details, enhancing normal vector images' accuracy and smoothness. Both the anisotropic diffusion and the RANSAC-based plane extraction processes are optimized for parallel processing on GPUs, significantly enhancing computational efficiency. Our approach achieves real-time performance, processing single frames at rates exceeding $30~Hz$, which facilitates detailed plane extraction and map management swiftly and efficiently. Extensive testing underscores the algorithm's capabilities in real-time scenarios and demonstrates its practical application in humanoid robot gait planning, significantly improving its ability to navigate dynamic environments.

AISep 29, 2025
ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration

Shaobin Ling, Yun Wang, Chenyou Fan et al.

Large Language Models (LLMs) enable intelligent multi-robot collaboration but face fundamental trade-offs: declarative methods lack adaptability in dynamic environments, while iterative methods incur prohibitive computational costs that scale poorly with team size and task complexity. In this paper, we propose ELHPlan, a novel framework that introduces Action Chains--sequences of actions explicitly bound to sub-goal intentions--as the fundamental planning primitive. ELHPlan operates via a cyclical process: 1) constructing intention-bound action sequences, 2) proactively validating for conflicts and feasibility, 3) refining issues through targeted mechanisms, and 4) executing validated actions. This design balances adaptability and efficiency by providing sufficient planning horizons while avoiding expensive full re-planning. We further propose comprehensive efficiency metrics, including token consumption and planning time, to more holistically evaluate multi-agent collaboration. Our experiments on benchmark TDW-MAT and C-WAH demonstrate that ELHPlan achieves comparable task success rates while consuming only 24% of the tokens required by state-of-the-art methods. Our research establishes a new efficiency-effectiveness frontier for LLM-based multi-agent planning systems.

CVOct 18, 2021
Abnormal Occupancy Grid Map Recognition using Attention Network

Fuqin Deng, Hua Feng, Mingjian Liang et al.

The occupancy grid map is a critical component of autonomous positioning and navigation in the mobile robotic system, as many other systems' performance depends heavily on it. To guarantee the quality of the occupancy grid maps, researchers previously had to perform tedious manual recognition for a long time. This work focuses on automatic abnormal occupancy grid map recognition using the residual neural networks and a novel attention mechanism module. We propose an effective channel and spatial Residual SE(csRSE) attention module, which contains a residual block for producing hierarchical features, followed by both channel SE (cSE) block and spatial SE (sSE) block for the sufficient information extraction along the channel and spatial pathways. To further summarize the occupancy grid map characteristics and experiment with our csRSE attention modules, we constructed a dataset called occupancy grid map dataset (OGMD) for our experiments. On this OGMD test dataset, we tested few variants of our proposed structure and compared them with other attention mechanisms. Our experimental results show that the proposed attention network can infer the abnormal map with state-of-the-art (SOTA) accuracy of 96.23% for abnormal occupancy grid map recognition.

CVOct 18, 2021
FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation

Fuqin Deng, Hua Feng, Mingjian Liang et al.

The RGB-Thermal (RGB-T) information for semantic segmentation has been extensively explored in recent years. However, most existing RGB-T semantic segmentation usually compromises spatial resolution to achieve real-time inference speed, which leads to poor performance. To better extract detail spatial information, we propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task. Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views. Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images. Extensive experiments on the urban scene dataset demonstrate that our FEANet outperforms other state-of-the-art (SOTA) RGB-T methods in terms of objective metrics and subjective visual comparison (+2.6% in global mAcc and +0.8% in global mIoU). For the 480 x 640 RGB-T test images, our FEANet can run with a real-time speed on an NVIDIA GeForce RTX 2080 Ti card.

ROOct 2, 2021
AB-Mapper: Attention and BicNet Based Multi-agent Path Finding for Dynamic Crowded Environment

Huifeng Guan, Yuan Gao, Min Zhao et al.

Multi-agent path finding in dynamic crowded environments is of great academic and practical value for multi-robot systems in the real world. To improve the effectiveness and efficiency of communication and learning process during path planning in dynamic crowded environments, we introduce an algorithm called Attention and BicNet based Multi-agent path planning with effective reinforcement (AB-Mapper)under the actor-critic reinforcement learning framework. In this framework, on the one hand, we utilize the BicNet with communication function in the actor-network to achieve intra team coordination. On the other hand, we propose a centralized critic network that can selectively allocate attention weights to surrounding agents. This attention mechanism allows an individual agent to automatically learn a better evaluation of actions by also considering the behaviours of its surrounding agents. Compared with the state-of-the-art method Mapper,our AB-Mapper is more effective (85.86% vs. 81.56% in terms of success rate) in solving the general path finding problems with dynamic obstacles. In addition, in crowded scenarios, our method outperforms the Mapper method by a large margin,reaching a stunning gap of more than 40% for each experiment.

ROSep 28, 2021
Meta Reinforcement Learning Based Sensor Scanning in 3D Uncertain Environments for Heterogeneous Multi-Robot Systems

Junfeng Chen, Yuan Gao, Junjie Hu et al.

We study a novel problem that tackles learning based sensor scanning in 3D and uncertain environments with heterogeneous multi-robot systems. Our motivation is two-fold: first, 3D environments are complex, the use of heterogeneous multi-robot systems intuitively can facilitate sensor scanning by fully taking advantage of sensors with different capabilities. Second, in uncertain environments (e.g. rescue), time is of great significance. Since the learning process normally takes time to train and adapt to a new environment, we need to find an effective way to explore and adapt quickly. To this end, in this paper, we present a meta-learning approach to improve the exploration and adaptation capabilities. The experimental results demonstrate our method can outperform other methods by approximately 15%-27% on success rate and 70%-75% on adaptation speed.

IVSep 10, 2021
View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Yuhongze Zhou, Liguang Zhou, Tin Lun Lam et al.

In recent years, self-supervised denoising methods have shown impressive performance, which circumvent painstaking collection procedure of noisy-clean image pairs in supervised denoising methods and boost denoising applicability in real world. One of well-known self-supervised denoising strategies is the blind-spot training scheme. However, a few works attempt to improve blind-spot based self-denoiser in the aspect of network architecture. In this paper, we take an intuitive view of blind-spot strategy and consider its process of using neighbor pixels to predict manipulated pixels as an inpainting process. Therefore, we propose a novel Mask Guided Residual Convolution (MGRConv) into common convolutional neural networks, e.g. U-Net, to promote blind-spot based denoising. Our MGRConv can be regarded as soft partial convolution and find a trade-off among partial convolution, learnable attention maps, and gated convolution. It enables dynamic mask learning with appropriate mask constrain. Different from partial convolution and gated convolution, it provides moderate freedom for network learning. It also avoids leveraging external learnable parameters for mask activation, unlike learnable attention maps. The experiments show that our proposed plug-and-play MGRConv can assist blind-spot based denoising network to reach promising results on both existing single-image based and dataset-based methods.

LGSep 8, 2021
Learn2Agree: Fitting with Multiple Annotators without Objective Ground Truth

Chongyang Wang, Yuan Gao, Chenyou Fan et al.

The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.

ROAug 3, 2021
AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments

Tianwei Zhang, Huayan Zhang, Xiaofei Li et al.

Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization and mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors to remove these dynamic objects. However, these object detectors are computationally too expensive for mobile robot on-board processing. In practical applications, these objects output noisy sounds that can be effectively detected by on-board sound source localization. The directional information of the sound source object can be efficiently obtained by direction of sound arrival (DoA) estimation, but depth estimation is difficult. Therefore, in this paper, we propose a novel audio-visual fusion approach that fuses sound source direction into the RGB-D image and thus removes the effect of dynamic obstacles on the multi-robot SLAM system. Experimental results of multi-robot SLAM in different dynamic environments show that the proposed method uses very small computational resources to obtain very stable self-localization results.

CVAug 2, 2021
PoseFusion2: Simultaneous Background Reconstruction and Human Shape Recovery in Real-time

Huayan Zhang, Tianwei Zhang, Tin Lun Lam et al.

Dynamic environments that include unstructured moving objects pose a hard problem for Simultaneous Localization and Mapping (SLAM) performance. The motion of rigid objects can be typically tracked by exploiting their texture and geometric features. However, humans moving in the scene are often one of the most important, interactive targets - they are very hard to track and reconstruct robustly due to non-rigid shapes. In this work, we present a fast, learning-based human object detector to isolate the dynamic human objects and realise a real-time dense background reconstruction framework. We go further by estimating and reconstructing the human pose and shape. The final output environment maps not only provide the dense static backgrounds but also contain the dynamic human meshes and their trajectories. Our Dynamic SLAM system runs at around 26 frames per second (fps) on GPUs, while additionally turning on accurate human pose estimation can be executed at up to 10 fps.

CVMay 13, 2021
Boosting Light-Weight Depth Estimation Via Knowledge Distillation

Junjie Hu, Chenyou Fan, Hualie Jiang et al.

Monocular depth estimation (MDE) methods are often either too computationally expensive or not accurate enough due to the trade-off between model complexity and inference performance. In this paper, we propose a lightweight network that can accurately estimate depth maps using minimal computing resources. We achieve this by designing a compact model architecture that maximally reduces model complexity. To improve the performance of our lightweight network, we adopt knowledge distillation (KD) techniques. We consider a large network as an expert teacher that accurately estimates depth maps on the target domain. The student, which is the lightweight network, is then trained to mimic the teacher's predictions. However, this KD process can be challenging and insufficient due to the large model capacity gap between the teacher and the student. To address this, we propose to use auxiliary unlabeled data to guide KD, enabling the student to better learn from the teacher's predictions. This approach helps fill the gap between the teacher and the student, resulting in improved data-driven learning. Our extensive experiments show that our method achieves comparable performance to state-of-the-art methods while using only 1% of their parameters. Furthermore, our method outperforms previous lightweight methods regarding inference accuracy, computational efficiency, and generalizability.

CVMar 31, 2021
Semantic-guided Automatic Natural Image Matting with Trimap Generation Network and Light-weight Non-local Attention

Yuhongze Zhou, Liguang Zhou, Tin Lun Lam et al.

Natural image matting aims to precisely separate foreground objects from background using alpha matte. Fully automatic natural image matting without external annotation is challenging. Well-performed matting methods usually require accurate labor-intensive handcrafted trimap as extra input, while the performance of automatic trimap generation method of dilating foreground segmentation fluctuates with segmentation quality. Therefore, we argue that how to handle trade-off of additional information input is a major issue in automatic matting. This paper presents a semantic-guided automatic natural image matting pipeline with Trimap Generation Network and light-weight non-local attention, which does not need trimap and background as input. Specifically, guided by foreground segmentation, Trimap Generation Network estimates accurate trimap. Then, with estimated trimap as guidance, our light-weight Non-local Matting Network with Refinement produces final alpha matte, whose trimap-guided global aggregation attention block is equipped with stride downsampling convolution, reducing computation complexity and promoting performance. Experimental results show that our matting algorithm has competitive performance with state-of-the-art methods in both trimap-free and trimap-needed aspects.

CVOct 19, 2020
A Two-stage Unsupervised Approach for Low light Image Enhancement

Junjie Hu, Xiyue Guo, Junfeng Chen et al.

As vision based perception methods are usually built on the normal light assumption, there will be a serious safety issue when deploying them into low light environments. Recently, deep learning based methods have been proposed to enhance low light images by penalizing the pixel-wise loss of low light and normal light images. However, most of them suffer from the following problems: 1) the need of pairs of low light and normal light images for training, 2) the poor performance for dark images, 3) the amplification of noise. To alleviate these problems, in this paper, we propose a two-stage unsupervised method that decomposes the low light image enhancement into a pre-enhancement and a post-refinement problem. In the first stage, we pre-enhance a low light image with a conventional Retinex based method. In the second stage, we use a refinement network learned with adversarial training for further improvement of the image quality. The experimental results show that our method outperforms previous methods on four benchmark datasets. In addition, we show that our method can significantly improve feature points matching and simultaneous localization and mapping in low light conditions.

ROOct 19, 2020
Semantic Histogram Based Graph Matching for Real-Time Multi-Robot Global Localization in Large Scale Environment

Xiyue Guo, Junjie Hu, Junfeng Chen et al.

The core problem of visual multi-robot simultaneous localization and mapping (MR-SLAM) is how to efficiently and accurately perform multi-robot global localization (MR-GL). The difficulties are two-fold. The first is the difficulty of global localization for significant viewpoint difference. Appearance-based localization methods tend to fail under large viewpoint changes. Recently, semantic graphs have been utilized to overcome the viewpoint variation problem. However, the methods are highly time-consuming, especially in large-scale environments. This leads to the second difficulty, which is how to perform real-time global localization. In this paper, we propose a semantic histogram-based graph matching method that is robust to viewpoint variation and can achieve real-time global localization. Based on that, we develop a system that can accurately and efficiently perform MR-GL for both homogeneous and heterogeneous robots. The experimental results show that our approach is about 30 times faster than Random Walk based semantic descriptors. Moreover, it achieves an accuracy of 95% for global localization, while the accuracy of the state-of-the-art method is 85%.

CVApr 26, 2020
IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition Report

Qi She, Fan Feng, Qi Liu et al.

This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, with everyday objects in home, office, campus, and mall scenarios. The dataset explicitly quantifies the variants of illumination, object occlusion, object size, camera-object distance/angles, and clutter information. Rules are designed to quantify the learning capability of the robotic vision system when faced with the objects appearing in the dynamic environments in the contest. Individual reports, dataset information, rules, and released source code can be found at the project homepage: "https://lifelong-robotic-vision.github.io/competition/".