CVFeb 28, 2023
Tracking Fast by Learning Slow: An Event-based Speed Adaptive Hand Tracker Leveraging Knowledge in RGB DomainChuanlin Lan, Ziyuan Yin, Arindam Basu et al.
3D hand tracking methods based on monocular RGB videos are easily affected by motion blur, while event camera, a sensor with high temporal resolution and dynamic range, is naturally suitable for this task with sparse output and low power consumption. However, obtaining 3D annotations of fast-moving hands is difficult for constructing event-based hand-tracking datasets. In this paper, we provided an event-based speed adaptive hand tracker (ESAHT) to solve the hand tracking problem based on event camera. We enabled a CNN model trained on a hand tracking dataset with slow motion, which enabled the model to leverage the knowledge of RGB-based hand tracking solutions, to work on fast hand tracking tasks. To realize our solution, we constructed the first 3D hand tracking dataset captured by an event camera in a real-world environment, figured out two data augment methods to narrow the domain gap between slow and fast motion data, developed a speed adaptive event stream segmentation method to handle hand movements in different moving speeds, and introduced a new event-to-frame representation method adaptive to event streams with different lengths. Experiments showed that our solution outperformed RGB-based as well as previous event-based solutions in fast hand tracking tasks, and our codes and dataset will be publicly available.
CVApr 1, 2025
Spatiotemporal Attention Learning Framework for Event-Driven Object RecognitionTiantian Xie, Pengpai Wang, Rosa H. M. Chan
Event-based vision sensors, inspired by biological neural systems, asynchronously capture local pixel-level intensity changes as a sparse event stream containing position, polarity, and timestamp information. These neuromorphic sensors offer significant advantages in dynamic range, latency, and power efficiency. Their working principle inherently addresses traditional camera limitations such as motion blur and redundant background information, making them particularly suitable for dynamic vision tasks. While recent works have proposed increasingly complex event-based architectures, the computational overhead and parameter complexity of these approaches limit their practical deployment. This paper presents a novel spatiotemporal learning framework for event-based object recognition, utilizing a VGG network enhanced with Convolutional Block Attention Module (CBAM). Our approach achieves comparable performance to state-of-the-art ResNet-based methods while reducing parameter count by 2.3% compared to the original VGG model. Specifically, it outperforms ResNet-based methods like MVF-Net, achieving the highest Top-1 accuracy of 76.4% (pretrained) and 71.3% (not pretrained) on CIFAR10-DVS, and 72.4% (not pretrained) on N-Caltech101. These results highlight the robustness of our method when pretrained weights are not used, making it suitable for scenarios where transfer learning is unavailable. Moreover, our approach reduces reliance on data augmentation. Experimental results on standard event-based datasets demonstrate the framework's efficiency and effectiveness for real-world applications.
MMJul 27, 2021
The CORSMAL benchmark for the prediction of the properties of containersAlessio Xompero, Santiago Donaher, Vladimir Iashin et al.
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework.
CVApr 26, 2020
IROS 2019 Lifelong Robotic Vision Challenge -- Lifelong Object Recognition ReportQi She, Fan Feng, Qi Liu et al.
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning research and application in robotic vision domain, with everyday objects in home, office, campus, and mall scenarios. The dataset explicitly quantifies the variants of illumination, object occlusion, object size, camera-object distance/angles, and clutter information. Rules are designed to quantify the learning capability of the robotic vision system when faced with the objects appearing in the dynamic environments in the contest. Individual reports, dataset information, rules, and released source code can be found at the project homepage: "https://lifelong-robotic-vision.github.io/competition/".
CVNov 15, 2019
OpenLORIS-Object: A Robotic Vision Dataset and Benchmark for Lifelong Deep LearningQi She, Fan Feng, Xinyue Hao et al.
The recent breakthroughs in computer vision have benefited from the availability of large representative datasets (e.g. ImageNet and COCO) for training. Yet, robotic vision poses unique challenges for applying visual algorithms developed from these standard computer vision datasets due to their implicit assumption over non-varying distributions for a fixed set of tasks. Fully retraining models each time a new task becomes available is infeasible due to computational, storage and sometimes privacy issues, while naïve incremental strategies have been shown to suffer from catastrophic forgetting. It is crucial for the robots to operate continuously under open-set and detrimental conditions with adaptive visual perceptual systems, where lifelong learning is a fundamental capability. However, very few datasets and benchmarks are available to evaluate and compare emerging techniques. To fill this gap, we provide a new lifelong robotic vision dataset ("OpenLORIS-Object") collected via RGB-D cameras. The dataset embeds the challenges faced by a robot in the real-life application and provides new benchmarks for validating lifelong object recognition algorithms. Moreover, we have provided a testbed of $9$ state-of-the-art lifelong learning algorithms. Each of them involves $48$ tasks with $4$ evaluation metrics over the OpenLORIS-Object dataset. The results demonstrate that the object recognition task in the ever-changing difficulty environments is far from being solved and the bottlenecks are at the forward/backward transfer designs. Our dataset and benchmark are publicly available at at \href{https://lifelong-robotic-vision.github.io/dataset/object}{\underline{https://lifelong-robotic-vision.github.io/dataset/object}}.
RONov 13, 2019
Are We Ready for Service Robots? The OpenLORIS-Scene Datasets for Lifelong SLAMXuesong Shi, Dongjiang Li, Pengpeng Zhao et al.
Service robots should be able to operate autonomously in dynamic and daily changing environments over an extended period of time. While Simultaneous Localization And Mapping (SLAM) is one of the most fundamental problems for robotic autonomy, most existing SLAM works are evaluated with data sequences that are recorded in a short period of time. In real-world deployment, there can be out-of-sight scene changes caused by both natural factors and human activities. For example, in home scenarios, most objects may be movable, replaceable or deformable, and the visual features of the same place may be significantly different in some successive days. Such out-of-sight dynamics pose great challenges to the robustness of pose estimation, and hence a robot's long-term deployment and operation. To differentiate the forementioned problem from the conventional works which are usually evaluated in a static setting in a single run, the term \textit{lifelong SLAM} is used here to address SLAM problems in an ever-changing environment over a long period of time. To accelerate lifelong SLAM research, we release the OpenLORIS-Scene datasets. The data are collected in real-world indoor scenes, for multiple times in each place to include scene changes in real life. We also design benchmarking metrics for lifelong SLAM, with which the robustness and accuracy of pose estimation are evaluated separately. The datasets and benchmark are available online at https://lifelong-robotic-vision.github.io/dataset/scene.
QMMay 10, 2016
An Efficient and Flexible Spike Train Model via Empirical BayesQi She, Xiaoli Wu, Beth Jelfs et al.
Accurate statistical models of neural spike responses can characterize the information carried by neural populations. But the limited samples of spike counts during recording usually result in model overfitting. Besides, current models assume spike counts to be Poisson-distributed, which ignores the fact that many neurons demonstrate over-dispersed spiking behaviour. Although the Negative Binomial Generalized Linear Model (NB-GLM) provides a powerful tool for modeling over-dispersed spike counts, the maximum likelihood-based standard NB-GLM leads to highly variable and inaccurate parameter estimates. Thus, we propose a hierarchical parametric empirical Bayes method to estimate the neural spike responses among neuronal population. Our method integrates both Generalized Linear Models (GLMs) and empirical Bayes theory, which aims to (1) improve the accuracy and reliability of parameter estimation, compared to the maximum likelihood-based method for NB-GLM and Poisson-GLM; (2) effectively capture the over-dispersion nature of spike counts from both simulated data and experimental data; and (3) provide insight into both neural interactions and spiking behaviours of the neuronal populations. We apply our approach to study both simulated data and experimental neural data. The estimation of simulation data indicates that the new framework can accurately predict mean spike counts simulated from different models and recover the connectivity weights among neural populations. The estimation based on retinal neurons demonstrate the proposed method outperforms both NB-GLM and Poisson-GLM in terms of the predictive log-likelihood of held-out data. Codes are available in https://doi.org/10.5281/zenodo.4704423
MLNov 24, 2014
Mutual Information-Based Unsupervised Feature Transformation for Heterogeneous Feature Subset SelectionMin Wei, Tommy W. S. Chow, Rosa H. M. Chan
Conventional mutual information (MI) based feature selection (FS) methods are unable to handle heterogeneous feature subset selection properly because of data format differences or estimation methods of MI between feature subset and class label. A way to solve this problem is feature transformation (FT). In this study, a novel unsupervised feature transformation (UFT) which can transform non-numerical features into numerical features is developed and tested. The UFT process is MI-based and independent of class label. MI-based FS algorithms, such as Parzen window feature selector (PWFS), minimum redundancy maximum relevance feature selection (mRMR), and normalized MI feature selection (NMIFS), can all adopt UFT for pre-processing of non-numerical features. Unlike traditional FT methods, the proposed UFT is unbiased while PWFS is utilized to its full advantage. Simulations and analyses of large-scale datasets showed that feature subset selected by the integrated method, UFT-PWFS, outperformed other FT-FS integrated methods in classification accuracy.