CVMar 21, 2022Code
Stereo Neural Vernier CaliperShichao Li, Zechun Liu, Zhiqiang Shen et al.
We propose a new object-centric framework for learning-based stereo 3D object detection. Previous studies build scene-centric representations that do not consider the significant variation among outdoor instances and thus lack the flexibility and functionalities that an instance-level model can offer. We build such an instance-level model by formulating and tackling a local update problem, i.e., how to predict a refined update given an initial 3D cuboid guess. We demonstrate how solving this problem can complement scene-centric approaches in (i) building a coarse-to-fine multi-resolution system, (ii) performing model-agnostic object location refinement, and (iii) conducting stereo 3D tracking-by-detection. Extensive experiments demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on the KITTI benchmark. Code and pre-trained models are available at https://github.com/Nicholasli1995/SNVC.
LGJun 9, 2022
SDQ: Stochastic Differentiable Quantization with Mixed PrecisionXijie Huang, Zhiqiang Shen, Shichao Li et al.
In order to deploy deep models in a computationally efficient manner, model quantization approaches have been frequently used. In addition, as new hardware that supports mixed bitwidth arithmetic operations, recent research on mixed precision quantization (MPQ) begins to fully leverage the capacity of representation by searching optimized bitwidths for different layers and modules in a network. However, previous studies mainly search the MPQ strategy in a costly scheme using reinforcement learning, neural architecture search, etc., or simply utilize partial prior knowledge for bitwidth assignment, which might be biased and sub-optimal. In this work, we present a novel Stochastic Differentiable Quantization (SDQ) method that can automatically learn the MPQ strategy in a more flexible and globally-optimized space with smoother gradient approximation. Particularly, Differentiable Bitwidth Parameters (DBPs) are employed as the probability factors in stochastic quantization between adjacent bitwidth choices. After the optimal MPQ strategy is acquired, we further train our network with entropy-aware bin regularization and knowledge distillation. We extensively evaluate our method for several networks on different hardware (GPUs and FPGA) and datasets. SDQ outperforms all state-of-the-art mixed or single precision quantization with a lower bitwidth and is even better than the full-precision counterparts across various ResNet and MobileNet families, demonstrating the effectiveness and superiority of our method.
AIApr 9, 2023Code
OpenDriver: An Open-Road Driver State Detection DatasetDelong Liu, Shichao Li, Tianyi Shi et al.
Among numerous studies for driver state detection, wearable physiological measurements offer a practical method for real-time monitoring. However, there are few driver physiological datasets in open-road scenarios, and the existing datasets suffer from issues such as poor signal quality, small sample sizes, and short data collection periods. Therefore, in this paper, a large-scale multimodal driving dataset, OpenDriver, for driver state detection is developed. The OpenDriver encompasses a total of 3,278 driving trips, with a signal collection duration spanning approximately 4,600 hours. Two modalities of driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals and six-axis motion data of the steering wheel from a motion measurement unit (IMU), which were recorded from 81 drivers and their vehicles. Furthermore, three challenging tasks are involved in our work, namely ECG signal quality assessment, individual biometric identification based on ECG signals, and physiological signal analysis in complex driving environments. To facilitate research in these tasks, corresponding benchmarks have also been introduced. First, a noisy augmentation strategy is applied to generate a larger-scale ECG signal dataset with realistic noise simulation for quality assessment. Second, an end-to-end contrastive learning framework is employed for individual biometric identification. Finally, a comprehensive analysis of drivers' HRV features under different driving conditions is conducted. Each benchmark provides evaluation metrics and reference results. The OpenDriver dataset will be publicly available at https://github.com/bdne/OpenDriver.
57.3ROMar 17
Toward Deep Representation Learning for Event-Enhanced Visual Autonomous Perception: the eAP DatasetJinghang Li, Shichao Li, Qing Lian et al.
Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination.While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the ffrst successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometryaware representation learning framework leads to the best eventbased object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.
IVJan 13, 2025Code
A Multi-Modal Deep Learning Framework for Pan-Cancer PrognosisBinyu Zhang, Shichao Li, Junpeng Jian et al.
Prognostic task is of great importance as it closely related to the survival analysis of patients, the optimization of treatment plans and the allocation of resources. The existing prognostic models have shown promising results on specific datasets, but there are limitations in two aspects. On the one hand, they merely explore certain types of modal data, such as patient histopathology WSI and gene expression analysis. On the other hand, they adopt the per-cancer-per-model paradigm, which means the trained models can only predict the prognostic effect of a single type of cancer, resulting in weak generalization ability. In this paper, a deep-learning based model, named UMPSNet, is proposed. Specifically, to comprehensively understand the condition of patients, in addition to constructing encoders for histopathology images and genomic expression profiles respectively, UMPSNet further integrates four types of important meta data (demographic information, cancer type information, treatment protocols, and diagnosis results) into text templates, and then introduces a text encoder to extract textual features. In addition, the optimal transport OT-based attention mechanism is utilized to align and fuse features of different modalities. Furthermore, a guided soft mixture of experts (GMoE) mechanism is introduced to effectively address the issue of distribution differences among multiple cancer datasets. By incorporating the multi-modality of patient data and joint training, UMPSNet outperforms all SOTA approaches, and moreover, it demonstrates the effectiveness and generalization ability of the proposed learning paradigm of a single model for multiple cancer types. The code of UMPSNet is available at https://github.com/binging512/UMPSNet.
LGJun 21, 2021Code
How Do Adam and Training Strategies Help BNNs Optimization?Zechun Liu, Zhiqiang Shen, Shichao Li et al.
The best performing Binary Neural Networks (BNNs) are usually attained using Adam optimization and its multi-step training variants. However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support specific training strategies. To address this, in this paper we first investigate the trajectories of gradients and weights in BNNs during the training process. We show the regularization effect of second-order momentum in Adam is crucial to revitalize the weights that are dead due to the activation saturation in BNNs. We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. Furthermore, we inspect the intriguing role of the real-valued weights in binary networks, and reveal the effect of weight decay on the stability and sluggishness of BNN optimization. Through extensive experiments and analysis, we derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset using the same architecture as the state-of-the-art ReActNet while achieving 1.1% higher accuracy. Code and models are available at https://github.com/liuzechun/AdamBNN.
CVAug 28, 2019Code
Facial age estimation by deep residual decision makingShichao Li, Kwang-Ting Cheng
Residual representation learning simplifies the optimization problem of learning complex functions and has been widely used by traditional convolutional neural networks. However, it has not been applied to deep neural decision forest (NDF). In this paper we incorporate residual learning into NDF and the resulting model achieves state-of-the-art level accuracy on three public age estimation benchmarks while requiring less memory and computation. We further employ gradient-based technique to visualize the decision-making process of NDF and understand how it is influenced by facial image inputs. The code and pre-trained models will be available at https://github.com/Nicholasli1995/VisualizingNDF.
CVApr 19, 2019Code
Visualizing the decision-making process in deep neural decision forestShichao Li, Kwang-Ting Cheng
Deep neural decision forest (NDF) achieved remarkable performance on various vision tasks via combining decision tree and deep representation learning. In this work, we first trace the decision-making process of this model and visualize saliency maps to understand which portion of the input influence it more for both classification and regression problems. We then apply NDF on a multi-task coordinate regression problem and demonstrate the distribution of routing probabilities, which is vital for interpreting NDF yet not shown for regression problems. The pre-trained model and code for visualization will be available at https://github.com/Nicholasli1995/VisualizingNDF
CVMay 21, 2025
Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detectionShichao Li, Peiliang Li, Qing Lian et al.
Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.
MMDec 20, 2021
Automated Vision-Based Wellness Analysis for Elderly Care CentersXijie Huang, Jeffry Wicaksana, Shichao Li et al.
The growth in the aging population requires caregivers to improve both efficiency and quality of healthcare. In this study, we develop an automatic, vision-based system for monitoring and analyzing the physical and mental well-being of senior citizens. Through collaboration with Haven of Hope Christian Service, we collect video recording data in the care center with surveillance cameras. We then process and extract personalized facial, activity, and interaction features from the video data using deep neural networks. This integrated health information systems can assist caregivers to gain better insights into the seniors they are taking care of. These insights, including wellness metrics and long-term health patterns of senior citizens, can help caregivers update their caregiving strategies. We report the findings of our analysis and evaluate the system quantitatively. We also summarize technical challenges and additional functionalities and technologies needed for offering a comprehensive system.
CVNov 25, 2021
Joint stereo 3D object detection and implicit surface reconstructionShichao Li, Xijie Huang, Zechun Liu et al.
We present a new learning-based framework S-3D-RCNN that can recover accurate object orientation in SO(3) and simultaneously predict implicit rigid shapes from stereo RGB images. For orientation estimation, in contrast to previous studies that map local appearance to observation angles, we propose a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs). This approach features a deep model that transforms perceived intensities from one or two views to object part coordinates to achieve direct egocentric object orientation estimation in the camera coordinate system. To further achieve finer description inside 3D bounding boxes, we investigate the implicit shape estimation problem from stereo images. We model visible object surfaces by designing a point-based representation, augmenting IGRs to explicitly address the unseen surface hallucination problem. Extensive experiments validate the effectiveness of the proposed IGRs, and S-3D-RCNN achieves superior 3D scene understanding performance. We also designed new metrics on the KITTI benchmark for our evaluation of implicit shape estimation.
CVNov 17, 2020
Exploring intermediate representation for monocular vehicle pose estimationShichao Li, Zengqiang Yan, Hongyang Li et al.
We present a new learning-based framework to recover vehicle pose in SO(3) from a single RGB image. In contrast to previous works that map from local appearance to observation angles, we explore a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs) to estimate egocentric vehicle orientation. This approach features a deep model that transforms perceived intensities to IGRs, which are mapped to a 3D representation encoding object orientation in the camera coordinate system. Core problems are what IGRs to use and how to learn them more effectively. We answer the former question by designing IGRs based on an interpolated cuboid that derives from primitive 3D annotation readily. The latter question motivates us to incorporate geometry knowledge with a new loss function based on a projective invariant. This loss function allows unlabeled data to be used in the training stage to improve representation learning. Without additional labels, our system outperforms previous monocular RGB-based methods for joint vehicle detection and pose estimation on the KITTI benchmark, achieving performance even comparable to stereo methods. Code and pre-trained models are available at this https URL.
CVJul 26, 2020
GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware SupervisionLei Ke, Shichao Li, Yanan Sun et al.
We present a novel end-to-end framework named as GSNet (Geometric and Scene-aware Network), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. Project page is available at https://lkeab.github.io/gsnet/.
CVJun 14, 2020
Cascaded deep monocular 3D human pose estimation with evolutionary training dataShichao Li, Lei Ke, Kevin Pratama et al.
End-to-end deep representation learning has achieved remarkable accuracy for monocular 3D human pose estimation, yet these models may fail for unseen poses with limited and fixed training data. This paper proposes a novel data augmentation method that: (1) is scalable for synthesizing massive amount of training data (over 8 million valid 3D human poses with corresponding 2D projections) for training 2D-to-3D networks, (2) can effectively reduce dataset bias. Our method evolves a limited dataset to synthesize unseen 3D human skeletons based on a hierarchical human representation and heuristics inspired by prior knowledge. Extensive experiments show that our approach not only achieves state-of-the-art accuracy on the largest public benchmark, but also generalizes significantly better to unseen and rare poses. Code, pre-trained models and tools are available at this HTTPS URL.
CVDec 3, 2018
Deep Hierarchical Machine: a Flexible Divide-and-Conquer ArchitectureShichao Li, Xin Yang, Tim Cheng
We propose Deep Hierarchical Machine (DHM), a model inspired from the divide-and-conquer strategy while emphasizing representation learning ability and flexibility. A stochastic routing framework as used by recent deep neural decision/regression forests is incorporated, but we remove the need to evaluate unnecessary computation paths by utilizing a different topology and introducing a probabilistic pruning technique. We also show a specified version of DHM (DSHM) for efficiency, which inherits the sparse feature extraction process as in traditional decision tree with pixel-difference feature. To achieve sparse feature extraction, we propose to utilize sparse convolution operation in DSHM and show one possibility of introducing sparse convolution kernels by using local binary convolution layer. DHM can be applied to both classification and regression problems, and we validate it on standard image classification and face alignment tasks to show its advantages over past architectures.