Quanquan Li

CV
18papers
2,212citations
Novelty50%
AI Score31

18 Papers

CVSep 28, 2022
RALACs: Action Recognition in Autonomous Vehicles using Interaction Encoding and Optical Flow

Eddy Zhou, Alex Zhuang, Alikasim Budhwani et al.

When applied to autonomous vehicle (AV) settings, action recognition can enhance an environment model's situational awareness. This is especially prevalent in scenarios where traditional geometric descriptions and heuristics in AVs are insufficient. However, action recognition has traditionally been studied for humans, and its limited adaptability to noisy, un-clipped, un-pampered, raw RGB data has limited its application in other fields. To push for the advancement and adoption of action recognition into AVs, this work proposes a novel two-stage action recognition system, termed RALACs. RALACs formulates the problem of action recognition for road scenes, and bridges the gap between it and the established field of human action recognition. This work shows how attention layers can be useful for encoding the relations across agents, and stresses how such a scheme can be class-agnostic. Furthermore, to address the dynamic nature of agents on the road, RALACs constructs a novel approach to adapting Region of Interest (ROI) Alignment to agent tracks for downstream action classification. Finally, our scheme also considers the problem of active agent detection, and utilizes a novel application of fusing optical flow maps to discern relevant agents in a road scene. We show that our proposed scheme can outperform the baseline on the ICCV2021 Road Challenge dataset and by deploying it on a real vehicle platform, we provide preliminary insight to the usefulness of action recognition in decision making.

CVApr 17, 2021Code
RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

Gang Zhang, Xin Lu, Jingru Tan et al.

The two-stage methods for instance segmentation, e.g. Mask R-CNN, have achieved excellent performance recently. However, the segmented masks are still very coarse due to the downsampling operations in both the feature pyramid and the instance-wise pooling process, especially for large objects. In this work, we propose a new method called RefineMask for high-quality instance segmentation of objects and scenes, which incorporates fine-grained features during the instance-wise segmenting process in a multi-stage manner. Through fusing more detailed information stage by stage, RefineMask is able to refine high-quality masks consistently. RefineMask succeeds in segmenting hard cases such as bent parts of objects that are over-smoothed by most previous methods and outputs accurate boundaries. Without bells and whistles, RefineMask yields significant gains of 2.6, 3.4, 3.8 AP over Mask R-CNN on COCO, LVIS, and Cityscapes benchmarks respectively at a small amount of additional computational cost. Furthermore, our single-model result outperforms the winner of the LVIS Challenge 2020 by 1.3 points on the LVIS test-dev set and establishes a new state-of-the-art. Code will be available at https://github.com/zhanggang001/RefineMask.

CVDec 15, 2020Code
Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection

Jingru Tan, Xin Lu, Gang Zhang et al.

Recently proposed decoupled training methods emerge as a dominant paradigm for long-tailed object detection. But they require an extra fine-tuning stage, and the disjointed optimization of representation and classifier might lead to suboptimal results. However, end-to-end training methods, like equalization loss (EQL), still perform worse than decoupled training methods. In this paper, we reveal the main issue in long-tailed object detection is the imbalanced gradients between positives and negatives, and find that EQL does not solve it well. To address the problem of imbalanced gradients, we introduce a new version of equalization loss, called equalization loss v2 (EQL v2), a novel gradient guided reweighing mechanism that re-balances the training process for each category independently and equally. Extensive experiments are performed on the challenging LVIS benchmark. EQL v2 outperforms origin EQL by about 4 points overall AP with 14-18 points improvements on the rare categories. More importantly, it also surpasses decoupled training methods. Without further tuning for the Open Images dataset, EQL v2 improves EQL by 7.3 points AP, showing strong generalization ability. Codes have been released at https://github.com/tztztztztz/eqlv2

CVMay 7, 2020Code
DMCP: Differentiable Markov Channel Pruning for Neural Networks

Shaopeng Guo, Yujie Wang, Quanquan Li et al.

Recent works imply that the channel pruning can be regarded as searching optimal sub-structure from unpruned networks. However, existing works based on this observation require training and evaluating a large number of structures, which limits their application. In this paper, we propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP), to efficiently search the optimal sub-structure. Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization (e.g. FLOPs constraint). In DMCP, we model the channel pruning as a Markov process, in which each state represents for retaining the corresponding channel during pruning, and transitions between states denote the pruning process. In the end, our method is able to implicitly select the proper number of channels in each layer by the Markov process with optimized transitions. To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2. Results show our method can achieve consistent improvement than state-of-the-art pruning methods in various FLOPs settings. The code is available at https://github.com/zx55/dmcp

CVMar 11, 2020Code
Equalization Loss for Long-Tailed Object Recognition

Jingru Tan, Changbao Wang, Buyu Li et al.

Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS. In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https: //github.com/tztztztztz/eql.detectron2

CVJun 13, 2019Code
Grid R-CNN Plus: Faster and Better

Xin Lu, Buyu Li, Yuxin Yue et al.

Grid R-CNN is a well-performed objection detection framework. It transforms the traditional box offset regression problem into a grid point estimation problem. With the guidance of the grid points, it can obtain high-quality localization results. However, the speed of Grid R-CNN is not so satisfactory. In this technical report we present Grid R-CNN Plus, a better and faster version of Grid R-CNN. We have made several updates that significantly speed up the framework and simultaneously improve the accuracy. On COCO dataset, the Res50-FPN based Grid R-CNN Plus detector achieves an mAP of 40.4%, outperforming the baseline on the same model by 3.0 points with similar inference time. Code is available at https://github.com/STVIR/Grid-R-CNN .

CVMar 31, 2021
Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Jiangfan Han, Mengya Gao, Yujie Wang et al.

Training a small student network with the guidance of a larger teacher network is an effective way to promote the performance of the student. Despite the different types, the guided knowledge used to distill is always kept unchanged for different teacher and student pairs in previous knowledge distillation methods. However, we find that teacher and student models with different networks or trained from different initialization could have distinct feature representations among different channels. (e.g. the high activated channel for different categories). We name this incongruous representation of channels as teacher-student knowledge discrepancy in the distillation process. Ignoring the knowledge discrepancy problem of teacher and student models will make the learning of student from teacher more difficult. To solve this problem, in this paper, we propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student and provides the best suitable knowledge to different student networks for distillation. Extensive experiments on different datasets (CIFAR100, ImageNet, COCO) and tasks (image classification, object detection) reveal the widely existing knowledge discrepancy problem between teachers and students and demonstrate the effectiveness of our proposed method. Our method is very flexible that can be easily combined with other state-of-the-art approaches.

CVMar 30, 2021
Differentiable Network Adaption with Elastic Search Space

Shaopeng Guo, Yujie Wang, Kun Yuan et al.

In this paper we propose a novel network adaption method called Differentiable Network Adaption (DNA), which can adapt an existing network to a specific computation budget by adjusting the width and depth in a differentiable manner. The gradient-based optimization allows DNA to achieve an automatic optimization of width and depth rather than previous heuristic methods that heavily rely on human priors. Moreover, we propose a new elastic search space that can flexibly condense or expand during the optimization process, allowing the network optimization of width and depth in a bi-direction manner. By DNA, we successfully achieve network architecture optimization by condensing and expanding in both width and depth dimensions. Extensive experiments on ImageNet demonstrate that DNA can adapt the existing network to meet different targeted computation requirements with better performance than previous methods. What's more, DNA can further improve the performance of high-accuracy networks obtained by state-of-the-art neural architecture search methods such as EfficientNet and MobileNet-v3.

CVOct 2, 2020
Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks

Kun Yuan, Quanquan Li, Dapeng Chen et al.

One practice of employing deep neural networks is to apply the same architecture to all the input instances. However, a fixed architecture may not be representative enough for data with high diversity. To promote the model capacity, existing approaches usually employ larger convolutional kernels or deeper network structure, which may increase the computational cost. In this paper, we address this issue by raising the Dynamic Graph Network (DG-Net). The network learns the instance-aware connectivity, which creates different forward paths for different instances. Specifically, the network is initialized as a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent the connection paths. We generate edge weights by a learnable module \textit{router} and select the edges whose weights are larger than a threshold, to adjust the connectivity of the neural network structure. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability. To facilitate the training, we represent the network connectivity of each sample in an adjacency matrix. The matrix is updated to aggregate features in the forward pass, cached in the memory, and used for gradient computing in the backward pass. We verify the effectiveness of our method with several static architectures, including MobileNetV2, ResNet, ResNeXt, and RegNet. Extensive experiments are performed on ImageNet classification and COCO object detection, which shows the effectiveness and generalization ability of our approach.

CVSep 24, 2020
MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection

Xin Lu, Quanquan Li, Buyu Li et al.

Modern object detection methods can be divided into one-stage approaches and two-stage ones. One-stage detectors are more efficient owing to straightforward architectures, but the two-stage detectors still take the lead in accuracy. Although recent work try to improve the one-stage detectors by imitating the structural design of the two-stage ones, the accuracy gap is still significant. In this paper, we propose MimicDet, a novel and efficient framework to train a one-stage detector by directly mimic the two-stage features, aiming to bridge the accuracy gap between one-stage and two-stage detectors. Unlike conventional mimic methods, MimicDet has a shared backbone for one-stage and two-stage detectors, then it branches into two heads which are well designed to have compatible features for mimicking. Thus MimicDet can be end-to-end trained without the pre-train of the teacher network. And the cost does not increase much, which makes it practical to adopt large networks as backbones. We also make several specialized designs such as dual-path mimicking and staggered feature pyramid to facilitate the mimicking process. Experiments on the challenging COCO detection benchmark demonstrate the effectiveness of MimicDet. It achieves 46.1 mAP with ResNeXt-101 backbone on the COCO test-dev set, which significantly surpasses current state-of-the-art methods.

CVSep 3, 2020
1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask

Jingru Tan, Gang Zhang, Hanming Deng et al.

This article introduces the solutions of the team lvisTraveler for LVIS Challenge 2020. In this work, two characteristics of LVIS dataset are mainly considered: the long-tailed distribution and high quality instance segmentation mask. We adopt a two-stage training pipeline. In the first stage, we incorporate EQL and self-training to learn generalized representation. In the second stage, we utilize Balanced GroupSoftmax to promote the classifier, and propose a novel proposal assignment strategy and a new balanced mask loss for mask head to get more precise mask predictions. Finally, we achieve 41.5 and 41.2 AP on LVIS v1.0 val and test-dev splits respectively, outperforming the baseline based on X101-FPN-MaskRCNN by a large margin.

CVAug 19, 2020
Learning Connectivity of Neural Networks from a Topological Perspective

Kun Yuan, Quanquan Li, Jing Shao et al.

Seeking effective neural networks is a critical and practical field in deep learning. Besides designing the depth, type of convolution, normalization, and nonlinearities, the topological connectivity of neural networks is also important. Previous principles of rule-based modular design simplify the difficulty of building an effective architecture, but constrain the possible topologies in limited spaces. In this paper, we attempt to optimize the connectivity in neural networks. We propose a topological perspective to represent a network into a complete graph for analysis, where nodes carry out aggregation and transformation of features, and edges determine the flow of information. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. We further attach auxiliary sparsity constraint to the distribution of connectedness, which promotes the learned topology focus on critical connections. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks. Quantitative results of experiments reflect the learned connectivity is superior to traditional rule-based ones, such as random, residual, and complete. In addition, it obtains significant improvements in image classification and object detection without introducing excessive computation burden.

LGFeb 21, 2020
Residual Knowledge Distillation

Mengya Gao, Yujun Shen, Quanquan Li et al.

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance degradation due to the substantial gap between the learning capacities of S and T. To remedy this problem, this work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A). Specifically, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. In this way, S and A complement with each other to get better knowledge from T. Furthermore, we devise an effective method to derive S and A from a given model without increasing the total computational cost. Extensive experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet, surpassing state-of-the-art methods.

CVNov 12, 2019
Equalization Loss for Large Vocabulary Instance Segmentation

Jingru Tan, Changbao Wang, Quanquan Li et al.

Recent object detection and instance segmentation tasks mainly focus on datasets with a relatively small set of categories, e.g. Pascal VOC with 20 classes and COCO with 80 classes. The new large vocabulary dataset LVIS brings new challenges to conventional methods. In this work, we propose an equalization loss to solve the long tail of rare categories problem. Combined with exploiting the data from detection datasets to alleviate the effect of missing-annotation problems during the training, our method achieves 5.1\% overall AP gain and 11.4\% AP gain of rare categories on LVIS benchmark without any bells and whistles compared to Mask R-CNN baseline. Finally we achieve 28.9 mask AP on the test-set of the LVIS and rank 1st place in LVIS Challenge 2019.

CVFeb 19, 2019
WIDER Face and Pedestrian Challenge 2018: Methods and Results

Chen Change Loy, Dahua Lin, Wanli Ouyang et al.

This paper presents a review of the 2018 WIDER Challenge on Face and Pedestrian. The challenge focuses on the problem of precise localization of human faces and bodies, and accurate association of identities. It comprises of three tracks: (i) WIDER Face which aims at soliciting new approaches to advance the state-of-the-art in face detection, (ii) WIDER Pedestrian which aims to find effective and efficient approaches to address the problem of pedestrian detection in unconstrained environments, and (iii) WIDER Person Search which presents an exciting challenge of searching persons across 192 movies. In total, 73 teams made valid submissions to the challenge tracks. We summarize the winning solutions for all three tracks. and present discussions on open problems and potential research directions in these topics.

CVDec 5, 2018
An Embarrassingly Simple Approach for Knowledge Distillation

Mengya Gao, Yujun Shen, Quanquan Li et al.

Knowledge Distillation (KD) aims at improving the performance of a low-capacity student model by inheriting knowledge from a high-capacity teacher model. Previous KD methods typically train a student by minimizing a task-related loss and the KD loss simultaneously, using a pre-defined loss weight to balance these two terms. In this work, we propose to first transfer the backbone knowledge from a teacher to the student, and then only learn the task-head of the student network. Such a decomposition of the training process circumvents the need of choosing an appropriate loss weight, which is often difficult in practice, and thus makes it easier to apply to different datasets and tasks. Importantly, the decomposition permits the core of our method, Stage-by-Stage Knowledge Distillation (SSKD), which facilitates progressive feature mimicking from teacher to student. Extensive experiments on CIFAR-100 and ImageNet suggest that SSKD significantly narrows down the performance gap between student and teacher, outperforming state-of-the-art approaches. We also demonstrate the generalization ability of SSKD on other challenging benchmarks, including face recognition on IJB-A dataset as well as object detection on COCO dataset.

CVNov 29, 2018
Grid R-CNN

Xin Lu, Buyu Li, Yuxin Yue et al.

This paper proposes a novel object detection framework named Grid R-CNN, which adopts a grid guided localization mechanism for accurate object detection. Different from the traditional regression based methods, the Grid R-CNN captures the spatial information explicitly and enjoys the position sensitive property of fully convolutional architecture. Instead of using only two independent points, we design a multi-point supervision formulation to encode more clues in order to reduce the impact of inaccurate prediction of specific points. To take the full advantage of the correlation of points in a grid, we propose a two-stage information fusion strategy to fuse feature maps of neighbor grid points. The grid guided localization approach is easy to be extended to different state-of-the-art detection frameworks. Grid R-CNN leads to high quality object localization, and experiments demonstrate that it achieves a 4.1% AP gain at IoU=0.8 and a 10.0% AP gain at IoU=0.9 on COCO benchmark compared to Faster R-CNN with Res50 backbone and FPN architecture.

CVOct 19, 2016
POI: Multiple Object Tracking with High Performance Detection and Appearance Feature

Fengwei Yu, Wenbo Li, Quanquan Li et al.

Detection and learning based appearance feature play the central role in data association based multiple object tracking (MOT), but most recent MOT works usually ignore them and only focus on the hand-crafted feature and association algorithms. In this paper, we explore the high-performance detection and deep learning based appearance feature, and show that they lead to significantly better MOT results in both online and offline setting. We make our detection and appearance feature publicly available. In the following part, we first summarize the detection and appearance feature, and then introduce our tracker named Person of Interest (POI), which has both online and offline version.