Zhanglin Peng

CV
14papers
703citations
Novelty51%
AI Score30

14 Papers

LGAug 11, 2023Code
Foundation Model is Efficient Multimodal Multitask Model Selector

Fanqing Meng, Wenqi Shao, Zhanglin Peng et al.

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

CVMay 23, 2022
Active Domain Adaptation with Multi-level Contrastive Units for Semantic Segmentation

Hao Zhang, Ruimao Zhang, Zhanglin Peng et al.

To further reduce the cost of semi-supervised domain adaptation (SSDA) labeling, a more effective way is to use active learning (AL) to annotate a selected subset with specific properties. However, domain adaptation tasks are always addressed in two interactive aspects: domain transfer and the enhancement of discrimination, which requires the selected data to be both uncertain under the model and diverse in feature space. Contrary to active learning in classification tasks, it is usually challenging to select pixels that contain both the above properties in segmentation tasks, leading to the complex design of pixel selection strategy. To address such an issue, we propose a novel Active Domain Adaptation scheme with Multi-level Contrastive Units (ADA-MCU) for semantic image segmentation. A simple pixel selection strategy followed with the construction of multi-level contrastive units is introduced to optimize the model for both domain adaptation and active supervised learning. In practice, MCUs are constructed from intra-image, cross-image, and cross-domain levels by using both labeled and unlabeled pixels. At each level, we define contrastive losses from center-to-center and pixel-to-pixel manners, with the aim of jointly aligning the category centers and reducing outliers near the decision boundaries. In addition, we also introduce a categories correlation matrix to implicitly describe the relationship between categories, which are used to adjust the weights of the losses for MCUs. Extensive experimental results on standard benchmarks show that the proposed method achieves competitive performance against state-of-the-art SSDA methods with 50% fewer labeled pixels and significantly outperforms state-of-the-art with a large margin by using the same level of annotation cost.

CVJan 2, 2023
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

Ziyi Tang, Ruimao Zhang, Zhanglin Peng et al.

In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

CVJun 28, 2018Code
Differentiable Learning-to-Normalize via Switchable Normalization

Ping Luo, Jiamin Ren, Zhanglin Peng et al.

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig.1). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, and Kinetics. Analyses of SN are also presented. We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been made available in https://github.com/switchablenorms/.

CVNov 10, 2020
AIM 2020 Challenge on Learned Image Signal Processing Pipeline

Andrey Ignatov, Radu Timofte, Zhilu Zhang et al.

This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.

CVMar 19, 2020
Exemplar Normalization for Learning Deep Representation

Ruimao Zhang, Zhanglin Peng, Lingyun Wu et al.

Normalization techniques are important in different advanced neural networks and different tasks. This work investigates a novel dynamic learning-to-normalize (L2N) problem by proposing Exemplar Normalization (EN), which is able to learn different normalization methods for different convolutional layers and image samples of a deep network. EN significantly improves flexibility of the recently proposed switchable normalization (SN), which solves a static L2N problem by linearly combining several normalizers in each normalization layer (the combination is the same for all samples). Instead of directly employing a multi-layer perceptron (MLP) to learn data-dependent parameters as conditional batch normalization (cBN) did, the internal architecture of EN is carefully designed to stabilize its optimization, leading to many appealing benefits. (1) EN enables different convolutional layers, image samples, categories, benchmarks, and tasks to use different normalization methods, shedding light on analyzing them in a holistic view. (2) EN is effective for various network architectures and tasks. (3) It could replace any normalization layers in a deep network and still produce stable model training. Extensive experiments demonstrate the effectiveness of EN in a wide spectrum of tasks including image recognition, noisy label learning, and semantic segmentation. For example, by replacing BN in the ordinary ResNet50, improvement produced by EN is 300% more than that of SN on both ImageNet and the noisy WebVision dataset.

CVAug 16, 2019
Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

Zhaoyang Zhang, Jingyu Li, Wenqi Shao et al.

Group convolution, which divides the channels of ConvNets into groups, has achieved impressive improvement over the regular convolution operation. However, existing models, eg. ResNeXt, still suffers from the sub-optimal performance due to manually defining the number of groups as a constant over all of the layers. Toward addressing this issue, we present Groupable ConvNet (GroupNet) built by using a novel dynamic grouping convolution (DGConv) operation, which is able to learn the number of groups in an end-to-end manner. The proposed approach has several appealing benefits. (1) DGConv provides a unified convolution representation and covers many existing convolution operations such as regular dense convolution, group convolution, and depthwise convolution. (2) DGConv is a differentiable and flexible operation which learns to perform various convolutions from training data. (3) GroupNet trained with DGConv learns different number of groups for different convolution layers. Extensive experiments demonstrate that GroupNet outperforms its counterparts such as ResNet and ResNeXt in terms of accuracy and computational complexity. We also present introspection and reproducibility study, for the first time, showing the learning dynamics of training group numbers.

CVJul 22, 2019
Switchable Normalization for Learning-to-Normalize Deep Representation

Ping Luo, Ruimao Zhang, Jiamin Ren et al.

We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks. Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics. Analyses of SN are also presented to answer the following three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been released at https://github.com/switchablenorms.

CVNov 19, 2018
Do Normalization Layers in a Deep ConvNet Really Need to Be Distinct?

Ping Luo, Zhanglin Peng, Jiamin Ren et al.

Yes, they do. This work investigates a perspective for deep learning: whether different normalization layers in a ConvNet require different normalizers. This is the first step towards understanding this phenomenon. We allow each convolutional layer to be stacked before a switchable normalization (SN) that learns to choose a normalizer from a pool of normalization methods. Through systematic experiments in ImageNet, COCO, Cityscapes, and ADE20K, we answer three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? Our results suggest that (1) using distinct normalizers improves both learning and generalization of a ConvNet; (2) the choices of normalizers are more related to depth and batch size, but less relevant to parameter initialization, learning rate decay, and solver; (3) different tasks and datasets have different behaviors when learning to select normalizers.

LGSep 4, 2018
Towards Understanding Regularization in Batch Normalization

Ping Luo, Xinjiang Wang, Wenqi Shao et al.

Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses.

CVFeb 20, 2017
Progressively Diffused Networks for Semantic Image Segmentation

Ruimao Zhang, Wei Yang, Zhanglin Peng et al.

This paper introduces Progressively Diffused Networks (PDNs) for unifying multi-scale context modeling with deep feature learning, by taking semantic image segmentation as an exemplar application. Prior neural networks, such as ResNet, tend to enhance representational power by increasing the depth of architectures and driving the training objective across layers. However, we argue that spatial dependencies in different layers, which generally represent the rich contexts among data elements, are also critical to building deep and discriminative representations. To this end, our PDNs enables to progressively broadcast information over the learned feature maps by inserting a stack of information diffusion layers, each of which exploits multi-dimensional convolutional LSTMs (Long-Short-Term Memory Structures). In each LSTM unit, a special type of atrous filters are designed to capture the short range and long range dependencies from various neighbors to a certain site of the feature map and pass the accumulated information to the next layer. From the extensive experiments on semantic image segmentation benchmarks (e.g., ImageNet Parsing, PASCAL VOC2012 and PASCAL-Part), our framework demonstrates the effectiveness to substantially improve the performances over the popular existing neural network models, and achieves state-of-the-art on ImageNet Parsing for large scale semantic segmentation.

CVApr 7, 2016
Geometric Scene Parsing with Hierarchical LSTM

Zhanglin Peng, Ruimao Zhang, Xiaodan Liang et al.

This paper addresses the problem of geometric scene parsing, i.e. simultaneously labeling geometric surfaces (e.g. sky, ground and vertical plane) and determining the interaction relations (e.g. layering, supporting, siding and affinity) between main regions. This problem is more challenging than the traditional semantic scene labeling, as recovering geometric structures necessarily requires the rich and diverse contextual information. To achieve these goals, we propose a novel recurrent neural network model, named Hierarchical Long Short-Term Memory (H-LSTM). It contains two coupled sub-networks: the Pixel LSTM (P-LSTM) and the Multi-scale Super-pixel LSTM (MS-LSTM) for handling the surface labeling and relation prediction, respectively. The two sub-networks provide complementary information to each other to exploit hierarchical scene contexts, and they are jointly optimized for boosting the performance. Our extensive experiments show that our model is capable of parsing scene geometric structures and outperforming several state-of-the-art methods by large margins. In addition, we show promising 3D reconstruction results from the still images based on the geometric parsing.

CVAug 8, 2015
Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy

Zhanglin Peng, Ya Li, Zhaoquan Cai et al.

This work investigates how the traditional image classification pipelines can be extended into a deep architecture, inspired by recent successes of deep neural networks. We propose a deep boosting framework based on layer-by-layer joint feature boosting and dictionary learning. In each layer, we construct a dictionary of filters by combining the filters from the lower layer, and iteratively optimize the image representation with a joint discriminative-generative formulation, i.e. minimization of empirical classification error plus regularization of analysis image generation over training images. For optimization, we perform two iterating steps: i) to minimize the classification error, select the most discriminative features using the gentle adaboost algorithm; ii) according to the feature selection, update the filters to minimize the regularization on analysis image representation using the gradient descent method. Once the optimization is converged, we learn the higher layer representation in the same way. Our model delivers several distinct advantages. First, our layer-wise optimization provides the potential to build very deep architectures. Second, the generated image representation is compact and meaningful. In several visual recognition tasks, our framework outperforms existing state-of-the-art approaches.

CVFeb 3, 2015
Deep Boosting: Layered Feature Mining for General Image Classification

Zhanglin Peng, Liang Lin, Ruimao Zhang et al.

Constructing effective representations is a critical but challenging problem in multimedia understanding. The traditional handcraft features often rely on domain knowledge, limiting the performances of exiting methods. This paper discusses a novel computational architecture for general image feature mining, which assembles the primitive filters (i.e. Gabor wavelets) into compositional features in a layer-wise manner. In each layer, we produce a number of base classifiers (i.e. regression stumps) associated with the generated features, and discover informative compositions by using the boosting algorithm. The output compositional features of each layer are treated as the base components to build up the next layer. Our framework is able to generate expressive image representations while inducing very discriminate functions for image classification. The experiments are conducted on several public datasets, and we demonstrate superior performances over state-of-the-art approaches.