Jian Sun

h-index91

35papers

29,403citations

Novelty55%

AI Score48

Ranked #26,774 of 194,257 authors (top 14%)#9,641 in CV (top 16%)

35 Papers

18.7CVJan 10, 2023Code

Dynamic Grained Encoder for Vision Transformers

Lin Song, Songyang Zhang, Songtao Liu et al.

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

1.2NAMay 4, 2016

Convergence of the Point Integral method for Poisson equation on point cloud

Zuoqiang Shi, Jian Sun

The Laplace-Beltrami operator (LBO) is a fundamental object associated to Riemannian manifolds, which encodes all intrinsic geometry of the manifolds and has many desirable properties. Recently, we proposed a novel numerical method, Point Integral method (PIM), to discretize the Laplace-Beltrami operator on point clouds \cite{LSS}. In this paper, we analyze the convergence of Point Integral method (PIM) for Poisson equation with Neumann boundary condition on submanifolds isometrically embedded in Euclidean spaces.

4.7SEJul 28, 2024

High-Dimensional Fault Tolerance Testing of Highly Automated Vehicles Based on Low-Rank Models

Yuewen Mei, Tong Nie, Jian Sun et al.

Ensuring fault tolerance of Highly Automated Vehicles (HAVs) is crucial for their safety due to the presence of potentially severe faults. Hence, Fault Injection (FI) testing is conducted by practitioners to evaluate the safety level of HAVs. To fully cover test cases, various driving scenarios and fault settings should be considered. However, due to numerous combinations of test scenarios and fault settings, the testing space can be complex and high-dimensional. In addition, evaluating performance in all newly added scenarios is resource-consuming. The rarity of critical faults that can cause security problems further strengthens the challenge. To address these challenges, we propose to accelerate FI testing under the low-rank Smoothness Regularized Matrix Factorization (SRMF) framework. We first organize the sparse evaluated data into a structured matrix based on its safety values. Then the untested values are estimated by the correlation captured by the matrix structure. To address high dimensionality, a low-rank constraint is imposed on the testing space. To exploit the relationships between existing scenarios and new scenarios and capture the local regularity of critical faults, three types of smoothness regularization are further designed as a complement. We conduct experiments on car following and cut in scenarios. The results indicate that SRMF has the lowest prediction error in various scenarios and is capable of predicting rare critical faults compared to other machine learning models. In addition, SRMF can achieve 1171 acceleration rate, 99.3% precision and 91.1% F1 score in identifying critical faults. To the best of our knowledge, this is the first work to introduce low-rank models to FI testing of HAVs.

8.4CVMay 27, 2025Code

BaryIR: Learning Multi-Source Unified Representation in Continuous Barycenter Space for Generalizable All-in-One Image Restoration

Xiaole Tang, Xiaoyi He, Xiang Gu et al.

Despite remarkable advances made in all-in-one image restoration (AIR) for handling different types of degradations simultaneously, existing methods remain vulnerable to out-of-distribution degradations and images, limiting their real-world applicability. In this paper, we propose a multi-source representation learning framework BaryIR, which decomposes the latent space of multi-source degraded images into a continuous barycenter space for unified feature encoding and source-specific subspaces for specific semantic encoding. Specifically, we seek the multi-source unified representation by introducing a multi-source latent optimal transport barycenter problem, in which a continuous barycenter map is learned to transport the latent representations to the barycenter space. The transport cost is designed such that the representations from source-specific subspaces are contrasted with each other while maintaining orthogonality to those from the barycenter space. This enables BaryIR to learn compact representations with unified degradation-agnostic information from the barycenter space, as well as degradation-specific semantics from source-specific subspaces, capturing the inherent geometry of multi-source data manifold for generalizable AIR. Extensive experiments demonstrate that BaryIR achieves competitive performance compared to state-of-the-art all-in-one methods. Particularly, BaryIR exhibits superior generalization ability to real-world data and unseen degradations. The code will be publicly available at https://github.com/xl-tang3/BaryIR.

3.6CVNov 21, 2025Code

Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

Jiayi Wang, Wei Dai, Haoyu Wang et al.

In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM's zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}

49.4CVJul 26, 2018Code

Unified Perceptual Parsing for Scene Understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou et al.

Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at \url{https://github.com/CSAILVision/unifiedparsing}.

50.1CVMay 20, 2016Code

R-FCN: Object Detection via Region-based Fully Convolutional Networks

Jifeng Dai, Yi Li, Kaiming He et al.

We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn

6.2CVMar 30, 2025

GMapLatent: Geometric Mapping in Latent Space

Wei Zeng, Xuebin Chang, Jianghao Su et al.

Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generalization capabilities. In this work, we innovate a cross-domain alignment and generation model that introduces a canonical latent space representation based on geometric mapping to align the cross-domain latent spaces in a rigorous and precise manner, thus avoiding mode collapse and mixture in the encoder-decoder generation architectures. We name this model GMapLatent. The core of the method is to seamlessly align latent spaces with strict cluster correspondence constraints using the canonical parameterizations of cluster-decorated latent spaces. We first (1) transform the latent space to a canonical parameter domain by composing barycenter translation, optimal transport merging and constrained harmonic mapping, and then (2) compute geometric registration with cluster constraints over the canonical parameter domains. This process realizes a bijective (one-to-one and onto) mapping between newly transformed latent spaces and generates a precise alignment of cluster pairs. Cross-domain generation is then achieved through the aligned latent spaces embedded in the encoder-decoder pipeline. Experiments on gray-scale and color images validate the efficiency, efficacy and applicability of GMapLatent, and demonstrate that the proposed model has superior performance over existing models.

2.0CVApr 26, 2024Code

Adversarial Reweighting with $α$-Power Maximization for Domain Adaptation

Xiang Gu, Xi Yu, Yan Yang et al.

The practical Domain Adaptation (DA) tasks, e.g., Partial DA (PDA), open-set DA, universal DA, and test-time adaptation, have gained increasing attention in the machine learning community. In this paper, we propose a novel approach, dubbed Adversarial Reweighting with $α$-Power Maximization (ARPM), for PDA where the source domain contains private classes absent in target domain. In ARPM, we propose a novel adversarial reweighting model that adversarially learns to reweight source domain data to identify source-private class samples by assigning smaller weights to them, for mitigating potential negative transfer. Based on the adversarial reweighting, we train the transferable recognition model on the reweighted source distribution to be able to classify common class data. To reduce the prediction uncertainty of the recognition model on the target domain for PDA, we present an $α$-power maximization mechanism in ARPM, which enriches the family of losses for reducing the prediction uncertainty for PDA. Extensive experimental results on five PDA benchmarks, i.e., Office-31, Office-Home, VisDA-2017, ImageNet-Caltech, and DomainNet, show that our method is superior to recent PDA methods. Ablation studies also confirm the effectiveness of components in our approach. To theoretically analyze our method, we deduce an upper bound of target domain expected error for PDA, which is approximately minimized in our approach. We further extend ARPM to open-set DA, universal DA, and test time adaptation, and verify the usefulness through experiments.

8.7CVAug 16, 2021Code

Learning Canonical View Representation for 3D Shape Recognition with Arbitrary Views

Xin Wei, Yifei Gong, Fudong Wang et al.

In this paper, we focus on recognizing 3D shapes from arbitrary views, i.e., arbitrary numbers and positions of viewpoints. It is a challenging and realistic setting for view-based 3D shape recognition. We propose a canonical view representation to tackle this challenge. We first transform the original features of arbitrary views to a fixed number of view features, dubbed canonical view representation, by aligning the arbitrary view features to a set of learnable reference view features using optimal transport. In this way, each 3D shape with arbitrary views is represented by a fixed number of canonical view features, which are further aggregated to generate a rich and robust 3D shape representation for shape recognition. We also propose a canonical view feature separation constraint to enforce that the view features in canonical view representation can be embedded into scattered points in a Euclidean space. Experiments on the ModelNet40, ScanObjectNN, and RGBD datasets show that our method achieves competitive results under the fixed viewpoint settings, and significantly outperforms the applicable methods under the arbitrary view setting.

17.0IVJul 26, 2021Code

A Unified Hyper-GAN Model for Unpaired Multi-contrast MR Image Translation

Heran Yang, Jian Sun, Liwei Yang et al.

Cross-contrast image translation is an important task for completing missing contrasts in clinical diagnosis. However, most existing methods learn separate translator for each pair of contrasts, which is inefficient due to many possible contrast pairs in real scenarios. In this work, we propose a unified Hyper-GAN model for effectively and efficiently translating between different contrast pairs. Hyper-GAN consists of a pair of hyper-encoder and hyper-decoder to first map from the source contrast to a common feature space, and then further map to the target contrast image. To facilitate the translation between different contrast pairs, contrast-modulators are designed to tune the hyper-encoder and hyper-decoder adaptive to different contrasts. We also design a common space loss to enforce that multi-contrast images of a subject share a common feature space, implicitly modeling the shared underlying anatomical structures. Experiments on two datasets of IXI and BraTS 2019 show that our Hyper-GAN achieves state-of-the-art results in both accuracy and efficiency, e.g., improving more than 1.47 and 1.09 dB in PSNR on two datasets with less than half the amount of parameters.

20.0CVMay 22, 2021Code

ADNet: Attention-guided Deformable Convolutional Network for High Dynamic Range Imaging

Zhen Liu, Wenjie Lin, Xinpeng Li et al.

In this paper, we present an attention-guided deformable convolutional network for hand-held multi-frame high dynamic range (HDR) imaging, namely ADNet. This problem comprises two intractable challenges of how to handle saturation and noise properly and how to tackle misalignments caused by object motion or camera jittering. To address the former, we adopt a spatial attention module to adaptively select the most appropriate regions of various exposure low dynamic range (LDR) images for fusion. For the latter one, we propose to align the gamma-corrected images in the feature-level with a Pyramid, Cascading and Deformable (PCD) alignment module. The proposed ADNet shows state-of-the-art performance compared with previous methods, achieving a PSNR-$l$ of 39.4471 and a PSNR-$μ$ of 37.6359 in NTIRE 2021 Multi-Frame HDR Challenge.

12.9IVMay 17, 2021Code

Fast Camera Image Denoising on Mobile GPUs with Deep Learning, Mobile AI 2021 Challenge: Report

Andrey Ignatov, Kim Byeoung-su, Radu Timofte et al.

Image denoising is one of the most critical problems in mobile photo processing. While many solutions have been proposed for this task, they are usually working with synthetic data and are too computationally expensive to run on mobile devices. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based image denoising solution that can demonstrate high efficiency on smartphone GPUs. For this, the participants were provided with a novel large-scale dataset consisting of noisy-clean image pairs captured in the wild. The runtime of all models was evaluated on the Samsung Exynos 2100 chipset with a powerful Mali GPU capable of accelerating floating-point and quantized neural networks. The proposed solutions are fully compatible with any mobile GPU and are capable of processing 480p resolution images under 40-80 ms while achieving high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.

4.4IVApr 13, 2021

Learning to Jointly Deblur, Demosaick and Denoise Raw Images

Thomas Eboli, Jian Sun, Jean Ponce

We address the problem of non-blind deblurring and demosaicking of noisy raw images. We adapt an existing learning-based approach to RGB image deblurring to handle raw images by introducing a new interpretable module that jointly demosaicks and deblurs them. We train this model on RGB images converted into raw ones following a realistic invertible camera pipeline. We demonstrate the effectiveness of this model over two-stage approaches stacking demosaicking and deblurring modules on quantitive benchmarks. We also apply our approach to remove a camera's inherent blur (its color-dependent point-spread function) from real images, in essence deblurring sharp images.

12.4CVJul 3, 2020Code

End-to-end Interpretable Learning of Non-blind Image Deblurring

Thomas Eboli, Jian Sun, Jean Ponce

Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture's gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. Using convolutions instead of a generic linear preconditioner allows extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods. More importantly, the proposed architecture is easily adapted to learning both the preconditioner and the proximal operator using CNN embeddings. This yields a simple and efficient algorithm for non-blind image deblurring which is fully interpretable, can be learned end to end, and whose accuracy matches or exceeds the state of the art, quite significantly, in the non-uniform case.

1.2LGJun 16, 2020

Structured and Localized Image Restoration

Thomas Eboli, Alex Nowak-Vila, Jian Sun et al.

We present a novel approach to image restoration that leverages ideas from localized structured prediction and non-linear multi-task learning. We optimize a penalized energy function regularized by a sum of terms measuring the distance between patches to be restored and clean patches from an external database gathered beforehand. The resulting estimator comes with strong statistical guarantees leveraging local dependency properties of overlapping patches. We derive the corresponding algorithms for energies based on the mean-squared and Euclidean norm errors. Finally, we demonstrate the practical effectiveness of our model on different image restoration problems using standard benchmarks.

2.2ROMay 30, 2020

A real-time multi-constraints obstacle avoidance method using LiDAR

Wei Chen, Jian Sun, Weishuo Li et al.

Obstacle avoidance is one of the essential and indispensable functions for autonomous mobile robots. Most of the existing solutions are typically based on single condition constraint and cannot incorporate sensor data in a real-time manner, which often fail to respond to unexpected moving obstacles in dynamic unknown environments. In this paper, a novel real-time multi-constraints obstacle avoidance method using Light Detection and Ranging(LiDAR) is proposed, which is able to, based on the latest estimation of the robot pose and environment, find the sub-goal defined by a multi-constraints function within the explored region and plan a corresponding optimal trajectory at each time step iteratively, so that the robot approaches the goal over time. Meanwhile, at each time step, the improved Ant Colony Optimization(ACO) algorithm is also used to re-plan optimal paths from the latest robot pose to the latest defined sub-goal position. While ensuring convergence, planning in this method is done by repeated local optimizations, so that the latest sensor data from LiDAR and derived environment information can be fully utilized at each step until the robot reaches the desired position. This method facilitates real-time performance, also has little requirement on memory space or computational power due to its nature, thus our method has huge potentials to benefit small low-cost autonomous platforms. The method is evaluated against several existing technologies in both simulation and real-world experiments.

1.8CVAug 27, 2019

HRGE-Net: Hierarchical Relational Graph Embedding Network for Multi-view 3D Shape Recognition

Xin Wei, Ruixuan Yu, Jian Sun

View-based approach that recognizes 3D shape through its projected 2D images achieved state-of-the-art performance for 3D shape recognition. One essential challenge for view-based approach is how to aggregate the multi-view features extracted from 2D images to be a global 3D shape descriptor. In this work, we propose a novel feature aggregation network by fully investigating the relations among views. We construct a relational graph with multi-view images as nodes, and design relational graph embedding by modeling pairwise and neighboring relations among views. By gradually coarsening the graph, we build a hierarchical relational graph embedding network (HRGE-Net) to aggregate the multi-view features to be a global shape descriptor. Extensive experiments show that HRGE-Net achieves stateof-the-art performance for 3D shape classification and retrieval on benchmark datasets.

21.8CVApr 1, 2019Code

Perceive Where to Focus: Learning Visibility-aware Part-level Features for Partial Person Re-identification

Yifan Sun, Qin Xu, Yali Li et al.

This paper considers a realistic problem in person re-identification (re-ID) task, i.e., partial re-ID. Under partial re-ID scenario, the images may contain a partial observation of a pedestrian. If we directly compare a partial pedestrian image with a holistic one, the extreme spatial misalignment significantly compromises the discriminative ability of the learned representation. We propose a Visibility-aware Part Model (VPM), which learns to perceive the visibility of regions through self-supervision. The visibility awareness allows VPM to extract region-level features and compare two images with focus on their shared regions (which are visible on both images). VPM gains two-fold benefit toward higher accuracy for partial re-ID. On the one hand, compared with learning a global feature, VPM learns region-level features and benefits from fine-grained information. On the other hand, with visibility awareness, VPM is capable to estimate the shared regions between two images and thus suppresses the spatial misalignment. Experimental results confirm that our method significantly improves the learned representation and the achieved accuracy is on par with the state of the art.

5.7LGNov 22, 2018

HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Shipeng Wang, Jian Sun, Zongben Xu

Deep neural networks are traditionally trained using human-designed stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as \textit{HyperAdam}, is proposed that combines the idea of "learning to optimize" and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates. The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.

0.9CVOct 21, 2018

Learning Spectral Transform Network on 3D Surface for Non-rigid Shape Analysis

Ruixuan Yu, Jian Sun, Huibin Li

Designing a network on 3D surface for non-rigid shape analysis is a challenging task. In this work, we propose a novel spectral transform network on 3D surface to learn shape descriptors. The proposed network architecture consists of four stages: raw descriptor extraction, surface second-order pooling, mixture of power function-based spectral transform, and metric learning. The proposed network is simple and shallow. Quantitative experiments on challenging benchmarks show its effectiveness for non-rigid shape retrieval and classification, e.g., it achieved the highest accuracies on SHREC14, 15 datasets as well as the Range subset of SHREC17 dataset.

16.9CVSep 12, 2018

Unpaired Brain MR-to-CT Synthesis using a Structure-Constrained CycleGAN

Heran Yang, Jian Sun, Aaron Carass et al.

The cycleGAN is becoming an influential method in medical image synthesis. However, due to a lack of direct constraints between input and synthetic images, the cycleGAN cannot guarantee structural consistency between these two images, and such consistency is of extreme importance in medical imaging. To overcome this, we propose a structure-constrained cycleGAN for brain MR-to-CT synthesis using unpaired data that defines an extra structure-consistency loss based on the modality independent neighborhood descriptor to constrain structural consistency. Additionally, we use a position-based selection strategy for selecting training images instead of a completely random selection scheme. Experimental results on synthesizing CT images from brain MR images demonstrate that our method is better than the conventional cycleGAN and approximates the cycleGAN trained with paired data.

31.6CVNov 22, 2017

AlignedReID: Surpassing Human-Level Performance in Person Re-Identification

Xuan Zhang, Hao Luo, Xing Fan et al.

In this paper, we propose a novel method called AlignedReID that extracts a global feature which is jointly learned with local features. Global feature learning benefits greatly from local feature learning, which performs an alignment/matching by calculating the shortest path between two sets of local features, without requiring extra supervision. After the joint learning, we only keep the global feature to compute the similarities between images. Our method achieves rank-1 accuracy of 94.4% on Market1501 and 97.8% on CUHK03, outperforming state-of-the-art methods by a large margin. We also evaluate human-level performance and demonstrate that our method is the first to surpass human-level performance on Market1501 and CUHK03, two widely used Person ReID datasets.

34.3CVNov 20, 2017

MegDet: A Large Mini-Batch Object Detector

Chao Peng, Tete Xiao, Zeming Li et al.

The improvements in recent CNN-based object detection works, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from new network, new framework, or novel loss design. But mini-batch size, a key factor in the training, has not been well studied. In this paper, we propose a Large MiniBatch Object Detector (MegDet) to enable the training with much larger mini-batch size than before (e.g. from 16 to 256), so that we can effectively utilize multiple GPUs (up to 128 in our experiments) to significantly shorten the training time. Technically, we suggest a learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.

7.6CVSep 27, 2017

Neural Multi-Atlas Label Fusion: Application to Cardiac MR Images

Heran Yang, Jian Sun, Huibin Li et al.

Multi-atlas segmentation approach is one of the most widely-used image segmentation techniques in biomedical applications. There are two major challenges in this category of methods, i.e., atlas selection and label fusion. In this paper, we propose a novel multi-atlas segmentation method that formulates multi-atlas segmentation in a deep learning framework for better solving these challenges. The proposed method, dubbed deep fusion net (DFN), is a deep architecture that integrates a feature extraction subnet and a non-local patch-based label fusion (NL-PLF) subnet in a single network. The network parameters are learned by end-to-end training for automatically learning deep features that enable optimal performance in a NL-PLF framework. The learned deep features are further utilized in defining a similarity measure for atlas selection. By evaluating on two public cardiac MR datasets of SATA-13 and LV-09 for left ventricle segmentation, our approach achieved 0.833 in averaged Dice metric (ADM) on SATA-13 dataset and 0.95 in ADM for epicardium segmentation on LV-09 dataset, comparing favorably with the other automatic left ventricle segmentation methods. We also tested our approach on Cardiac Atlas Project (CAP) testing set of MICCAI 2013 SATA Segmentation Challenge, and our method achieved 0.815 in ADM, ranking highest at the time of writing.

52.0CVJul 19, 2017Code

Channel Pruning for Accelerating Very Deep Neural Networks

Yihui He, Xiangyu Zhang, Jian Sun

In this paper, we introduce a new channel pruning method to accelerate very deep convolutional neural networks.Given a trained CNN model, we propose an iterative two-step algorithm to effectively prune each layer, by a LASSO regression based channel selection and least square reconstruction. We further generalize this algorithm to multi-layer and multi-branch cases. Our method reduces the accumulated error and enhance the compatibility with various architectures. Our pruned VGG-16 achieves the state-of-the-art results by 5x speed-up along with only 0.3% increase of error. More importantly, our method is able to accelerate modern networks like ResNet, Xception and suffers only 1.4%, 1.0% accuracy loss under 2x speed-up respectively, which is significant. Code has been made publicly available.

16.1CVMay 19, 2017

ADMM-Net: A Deep Learning Approach for Compressive Sensing MRI

Yan Yang, Jian Sun, Huibin Li et al.

Compressive sensing (CS) is an effective approach for fast Magnetic Resonance Imaging (MRI). It aims at reconstructing MR images from a small number of under-sampled data in k-space, and accelerating the data acquisition in MRI. To improve the current MRI system in reconstruction accuracy and speed, in this paper, we propose two novel deep architectures, dubbed ADMM-Nets in basic and generalized versions. ADMM-Nets are defined over data flow graphs, which are derived from the iterative procedures in Alternating Direction Method of Multipliers (ADMM) algorithm for optimizing a general CS-based MRI model. They take the sampled k-space data as inputs and output reconstructed MR images. Moreover, we extend our network to cope with complex-valued MR images. In the training phase, all parameters of the nets, e.g., transforms, shrinkage functions, etc., are discriminatively trained end-to-end. In the testing phase, they have computational overhead similar to ADMM algorithm but use optimized parameters learned from the data for CS-based reconstruction task. We investigate different configurations in network structures and conduct extensive experiments on MR image reconstruction under different sampling rates. Due to the combination of the advantages in model-based approach and deep learning approach, the ADMM-Nets achieve state-of-the-art reconstruction accuracies with fast computational speed.

21.7CVJul 19, 2016

Supervised Transformer Network for Efficient Face Detection

Dong Chen, Gang Hua, Fang Wen et al.

Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along with associated facial landmarks. The candidate regions are then warped by mapping the detected facial landmarks to their canonical positions to better normalize the face patterns. The second stage, which is a RCNN, then verifies if the warped candidate regions are valid faces or not. We conduct end-to-end learning of the cascaded network, including optimizing the canonical positions of the facial landmarks. This supervised learning of the transformations automatically selects the best scale to differentiate face/non-face patterns. By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks. For real-time performance, we run the cascaded network only on regions of interests produced from a boosting cascade face detector. Our detector runs at 30 FPS on a single CPU core for a VGA-resolution image.

44.7CVDec 14, 2015

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Jifeng Dai, Kaiming He, Jian Sun

Semantic segmentation research has recently witnessed rapid progress, but many leading methods are unable to identify object instances. In this paper, we present Multi-task Network Cascades for instance-aware semantic segmentation. Our model consists of three networks, respectively differentiating instances, estimating masks, and categorizing objects. These networks form a cascaded structure, and are designed to share their convolutional features. We develop an algorithm for the nontrivial end-to-end training of this causal, cascaded structure. Our solution is a clean, single-step training framework and can be generalized to cascades that have more stages. We demonstrate state-of-the-art instance-aware semantic segmentation accuracy on PASCAL VOC. Meanwhile, our method takes only 360ms testing an image using VGG-16, which is two orders of magnitude faster than previous systems for this challenging problem. As a by product, our method also achieves compelling object detection results which surpass the competitive Fast/Faster R-CNN systems. The method described in this paper is the foundation of our submissions to the MS COCO 2015 segmentation competition, where we won the 1st place.

35.9CVMay 26, 2015

Accelerating Very Deep Convolutional Networks for Classification and Detection

Xiangyu Zhang, Jianhua Zou, Kaiming He et al.

This paper aims to accelerate the test-time computation of convolutional neural networks (CNNs), especially very deep CNNs that have substantially impacted the computer vision community. Unlike previous methods that are designed for approximating linear filters or linear responses, our method takes the nonlinear units into account. We develop an effective solution to the resulting nonlinear optimization problem without the need of stochastic gradient descent (SGD). More importantly, while previous methods mainly focus on optimizing one or two layers, our nonlinear method enables an asymmetric reconstruction that reduces the rapidly accumulated error when multiple (e.g., >=10) layers are approximated. For the widely used very deep VGG-16 model, our method achieves a whole-model speedup of 4x with merely a 0.3% increase of top-5 error in ImageNet classification. Our 4x accelerated VGG-16 model also shows a graceful accuracy degradation for object detection when plugged into the Fast R-CNN detector.

1.2NAAug 4, 2015

Point Integral Method for Solving Poisson-type Equations on Manifolds from Point Clouds with Convergence Guarantees

Zhen Li, Zuoqiang Shi, Jian Sun

Partial differential equations (PDE) on manifolds arise in many areas, including mathematics and many applied fields. Among all kinds of PDEs, the Poisson-type equations including the standard Poisson equation and the related eigenproblem of the Laplace-Beltrami operator are of the most important. Due to the complicated geometrical structure of the manifold, it is difficult to get efficient numerical method to solve PDE on manifold. In the paper, we propose a method called point integral method (PIM) to solve the Poisson-type equations from point clouds with convergence guarantees. In PIM, the key idea is to derive the integral equations which approximates the Poisson-type equations and contains no derivatives but only the values of the unknown function. The latter makes the integral equation easy to be approximated from point cloud. In the paper, we explain the derivation of the integral equations, describe the point integral method and its implementation, and present the numerical experiments to demonstrate the convergence of PIM.

29.2CVApr 23, 2015

Object Detection Networks on Convolutional Feature Maps

Shaoqing Ren, Kaiming He, Ross Girshick et al.

Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep convolutional architectures. The object classifier, however, has not received much attention and many recent systems (like SPPnet and Fast/Faster R-CNN) use simple multi-layer perceptrons. This paper demonstrates that carefully designing deep networks for object classification is just as important. We experiment with region-wise classifier networks that use shared, region-independent convolutional features. We call them "Networks on Convolutional feature maps" (NoCs). We discover that aside from deep feature maps, a deep and convolutional per-region classifier is of particular importance for object detection, whereas latest superior image classification models (such as ResNets and GoogLeNets) do not directly lead to good detection accuracy without using such a per-region classifier. We show by experiments that despite the effective ResNets and Faster R-CNN systems, the design of NoCs is an essential element for the 1st-place winning entries in ImageNet and MS COCO challenges 2015.

35.9CVMar 2, 2015

Learning a Convolutional Neural Network for Non-uniform Motion Blur Removal

Jian Sun, Wenfei Cao, Zongben Xu et al.

In this paper, we address the problem of estimating and removing non-uniform motion blur from a single blurry image. We propose a deep learning approach to predicting the probabilistic distribution of motion blur at the patch level using a convolutional neural network (CNN). We further extend the candidate set of motion kernels predicted by the CNN using carefully designed image rotations. A Markov random field model is then used to infer a dense non-uniform motion blur field enforcing motion smoothness. Finally, motion blur is removed by a non-uniform deblurring model using patch-level image prior. Experimental evaluations show that our approach can effectively estimate and remove complex non-uniform motion blur that is not handled well by previous approaches.

60.1CVJun 18, 2014

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren et al.

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

2.3CGMay 6, 2013

Gromov-Hausdorff Approximation of Metric Spaces with Linear Structure

Frédéric Chazal, Jian Sun

In many real-world applications data come as discrete metric spaces sampled around 1-dimensional filamentary structures that can be seen as metric graphs. In this paper we address the metric reconstruction problem of such filamentary structures from data sampled around them. We prove that they can be approximated, with respect to the Gromov-Hausdorff distance by well-chosen Reeb graphs (and some of their variants) and we provide an efficient and easy to implement algorithm to compute such approximations in almost linear time. We illustrate the performances of our algorithm on a few synthetic and real data sets.