Yang He

CV
h-index72
48papers
3,536citations
Novelty51%
AI Score64

48 Papers

CVMar 1, 2023Code
Structured Pruning for Deep Convolutional Neural Networks: A survey

Yang He, Lingao Xiao

The remarkable performance of deep Convolutional neural networks (CNNs) is generally attributed to their deeper and wider architectures, which can come with significant computational costs. Pruning neural networks has thus gained interest since it effectively lowers storage and computational costs. In contrast to weight pruning, which results in unstructured models, structured pruning provides the benefit of realistic acceleration by producing models that are friendly to hardware implementation. The special requirements of structured pruning have led to the discovery of numerous new challenges and the development of innovative solutions. This article surveys the recent progress towards structured pruning of deep CNNs. We summarize and compare the state-of-the-art structured pruning techniques with respect to filter ranking methods, regularization methods, dynamic execution, neural architecture search, the lottery ticket hypothesis, and the applications of pruning. While discussing structured pruning algorithms, we briefly introduce the unstructured pruning counterpart to emphasize their differences. Furthermore, we provide insights into potential research opportunities in the field of structured pruning. A curated list of neural network pruning papers can be found at https://github.com/he-y/Awesome-Pruning . A dedicated website offering a more interactive comparison of structured pruning methods can be found at: https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey .

CVDec 15, 2022
Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation

Anurag Das, Yongqin Xian, Yang He et al.

For best performance, today's semantic segmentation methods use large and carefully labeled datasets, requiring expensive annotation budgets. In this work, we show that coarse annotation is a low-cost but highly effective alternative for training semantic segmentation models. Considering the urban scene segmentation scenario, we leverage cheap coarse annotations for real-world captured data, as well as synthetic data to train our model and show competitive performance compared with finely annotated real-world data. Specifically, we propose a coarse-to-fine self-training framework that generates pseudo labels for unlabeled regions of the coarsely annotated data, using synthetic data to improve predictions around the boundaries between semantic classes, and using cross-domain data augmentation to increase diversity. Our extensive experimental results on Cityscapes and BDD100k datasets demonstrate that our method achieves a significantly better performance vs annotation cost tradeoff, yielding a comparable performance to fully annotated data with only a small fraction of the annotation budget. Also, when used as pretraining, our framework performs better compared to the standard fully supervised setting.

DBMar 4Code
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Rocky Klopfenstein, Yang He, Andrew Tremante et al.

We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.

LGSep 28, 2022Code
A Parameter-free Nonconvex Low-rank Tensor Completion Model for Spatiotemporal Traffic Data Recovery

Yang He, Yuheng Jia, Liyang Hu et al.

Traffic data chronically suffer from missing and corruption, leading to accuracy and utility reduction in subsequent Intelligent Transportation System (ITS) applications. Noticing the inherent low-rank property of traffic data, numerous studies formulated missing traffic data recovery as a low-rank tensor completion (LRTC) problem. Due to the non-convexity and discreteness of the rank minimization in LRTC, existing methods either replaced rank with convex surrogates that are quite far away from the rank function or approximated rank with nonconvex surrogates involving many parameters. In this study, we proposed a Parameter-Free Non-Convex Tensor Completion model (TC-PFNC) for traffic data recovery, in which a log-based relaxation term was designed to approximate tensor algebraic rank. Moreover, previous studies usually assumed the observations are reliable without any outliers. Therefore, we extended the TC-PFNC to a robust version (RTC-PFNC) by modeling potential traffic data outliers, which can recover the missing value from partial and corrupted observations and remove the anomalies in observations. The numerical solutions of TC-PFNC and RTC-PFNC were elaborated based on the alternating direction multiplier method (ADMM). The extensive experimental results conducted on four real-world traffic data sets demonstrated that the proposed methods outperform other state-of-the-art methods in both missing and corrupted data recovery. The code used in this paper is available at: https://github.com/YoungHe49/T-ITSPFNC.

CVOct 21, 2023Code
You Only Condense Once: Two Rules for Pruning Condensed Datasets

Yang He, Lingao Xiao, Joey Tianyi Zhou

Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size of the training dataset, particularly in on-device scenarios. However, these scenarios have two significant challenges: 1) the varying computational resources available on the devices require a dataset size different from the pre-defined condensed dataset, and 2) the limited computational resources often preclude the possibility of conducting additional condensation processes. We introduce You Only Condense Once (YOCO) to overcome these limitations. On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules: Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10, CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class (IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The code is available at: https://github.com/he-y/you-only-condense-once.

AIMay 28
DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

Yang He, Xiao Ding, Bibo Cai et al.

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

CVApr 20Code
Soft Label Pruning and Quantization for Large-Scale Dataset Distillation

Xiao Lingao, Yang He

Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.

CVFeb 24Code
Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

Chenyue Yu, Lingao Xiao, Jinhong Deng et al.

Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{https://github.com/he-y/Dataset-Color-Quantization}{https://github.com/he-y/Dataset-Color-Quantization}.

LGSep 6, 2023
Towards Unsupervised Graph Completion Learning on Graphs with Features and Structure Missing

Sichao Fu, Qinmu Peng, Yang He et al.

In recent years, graph neural networks (GNN) have achieved significant developments in a variety of graph analytical tasks. Nevertheless, GNN's superior performance will suffer from serious damage when the collected node features or structure relationships are partially missing owning to numerous unpredictable factors. Recently emerged graph completion learning (GCL) has received increasing attention, which aims to reconstruct the missing node features or structure relationships under the guidance of a specifically supervised task. Although these proposed GCL methods have made great success, they still exist the following problems: the reliance on labels, the bias of the reconstructed node features and structure relationships. Besides, the generalization ability of the existing GCL still faces a huge challenge when both collected node features and structure relationships are partially missing at the same time. To solve the above issues, we propose a more general GCL framework with the aid of self-supervised learning for improving the task performance of the existing GNN variants on graphs with features and structure missing, termed unsupervised GCL (UGCL). Specifically, to avoid the mismatch between missing node features and structure during the message-passing process of GNN, we separate the feature reconstruction and structure reconstruction and design its personalized model in turn. Then, a dual contrastive loss on the structure level and feature level is introduced to maximize the mutual information of node representations from feature reconstructing and structure reconstructing paths for providing more supervision signals. Finally, the reconstructed node features and structure can be applied to the downstream node classification task. Extensive experiments on eight datasets, three GNN variants and five missing rates demonstrate the effectiveness of our proposed method.

DBOct 30, 2025
SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

Rocky Klopfenstein, Yang He, Andrew Tremante et al.

Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.

SEDec 30, 2023Code
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Yao Wan, Yang He, Zhangqian Bi et al.

Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (https://xcodemind.github.io). At last, we also point out several challenging and promising directions for future research.

CLMay 4Code
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

Quang Hieu Pham, Yang He, Ping Nie et al.

Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4\% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10\% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: https://github.com/StringNLPLAB/FlexSQL

LGOct 9, 2023
Temporal Convolutional Explorer Helps Understand 1D-CNN's Learning Behavior in Time Series Classification from Frequency Domain

Junru Zhang, Lang Feng, Yang He et al.

While one-dimensional convolutional neural networks (1D-CNNs) have been empirically proven effective in time series classification tasks, we find that there remain undesirable outcomes that could arise in their application, motivating us to further investigate and understand their underlying mechanisms. In this work, we propose a Temporal Convolutional Explorer (TCE) to empirically explore the learning behavior of 1D-CNNs from the perspective of the frequency domain. Our TCE analysis highlights that deeper 1D-CNNs tend to distract the focus from the low-frequency components leading to the accuracy degradation phenomenon, and the disturbing convolution is the driving factor. Then, we leverage our findings to the practical application and propose a regulatory framework, which can easily be integrated into existing 1D-CNNs. It aims to rectify the suboptimal learning behavior by enabling the network to selectively bypass the specified disturbing convolutions. Finally, through comprehensive experiments on widely-used UCR, UEA, and UCI benchmarks, we demonstrate that 1) TCE's insight into 1D-CNN's learning behavior; 2) our regulatory framework enables state-of-the-art 1D-CNNs to get improved performances with less consumption of memory and computational overhead.

CVMar 10, 2024Code
Multisize Dataset Condensation

Yang He, Lingao Xiao, Joey Tianyi Zhou et al.

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.

CVOct 21, 2024Code
Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?

Lingao Xiao, Yang He

In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain. Code is available at: https://github.com/he-y/soft-label-pruning-for-dataset-distillation

CVApr 21, 2024Code
Data-independent Module-aware Pruning for Hierarchical Vision Transformers

Yang He, Joey Tianyi Zhou

Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning

CVOct 27, 2024Code
UTSRMorph: A Unified Transformer and Superresolution Network for Unsupervised Medical Image Registration

Runshi Zhang, Hao Mo, Junchen Wang et al.

Complicated image registration is a key issue in medical image analysis, and deep learning-based methods have achieved better results than traditional methods. The methods include ConvNet-based and Transformer-based methods. Although ConvNets can effectively utilize local information to reduce redundancy via small neighborhood convolution, the limited receptive field results in the inability to capture global dependencies. Transformers can establish long-distance dependencies via a self-attention mechanism; however, the intense calculation of the relationships among all tokens leads to high redundancy. We propose a novel unsupervised image registration method named the unified Transformer and superresolution (UTSRMorph) network, which can enhance feature representation learning in the encoder and generate detailed displacement fields in the decoder to overcome these problems. We first propose a fusion attention block to integrate the advantages of ConvNets and Transformers, which inserts a ConvNet-based channel attention module into a multihead self-attention module. The overlapping attention block, a novel cross-attention method, uses overlapping windows to obtain abundant correlations with match information of a pair of images. Then, the blocks are flexibly stacked into a new powerful encoder. The decoder generation process of a high-resolution deformation displacement field from low-resolution features is considered as a superresolution process. Specifically, the superresolution module was employed to replace interpolation upsampling, which can overcome feature degradation. UTSRMorph was compared to state-of-the-art registration methods in the 3D brain MR (OASIS, IXI) and MR-CT datasets. The qualitative and quantitative results indicate that UTSRMorph achieves relatively better performance. The code and datasets are publicly available at https://github.com/Runshi-Zhang/UTSRMorph.

CVFeb 10, 2025Code
Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

Lingao Xiao, Songhua Liu, Yang He et al.

Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

CVMar 2, 2025Code
Training-Free Dataset Pruning for Instance Segmentation

Yalun Dai, Lingao Xiao, Ivor W. Tsang et al.

Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel-level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification-based pruning methods proves ineffective due to their reliance on time-consuming model training process. To address this, we propose a novel Training-Free Dataset Pruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale-Invariant (SI-SCS) and Class-Balanced (CB-SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state-of-the-art results on VOC 2012, Cityscapes, and COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of 1349$\times$ on COCO compared to the adapted baselines. Source code is available at: https://github.com/he-y/dataset-pruning-for-instance-segmentation

CLJan 5, 2025Code
Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding

Binh-Nguyen Nguyen, Yang He

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning

CVAug 20, 2025Code
TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network

Runshi Zhang, Bimeng Jie, Yang He et al.

Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.

CVOct 28, 2025Code
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou et al.

Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

CVApr 15, 2025Code
Explicit and Implicit Representations in AI-based 3D Reconstruction for Radiology: A Systematic Review

Yuezhe Yang, Boyu Yang, Yaqian Wang et al.

The demand for high-quality medical imaging in clinical practice and assisted diagnosis has made 3D reconstruction in radiological imaging a key research focus. Artificial intelligence (AI) has emerged as a promising approach to enhancing reconstruction accuracy while reducing acquisition and processing time, thereby minimizing patient radiation exposure and discomfort and ultimately benefiting clinical diagnosis. This review explores state-of-the-art AI-based 3D reconstruction algorithms in radiological imaging, categorizing them into explicit and implicit approaches based on their underlying principles. Explicit methods include point-based, volume-based, and Gaussian representations, while implicit methods encompass implicit prior embedding and neural radiance fields. Additionally, we examine commonly used evaluation metrics and benchmark datasets. Finally, we discuss the current state of development, key challenges, and future research directions in this evolving field. Our project available on: https://github.com/Bean-Young/AI4Radiology.

LGOct 11, 2021Code
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training

Hui-Po Wang, Sebastian U. Stich, Yang He et al.

Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resource-intensive for edge devices, and limited network bandwidth is often the main bottleneck. Prior work often overcomes the constraints by condensing the models or messages into compact formats, e.g., by gradient compression or distillation. In contrast, we propose ProgFed, the first progressive training framework for efficient and effective federated learning. It inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. We theoretically prove that ProgFed converges at the same asymptotic rate as standard training on full models. Extensive results on a broad range of architectures, including CNNs (VGG, ResNet, ConvNets) and U-nets, and diverse tasks from simple classification to medical image segmentation show that our highly effective training approach saves up to $20\%$ computation and up to $63\%$ communication costs for converged models. As our approach is also complimentary to prior work on compression, we can achieve a wide range of trade-offs by combining these techniques, showing reduced communication of up to $50\times$ at only $0.1\%$ loss in utility. Code is available at https://github.com/hui-po-wang/ProgFed.

CVMay 29, 2021Code
Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis

Yang He, Ning Yu, Margret Keuper et al.

The rapid advances in deep generative models over the past years have led to highly {realistic media, known as deepfakes,} that are commonly indistinguishable from real to human eyes. These advances make assessing the authenticity of visual data increasingly difficult and pose a misinformation threat to the trustworthiness of visual content in general. Although recent work has shown strong detection accuracy of such deepfakes, the success largely relies on identifying frequency artifacts in the generated images, which will not yield a sustainable detection approach as generative models continue evolving and closing the gap to real images. In order to overcome this issue, we propose a novel fake detection that is designed to re-synthesize testing images and extract visual cues for detection. The re-synthesis procedure is flexible, allowing us to incorporate a series of visual tasks - we adopt super-resolution, denoising and colorization as the re-synthesis. We demonstrate the improved effectiveness, cross-GAN generalization, and robustness against perturbations of our approach in a variety of detection scenarios involving multiple generators over CelebA-HQ, FFHQ, and LSUN datasets. Source code is available at https://github.com/SSAW14/BeyondtheSpectrum.

CVNov 1, 2018Code
Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

Yang He, Ping Liu, Ziwei Wang et al.

Previous works utilized ''smaller-norm-less-important'' criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with ''relatively less'' importance. When applied to two image classification benchmarks, our method validates its usefulness and strengths. Notably, on CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69% relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than 42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/filter-pruning-geometric-median

CVAug 21, 2018Code
Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks

Yang He, Guoliang Kang, Xuanyi Dong et al.

This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pre-trained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pre-trained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC-2012, SFP reduces more than 42% FLOPs on ResNet-101 with even 0.2% top-5 accuracy improvement, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/soft-filter-pruning

CLMay 27, 2025
Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

Yang He, Xiao Ding, Bibo Cai et al.

While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.

LGNov 12, 2024
Bayesian Deep Learning Approach for Real-time Lane-based Arrival Curve Reconstruction at Intersection using License Plate Recognition Data

Yang He, Chengchuan An, Jiawei Lu et al.

The acquisition of real-time and accurate traffic arrival information is of vital importance for proactive traffic control systems, especially in partially connected vehicle environments. License plate recognition (LPR) data that record both vehicle departures and identities are proven to be desirable in reconstructing lane-based arrival curves in previous works. Existing LPR databased methods are predominantly designed for reconstructing historical arrival curves. For real-time reconstruction of multi-lane urban roads, it is pivotal to determine the lane choice of real-time link-based arrivals, which has not been exploited in previous studies. In this study, we propose a Bayesian deep learning approach for real-time lane-based arrival curve reconstruction, in which the lane choice patterns and uncertainties of link-based arrivals are both characterized. Specifically, the learning process is designed to effectively capture the relationship between partially observed link-based arrivals and lane-based arrivals, which can be physically interpreted as lane choice proportion. Moreover, the lane choice uncertainties are characterized using Bayesian parameter inference techniques, minimizing arrival curve reconstruction uncertainties, especially in low LPR data matching rate conditions. Real-world experiment results conducted in multiple matching rate scenarios demonstrate the superiority and necessity of lane choice modeling in reconstructing arrival curves.

AIDec 21, 2024
Knowledge as a Breaking of Ergodicity

Yang He, Vassiliy Lubchenko

We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally-manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2^N of the full phase space. The non-represented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus it is necessary for proper functionality. It may however have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models -- up to one model per free energy minimum.

SYNov 6, 2024
Efficient and Robust Freeway Traffic Speed Estimation under Oblique Grid using Vehicle Trajectory Data

Yang He, Chengchuan An, Yuheng Jia et al.

Accurately estimating spatiotemporal traffic states on freeways is a significant challenge due to limited sensor deployment and potential data corruption. In this study, we propose an efficient and robust low-rank model for precise spatiotemporal traffic speed state estimation (TSE) using lowpenetration vehicle trajectory data. Leveraging traffic wave priors, an oblique grid-based matrix is first designed to transform the inherent dependencies of spatiotemporal traffic states into the algebraic low-rankness of a matrix. Then, with the enhanced traffic state low-rankness in the oblique matrix, a low-rank matrix completion method is tailored to explicitly capture spatiotemporal traffic propagation characteristics and precisely reconstruct traffic states. In addition, an anomaly-tolerant module based on a sparse matrix is developed to accommodate corrupted data input and thereby improve the TSE model robustness. Notably, driven by the understanding of traffic waves, the computational complexity of the proposed efficient method is only correlated with the problem size itself, not with dataset size and hyperparameter selection prevalent in existing studies. Extensive experiments demonstrate the effectiveness, robustness, and efficiency of the proposed model. The performance of the proposed method achieves up to a 12% improvement in Root Mean Squared Error (RMSE) in the TSE scenarios and an 18% improvement in RMSE in the robust TSE scenarios, and it runs more than 20 times faster than the state-of-the-art (SOTA) methods.

LGJun 7, 2024
Diverse Intra- and Inter-Domain Activity Style Fusion for Cross-Person Generalization in Activity Recognition

Junru Zhang, Lang Feng, Zhidan Liu et al.

Existing domain generalization (DG) methods for cross-person generalization tasks often face challenges in capturing intra- and inter-domain style diversity, resulting in domain gaps with the target domain. In this study, we explore a novel perspective to tackle this problem, a process conceptualized as domain padding. This proposal aims to enrich the domain diversity by synthesizing intra- and inter-domain style data while maintaining robustness to class labels. We instantiate this concept using a conditional diffusion model and introduce a style-fused sampling strategy to enhance data generation diversity. In contrast to traditional condition-guided sampling, our style-fused sampling strategy allows for the flexible use of one or more random styles to guide data synthesis. This feature presents a notable advancement: it allows for the maximum utilization of possible permutations and combinations among existing styles to generate a broad spectrum of new style instances. Empirical evaluations on a broad range of datasets demonstrate that our generated data achieves remarkable diversity within the domain space. Both intra- and inter-domain generated data have proven to be significant and valuable, contributing to varying degrees of performance enhancements. Notably, our approach outperforms state-of-the-art DG methods in all human activity recognition tasks.

IRJan 17, 2022
Alleviating Cold-start Problem in CTR Prediction with A Variational Embedding Learning Framework

Xiaoxiao Xu, Chen Yang, Qian Yu et al.

We propose a general Variational Embedding Learning Framework (VELF) for alleviating the severe cold-start problem in CTR prediction. VELF addresses the cold start problem via alleviating over-fits caused by data-sparsity in two ways: learning probabilistic embedding, and incorporating trainable and regularized priors which utilize the rich side information of cold start users and advertisements (Ads). The two techniques are naturally integrated into a variational inference framework, forming an end-to-end training process. Abundant empirical tests on benchmark datasets well demonstrate the advantages of our proposed VELF. Besides, extended experiments confirmed that our parameterized and regularized priors provide more generalization capability than traditional fixed priors.

CVJun 20, 2021
Automated Deepfake Detection

Ping Liu, Yuewei Lin, Yang He et al.

In this paper, we propose to utilize Automated Machine Learning to adaptively search a neural architecture for deepfake detection. This is the first time to employ automated machine learning for deepfake detection. Based on our explored search space, our proposed method achieves competitive prediction accuracy compared to previous methods. To improve the generalizability of our method, especially when training data and testing data are manipulated by different methods, we propose a simple yet effective strategy in our network learning process: making it to estimate potential manipulation regions besides predicting the real/fake labels. Unlike previous works manually design neural networks, our method can relieve us from the high labor cost in network construction. More than that, compared to previous works, our method depends much less on prior knowledge, e.g., which manipulation method is utilized or where exactly the fake image is manipulated. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method for deepfake detection.

LGDec 15, 2020
CosSGD: Communication-Efficient Federated Learning with a Simple Cosine-Based Quantization

Yang He, Hui-Po Wang, Maximilian Zenk et al.

Federated learning is a promising framework to mitigate data privacy and computation concerns. However, the communication cost between the server and clients has become the major bottleneck for successful deployment. Despite notable progress in gradient compression, the existing quantization methods require further improvement when low-bits compression is applied, especially the overall systems often degenerate a lot when quantization are applied in double directions to compress model weights and gradients. In this work, we propose a simple cosine-based nonlinear quantization and achieve impressive results in compressing round-trip communication costs. We are not only able to compress model weights and gradients at higher ratios than previous methods, but also achieve competing model performance at the same time. Further, our approach is highly suitable for federated learning problems since it has low computational complexity and requires only a little additional data to recover the compressed information. Extensive experiments have been conducted on image classification and brain tumor semantic segmentation using the CIFAR-10, and BraTS datasets where we show state-of-the-art effectiveness and impressive communication efficiency.

SEDec 6, 2020
NaturalCC: A Toolkit to Naturalize the Source Code Corpus

Yao Wan, Yang He, Jian-Guo Zhang et al.

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model's performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s.

CVSep 18, 2020
Synthetic Convolutional Features for Improved Semantic Segmentation

Yang He, Bernt Schiele, Mario Fritz

Recently, learning-based image synthesis has enabled to generate high-resolution images, either applying popular adversarial training or a powerful perceptual loss. However, it remains challenging to successfully leverage synthetic data for improving semantic segmentation with additional synthetic images. Therefore, we suggest to generate intermediate convolutional features and propose the first synthesis approach that is catered to such intermediate convolutional features. This allows us to generate new features from label masks and include them successfully into the training procedure in order to improve the performance of semantic segmentation. Experimental results and analysis on two challenging datasets Cityscapes and ADE20K show that our generated feature improves performance on segmentation tasks.

CVJan 24, 2020
Progressive Local Filter Pruning for Image Retrieval Acceleration

Xiaodong Wang, Zhedong Zheng, Yang He et al.

This paper focuses on network pruning for image retrieval acceleration. Prevailing image retrieval works target at the discriminative feature learning, while little attention is paid to how to accelerate the model inference, which should be taken into consideration in real-world practice. The challenge of pruning image retrieval models is that the middle-level feature should be preserved as much as possible. Such different requirements of the retrieval and classification model make the traditional pruning methods not that suitable for our task. To solve the problem, we propose a new Progressive Local Filter Pruning (PLFP) method for image retrieval acceleration. Specifically, layer by layer, we analyze the local geometric properties of each filter and select the one that can be replaced by the neighbors. Then we progressively prune the filter by gradually changing the filter weights. In this way, the representation ability of the model is preserved. To verify this, we evaluate our method on two widely-used image retrieval datasets,i.e., Oxford5k and Paris6K, and one person re-identification dataset,i.e., Market-1501. The proposed method arrives with superior performance to the conventional pruning methods, suggesting the effectiveness of the proposed method for image retrieval.

CVDec 20, 2019
Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation

Yang He, Shadi Rahimian, Bernt Schiele et al.

Today's success of state of the art methods for semantic segmentation is driven by large datasets. Data is considered an important asset that needs to be protected, as the collection and annotation of such datasets comes at significant efforts and associated costs. In addition, visual data might contain private or sensitive information, that makes it equally unsuited for public release. Unfortunately, recent work on membership inference in the broader area of adversarial machine learning and inference attacks on machine learning models has shown that even black box classifiers leak information on the dataset that they were trained on. We show that such membership inference attacks can be successfully carried out on complex, state of the art models for semantic segmentation. In order to mitigate the associated risks, we also study a series of defenses against such membership inference attacks and find effective counter measures against the existing risks with little effect on the utility of the segmentation method. Finally, we extensively evaluate our attacks and defenses on a range of relevant real-world datasets: Cityscapes, BDD100K, and Mapillary Vistas.

LGNov 3, 2019
Regularized Adversarial Sampling and Deep Time-aware Attention for Click-Through Rate Prediction

Yikai Wang, Liang Zhang, Quanyu Dai et al.

Improving the performance of click-through rate (CTR) prediction remains one of the core tasks in online advertising systems. With the rise of deep learning, CTR prediction models with deep networks remarkably enhance model capacities. In deep CTR models, exploiting users' historical data is essential for learning users' behaviors and interests. As existing CTR prediction works neglect the importance of the temporal signals when embed users' historical clicking records, we propose a time-aware attention model which explicitly uses absolute temporal signals for expressing the users' periodic behaviors and relative temporal signals for expressing the temporal relation between items. Besides, we propose a regularized adversarial sampling strategy for negative sampling which eases the classification imbalance of CTR data and can make use of the strong guidance provided by the observed negative CTR samples. The adversarial sampling strategy significantly improves the training efficiency, and can be co-trained with the time-aware attention model seamlessly. Experiments are conducted on real-world CTR datasets from both in-station and out-station advertising places.

CVApr 8, 2019
Filter Pruning by Switching to Neighboring CNNs with Good Attributes

Yang He, Ping Liu, Linchao Zhu et al.

Filter pruning is effective to reduce the computational costs of neural networks. Existing methods show that updating the previous pruned filter would enable large model capacity and achieve better performance. However, during the iterative pruning process, even if the network weights are updated to new values, the pruning criterion remains the same. In addition, when evaluating the filter importance, only the magnitude information of the filters is considered. However, in neural networks, filters do not work individually, but they would affect other filters. As a result, the magnitude information of each filter, which merely reflects the information of an individual filter itself, is not enough to judge the filter importance. To solve the above problems, we propose Meta-attribute-based Filter Pruning (MFP). First, to expand the existing magnitude information based pruning criteria, we introduce a new set of criteria to consider the geometric distance of filters. Additionally, to explicitly assess the current state of the network, we adaptively select the most suitable criteria for pruning via a meta-attribute, a property of the neural network at the current state. Experiments on two image classification benchmarks validate our method. For ResNet-50 on ILSVRC-2012, we could reduce more than 50% FLOPs with only 0.44% top-5 accuracy loss.

LGDec 5, 2018
Stochastic Model Pruning via Weight Dropping Away and Back

Haipeng Jia, Xueshuang Xiang, Da Fan et al.

Deep neural networks have dramatically achieved great success on a variety of challenging tasks. However, most successful DNNs have an extremely complex structure, leading to extensive research on model compression.As a significant area of progress in model compression, traditional gradual pruning approaches involve an iterative prune-retrain procedure and may suffer from two critical issues: local importance judgment, where the pruned weights are merely unimportant in the current model; and an irretrievable pruning process, where the pruned weights have no chance to come back. Addressing these two issues, this paper proposes the Drop Pruning approach, which leverages stochastic optimization in the pruning process by introducing a drop strategy at each pruning step, namely, drop away, which stochastically deletes some unimportant weights, and drop back, which stochastically recovers some pruned weights. The suitable choice of drop probabilities decreases the model size during the pruning process and helps it flow to the target sparsity. Compared to the Bayesian approaches that stochastically train a compact model for pruning, we directly aim at stochastic gradual pruning. We provide a detailed analysis showing that the drop away and drop back approaches have individual contributions. Moreover, Drop Pruning can achieve competitive compression performance and accuracy on many benchmark tasks compared with state-of-the-art weights pruning and Bayesian training approaches.

CVAug 22, 2018
Asymptotic Soft Filter Pruning for Deep Convolutional Neural Networks

Yang He, Xuanyi Dong, Guoliang Kang et al.

Deeper and wider Convolutional Neural Networks (CNNs) achieve superior performance but bring expensive computation cost. Accelerating such over-parameterized neural network has received increased attention. A typical pruning algorithm is a three-stage pipeline, i.e., training, pruning, and retraining. Prevailing approaches fix the pruned filters to zero during retraining, and thus significantly reduce the optimization space. Besides, they directly prune a large number of filters at first, which would cause unrecoverable information loss. To solve these problems, we propose an Asymptotic Soft Filter Pruning (ASFP) method to accelerate the inference procedure of the deep neural networks. First, we update the pruned filters during the retraining stage. As a result, the optimization space of the pruned model would not be reduced but be the same as that of the original model. In this way, the model has enough capacity to learn from the training data. Second, we prune the network asymptotically. We prune few filters at first and asymptotically prune more filters during the training procedure. With asymptotic pruning, the information of the training set would be gradually concentrated in the remaining filters, so the subsequent training and pruning process would be stable. Experiments show the effectiveness of our ASFP on image classification benchmarks. Notably, on ILSVRC-2012, our ASFP reduces more than 40% FLOPs on ResNet-50 with only 0.14% top-5 accuracy degradation, which is higher than the soft filter pruning (SFP) by 8%.

CVAug 3, 2018
Diverse Conditional Image Generation by Stochastic Regression with Latent Drop-Out Codes

Yang He, Bernt Schiele, Mario Fritz

Recent advances in Deep Learning and probabilistic modeling have led to strong improvements in generative models for images. On the one hand, Generative Adversarial Networks (GANs) have contributed a highly effective adversarial learning procedure, but still suffer from stability issues. On the other hand, Conditional Variational Auto-Encoders (CVAE) models provide a sound way of conditional modeling but suffer from mode-mixing issues. Therefore, recent work has turned back to simple and stable regression models that are effective at generation but give up on the sampling mechanism and the latent code representation. We propose a novel and efficient stochastic regression approach with latent drop-out codes that combines the merits of both lines of research. In addition, a new training objective increases coverage of the training distribution leading to improvements over the state of the art in terms of accuracy as well as diversity.

CVSep 6, 2017
Learning Dilation Factors for Semantic Segmentation of Street Scenes

Yang He, Margret Keuper, Bernt Schiele et al.

Contextual information is crucial for semantic segmentation. However, finding the optimal trade-off between keeping desired fine details and at the same time providing sufficiently large receptive fields is non trivial. This is even more so, when objects or classes present in an image significantly vary in size. Dilated convolutions have proven valuable for semantic segmentation, because they allow to increase the size of the receptive field without sacrificing image resolution. However, in current state-of-the-art methods, dilation parameters are hand-tuned and fixed. In this paper, we present an approach for learning dilation parameters adaptively per channel, consistently improving semantic segmentation results on street-scene datasets like Cityscapes and Camvid.

IRSep 1, 2017
Telepath: Understanding Users from a Human Vision Perspective in Large-Scale Recommender Systems

Yu Wang, Jixing Xu, Aohan Wu et al.

Designing an e-commerce recommender system that serves hundreds of millions of active users is a daunting challenge. From a human vision perspective, there're two key factors that affect users' behaviors: items' attractiveness and their matching degree with users' interests. This paper proposes Telepath, a vision-based bionic recommender system model, which understands users from such perspective. Telepath is a combination of a convolutional neural network (CNN), a recurrent neural network (RNN) and deep neural networks (DNNs). Its CNN subnetwork simulates the human vision system to extract key visual signals of items' attractiveness and generate corresponding activations. Its RNN and DNN subnetworks simulate cerebral cortex to understand users' interest based on the activations generated from browsed items. In practice, the Telepath model has been launched to JD's recommender system and advertising system. For one of the major item recommendation blocks on the JD app, click-through rate (CTR), gross merchandise value (GMV) and orders have increased 1.59%, 8.16% and 8.71% respectively. For several major ads publishers of JD demand-side platform, CTR, GMV and return on investment have increased 6.58%, 61.72% and 65.57% respectively by the first launch, and further increased 2.95%, 41.75% and 41.37% respectively by the second launch.

LGAug 18, 2017
LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions

Yu Wang, Jiayi Liu, Yuxiang Liu et al.

We present LADDER, the first deep reinforcement learning agent that can successfully learn control policies for large-scale real-world problems directly from raw inputs composed of high-level semantic information. The agent is based on an asynchronous stochastic variant of DQN (Deep Q Network) named DASQN. The inputs of the agent are plain-text descriptions of states of a game of incomplete information, i.e. real-time large scale online auctions, and the rewards are auction profits of very large scale. We apply the agent to an essential portion of JD's online RTB (real-time bidding) advertising business and find that it easily beats the former state-of-the-art bidding policy that had been carefully engineered and calibrated by human experts: during JD.com's June 18th anniversary sale, the agent increased the company's ads revenue from the portion by more than 50%, while the advertisers' ROI (return on investment) also improved significantly.

CVApr 8, 2016
STD2P: RGBD Semantic Segmentation Using Spatio-Temporal Data-Driven Pooling

Yang He, Wei-Chen Chiu, Margret Keuper et al.

We propose a novel superpixel-based multi-view convolutional neural network for semantic image segmentation. The proposed network produces a high quality segmentation of a single image by leveraging information from additional views of the same scene. Particularly in indoor videos such as captured by robotic platforms or handheld and bodyworn RGBD cameras, nearby video frames provide diverse viewpoints and additional context of objects and scenes. To leverage such information, we first compute region correspondences by optical flow and image boundary-based superpixels. Given these region correspondences, we propose a novel spatio-temporal pooling layer to aggregate information over space and time. We evaluate our approach on the NYU--Depth--V2 and the SUN3D datasets and compare it to various state-of-the-art single-view and multi-view approaches. Besides a general improvement over the state-of-the-art, we also show the benefits of making use of unlabeled frames during training for multi-view as well as single-view prediction.