Yanan Sun

CV
h-index27
19papers
461citations
Novelty54%
AI Score34

19 Papers

19.3CVMar 2, 2022Code
A Unified Query-based Paradigm for Point Cloud Understanding

Zetong Yang, Li Jiang, Yanan Sun et al.

3D point cloud understanding is an important component in autonomous driving and robotics. In this paper, we present a novel Embedding-Querying paradigm (EQ- Paradigm) for 3D understanding tasks including detection, segmentation, and classification. EQ-Paradigm is a unified paradigm that enables the combination of any existing 3D backbone architectures with different task heads. Under the EQ-Paradigm, the input is firstly encoded in the embedding stage with an arbitrary feature extraction architecture, which is independent of tasks and heads. Then, the querying stage enables the encoded features to be applicable for diverse task heads. This is achieved by introducing an intermediate representation, i.e., Q-representation, in the querying stage to serve as a bridge between the embedding stage and task heads. We design a novel Q- Net as the querying stage network. Extensive experimental results on various 3D tasks, including object detection, semantic segmentation and shape classification, show that EQ-Paradigm in tandem with Q-Net is a general and effective pipeline, which enables a flexible collaboration of backbones and heads, and further boosts the performance of the state-of-the-art methods. Codes and models are available at https://github.com/dvlab-research/DeepVision3D.

15.3CVMay 22, 2022Code
Human Instance Matting via Mutual Guidance and Multi-Instance Refinement

Yanan Sun, Chi-Keung Tang, Yu-Wing Tai

This paper introduces a new matting task called human instance matting (HIM), which requires the pertinent model to automatically predict a precise alpha matte for each human instance. Straightforward combination of closely related techniques, namely, instance segmentation, soft segmentation and human/conventional matting, will easily fail in complex cases requiring disentangling mingled colors belonging to multiple instances along hairy and thin boundary structures. To tackle these technical challenges, we propose a human instance matting framework, called InstMatt, where a novel mutual guidance strategy working in tandem with a multi-instance refinement module is used, for delineating multi-instance relationship among humans with complex and overlapping boundaries if present. A new instance matting metric called instance matting quality (IMQ) is proposed, which addresses the absence of a unified and fair means of evaluation emphasizing both instance recognition and matting quality. Finally, we construct a HIM benchmark for evaluation, which comprises of both synthetic and natural benchmark images. In addition to thorough experimental results on complex cases with multiple and overlapping human instances each has intricate boundaries, preliminary results are presented on general instance matting. Code and benchmark are available in https://github.com/nowsyn/InstMatt.

5.7CVApr 13, 2022
Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions

Zipeng Ye, Zhiyao Sun, Yu-Hui Wen et al. · tsinghua

Recently, talking-face video generation has received considerable attention. So far most methods generate results with neutral expressions or expressions that are implicitly determined by neural networks in an uncontrollable way. In this paper, we propose a method to generate talking-face videos with continuously controllable expressions in real-time. Our method is based on an important observation: In contrast to facial geometry of moderate resolution, most expression information lies in textures. Then we make use of neural textures to generate high-quality talking face videos and design a novel neural network that can generate neural textures for image frames (which we called dynamic neural textures) based on the input expression and continuous intensity expression coding (CIEC). Our method uses 3DMM as a 3D model to sample the dynamic neural texture. The 3DMM does not cover the teeth area, so we propose a teeth submodule to complete the details in teeth. Results and an ablation study show the effectiveness of our method in generating high-quality talking-face videos with continuously controllable expressions. We also set up four baseline methods by combining existing representative methods and compare them with our method. Experimental results including a user study show that our method has the best performance.

12.2CVSep 17, 2022
Continuously Controllable Facial Expression Editing in Talking Face Videos

Zhiyao Sun, Yu-Hui Wen, Tian Lv et al.

Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.

5.7CVNov 21, 2022
H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions

Changlin Li, Guangyang Wu, Yanan Sun et al.

Capitalizing on the rapid development of neural networks, recent video frame interpolation (VFI) methods have achieved notable improvements. However, they still fall short for real-world videos containing large motions. Complex deformation and/or occlusion caused by large motions make it an extremely difficult problem in video frame interpolation. In this paper, we propose a simple yet effective solution, H-VFI, to deal with large motions in video frame interpolation. H-VFI contributes a hierarchical video interpolation transformer (HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple scales. The learnt deformable kernel is then utilized in convolving the input frames for predicting the interpolated frame. Starting from the smallest scale, H-VFI updates the deformable kernel by a residual in succession based on former predicted kernels, intermediate interpolated results and hierarchical features from transformer. Bias and masks to refine the final outputs are then predicted by a transformer block based on interpolated results. The advantage of such a progressive approximation is that the large motion frame interpolation problem can be decomposed into several relatively simpler sub-tasks, which enables a very accurate prediction in the final results. Another noteworthy contribution of our paper consists of a large-scale high-quality dataset, YouTube200K, which contains videos depicting a great variety of scenarios captured at high resolution and high frame rate. Extensive experiments on multiple frame interpolation benchmarks validate that H-VFI outperforms existing state-of-the-art methods especially for videos with large motions.

5.0CVJun 7, 2023
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Yanan Sun, Zihan Zhong, Qi Fan et al.

Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unimodal models trained on text or image data alone through unsupervised techniques can achieve broader coverage of diverse real-world data and are not constrained by the requirement of simultaneous presence of image and text. In this paper, we demonstrate that using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models. Our thorough studies validate that models pre-trained as such can learn rich representations of both modalities, improving their ability to understand how images and text relate to each other. Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models by 6.5% (52.3% $\rightarrow$ 58.8%) on PASCAL-5$^i$ and 6.2% (27.2% $\rightarrow$ 33.4%) on COCO-20$^i$ semantic segmentation under zero-shot setting respectively. By learning representations of both modalities, unimodal pre-training offers broader coverage, reduced misalignment errors, and the ability to capture more complex features and patterns in the real-world data resulting in better performance especially for zero-shot vision-language tasks.

6.7NEApr 18, 2023
Differentiable Genetic Programming for High-dimensional Symbolic Regression

Peng Zeng, Xiaotian Song, Andrew Lensen et al.

Symbolic regression (SR) is the process of discovering hidden relationships from data with mathematical expressions, which is considered an effective way to reach interpretable machine learning (ML). Genetic programming (GP) has been the dominator in solving SR problems. However, as the scale of SR problems increases, GP often poorly demonstrates and cannot effectively address the real-world high-dimensional problems. This limitation is mainly caused by the stochastic evolutionary nature of traditional GP in constructing the trees. In this paper, we propose a differentiable approach named DGP to construct GP trees towards high-dimensional SR for the first time. Specifically, a new data structure called differentiable symbolic tree is proposed to relax the discrete structure to be continuous, thus a gradient-based optimizer can be presented for the efficient optimization. In addition, a sampling method is proposed to eliminate the discrepancy caused by the above relaxation for valid symbolic expressions. Furthermore, a diversification mechanism is introduced to promote the optimizer escaping from local optima for globally better solutions. With these designs, the proposed DGP method can efficiently search for the GP trees with higher performance, thus being capable of dealing with high-dimensional SR. To demonstrate the effectiveness of DGP, we conducted various experiments against the state of the arts based on both GP and deep neural networks. The experiment results reveal that DGP can outperform these chosen peer competitors on high-dimensional regression benchmarks with dimensions varying from tens to thousands. In addition, on the synthetic SR problems, the proposed DGP method can also achieve the best recovery rate even with different noisy levels. It is believed this work can facilitate SR being a powerful alternative to interpretable ML for a broader range of real-world problems.

3.9CVApr 12, 2023
Neural Architecture Search Using Genetic Algorithm for Facial Expression Recognition

Shuchao Deng, Yanan Sun, Edgar Galvan

Facial expression is one of the most powerful, natural, and universal signals for human beings to express emotional states and intentions. Thus, it is evident the importance of correct and innovative facial expression recognition (FER) approaches in Artificial Intelligence. The current common practice for FER is to correctly design convolutional neural networks' architectures (CNNs) using human expertise. However, finding a well-performing architecture is often a very tedious and error-prone process for deep learning researchers. Neural architecture search (NAS) is an area of growing interest as demonstrated by the large number of scientific works published in recent years thanks to the impressive results achieved in recent years. We propose a genetic algorithm approach that uses an ingenious encoding-decoding mechanism that allows to automatically evolve CNNs on FER tasks attaining high accuracy classification rates. The experimental results demonstrate that the proposed algorithm achieves the best-known results on the CK+ and FERG datasets as well as competitive results on the JAFFE dataset.

7.7LGFeb 27, 2023
Communication-efficient Federated Learning with Single-Step Synthetic Features Compressor for Faster Convergence

Yuhao Zhou, Mingjia Shi, Yuanxi Li et al.

Reducing communication overhead in federated learning (FL) is challenging but crucial for large-scale distributed privacy-preserving machine learning. While methods utilizing sparsification or others can largely lower the communication overhead, the convergence rate is also greatly compromised. In this paper, we propose a novel method, named single-step synthetic features compressor (3SFC), to achieve communication-efficient FL by directly constructing a tiny synthetic dataset based on raw gradients. Thus, 3SFC can achieve an extremely low compression rate when the constructed dataset contains only one data sample. Moreover, 3SFC's compressing phase utilizes a similarity-based objective function so that it can be optimized with just one step, thereby considerably improving its performance and robustness. In addition, to minimize the compressing error, error feedback (EF) is also incorporated into 3SFC. Experiments on multiple datasets and models suggest that 3SFC owns significantly better convergence rates compared to competing methods with lower compression rates (up to 0.02%). Furthermore, ablation studies and visualizations show that 3SFC can carry more information than competing methods for every communication round, further validating its effectiveness.

3.3LGDec 23, 2022
DAS: Neural Architecture Search via Distinguishing Activation Score

Yuqiao Liu, Haipeng Li, Yanan Sun et al.

Neural Architecture Search (NAS) is an automatic technique that can search for well-performed architectures for a specific task. Although NAS surpasses human-designed architecture in many fields, the high computational cost of architecture evaluation it requires hinders its development. A feasible solution is to directly evaluate some metrics in the initial stage of the architecture without any training. NAS without training (WOT) score is such a metric, which estimates the final trained accuracy of the architecture through the ability to distinguish different inputs in the activation layer. However, WOT score is not an atomic metric, meaning that it does not represent a fundamental indicator of the architecture. The contributions of this paper are in three folds. First, we decouple WOT into two atomic metrics which represent the distinguishing ability of the network and the number of activation units, and explore better combination rules named (Distinguishing Activation Score) DAS. We prove the correctness of decoupling theoretically and confirmed the effectiveness of the rules experimentally. Second, in order to improve the prediction accuracy of DAS to meet practical search requirements, we propose a fast training strategy. When DAS is used in combination with the fast training strategy, it yields more improvements. Third, we propose a dataset called Darts-training-bench (DTB), which fills the gap that no training states of architecture in existing datasets. Our proposed method has 1.04$\times$ - 1.56$\times$ improvements on NAS-Bench-101, Network Design Spaces, and the proposed DTB.

4.6LGDec 28, 2022
Differentiable Search of Accurate and Robust Architectures

Yuwei Ou, Xiangning Xie, Shangce Gao et al.

Deep neural networks (DNNs) are found to be vulnerable to adversarial attacks, and various methods have been proposed for the defense. Among these methods, adversarial training has been drawing increasing attention because of its simplicity and effectiveness. However, the performance of the adversarial training is greatly limited by the architectures of target DNNs, which often makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To address this problem, we propose DSARA to automatically search for the neural architectures that are accurate and robust after adversarial training. In particular, we design a novel cell-based search space specially for adversarial training, which improves the accuracy and the robustness upper bound of the searched architectures by carefully designing the placement of the cells and the proportional relationship of the filter numbers. Then we propose a two-stage search strategy to search for both accurate and robust neural architectures. At the first stage, the architecture parameters are optimized to minimize the adversarial loss, which makes full use of the effectiveness of the adversarial training in enhancing the robustness. At the second stage, the architecture parameters are optimized to minimize both the natural loss and the adversarial loss utilizing the proposed multi-objective adversarial training method, so that the searched neural architectures are both accurate and robust. We evaluate the proposed algorithm under natural data and various adversarial attacks, which reveals the superiority of the proposed method in terms of both accurate and robust architectures. We also conclude that accurate and robust neural architectures tend to deploy very different structures near the input and the output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust neural architectures.

1.8LGApr 21, 2022
Automating Neural Architecture Design without Search

Zixuan Liang, Yanan Sun

Neural structure search (NAS), as the mainstream approach to automate deep neural architecture design, has achieved much success in recent years. However, the performance estimation component adhering to NAS is often prohibitively costly, which leads to the enormous computational demand. Though a large number of efforts have been dedicated to alleviating this pain point, no consensus has been made yet on which is optimal. In this paper, we study the automated architecture design from a new perspective that eliminates the need to sequentially evaluate each neural architecture generated during algorithm execution. Specifically, the proposed approach is built by learning the knowledge of high-level experts in designing state-of-the-art architectures, and then the new architecture is directly generated upon the knowledge learned. We implemented the proposed approach by using a graph neural network for link prediction and acquired the knowledge from NAS-Bench-101. Compared to existing peer competitors, we found a competitive network with minimal cost. In addition, we also utilized the learned knowledge from NAS-Bench-101 to automate architecture design in the DARTS search space, and achieved 97.82% accuracy on CIFAR10, and 76.51% top-1 accuracy on ImageNet consuming only $2\times10^{-4}$ GPU days. This also demonstrates the high transferability of the proposed approach, and can potentially lead to a new, more computationally efficient paradigm in this research direction.

4.1LGFeb 6, 2025Code
FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

Long Chen, Xiaotian Song, Andy Song et al.

Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS

3.7CVNov 9, 2024
Revisiting Long-Tailed Learning: Insights from an Architectural Perspective

Yuhan Pan, Yanan Sun, Wei Gong

Long-Tailed (LT) recognition has been widely studied to tackle the challenge of imbalanced data distributions in real-world applications. However, the design of neural architectures for LT settings has received limited attention, despite evidence showing that architecture choices can substantially affect performance. This paper aims to bridge the gap between LT challenges and neural network design by providing an in-depth analysis of how various architectures influence LT performance. Specifically, we systematically examine the effects of key network components on LT handling, such as topology, convolutions, and activation functions. Based on these observations, we propose two convolutional operations optimized for improved performance. Recognizing that operation interactions are also crucial to network effectiveness, we apply Neural Architecture Search (NAS) to facilitate efficient exploration. We propose LT-DARTS, a NAS method with a novel search space and search strategy specifically designed for LT data. Experimental results demonstrate that our approach consistently outperforms existing architectures across multiple LT datasets, achieving parameter-efficient, state-of-the-art results when integrated with current LT methods.

3.0NENov 15, 2021
Evolving Deep Neural Networks for Collaborative Filtering

Yuhan Fang, Yuqiao Liu, Yanan Sun

Collaborative Filtering (CF) is widely used in recommender systems to model user-item interactions. With the great success of Deep Neural Networks (DNNs) in various fields, advanced works recently have proposed several DNN-based models for CF, which have been proven effective. However, the neural networks are all designed manually. As a consequence, it requires the designers to develop expertise in both CF and DNNs, which limits the application of deep learning methods in CF and the accuracy of recommended results. In this paper, we introduce the genetic algorithm into the process of designing DNNs. By means of genetic operations like crossover, mutation, and environmental selection strategy, the architectures and the connection weights initialization of the DNNs can be designed automatically. We conduct extensive experiments on two benchmark datasets. The results demonstrate the proposed algorithm outperforms several manually designed state-of-the-art neural networks.

12.8NEAug 9, 2021
BenchENAS: A Benchmarking Platform for Evolutionary Neural Architecture Search

Xiangning Xie, Yuqiao Liu, Yanan Sun et al.

Neural architecture search (NAS), which automatically designs the architectures of deep neural networks, has achieved breakthrough success over many applications in the past few years. Among different classes of NAS methods, evolutionary computation based NAS (ENAS) methods have recently gained much attention. Unfortunately, the issues of fair comparisons and efficient evaluations have hindered the development of ENAS. The current benchmark architecture datasets designed for fair comparisons only provide the datasets, not the ENAS algorithms or the platform to run the algorithms. The existing efficient evaluation methods are either not suitable for the population-based ENAS algorithm or are too complex to use. This paper develops a platform named BenchENAS to address these issues. BenchENAS aims to achieve fair comparisons by running different algorithms in the same environment and with the same settings. To achieve efficient evaluation in a common lab environment, BenchENAS designs a parallel component and a cache component with high maintainability. Furthermore, BenchENAS is easy to install and highly configurable and modular, which brings benefits in good usability and easy extensibility. The paper conducts efficient comparison experiments on eight ENAS algorithms with high GPU utilization on this platform. The experiments validate that the fair comparison issue does exist, and BenchENAS can alleviate this issue. A website has been built to promote BenchENAS at https://benchenas.com, where interested researchers can obtain the source code and document of BenchENAS for free.

6.0CVMar 10, 2019
A Hybrid GA-PSO Method for Evolving Architecture and Short Connections of Deep Convolutional Neural Networks

Bin Wang, Yanan Sun, Bing Xue et al.

Image classification is a difficult machine learning task, where Convolutional Neural Networks (CNNs) have been applied for over 20 years in order to solve the problem. In recent years, instead of the traditional way of only connecting the current layer with its next layer, shortcut connections have been proposed to connect the current layer with its forward layers apart from its next layer, which has been proved to be able to facilitate the training process of deep CNNs. However, there are various ways to build the shortcut connections, it is hard to manually design the best shortcut connections when solving a particular problem, especially given the design of the network architecture is already very challenging. In this paper, a hybrid evolutionary computation (EC) method is proposed to \textit{automatically} evolve both the architecture of deep CNNs and the shortcut connections. Three major contributions of this work are: Firstly, a new encoding strategy is proposed to encode a CNN, where the architecture and the shortcut connections are encoded separately; Secondly, a hybrid two-level EC method, which combines particle swarm optimisation and genetic algorithms, is developed to search for the optimal CNNs; Lastly, an adjustable learning rate is introduced for the fitness evaluations, which provides a better learning rate for the training process given a fixed number of epochs. The proposed algorithm is evaluated on three widely used benchmark datasets of image classification and compared with 12 peer Non-EC based competitors and one EC based competitor. The experimental results demonstrate that the proposed method outperforms all of the peer competitors in terms of classification accuracy.

14.6NEAug 20, 2018
A Hybrid Differential Evolution Approach to Designing Deep Convolutional Neural Networks for Image Classification

Bin Wang, Yanan Sun, Bing Xue et al.

Convolutional Neural Networks (CNNs) have demonstrated their superiority in image classification, and evolutionary computation (EC) methods have recently been surging to automatically design the architectures of CNNs to save the tedious work of manually designing CNNs. In this paper, a new hybrid differential evolution (DE) algorithm with a newly added crossover operator is proposed to evolve the architectures of CNNs of any lengths, which is named DECNN. There are three new ideas in the proposed DECNN method. Firstly, an existing effective encoding scheme is refined to cater for variable-length CNN architectures; Secondly, the new mutation and crossover operators are developed for variable-length DE to optimise the hyperparameters of CNNs; Finally, the new second crossover is introduced to evolve the depth of the CNN architectures. The proposed algorithm is tested on six widely-used benchmark datasets and the results are compared to 12 state-of-the-art methods, which shows the proposed method is vigorously competitive to the state-of-the-art algorithms. Furthermore, the proposed method is also compared with a method using particle swarm optimisation with a similar encoding strategy named IPPSO, and the proposed DECNN outperforms IPPSO in terms of the accuracy.

15.6NEMar 17, 2018
Evolving Deep Convolutional Neural Networks by Variable-length Particle Swarm Optimization for Image Classification

Bin Wang, Yanan Sun, Bing Xue et al.

Convolutional neural networks (CNNs) are one of the most effective deep learning methods to solve image classification problems, but the best architecture of a CNN to solve a specific problem can be extremely complicated and hard to design. This paper focuses on utilising Particle Swarm Optimisation (PSO) to automatically search for the optimal architecture of CNNs without any manual work involved. In order to achieve the goal, three improvements are made based on traditional PSO. First, a novel encoding strategy inspired by computer networks which empowers particle vectors to easily encode CNN layers is proposed; Second, in order to allow the proposed method to learn variable-length CNN architectures, a Disabled layer is designed to hide some dimensions of the particle vector to achieve variable-length particles; Third, since the learning process on large data is slow, partial datasets are randomly picked for the evaluation to dramatically speed it up. The proposed algorithm is examined and compared with 12 existing algorithms including the state-of-art methods on three widely used image classification benchmark datasets. The experimental results show that the proposed algorithm is a strong competitor to the state-of-art algorithms in terms of classification error. This is the first work using PSO for automatically evolving the architectures of CNNs.