CVMar 18, 2020Code
Collaborative Distillation for Ultra-Resolution Universal Style TransferHuan Wang, Yijun Li, Yuehai Wang et al.
Universal style transfer methods typically leverage rich representations from deep Convolutional Neural Network (CNN) models (e.g., VGG-19) pre-trained on large collections of images. Despite the effectiveness, its application is heavily constrained by the large model size to handle ultra-resolution images given limited memory. In this work, we present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer to reduce the convolutional filters. The main idea is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for style transfer models. Moreover, to overcome the feature size mismatch when applying collaborative distillation, a linear embedding loss is introduced to drive the student network to learn a linear embedding of the teacher's features. Extensive experiments show the effectiveness of our method when applied to different universal style transfer approaches (WCT and AdaIN), even if the model size is reduced by 15.5 times. Especially, on WCT with the compressed models, we achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time. Further experiments on optimization-based stylization scheme show the generality of our algorithm on different stylization paradigms. Our code and trained models are available at https://github.com/mingsun-tse/collaborative-distillation.
LGApr 25, 2018Code
Structured Pruning for Efficient ConvNets via Incremental RegularizationHuan Wang, Qiming Zhang, Yuehai Wang et al.
Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance degrade. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fragility of the expressiveness of CNNs, and thus calls for a more gentle regularization scheme so that the networks can adapt during pruning. To achieve this, we propose a new and novel regularization-based pruning method, named IncReg, to incrementally assign different regularization factors to different weights based on their relative importance. Empirical analysis on CIFAR-10 dataset verifies the merits of IncReg. Further extensive experiments with popular CNNs on CIFAR-10 and ImageNet datasets show that IncReg achieves comparable to even better results compared with state-of-the-arts. Our source codes and trained models are available here: https://github.com/mingsun-tse/caffe_increg.
ASAug 23, 2025
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio GenerationSizhe Shan, Qiulin Li, Yutao Cui et al.
Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.
CLAug 26, 2025
Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language ModelsHaoyu Wang, Guangyan Zhang, Jiale Chen et al.
With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models only convert response content into speech without fully capturing the rich emotional cues in user queries, where the same sentence may convey different meanings depending on the expression. Emotional understanding is thus essential for improving human-machine interaction. Most empathetic speech LLMs rely on massive datasets, demanding high computational cost. A key challenge is to build models that generate empathetic responses with limited data and without large-scale training. To this end, we propose Emotion Omni, a model that understands emotional content in user speech and generates empathetic responses. We further developed a data pipeline to construct a 200k emotional dialogue dataset supporting empathetic speech assistants. Experiments show that Emotion Omni achieves comparable instruction-following ability without large-scale pretraining, while surpassing existing models in speech quality (UTMOS:4.41) and empathy (Emotion GPT Score: 3.97). These results confirm its improvements in both speech fidelity and emotional expressiveness. Demos are available at https://w311411.github.io/omni_demo/.
CVJul 30, 2025
A Large Language Model Powered Integrated Circuit Footprint Geometry UnderstandingYida Wang, Taiting Lu, Runze Liu et al.
Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.
ASAug 12, 2021
Masked Acoustic Unit for Mispronunciation Detection and CorrectionZhan Zhang, Yuehai Wang, Jianyi Yang
Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. Conventional ASR-based CAPT methods require expensive annotation of the ground truth pronunciation for the supervised training. Meanwhile, certain undefined non-native phonemes cannot be correctly classified into standard phonemes, making the annotation process challenging and subjective. On the other hand, ASR-based CAPT methods only give the learner text-based feedback about the mispronunciation, but cannot teach the learner how to pronounce the sentence correctly. To solve these limitations, we propose to use the acoustic unit (AU) as the intermediary feature for both mispronunciation detection and correction. The proposed method uses the masked AU sequence and the target phonemes to detect the error AU and then corrects it. This method can give the learner speech-based self-imitating feedback, making our CAPT powerful for education.
ASMay 5, 2021
Accent Recognition with Hybrid Phonetic FeaturesZhan Zhang, Xi Chen, Yuehai Wang et al.
The performance of voice-controlled systems is usually influenced by accented speech. To make these systems more robust, the frontend accent recognition (AR) technologies have received increased attention in recent years. As accent is a high-level abstract feature that has a profound relationship with the language knowledge, AR is more challenging than other language-agnostic audio classification tasks. In this paper, we use an auxiliary automatic speech recognition (ASR) task to extract language-related phonetic features. Furthermore, we propose a hybrid structure that incorporates the embeddings of both a fixed acoustic model and a trainable acoustic model, making the language-related acoustic feature more robust. We conduct several experiments on the Accented English Speech Recognition Challenge (AESRC) 2020 dataset. The results demonstrate that our approach can obtain a 6.57% relative improvement on the validation set. We also get a 7.28% relative improvement on the final test set for this competition, showing the merits of the proposed method.
CVNov 20, 2018
Structured Pruning for Efficient ConvNets via Incremental RegularizationHuan Wang, Qiming Zhang, Yuehai Wang et al.
Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance loss. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fact that the expressiveness of CNNs is fragile and needs a more gentle way of regularization for the networks to adapt during pruning. To solve this problem, we propose a new regularization-based pruning method (named IncReg) to incrementally assign different regularization factors to different weight groups based on their relative importance, whose effectiveness is proved on popular CNNs compared with state-of-the-art methods.
LGSep 20, 2017
Structured Probabilistic Pruning for Convolutional Neural Network AccelerationHuan Wang, Qiming Zhang, Yuehai Wang et al.
In this paper, we propose a novel progressive parameter pruning method for Convolutional Neural Network acceleration, named Structured Probabilistic Pruning (SPP), which effectively prunes weights of convolutional layers in a probabilistic manner. Unlike existing deterministic pruning approaches, where unimportant weights are permanently eliminated, SPP introduces a pruning probability for each weight, and pruning is guided by sampling from the pruning probabilities. A mechanism is designed to increase and decrease pruning probabilities based on importance criteria in the training process. Experiments show that, with 4x speedup, SPP can accelerate AlexNet with only 0.3% loss of top-5 accuracy and VGG-16 with 0.8% loss of top-5 accuracy in ImageNet classification. Moreover, SPP can be directly applied to accelerate multi-branch CNN networks, such as ResNet, without specific adaptations. Our 2x speedup ResNet-50 only suffers 0.8% loss of top-5 accuracy on ImageNet. We further show the effectiveness of SPP on transfer learning tasks.