ASAug 11, 2024
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech ProcessingChunyu Qiang, Wang Geng, Yi Zhao et al.
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/
CVJul 19, 2024Code
Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit NetworksCheng Gong, Yao Chen, Qiuyang Luo et al.
Multi-exit network is a promising architecture for efficient model inference by sharing backbone networks and weights among multiple exits. However, the gradient conflict of the shared weights results in sub-optimal accuracy. This paper introduces Deep Feature Surgery (\methodname), which consists of feature partitioning and feature referencing approaches to resolve gradient conflict issues during the training of multi-exit networks. The feature partitioning separates shared features along the depth axis among all exits to alleviate gradient conflict while simultaneously promoting joint optimization for each exit. Subsequently, feature referencing enhances multi-scale features for distinct exits across varying depths to improve the model accuracy. Furthermore, \methodname~reduces the training operations with the reduced complexity of backpropagation. Experimental results on Cifar100 and ImageNet datasets exhibit that \methodname~provides up to a \textbf{50.00\%} reduction in training time and attains up to a \textbf{6.94\%} enhancement in accuracy when contrasted with baseline methods across diverse models and tasks. Budgeted batch classification evaluation on MSDNet demonstrates that DFS uses about $\mathbf{2}\boldsymbol{\times}$ fewer average FLOPs per image to achieve the same classification accuracy as baseline methods on Cifar100. The code is available at https://github.com/GongCheng1919/dfs.
81.2ASApr 24
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text InstructionsChunyu Qiang, Xiaopeng Wang, Kang Yin et al.
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
AIMar 2Code
CoVe: Training Interactive Tool-Use Agents via Constraint-Guided VerificationJinpeng Chen, Cheng Gong, Hanbo Li et al.
Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $τ^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.
CVApr 2, 2024Code
Minimize Quantization Output Error with Bias CompensationCheng Gong, Haoshuai Zheng, Mengting Hu et al.
Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in https://github.com/GongCheng1919/bias-compensation.
LGApr 7, 2023
AutoQNN: An End-to-End Framework for Automatically Quantizing Neural NetworksCheng Gong, Ye Lu, Surong Dai et al.
Exploring the expected quantizing scheme with suitable mixed-precision policy is the key point to compress deep neural networks (DNNs) in high efficiency and accuracy. This exploration implies heavy workloads for domain experts, and an automatic compression method is needed. However, the huge search space of the automatic method introduces plenty of computing budgets that make the automatic process challenging to be applied in real scenarios. In this paper, we propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. AutoQNN can seek desirable quantizing schemes and mixed-precision policies for mainstream DNN models efficiently by involving three techniques: quantizing scheme search (QSS), quantizing precision learning (QPL), and quantized architecture generation (QAG). QSS introduces five quantizing schemes and defines three new schemes as a candidate set for scheme search, and then uses the differentiable neural architecture search (DNAS) algorithm to seek the layer- or model-desired scheme from the set. QPL is the first method to learn mixed-precision policies by reparameterizing the bitwidths of quantizing schemes, to the best of our knowledge. QPL optimizes both classification loss and precision loss of DNNs efficiently and obtains the relatively optimal mixed-precision model within limited model size and memory footprint. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention, to facilitate end-to-end neural network quantization. We have implemented AutoQNN and integrated it into Keras. Extensive experiments demonstrate that AutoQNN can consistently outperform state-of-the-art quantization.
SDSep 27, 2024
EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech SynthesisHaoyu Wang, Chunyu Qiang, Tianrui Wang et al.
Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.
AINov 12, 2025
Efficient Reasoning via Reward ModelYuhao Wang, Xiaopeng Li, Cheng Gong et al.
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.
ASAug 4, 2025Code
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech CodecChunyu Qiang, Haoyu Wang, Cheng Gong et al.
Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\ref{fig:pesq_kbps_below_2kbps} shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We've open-sourced SecoustiCodec's demo, code, and model weights.
AIAug 2, 2025Code
H2C: Hippocampal Circuit-inspired Continual Learning for Lifelong Trajectory Prediction in Autonomous DrivingYunlong Lin, Zirui Li, Guodong Du et al.
Deep learning (DL) has shown state-of-the-art performance in trajectory prediction, which is critical to safe navigation in autonomous driving (AD). However, most DL-based methods suffer from catastrophic forgetting, where adapting to a new distribution may cause significant performance degradation in previously learned ones. Such inability to retain learned knowledge limits their applicability in the real world, where AD systems need to operate across varying scenarios with dynamic distributions. As revealed by neuroscience, the hippocampal circuit plays a crucial role in memory replay, effectively reconstructing learned knowledge based on limited resources. Inspired by this, we propose a hippocampal circuit-inspired continual learning method (H2C) for trajectory prediction across varying scenarios. H2C retains prior knowledge by selectively recalling a small subset of learned samples. First, two complementary strategies are developed to select the subset to represent learned knowledge. Specifically, one strategy maximizes inter-sample diversity to represent the distinctive knowledge, and the other estimates the overall knowledge by equiprobable sampling. Then, H2C updates via a memory replay loss function calculated by these selected samples to retain knowledge while learning new data. Experiments based on various scenarios from the INTERACTION dataset are designed to evaluate H2C. Experimental results show that H2C reduces catastrophic forgetting of DL baselines by 22.71% on average in a task-free manner, without relying on manually informed distributional shifts. The implementation is available at https://github.com/BIT-Jack/H2C-lifelong.
LGJan 13, 2025Code
MOS-Attack: A Scalable Multi-objective Adversarial Attack FrameworkPing Guo, Cheng Gong, Xi Lin et al.
Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions. Our code is available at https://github.com/pgg3/MOS-Attack.
ROFeb 22, 2022Code
An Ensemble Learning Framework for Vehicle Trajectory Prediction in Interactive ScenariosZirui Li, Yunlong Lin, Cheng Gong et al.
Precisely modeling interactions and accurately predicting trajectories of surrounding vehicles are essential to the decision-making and path-planning of intelligent vehicles. This paper proposes a novel framework based on ensemble learning to improve the performance of trajectory predictions in interactive scenarios. The framework is termed Interactive Ensemble Trajectory Predictor (IETP). IETP assembles interaction-aware trajectory predictors as base learners to build an ensemble learner. Firstly, each base learner in IETP observes historical trajectories of vehicles in the scene. Then each base learner handles interactions between vehicles to predict trajectories. Finally, an ensemble learner is built to predict trajectories by applying two ensemble strategies on the predictions from all base learners. Predictions generated by the ensemble learner are final outputs of IETP. In this study, three experiments using different data are conducted based on the NGSIM dataset. Experimental results show that IETP improves the predicting accuracy and decreases the variance of errors compared to base learners. In addition, IETP exceeds baseline models with 50% of the training data, indicating that IETP is data-efficient. Moreover, the implementation of IETP is publicly available at https://github.com/BIT-Jack/IETP.
LGJul 14, 2025
GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement LearningZiru Liu, Cheng Gong, Xinyu Fu et al.
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model's reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.
ROApr 27, 2024
Motion planning for off-road autonomous driving based on human-like cognition and weight adaptationYuchun Wang, Cheng Gong, Jianwei Gong et al.
Driving in an off-road environment is challenging for autonomous vehicles due to the complex and varied terrain. To ensure stable and efficient travel, the vehicle requires consideration and balancing of environmental factors, such as undulations, roughness, and obstacles, to generate optimal trajectories that can adapt to changing scenarios. However, traditional motion planners often utilize a fixed cost function for trajectory optimization, making it difficult to adapt to different driving strategies in challenging irregular terrains and uncommon scenarios. To address these issues, we propose an adaptive motion planner based on human-like cognition and cost evaluation for off-road driving. First, we construct a multi-layer map describing different features of off-road terrains, including terrain elevation, roughness, obstacle, and artificial potential field map. Subsequently, we employ a CNN-LSTM network to learn the trajectories planned by human drivers in various off-road scenarios. Then, based on human-like generated trajectories in different environments, we design a primitive-based trajectory planner that aims to mimic human trajectories and cost weight selection, generating trajectories that are consistent with the dynamics of off-road vehicles. Finally, we compute optimal cost weights and select and extend behavioral primitives to generate highly adaptive, stable, and efficient trajectories. We validate the effectiveness of the proposed method through experiments in a desert off-road environment with complex terrain and varying road conditions. The experimental results show that the proposed human-like motion planner has excellent adaptability to different off-road conditions. It shows real-time operation, greater stability, and more human-like planning ability in diverse and challenging scenarios.
CRMar 8, 2024
Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial HypervolumePing Guo, Cheng Gong, Xi Lin et al.
The escalating threat of adversarial attacks on deep learning models, particularly in security-critical fields, has underscored the need for robust deep learning systems. Conventional robustness evaluations have relied on adversarial accuracy, which measures a model's performance under a specific perturbation intensity. However, this singular metric does not fully encapsulate the overall resilience of a model against varying degrees of perturbation. To address this gap, we propose a new metric termed adversarial hypervolume, assessing the robustness of deep learning models comprehensively over a range of perturbation intensities from a multi-objective optimization standpoint. This metric allows for an in-depth comparison of defense mechanisms and recognizes the trivial improvements in robustness afforded by less potent defensive strategies. Additionally, we adopt a novel training algorithm that enhances adversarial robustness uniformly across various perturbation intensities, in contrast to methods narrowly focused on optimizing adversarial accuracy. Our extensive empirical studies validate the effectiveness of the adversarial hypervolume metric, demonstrating its ability to reveal subtle differences in robustness that adversarial accuracy overlooks. This research contributes a new measure of robustness and establishes a standard for assessing and benchmarking the resilience of current and future defensive models against adversarial threats.
ASNov 23, 2025
InstructAudio: Unified speech and music generation with natural language instructionChunyu Qiang, Kang Yin, Xiaopeng Wang et al.
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
LGAug 27, 2025
Complementary Learning System Empowers Online Continual Learning of Vehicle Motion Forecasting in Smart CitiesZirui Li, Yunlong Lin, Guodong Du et al.
Artificial intelligence underpins most smart city services, yet deep neural network (DNN) that forecasts vehicle motion still struggle with catastrophic forgetting, the loss of earlier knowledge when models are updated. Conventional fixes enlarge the training set or replay past data, but these strategies incur high data collection costs, sample inefficiently and fail to balance long- and short-term experience, leaving them short of human-like continual learning. Here we introduce Dual-LS, a task-free, online continual learning paradigm for DNN-based motion forecasting that is inspired by the complementary learning system of the human brain. Dual-LS pairs two synergistic memory rehearsal replay mechanisms to accelerate experience retrieval while dynamically coordinating long-term and short-term knowledge representations. Tests on naturalistic data spanning three countries, over 772,000 vehicles and cumulative testing mileage of 11,187 km show that Dual-LS mitigates catastrophic forgetting by up to 74.31\% and reduces computational resource demand by up to 94.02\%, markedly boosting predictive stability in vehicle motion forecasting without inflating data requirements. Meanwhile, it endows DNN-based vehicle motion forecasting with computation efficient and human-like continual learning adaptability fit for smart cities.
AIJun 27, 2024
Learning Pareto Set for Multi-Objective Continuous Robot ControlTianye Shu, Ke Shang, Cheng Gong et al.
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.
CLJun 13, 2024
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource ScenariosCheng Gong, Erica Cooper, Xin Wang et al.
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
RONov 11, 2021
Driver-Specific Risk Recognition in Interactive Driving Scenarios using Graph RepresentationJinghang Li, Chao Lu, Penghui Li et al.
This paper presents a driver-specific risk recognition framework for autonomous vehicles that can extract inter-vehicle interactions. This extraction is carried out for urban driving scenarios in a driver-cognitive manner to improve the recognition accuracy of risky scenes. First, clustering analysis is applied to the operation data of drivers for learning the subjective assessment of risky scenes of different drivers and generating the corresponding risk label for each scene. Second, the graph representation model (GRM) is adopted to unify and construct the features of dynamic vehicles, inter-vehicle interactions and static traffic markings in real driving scenes into graphs. The driver-specific risk label provides ground truth to capture the risk evaluation criteria of different drivers. In addition, the graph model represents multiple features of the driving scenes. Therefore, the proposed framework can learn the risk-evaluating pattern of driving scenes of different drivers and establish driver-specific risk identifiers. Last, the performance of the proposed framework is evaluated via experiments conducted using real-world urban driving datasets collected by multiple drivers. The results show that the risks and their levels in real driving environments can be accurately recognized by the proposed framework.
RONov 6, 2021
Prediction of Pedestrian Spatiotemporal Risk Levels for Intelligent Vehicles: A Data-driven ApproachZheyu Zhang, Boyang Wang, Chao Lu et al.
In recent years, road safety has attracted significant attention from researchers and practitioners in the intelligent transport systems domain. As one of the most common and vulnerable groups of road users, pedestrians cause great concerns due to their unpredictable behavior and movement, as subtle misunderstandings in vehicle-pedestrian interaction can easily lead to risky situations or collisions. Existing methods use either predefined collision-based models or human-labeling approaches to estimate the pedestrians' risks. These approaches are usually limited by their poor generalization ability and lack of consideration of interactions between the ego vehicle and a pedestrian. This work tackles the listed problems by proposing a Pedestrian Risk Level Prediction system. The system consists of three modules. Firstly, vehicle-perspective pedestrian data are collected. Since the data contains information regarding the movement of both the ego vehicle and pedestrian, it can simplify the prediction of spatiotemporal features in an interaction-aware fashion. Using the long short-term memory model, the pedestrian trajectory prediction module predicts their spatiotemporal features in the subsequent five frames. As the predicted trajectory follows certain interaction and risk patterns, a hybrid clustering and classification method is adopted to explore the risk patterns in the spatiotemporal features and train a risk level classifier using the learned patterns. Upon predicting the spatiotemporal features of pedestrians and identifying the corresponding risk level, the risk patterns between the ego vehicle and pedestrians are determined. Experimental results verified the capability of the PRLP system to predict the risk level of pedestrians, thus supporting the collision risk assessment of intelligent vehicles and providing safety warnings to both vehicles and pedestrians.
SDOct 9, 2021
Using multiple reference audios and style embedding constraints for speech synthesisCheng Gong, Longbiao Wang, Zhenhua Ling et al.
The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ''target'' style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ''target'' style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.
ROSep 15, 2021
Life-Long Multi-Task Learning of Adaptive Path Tracking Policy for Autonomous VehicleCheng Gong, Jianwei Gong, Chao Lu et al.
This paper proposes a life-long adaptive path tracking policy learning method for autonomous vehicles that can self-evolve and self-adapt with multi-task knowledge. Firstly, the proposed method can learn a model-free control policy for path tracking directly from the historical driving experience, where the property of vehicle dynamics and corresponding control strategy can be learned simultaneously. Secondly, by utilizing the life-long learning method, the proposed method can learn the policy with task-incremental knowledge without encountering catastrophic forgetting. Thus, with continual multi-task knowledge learned, the policy can iteratively adapt to new tasks and improve its performance with knowledge from new tasks. Thirdly, a memory evaluation and updating method is applied to optimize memory structure for life-long learning which enables the policy to learn toward selected directions. Experiments are conducted using a high-fidelity vehicle dynamic model in a complex curvy road to evaluate the performance of the proposed method. Results show that the proposed method can effectively evolve with continual multi-task knowledge and adapt to the new environment, where the performance of the proposed method can also surpass two commonly used baseline methods after evolving.
CVSep 8, 2021
Elastic Significant Bit Quantization and Acceleration for Deep Neural NetworksCheng Gong, Ye Lu, Kunpeng Xie et al.
Quantization has been proven to be a vital method for improving the inference efficiency of deep neural networks (DNNs). However, it is still challenging to strike a good balance between accuracy and efficiency while quantizing DNN weights or activation values from high-precision formats to their quantized counterparts. We propose a new method called elastic significant bit quantization (ESB) that controls the number of significant bits of quantized values to obtain better inference accuracy with fewer resources. We design a unified mathematical formula to constrain the quantized values of the ESB with a flexible number of significant bits. We also introduce a distribution difference aligner (DDA) to quantitatively align the distributions between the full-precision weight or activation values and quantized values. Consequently, ESB is suitable for various bell-shaped distributions of weights and activation of DNNs, thus maintaining a high inference accuracy. Benefitting from fewer significant bits of quantized values, ESB can reduce the multiplication complexity. We implement ESB as an accelerator and quantitatively evaluate its efficiency on FPGAs. Extensive experimental results illustrate that ESB quantization consistently outperforms state-of-the-art methods and achieves average accuracy improvements of 4.78%, 1.92%, and 3.56% over AlexNet, ResNet18, and MobileNetV2, respectively. Furthermore, ESB as an accelerator can achieve 10.95 GOPS peak performance of 1k LUTs without DSPs on the Xilinx ZCU102 FPGA platform. Compared with CPU, GPU, and state-of-the-art accelerators on FPGAs, the ESB accelerator can improve the energy efficiency by up to 65x, 11x, and 26x, respectively.
SDAug 4, 2021
Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech SynthesisXudong Dai, Cheng Gong, Longbiao Wang et al.
Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage. Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron.
ROAug 2, 2021
Orientation-Aware Planning for Parallel Task Execution of Omni-Directional Mobile RobotCheng Gong, Zirui Li, Xingyu Zhou et al.
Omni-directional mobile robot (OMR) systems have been very popular in academia and industry for their superb maneuverability and flexibility. Yet their potential has not been fully exploited, where the extra degree of freedom in OMR can potentially enable the robot to carry out extra tasks. For instance, gimbals or sensors on robots may suffer from a limited field of view or be constrained by the inherent mechanical design, which will require the chassis to be orientation-aware and respond in time. To solve this problem and further develop the OMR systems, in this paper, we categorize the tasks related to OMR chassis into orientation transition tasks and position transition tasks, where the two tasks can be carried out at the same time. By integrating the parallel task goals in a single planning problem, we proposed an orientation-aware planning architecture for OMR systems to execute the orientation transition and position transition in a unified and efficient way. A modified trajectory optimization method called orientation-aware timed-elastic-band (OATEB) is introduced to generate the trajectory that satisfies the requirements of both tasks. Experiments in both 2D simulated environments and real scenes are carried out. A four-wheeled OMR is deployed to conduct the real scene experiment and the results demonstrate that the proposed method is capable of simultaneously executing parallel tasks and is applicable to real-life scenarios.
ROJun 24, 2021
Autonomous Driving Strategies at Intersections: Scenarios, State-of-the-Art, and Future OutlooksLianzhen Wei, Zirui Li, Jianwei Gong et al.
Due to the complex and dynamic character of intersection scenarios, the autonomous driving strategy at intersections has been a difficult problem and a hot point in the research of intelligent transportation systems in recent years. This paper gives a brief summary of state-of-the-art autonomous driving strategies at intersections. Firstly, we enumerate and analyze common types of intersection scenarios, corresponding simulation platforms, as well as related datasets. Secondly, by reviewing previous studies, we have summarized characteristics of existing autonomous driving strategies and classified them into several categories. Finally, we point out problems of the existing autonomous driving strategies and put forward several valuable research outlooks.
CVSep 17, 2020
High-precision target positioning system for unmanned vehicles based on binocular visionXianqi He, Zirui Li, Xufeng Yin et al.
Unmanned vehicles often need to locate targets with high precision during work. In the unmanned material handling workshop, the unmanned vehicle needs to perform high-precision pose estimation of the workpiece to accurately grasp the workpiece. In this context, this paper proposes a high-precision unmanned vehicle target positioning system based on binocular vision. The system uses a region-based stereo matching algorithm to obtain a disparity map, and uses the RANSAC algorithm to extract position and posture features, which achives the estimation of the position and attitude of a six-degree-of-freedom cylindrical workpiece. In order to verify the effect of the system, this paper collects the accuracy and calculation time of the output results of the cylinder in different poses. The experimental data shows that the position accuracy of the system is 0.61~1.17mm and the angular accuracy is 1.95~5.13°, which can achieve better high-precision positioning effect.
CVJul 11, 2020
Driver Behavior Modelling at the Urban Intersection via Canonical Correlation AnalysisZirui Li, Chao Lu, Cheng Gong et al.
The urban intersection is a typically dynamic and complex scenario for intelligent vehicles, which exists a variety of driving behaviors and traffic participants. Accurately modelling the driver behavior at the intersection is essential for intelligent transportation systems (ITS). Previous researches mainly focus on using attention mechanism to model the degree of correlation. In this research, a canonical correlation analysis (CCA)-based framework is proposed. The value of canonical correlation is used for feature selection. Gaussian mixture model and Gaussian process regression are applied for driver behavior modelling. Two experiments using simulated and naturalistic driving data are designed for verification. Experimental results are consistent with the driver's judgment. Comparative studies show that the proposed framework can obtain a better performance.
CVMay 18, 2020
VecQ: Minimal Loss DNN Model Compression With Vectorized Weight QuantizationCheng Gong, Yao Chen, Ye Lu et al.
Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs. However, the trade-off between the quantization bitwidth and final accuracy is complex and non-convex, which makes it difficult to be optimized directly. Minimizing direct quantization loss (DQL) of the coefficient data is an effective local optimization method, but previous works often neglect the accurate control of the DQL, resulting in a higher loss of the final DNN model accuracy. In this paper, we propose a novel metric called Vector Loss. Based on this new metric, we develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy. In addition, in order to speed up the proposed quantization process during model training, we accelerate the quantization process with a parameterized probability estimation method and template-based derivation calculation. We evaluate our proposed algorithm on MNIST, CIFAR, ImageNet, IMDB movie review and THUCNews text data sets with numerical DNN models. The results demonstrate that our proposed quantization solution is more accurate and effective than the state-of-the-art approaches yet with more flexible bitwidth support. Moreover, the evaluation of our quantized models on Saliency Object Detection (SOD) tasks maintains comparable feature extraction quality with up to 16$\times$ weight size reduction.
CLAug 2, 2019
DELTA: A DEep learning based Language Technology plAtformKun Han, Junwen Chen, Hui Zhang et al.
In this paper we present DELTA, a deep learning based language technology platform. DELTA is an end-to-end platform designed to solve industry level natural language and speech processing problems. It integrates most popular neural network models for training as well as comprehensive deployment tools for production. DELTA aims to provide easy and fast experiences for using, deploying, and developing natural language processing and speech models for both academia and industry use cases. We demonstrate the reliable performance with DELTA on several natural language processing and speech tasks, including text classification, named entity recognition, natural language inference, speech recognition, speaker verification, etc. DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users.
CVMay 26, 2018
Enhanced-alignment Measure for Binary Foreground Map EvaluationDeng-Ping Fan, Cheng Gong, Yang Cao et al.
The existing binary foreground map (FM) measures to address various types of errors in either pixel-wise or structural ways. These measures consider pixel-level match or image-level information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective E-measure (Enhanced-alignment measure). Our measure combines local pixel values with the image-level mean value in one term, jointly capturing image-level statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 meta-measures, including ranking models for applications, demoting generic, random Gaussian noise maps, ground-truth switch, as well as human judgments. We find large improvements in almost all the meta-measures. For instance, in terms of application ranking, we observe improvementrangingfrom9.08% to 19.65% compared with other popular measures.