89.7SDMay 31Code
SegTune: Structured and Fine-Grained Control for Song GenerationYuejiao Wang, Zihao Ji, Pengfei Cai et al.
Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our project page (https://github.com/KlingAIResearch/SegTune) for codes and more generated songs.
68.3CLMay 29
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea GenerationXu Li, Hanzhe Tu, Xinyi Li et al.
Generating novel research ideas is fundamental to scientific progress. While Large Language Models (LLMs) show promise in assisting this process, existing approaches often exhibit semantic convergence, resulting in limited diversity and novelty. To address this, we introduce EvoGens, an evolution-inspired framework that recasts scientific idea generation as an evolutionary search over a population of ideas. EvoGens iteratively applies rank-based mutation with differentiated retrieval planning to incorporate external knowledge, and semantic-aware crossover to fuse complementary concepts for conceptual reorganization. A lightweight evaluation signal guides the selection process, encouraging sustained exploration while mitigating premature convergence. Extensive experiments demonstrate that EvoGens substantially enhances exploration capabilities compared to state-of-the-art baselines. Specifically, it improves the Novelty from 0.1 to 0.4 and the Diversity from 0.24 to 0.55, while maintaining comparable idea quality under the current automatic evaluation protocol. These findings suggest that evolutionary mechanisms can serve as a useful framework for exploration-oriented research ideation, especially for broadening the novelty and diversity of candidate ideas under a shared automatic evaluation setting.
85.2SDMay 28
SegTune: Structured and Fine-Grained Control for Song GenerationPengfei Cai, Joanna Wang, Haorui Zheng et al.
Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.
CVMay 9, 2022
Joint learning of object graph and relation graph for visual question answeringHao Li, Xu Li, Belhal Karimi et al.
Modeling visual question answering(VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which causes false attribute selection or missing relation in Figure 1(a). It is because these models cannot balance all kinds of information in scene graphs, neglecting relation and attribute information. In this paper, we introduce a novel Dual Message-passing enhanced Graph Neural Network (DM-GNN), which can obtain a balanced representation by properly encoding multi-scale scene graph information. Specifically, we (i)transform the scene graph into two graphs with diversified focuses on objects and relations; Then we design a dual structure to encode them, which increases the weights from relations (ii)fuse the encoder output with attribute features, which increases the weights from attributes; (iii)propose a message-passing mechanism to enhance the information transfer between objects, relations and attributes. We conduct extensive experiments on datasets including GQA, VG, motif-VG and achieve new state of the art.
LGJun 23, 2022Code
CGAR: Critic Guided Action Redistribution in Reinforcement LeaningTairan Huang, Xu Li, Hao Li et al.
Training a game-playing reinforcement learning agent requires multiple interactions with the environment. Ignorant random exploration may cause a waste of time and resources. It's essential to alleviate such waste. As discussed in this paper, under the settings of the off-policy actor critic algorithms, we demonstrate that the critic can bring more expected discounted rewards than or at least equal to the actor. Thus, the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor. This paper introduces the novel Critic Guided Action Redistribution (CGAR) algorithm and tests it on the OpenAI MuJoCo tasks. The experimental results demonstrate that our method improves the sample efficiency and achieves state-of-the-art performance. Our code can be found at https://github.com/tairanhuang/CGAR.
35.3CVJun 1
Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality AssessmentBishr Omer Abdelrahman Adam, Xu Li
Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.
36.3AIJun 1
TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability DeterminationXu Li, Zedong Fu, Xinyi Li et al.
Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.
SDMar 29, 2022
Spoofing-Aware Speaker Verification by Multi-Level FusionHaibin Wu, Lingwei Meng, Jiawen Kang et al.
Recently, many novel techniques have been introduced to deal with spoofing attacks, and achieve promising countermeasure (CM) performances. However, these works only take the stand-alone CM models into account. Nowadays, a spoofing aware speaker verification (SASV) challenge which aims to facilitate the research of integrated CM and ASV models, arguing that jointly optimizing CM and ASV models will lead to better performance, is taking place. In this paper, we propose a novel multi-model and multi-level fusion strategy to tackle the SASV task. Compared with purely scoring fusion and embedding fusion methods, this framework first utilizes embeddings from CM models, propagating CM embeddings into a CM block to obtain a CM score. In the second-level fusion, the CM score and ASV scores directly from ASV systems will be concatenated into a prediction block for the final decision. As a result, the best single fusion system has achieved the SASV-EER of 0.97% on the evaluation set. Then by ensembling the top-5 fusion systems, the final SASV-EER reached 0.89%.
SDOct 23, 2022
Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker SeparationXiaoyu Liu, Xu Li, Joan Serrà
Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.
CRFeb 13Code
Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using AgentsXu Li, Simon Yu, Minzhou Pan et al.
LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.
IVSep 9, 2022
DeepSTI: Towards Tensor Reconstruction using Fewer Orientations in Susceptibility Tensor ImagingZhenghan Fang, Kuo-Wei Lai, Peter van Zijl et al.
Susceptibility tensor imaging (STI) is an emerging magnetic resonance imaging technique that characterizes the anisotropic tissue magnetic susceptibility with a second-order tensor model. STI has the potential to provide information for both the reconstruction of white matter fiber pathways and detection of myelin changes in the brain at mm resolution or less, which would be of great value for understanding brain structure and function in healthy and diseased brain. However, the application of STI in vivo has been hindered by its cumbersome and time-consuming acquisition requirement of measuring susceptibility induced MR phase changes at multiple (usually more than six) head orientations. This complexity is enhanced by the limitation in head rotation angles due to physical constraints of the head coil. As a result, STI has not yet been widely applied in human studies in vivo. In this work, we tackle these issues by proposing an image reconstruction algorithm for STI that leverages data-driven priors. Our method, called DeepSTI, learns the data prior implicitly via a deep neural network that approximates the proximal operator of a regularizer function for STI. The dipole inversion problem is then solved iteratively using the learned proximal network. Experimental results using both simulation and in vivo human data demonstrate great improvement over state-of-the-art algorithms in terms of the reconstructed tensor image, principal eigenvector maps and tractography results, while allowing for tensor reconstruction with MR phase measured at much less than six different orientations. Notably, promising reconstruction results are achieved by our method from only one orientation in human in vivo, and we demonstrate a potential application of this technique for estimating lesion susceptibility anisotropy in patients with multiple sclerosis.
CVJul 10, 2024
EA-VTR: Event-Aware Video-Text RetrievalZongyang Ma, Ziqi Zhang, Yuxin Chen et al.
Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task.
SDSep 14, 2024
Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved IntelligibilityXiaoyu Liu, Xu Li, Joan Serrà et al.
Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.
MNJan 10, 2023
Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in BiofilmsSamitha Somathilaka, Daniel P. Martins, Xu Li et al.
Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular decision-making based on received signals from the environment and neighbor cells. In this study, we extract a sub-network of \textit{Pseudomonas aeruginosa} GRN that is associated with one virulence factor: pyocyanin production as a use case to investigate the GRNN behaviors. Further, using Graph Neural Network (GNN) architecture, we model a single species biofilm to reveal the role of GRNN dynamics on ecosystem-wide decision-making. Varying environmental conditions, we prove that the extracted GRNN computes input signals similar to natural decision-making process of the cell. Identifying of neural network behaviors in GRNs may lead to more accurate bacterial cell activity predictive models for many applications, including human health-related problems and agricultural applications. Further, this model can produce data on causal relationships throughout the network, enabling the possibility of designing tailor-made infection-controlling mechanisms. More interestingly, these GRNNs can perform computational tasks for bio-hybrid computing systems.
AIOct 11, 2024Code
Baichuan-Omni Technical ReportYadong Li, Haoze Sun, Mingan Lin et al.
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.
MLNov 1, 2023
Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance MatrixYunfeng Cai, Xu Li, Minging Sun et al.
Discovering the causal relationship via recovering the directed acyclic graph (DAG) structure from the observed data is a well-known challenging combinatorial problem. When there are latent variables, the problem becomes even more difficult. In this paper, we first propose a DAG structure recovering algorithm, which is based on the Cholesky factorization of the covariance matrix of the observed data. The algorithm is fast and easy to implement and has theoretical grantees for exact recovery. On synthetic and real-world datasets, the algorithm is significantly faster than previous methods and achieves the state-of-the-art performance. Furthermore, under the equal error variances assumption, we incorporate an optimization procedure into the Cholesky factorization based algorithm to handle the DAG recovering problem with latent variables. Numerical simulations show that the modified "Cholesky + optimization" algorithm is able to recover the ground truth graph in most cases and outperforms existing algorithms.
LGMar 11, 2022
PathSAGE: Spatial Graph Attention Neural Networks With Random Path SamplingJunhua Ma, Jiajun Li, Xueming Li et al.
Graph Convolutional Networks (GCNs) achieve great success in non-Euclidean structure data processing recently. In existing studies, deeper layers are used in CCNs to extract deeper features of Euclidean structure data. However, for non-Euclidean structure data, too deep GCNs will confront with problems like "neighbor explosion" and "over-smoothing", it also cannot be applied to large datasets. To address these problems, we propose a model called PathSAGE, which can learn high-order topological information and improve the model's performance by expanding the receptive field. The model randomly samples paths starting from the central node and aggregates them by Transformer encoder. PathSAGE has only one layer of structure to aggregate nodes which avoid those problems above. The results of evaluation shows that our model achieves comparable performance with the state-of-the-art models in inductive learning tasks.
LGOct 19, 2024Code
Baichuan Alignment Technical ReportMingan Lin, Fan Yang, Yanjun Shen et al.
We introduce Baichuan Alignment, a detailed analysis of the alignment techniques employed in the Baichuan series of models. This represents the industry's first comprehensive account of alignment methodologies, offering valuable insights for advancing AI research. We investigate the critical components that enhance model performance during the alignment process, including optimization methods, data strategies, capability enhancements, and evaluation processes. The process spans three key stages: Prompt Augmentation System(PAS), Supervised Fine-Tuning(SFT), and Preference Alignment. The problems encountered, the solutions applied, and the improvements made are thoroughly recorded. Through comparisons across well-established benchmarks, we highlight the technological advancements enabled by Baichuan Alignment. Baichuan-Instruct is an internal model, while Qwen2-Nova-72B and Llama3-PBM-Nova-70B are instruct versions of the Qwen2-72B and Llama-3-70B base models, optimized through Baichuan Alignment. Baichuan-Instruct demonstrates significant improvements in core capabilities, with user experience gains ranging from 17% to 28%, and performs exceptionally well on specialized benchmarks. In open-source benchmark evaluations, both Qwen2-Nova-72B and Llama3-PBM-Nova-70B consistently outperform their respective official instruct versions across nearly all datasets. This report aims to clarify the key technologies behind the alignment process, fostering a deeper understanding within the community. Llama3-PBM-Nova-70B model is available at https://huggingface.co/PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B.
30.4CVMar 11
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepDQinxin Wu, Fucheng Niu, Hengchuan Zhu et al.
Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.
CVDec 11, 2025
VLM-NCD:Novel Class Discovery with Vision-Based Large Language ModelsYuetong Su, Baoguo Wei, Xinyu Wang et al.
Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.
91.7LGMar 22
ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language ModelsXu Li, Yi Zheng, Yuxuan Liang et al.
Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.
72.4NAApr 13
A Block-Shifted Cyclic Reduction Algorithm for Solving a Class of Quadratic Matrix EquationsXu Li, Beatrice Meini
The cyclic reduction (CR) algorithm is an efficient method for solving quadratic matrix equations that arise in quasi-birth-death (QBD) stochastic processes. However, its convergence is not guaranteed when the associated matrix polynomial has more than one eigenvalue on the unit circle. To address this limitation, we introduce a novel iteration method, referred to as the Block-Shifted CR algorithm, that improves the CR algorithm by utilizing singular value decomposition (SVD) and block shift-and-deflate techniques. This new approach extends the applicability of existing solvers to a broader class of quadratic matrix equations. Numerical experiments demonstrate the effectiveness and robustness of the proposed method.
CLJan 26, 2025
Baichuan-Omni-1.5 Technical ReportYadong Li, Jun Liu, Tao Zhang et al.
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
ASMar 11, 2025
YuE: Scaling Open Foundation Models for Long-Form Music GenerationRuibin Yuan, Hanfeng Lin, Shuyue Guo et al.
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
LGFeb 2, 2024
Parametric Feature Transfer: One-shot Federated Learning with Foundation ModelsMahdi Beitollahi, Alex Bie, Sobhan Hemati et al.
In one-shot federated learning (FL), clients collaboratively train a global model in a single round of communication. Existing approaches for one-shot FL enhance communication efficiency at the expense of diminished accuracy. This paper introduces FedPFT (Federated Learning with Parametric Feature Transfer), a methodology that harnesses the transferability of foundation models to enhance both accuracy and communication efficiency in one-shot FL. The approach involves transferring per-client parametric models (specifically, Gaussian mixtures) of features extracted from foundation models. Subsequently, each parametric model is employed to generate synthetic features for training a classifier head. Experimental results on eight datasets demonstrate that FedPFT enhances the communication-accuracy frontier in both centralized and decentralized FL scenarios, as well as across diverse data-heterogeneity settings such as covariate shift and task shift, with improvements of up to 20.6%. Additionally, FedPFT adheres to the data minimization principle of FL, as clients do not send real features. We demonstrate that sending real features is vulnerable to potent reconstruction attacks. Moreover, we show that FedPFT is amenable to formal privacy guarantees via differential privacy, demonstrating favourable privacy-accuracy tradeoffs.
LGMar 7, 2025
NoT: Federated Unlearning via Weight NegationYasser H. Khalil, Leo Brunswic, Soufiane Lamghari et al.
Federated unlearning (FU) aims to remove a participant's data contributions from a trained federated learning (FL) model, ensuring privacy and regulatory compliance. Traditional FU methods often depend on auxiliary storage on either the client or server side or require direct access to the data targeted for removal-a dependency that may not be feasible if the data is no longer available. To overcome these limitations, we propose NoT, a novel and efficient FU algorithm based on weight negation (multiplying by -1), which circumvents the need for additional storage and access to the target data. We argue that effective and efficient unlearning can be achieved by perturbing model parameters away from the set of optimal parameters, yet being well-positioned for quick re-optimization. This technique, though seemingly contradictory, is theoretically grounded: we prove that the weight negation perturbation effectively disrupts inter-layer co-adaptation, inducing unlearning while preserving an approximate optimality property, thereby enabling rapid recovery. Experimental results across three datasets and three model architectures demonstrate that NoT significantly outperforms existing baselines in unlearning efficacy as well as in communication and computational efficiency.
CVNov 19, 2024
CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMsZhehan Kan, Ce Zhang, Zihan Liao et al.
Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.
CLMay 30, 2025
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error CorrectionYangui Fang, Baixu Cheng, Jing Peng et al.
Automatic Speech Recognition (ASR) error correction aims to correct recognition errors while preserving accurate text. Although traditional approaches demonstrate moderate effectiveness, LLMs offer a paradigm that eliminates the need for training and labeled data. However, directly using LLMs will encounter hallucinations problem, which may lead to the modification of the correct text. To address this problem, we propose the Reliable LLM Correction Framework (RLLM-CF), which consists of three stages: (1) error pre-detection, (2) chain-of-thought sub-tasks iterative correction, and (3) reasoning process verification. The advantage of our method is that it does not require additional information or fine-tuning of the model, and ensures the correctness of the LLM correction under multi-pass programming. Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
CVDec 26, 2024
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language ModelsXu Li, Yi Zheng, Haotian Chen et al.
Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.
CLJun 15, 2025
Large Language Models Enhanced by Plug and Play Syntactic Knowledge for Aspect-based Sentiment AnalysisYuanhe Tian, Xu Li, Wei Wang et al.
Aspect-based sentiment analysis (ABSA) generally requires a deep understanding of the contextual information, including the words associated with the aspect terms and their syntactic dependencies. Most existing studies employ advanced encoders (e.g., pre-trained models) to capture such context, especially large language models (LLMs). However, training these encoders is resource-intensive, and in many cases, the available data is insufficient for necessary fine-tuning. Therefore it is challenging for learning LLMs within such restricted environments and computation efficiency requirement. As a result, it motivates the exploration of plug-and-play methods that adapt LLMs to ABSA with minimal effort. In this paper, we propose an approach that integrates extendable components capable of incorporating various types of syntactic knowledge, such as constituent syntax, word dependencies, and combinatory categorial grammar (CCG). Specifically, we propose a memory module that records syntactic information and is incorporated into LLMs to instruct the prediction of sentiment polarities. Importantly, this encoder acts as a versatile, detachable plugin that is trained independently of the LLM. We conduct experiments on benchmark datasets, which show that our approach outperforms strong baselines and previous approaches, thus demonstrates its effectiveness.
CVJan 24, 2025
Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language ModelsYuxuan Liang, Xu Li, Xiaolei Chen et al.
As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA into the InternVL2-2B framework to create SleighVL, a lightweight yet high-performing model. Extensive experiments demonstrate that SleighVL outperforms models with comparable parameters and remains competitive with larger models. Our work provides a promising direction for more efficient and contextually aware high-resolution image processing in LVLMs, advancing multimodal system development.
IMApr 2, 2024
CSST Strong Lensing Preparation: a Framework for Detecting Strong Lenses in the Multi-color Imaging Survey by the China Survey Space Telescope (CSST)Xu Li, Ruiqi Sun, Jiameng Lv et al.
Strong gravitational lensing is a powerful tool for investigating dark matter and dark energy properties. With the advent of large-scale sky surveys, we can discover strong lensing systems on an unprecedented scale, which requires efficient tools to extract them from billions of astronomical objects. The existing mainstream lens-finding tools are based on machine learning algorithms and applied to cut-out-centered galaxies. However, according to the design and survey strategy of optical surveys by CSST, preparing cutouts with multiple bands requires considerable efforts. To overcome these challenges, we have developed a framework based on a hierarchical visual Transformer with a sliding window technique to search for strong lensing systems within entire images. Moreover, given that multi-color images of strong lensing systems can provide insights into their physical characteristics, our framework is specifically crafted to identify strong lensing systems in images with any number of channels. As evaluated using CSST mock data based on an Semi-Analytic Model named CosmoDC2, our framework achieves precision and recall rates of 0.98 and 0.90, respectively. To evaluate the effectiveness of our method in real observations, we have applied it to a subset of images from the DESI Legacy Imaging Surveys and media images from Euclid Early Release Observations. 61 new strong lensing system candidates are discovered by our method. However, we also identified false positives arising primarily from the simplified galaxy morphology assumptions within the simulation. This underscores the practical limitations of our approach while simultaneously highlighting potential avenues for future improvements.
CVApr 9, 2024
Dynamic Resolution Guidance for Facial Expression RecognitionSongpan Wang, Xu Li, Tianxiang Jiang et al.
Facial expression recognition (FER) is vital for human-computer interaction and emotion analysis, yet recognizing expressions in low-resolution images remains challenging. This paper introduces a practical method called Dynamic Resolution Guidance for Facial Expression Recognition (DRGFER) to effectively recognize facial expressions in images with varying resolutions without compromising FER model accuracy. Our framework comprises two main components: the Resolution Recognition Network (RRN) and the Multi-Resolution Adaptation Facial Expression Recognition Network (MRAFER). The RRN determines image resolution, outputs a binary vector, and the MRAFER assigns images to suitable facial expression recognition networks based on resolution. We evaluated DRGFER on widely-used datasets RAFDB and FERPlus, demonstrating that our method retains optimal model performance at each resolution and outperforms alternative resolution approaches. The proposed framework exhibits robustness against resolution variations and facial expressions, offering a promising solution for real-world applications.
LGFeb 2, 2024
DFML: Decentralized Federated Mutual LearningYasser H. Khalil, Amir H. Estiri, Mahdi Beitollahi et al.
In the realm of real-world devices, centralized servers in Federated Learning (FL) present challenges including communication bottlenecks and susceptibility to a single point of failure. Additionally, contemporary devices inherently exhibit model and data heterogeneity. Existing work lacks a Decentralized FL (DFL) framework capable of accommodating such heterogeneity without imposing architectural restrictions or assuming the availability of public data. To address these issues, we propose a Decentralized Federated Mutual Learning (DFML) framework that is serverless, supports nonrestrictive heterogeneous models, and avoids reliance on public data. DFML effectively handles model and data heterogeneity through mutual learning, which distills knowledge between clients, and cyclically varying the amount of supervision and distillation signals. Extensive experimental results demonstrate consistent effectiveness of DFML in both convergence speed and global accuracy, outperforming prevalent baselines under various conditions. For example, with the CIFAR-100 dataset and 50 clients, DFML achieves a substantial increase of +17.20% and +19.95% in global accuracy under Independent and Identically Distributed (IID) and non-IID data shifts, respectively.
SEMay 19, 2025
AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language ModelsShuyang Hou, Zhangxiao Shen, Huayi Wu et al.
Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.
CVOct 26, 2025
LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration RenderingWenkai Zhu, Xu Li, Qimin Xu et al.
3D Gaussian Splatting SLAM has emerged as a widely used technique for high-fidelity mapping in spatial intelligence. However, existing methods often rely on a single representation scheme, which limits their performance in large-scale dynamic outdoor scenes and leads to cumulative pose errors and scale ambiguity. To address these challenges, we propose \textbf{LVD-GS}, a novel LiDAR-Visual 3D Gaussian Splatting SLAM system. Motivated by the human chain-of-thought process for information seeking, we introduce a hierarchical collaborative representation module that facilitates mutual reinforcement for mapping optimization, effectively mitigating scale drift and enhancing reconstruction robustness. Furthermore, to effectively eliminate the influence of dynamic objects, we propose a joint dynamic modeling module that generates fine-grained dynamic masks by fusing open-world segmentation with implicit residual constraints, guided by uncertainty estimates from DINO-Depth features. Extensive evaluations on KITTI, nuScenes, and self-collected datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.
CVSep 19, 2025
Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided ImportanceYuxuan Liang, Xu Li, Xiaolei Chen et al.
Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.
CVSep 16, 2025
HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language ModelsXu Li, Yuxuan Liang, Xiaolei Chen et al.
By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
CVMay 26, 2025
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question AnsweringXu Li, Fan Lyu
Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.
CVFeb 19, 2025
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video CaptioningCaihua Liu, Xu Li, Wenjing Xue et al.
Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, object behavior is dynamic and complex. To comprehensively capture the essence of object behavior, we propose a dynamic action semantic-aware graph transformer. Firstly, a multi-scale temporal modeling module is designed to flexibly learn long and short-term latent action features. It not only acquires latent action features across time scales, but also considers local latent action details, enhancing the coherence and sensitiveness of latent action representations. Secondly, a visual-action semantic aware module is proposed to adaptively capture semantic representations related to object behavior, enhancing the richness and accurateness of action representations. By harnessing the collaborative efforts of these two modules,we can acquire rich behavior representations to generate human-like natural descriptions. Finally, this rich behavior representations and object representations are used to construct a temporal objects-action graph, which is fed into the graph transformer to model the complex temporal dependencies between objects and actions. To avoid adding complexity in the inference phase, the behavioral knowledge of the objects will be distilled into a simple network through knowledge distillation. The experimental results on MSVD and MSR-VTT datasets demonstrate that the proposed method achieves significant performance improvements across multiple metrics.
CVJan 25, 2025
Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot ClassificationZhongqi Wang, Jia Dai, Kai Li et al.
Vision language model (VLM) has been designed for large scale image-text alignment as a pretrained foundation model. For downstream few shot classification tasks, parameter efficient fine-tuning (PEFT) VLM has gained much popularity in the computer vision community. PEFT methods like prompt tuning and linear adapter have been studied for fine-tuning VLM while low rank adaptation (LoRA) algorithm has rarely been considered for few shot fine-tuning VLM. The main obstacle to use LoRA for few shot fine-tuning is the catastrophic forgetting problem. Because the visual language alignment knowledge is important for the generality in few shot learning, whereas low rank adaptation interferes with the most informative direction of the pretrained weight matrix. We propose the complementary subspace low rank adaptation (Comp-LoRA) method to regularize the catastrophic forgetting problem in few shot VLM finetuning. In detail, we optimize the low rank matrix in the complementary subspace, thus preserving the general vision language alignment ability of VLM when learning the novel few shot information. We conduct comparison experiments of the proposed Comp-LoRA method and other PEFT methods on fine-tuning VLM for few shot classification. And we also present the suppression on the catastrophic forgetting problem of our proposed method against directly applying LoRA to VLM. The results show that the proposed method surpasses the baseline method by about +1.0\% Top-1 accuracy and preserves the VLM zero-shot performance over the baseline method by about +1.3\% Top-1 accuracy.
SDJun 4, 2024
MaskSR: Masked Language Model for Full-band Speech RestorationXu Li, Qirui Wang, Xiaoyu Liu
Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions. Although several deep learning paradigms have been studied for this task, the power of the recently emerging language models has not been fully explored. In this paper, we propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens extracted using a pre-trained neural codec. During training, MaskSR is optimized to predict randomly masked tokens extracted from the high quality target speech, conditioned on the corrupted speech with various distortions. During inference, MaskSR reconstructs the target speech tokens with efficient iterative sampling. Extensive experiments show that MaskSR obtains competitive results on both the full-band speech restoration task and also on sub-tasks compared with a wide range of models.
SDSep 1, 2023
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-trainingShaohuan Zhou, Xu Li, Zhiyong Wu et al.
The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
SDNov 8, 2021
Characterizing the adversarial vulnerability of speech self-supervised learningHaibin Wu, Bo Zheng, Xu Li et al.
A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.
LGOct 26, 2021
Arbitrary Distribution Modeling with Censorship in Real-Time Bidding AdvertisingXu Li, Michelle Ma Zhang, Youjun Tong et al.
The purpose of Inventory Pricing is to bid the right prices to online ad opportunities, which is crucial for a Demand-Side Platform (DSP) to win advertising auctions in Real-Time Bidding (RTB). In the planning stage, advertisers need the forecast of probabilistic models to make bidding decisions. However, most of the previous works made strong assumptions on the distribution form of the winning price, which reduced their accuracy and weakened their ability to make generalizations. Though some works recently tried to fit the distribution directly, their complex structure lacked efficiency on online inference. In this paper, we devise a novel loss function, Neighborhood Likelihood Loss (NLL), collaborating with a proposed framework, Arbitrary Distribution Modeling (ADM), to predict the winning price distribution under censorship with no pre-assumption required. We conducted experiments on two real-world experimental datasets and one large-scale, non-simulated production dataset in our system. Experiments showed that ADM outperformed the baselines both on algorithm and business metrics. By replaying historical data of the production environment, this method was shown to lead to good yield in our system. Without any pre-assumed specific distribution form, ADM showed significant advantages in effectiveness and efficiency, demonstrating its great capability in modeling sophisticated price landscapes.
CVAug 2, 2021
S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for VisionTan Yu, Xu Li, Yunfeng Cai et al.
Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S$^2$-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S$^2$-MLP. In this paper, we improve the S$^2$-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S$^2$-MLPv2. Using 55M parameters, our medium-scale model, S$^2$-MLPv2-Medium achieves an $83.6\%$ top-1 accuracy on the ImageNet-1K benchmark using $224\times 224$ images without self-attention and external training data.
ASJul 19, 2021
Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech AttacksXu Li, Xixin Wu, Hui Lu et al.
Existing approaches for anti-spoofing in automatic speaker verification (ASV) still lack generalizability to unseen attacks. The Res2Net approach designs a residual-like connection between feature groups within one block, which increases the possible receptive fields and improves the system's detection generalizability. However, such a residual-like connection is performed by a direct addition between feature groups without channel-wise priority. We argue that the information across channels may not contribute to spoofing cues equally, and the less relevant channels are expected to be suppressed before adding onto the next feature group, so that the system can generalize better to unseen attacks. This argument motivates the current work that presents a novel, channel-wise gated Res2Net (CG-Res2Net), which modifies Res2Net to enable a channel-wise gating mechanism in the connection between feature groups. This gating mechanism dynamically selects channel-wise features based on the input, to suppress the less relevant channels and enhance the detection generalizability. Three gating mechanisms with different structures are proposed and integrated into Res2Net. Experimental results conducted on ASVspoof 2019 logical access (LA) demonstrate that the proposed CG-Res2Net significantly outperforms Res2Net on both the overall LA evaluation set and individual difficult unseen attacks, which also outperforms other state-of-the-art single systems, depicting the effectiveness of our method.
SDJul 7, 2021
VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech SynthesisHui Lu, Zhiyong Wu, Xixin Wu et al.
This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.
CVJun 28, 2021
Rethinking Token-Mixing MLP for MLP-based Vision BackboneTan Yu, Xu Li, Yunfeng Cai et al.
In the past decade, we have witnessed rapid progress in the machine vision backbone. By introducing the inductive bias from the image processing, convolution neural network (CNN) has achieved excellent performance in numerous computer vision tasks and has been established as \emph{de facto} backbone. In recent years, inspired by the great success achieved by Transformer in NLP tasks, vision Transformer models emerge. Using much less inductive bias, they have achieved promising performance in computer vision tasks compared with their CNN counterparts. More recently, researchers investigate using the pure-MLP architecture to build the vision backbone to further reduce the inductive bias, achieving good performance. The pure-MLP backbone is built upon channel-mixing MLPs to fuse the channels and token-mixing MLPs for communications between patches. In this paper, we re-think the design of the token-mixing MLP. We discover that token-mixing MLPs in existing MLP-based backbones are spatial-specific, and thus it is sensitive to spatial translation. Meanwhile, the channel-agnostic property of the existing token-mixing MLPs limits their capability in mixing tokens. To overcome those limitations, we propose an improved structure termed as Circulant Channel-Specific (CCS) token-mixing MLP, which is spatial-invariant and channel-specific. It takes fewer parameters but achieves higher classification accuracy on ImageNet1K benchmark.
CVJun 14, 2021
S$^2$-MLP: Spatial-Shift MLP Architecture for VisionTan Yu, Xu Li, Yunfeng Cai et al.
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.