Ruiqi Li

AS
h-index29
27papers
908citations
Novelty55%
AI Score62

27 Papers

ASAug 29, 2024Code
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang et al.

Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.

SDSep 14, 2023
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Yongqi Wang, Jionghao Bai, Rongjie Huang et al. · meta-ai

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .

ASSep 20, 2024Code
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

Yu Zhang, Changhao Pan, Wenxiang Guo et al.

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The demos can be found at http://aaronz345.github.io/GTSingerDemo/. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/AaronZ345/GTSinger and https://github.com/AaronZ345/GTSinger.

IVApr 5, 2023
Zero-shot Medical Image Translation via Frequency-Guided Diffusion Models

Yunxiang Li, Hua-Chieh Shao, Xiao Liang et al.

Recently, the diffusion model has emerged as a superior generative model that can produce high quality and realistic images. However, for medical image translation, the existing diffusion models are deficient in accurately retaining structural information since the structure details of source domain images are lost during the forward diffusion process and cannot be fully recovered through learned reverse diffusion, while the integrity of anatomical structures is extremely important in medical images. For instance, errors in image translation may distort, shift, or even remove structures and tumors, leading to incorrect diagnosis and inadequate treatments. Training and conditioning diffusion models using paired source and target images with matching anatomy can help. However, such paired data are very difficult and costly to obtain, and may also reduce the robustness of the developed model to out-of-distribution testing data. We propose a frequency-guided diffusion model (FGDM) that employs frequency-domain filters to guide the diffusion model for structure-preserving image translation. Based on its design, FGDM allows zero-shot learning, as it can be trained solely on the data from the target domain, and used directly for source-to-target domain translation without any exposure to the source-domain data during training. We evaluated it on three cone-beam CT (CBCT)-to-CT translation tasks for different anatomical sites, and a cross-institutional MR imaging translation task. FGDM outperformed the state-of-the-art methods (GAN-based, VAE-based, and diffusion-based) in metrics of Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM), showing its significant advantages in zero-shot medical image translation.

ASMay 29
Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Ke Lei, Yu Zhang, Changhao Pan et al.

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

CVFeb 9Code
ALIVE: Animate Your World with Lifelike Audio-Video Generation

Ying Guo, Qijun Gan, Yifu Zhang et al.

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.

ASSep 24, 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Yu Zhang, Ziyue Jiang, Ruiqi Li et al.

Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at https://aaronz345.github.io/TCSingerDemo/.

ASJul 2, 2024
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Ruiqi Li, Zhiqing Hong, Yongqi Wang et al.

Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.

CLJul 17, 2023
Automated Action Model Acquisition from Narrative Texts

Ruiqi Li, Leyang Cui, Songtuan Lin et al.

Action models, which take the form of precondition/effect axioms, facilitate causal and motivational connections between actions for AI agents. Action model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Acquiring action models from narrative texts in an automated way is essential, but challenging because of the inherent complexities of such texts. We present NaRuto, a system that extracts structured events from narrative text and subsequently generates planning-language-style action models based on predictions of commonsense event relations, as well as textual contradictions and similarities, in an unsupervised manner. Experimental results in classical narrative planning domains show that NaRuto can generate action models of significantly better quality than existing fully automated methods, and even on par with those of semi-automated methods.

CLApr 4, 2023
EDeR: A Dataset for Exploring Dependency Relations Between Events

Ruiqi Li, Patrik Haslum, Leyang Cui

Relation extraction is a central task in natural language processing (NLP) and information retrieval (IR) research. We argue that an important type of relation not explored in NLP or IR research to date is that of an event being an argument - required or optional - of another event. We introduce the human-annotated Event Dependency Relation dataset (EDeR) which provides this dependency relation. The annotation is done on a sample of documents from the OntoNotes dataset, which has the added benefit that it integrates with existing, orthogonal, annotations of this dataset. We investigate baseline approaches for predicting the event dependency relation, the best of which achieves an accuracy of 82.61 for binary argument/non-argument classification. We show that recognizing this relation leads to more accurate event extraction (semantic role labelling) and can improve downstream tasks that depend on this, such as co-reference resolution. Furthermore, we demonstrate that predicting the three-way classification into the required argument, optional argument or non-argument is a more challenging task.

ASApr 27, 2025Code
Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan et al.

Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.

LGMay 22, 2024Code
FAITH: Frequency-domain Attention In Two Horizons for Time Series Forecasting

Ruiqi Li, Maowei Jiang, Kai Wang et al.

Time Series Forecasting plays a crucial role in various fields such as industrial equipment maintenance, meteorology, energy consumption, traffic flow and financial investment. However, despite their considerable advantages over traditional statistical approaches, current deep learning-based predictive models often exhibit a significant deviation between their forecasting outcomes and the ground truth. This discrepancy is largely due to an insufficient emphasis on extracting the sequence's latent information, particularly its global information within the frequency domain and the relationship between different variables. To address this issue, we propose a novel model Frequency-domain Attention In Two Horizons, which decomposes time series into trend and seasonal components using a multi-scale sequence adaptive decomposition and fusion architecture, and processes them separately. FAITH utilizes Frequency Channel feature Extraction Module and Frequency Temporal feature Extraction Module to capture inter-channel relationships and temporal global information in the sequence, significantly improving its ability to handle long-term dependencies and complex patterns. Furthermore, FAITH achieves theoretically linear complexity by modifying the time-frequency domain transformation method, effectively reducing computational costs. Extensive experiments on 6 benchmarks for long-term forecasting and 3 benchmarks for short-term forecasting demonstrate that FAITH outperforms existing models in many fields, such as electricity, weather and traffic, proving its effectiveness and superiority both in long-term and short-term time series forecasting tasks. Our codes and data are available at https://github.com/LRQ577/FAITH.

CVApr 4
ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse

Yunhao Yao, Zhiqiang Wang, Ruiqi Li et al.

As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.

ASDec 17, 2023
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Yu Zhang, Rongjie Huang, Ruiqi Li et al.

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://aaronz345.github.io/StyleSingerDemo/.

CRJan 12
MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Ruiqi Li, Zhiqiang Wang, Yunhao Yao et al.

To standardize interactions between LLM-based agents and their environments, the Model Context Protocol (MCP) was proposed and has since been widely adopted. However, integrating external tools expands the attack surface, exposing agents to tool poisoning attacks. In such attacks, malicious instructions embedded in tool metadata are injected into the agent context during MCP registration phase, thereby manipulating agent behavior. Prior work primarily focuses on explicit tool poisoning or relied on manually crafted poisoned tools. In contrast, we focus on a particularly stealthy variant: implicit tool poisoning, where the poisoned tool itself remains uninvoked. Instead, the instructions embedded in the tool metadata induce the agent to invoke a legitimate but high-privilege tool to perform malicious operations. We propose MCP-ITP, the first automated and adaptive framework for implicit tool poisoning within the MCP ecosystem. MCP-ITP formulates poisoned tool generation as a black-box optimization problem and employs an iterative optimization strategy that leverages feedback from both an evaluation LLM and a detection LLM to maximize Attack Success Rate (ASR) while evading current detection mechanisms. Experimental results on the MCPTox dataset across 12 LLM agents demonstrate that MCP-ITP consistently outperforms the manually crafted baseline, achieving up to 84.2% ASR while suppressing the Malicious Tool Detection Rate (MDR) to as low as 0.3%.

ASFeb 26, 2025
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang, Yi Ren, Ruiqi Li et al.

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

SDMar 18, 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang et al.

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .

SDOct 16, 2024
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Ruiqi Li, Siqi Zheng, Xize Cheng et al.

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at https://muvi-v2m.github.io.

ASApr 14, 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Zhiqing Hong, Rongjie Huang, Xize Cheng et al.

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

CVOct 31, 2024
GaussianMarker: Uncertainty-Aware Copyright Protection of 3D Gaussian Splatting

Xiufeng Huang, Ruiqi Li, Yiu-ming Cheung et al.

3D Gaussian Splatting (3DGS) has become a crucial method for acquiring 3D assets. To protect the copyright of these assets, digital watermarking techniques can be applied to embed ownership information discreetly within 3DGS models. However, existing watermarking methods for meshes, point clouds, and implicit radiance fields cannot be directly applied to 3DGS models, as 3DGS models use explicit 3D Gaussians with distinct structures and do not rely on neural networks. Naively embedding the watermark on a pre-trained 3DGS can cause obvious distortion in rendered images. In our work, we propose an uncertainty-based method that constrains the perturbation of model parameters to achieve invisible watermarking for 3DGS. At the message decoding stage, the copyright messages can be reliably extracted from both 3D Gaussians and 2D rendered images even under various forms of 3D and 2D distortions. We conduct extensive experiments on the Blender, LLFF and MipNeRF-360 datasets to validate the effectiveness of our proposed method, demonstrating state-of-the-art performance on both message decoding accuracy and view synthesis quality.

LGJan 30, 2025
A Learnable Multi-views Contrastive Framework with Reconstruction Discrepancy for Medical Time-Series

Yifan Wang, Hongfeng Ai, Ruiqi Li et al.

In medical time series disease diagnosis, two key challenges are identified.First, the high annotation cost of medical data leads to overfitting in models trained on label-limited, single-center datasets. To address this, we propose incorporating external data from related tasks and leveraging AE-GAN to extract prior knowledge,providing valuable references for downstream tasks. Second, many existing studies employ contrastive learning to derive more generalized medical sequence representations for diagnostic tasks, usually relying on manually designed diverse positive and negative sample pairs.However, these approaches are complex, lack generalizability, and fail to adaptively capture disease-specific features across different conditions.To overcome this, we introduce LMCF (Learnable Multi-views Contrastive Framework), a framework that integrates a multi-head attention mechanism and adaptively learns representations from different views through inter-view and intra-view contrastive learning strategies.Additionally, the pre-trained AE-GAN is used to reconstruct discrepancies in the target data as disease probabilities, which are then integrated into the contrastive learning process.Experiments on three target datasets demonstrate that our method consistently outperforms seven other baselines, highlighting its significant impact on healthcare applications such as the diagnosis of myocardial infarction, Alzheimer's disease, and Parkinson's disease.

CVJul 31, 2025
Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment

Yifan Wang, Hongfeng Ai, Quangao Liu et al.

Vision Language Models (VLMs) face challenges in effectively coordinating diverse attention mechanisms for cross-modal embedding learning, leading to mismatched attention and suboptimal performance. We propose Consistent Cross-layer Regional Alignment (CCRA), which introduces Layer-Patch-wise Cross Attention (LPWCA) to capture fine-grained regional-semantic correlations by jointly weighting patch and layer-wise embedding, and Progressive Attention Integration (PAI) that systematically coordinates LPWCA, layer-wise, and patch-wise attention mechanisms in sequence. This progressive design ensures consistency from semantic to regional levels while preventing attention drift and maximizing individual attention benefits. Experimental results on ten diverse vision-language benchmarks demonstrate that our CCRA-enhanced LLaVA-v1.5-7B model achieves state-of-the-art performance, outperforming all baseline methods with only 3.55M additional parameters, while providing enhanced interpretability through more regionally focused and semantically aligned attention patterns.

LGJun 14, 2024
WindowMixer: Intra-Window and Inter-Window Modeling for Time Series Forecasting

Quangao Liu, Ruiqi Li, Maowei Jiang et al.

Time series forecasting (TSF) is crucial in fields like economic forecasting, weather prediction, traffic flow analysis, and public health surveillance. Real-world time series data often include noise, outliers, and missing values, making accurate forecasting challenging. Traditional methods model point-to-point relationships, which limits their ability to capture complex temporal patterns and increases their susceptibility to noise.To address these issues, we introduce the WindowMixer model, built on an all-MLP framework. WindowMixer leverages the continuous nature of time series by examining temporal variations from a window-based perspective. It decomposes time series into trend and seasonal components, handling them individually. For trends, a fully connected (FC) layer makes predictions. For seasonal components, time windows are projected to produce window tokens, processed by Intra-Window-Mixer and Inter-Window-Mixer modules. The Intra-Window-Mixer models relationships within each window, while the Inter-Window-Mixer models relationships between windows. This approach captures intricate patterns and long-range dependencies in the data.Experiments show WindowMixer consistently outperforms existing methods in both long-term and short-term forecasting tasks.

SDJun 1, 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang et al.

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.

CLSep 1, 2023
When Do Discourse Markers Affect Computational Sentence Understanding?

Ruiqi Li, Liesbeth Allein, Damien Sileo et al.

The capabilities and use cases of automatic natural language processing (NLP) have grown significantly over the last few years. While much work has been devoted to understanding how humans deal with discourse connectives, this phenomenon is understudied in computational systems. Therefore, it is important to put NLP models under the microscope and examine whether they can adequately comprehend, process, and reason within the complexity of natural language. In this chapter, we introduce the main mechanisms behind automatic sentence processing systems step by step and then focus on evaluating discourse connective processing. We assess nine popular systems in their ability to understand English discourse connectives and analyze how context and language understanding tasks affect their connective comprehension. The results show that NLP systems do not process all discourse connectives equally well and that the computational processing complexity of different connective kinds is not always consistently in line with the presumed complexity order found in human processing. In addition, while humans are more inclined to be influenced during the reading procedure but not necessarily in the final comprehension performance, discourse connectives have a significant impact on the final accuracy of NLP systems. The richer knowledge of connectives a system learns, the more negative effect inappropriate connectives have on it. This suggests that the correct explicitation of discourse connectives is important for computational natural language processing.

ASMay 8, 2023
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

Ruiqi Li, Rongjie Huang, Lichao Zhang et al.

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation. This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment, which views speech variance such as pitch and content as different modalities. Inspired by the mechanism of how humans will sing the lyrics to the melody, AlignSTS: 1) adopts a novel rhythm adaptor to predict the target rhythm representation to bridge the modality gap between content and pitch, where the rhythm representation is computed in a simple yet effective way and is quantized into a discrete space; and 2) uses the predicted rhythm representation to re-align the content based on cross-attention and conducts a cross-modal fusion for re-synthesize. Extensive experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://alignsts.github.io.

LGMay 24, 2021
Latent Space Arc Therapy Optimization

Noah Bice, Mohamad Fakhreddine, Ruiqi Li et al.

Volumetric modulated arc therapy planning is a challenging problem in high-dimensional, non-convex optimization. Traditionally, heuristics such as fluence-map-optimization-informed segment initialization use locally optimal solutions to begin the search of the full arc therapy plan space from a reasonable starting point. These routines facilitate arc therapy optimization such that clinically satisfactory radiation treatment plans can be created in about 10 minutes. However, current optimization algorithms favor solutions near their initialization point and are slower than necessary due to plan overparameterization. In this work, arc therapy overparameterization is addressed by reducing the effective dimension of treatment plans with unsupervised deep learning. An optimization engine is then built based on low-dimensional arc representations which facilitates faster planning times.