Yuchen Cao

h-index10

18papers

469citations

Novelty40%

AI Score54

Ranked #28,307 of 201,326 authors (top 14%)#11,656 in CV (top 20%)

18 Papers

SDMay 26

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Hanlin Zhang, Daxin Tan, Dehua Tao et al.

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose \textbf{DSA-Tokenizer}, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the DiT decoder to reduce sampling steps of inference to 4 and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.

CVAug 24, 2023

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Siming Fu, Xiaoxuan He, Xinpeng Ding et al.

Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

CLMar 16, 2025Code

CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences

Ziran Qin, Yuchen Cao, Mingbao Lin et al.

Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.

LGMar 22, 2025Code

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Codefuse, Ling Team, Wenting Cai et al.

Recent advancements in code large language models (LLMs) have demonstrated remarkable capabilities in code generation and understanding. It is still challenging to build a code LLM with comprehensive performance yet ultimate efficiency. Many attempts have been released in the open source community to break the trade-off between performance and efficiency, such as the Qwen Coder series and the DeepSeek Coder series. This paper introduces yet another attempt in this area, namely Ling-Coder-Lite. We leverage the efficient Mixture-of-Experts (MoE) architecture along with a set of high-quality data curation methods (especially those based on program analytics) to build an efficient yet powerful code LLM. Ling-Coder-Lite exhibits on-par performance on 12 representative coding benchmarks compared to state-of-the-art models of similar size, such as Qwen2.5-Coder-7B and DeepSeek-Coder-V2-Lite, while offering competitive latency and throughput. In practice, we achieve a 50\% reduction in deployment resources compared to the similar-sized dense model without performance loss. To facilitate further research and development in this area, we open-source our models as well as a substantial portion of high-quality data for the annealing and post-training stages. The models and data can be accessed at~\url{https://huggingface.co/inclusionAI/Ling-Coder-lite}.

CVMay 16

Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities

Chenglizhao Chen, Yuchen Cao, Xinyu Liu et al.

Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.

AIMay 12

Engagement Process: Rethinking the Temporal Interface of Action and Observation

Jialian Li, Yuchen Cao, Junhong Liu et al.

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

CLMar 3, 2025

Efficient or Powerful? Trade-offs Between Machine Learning and Deep Learning for Mental Illness Detection on Social Media

Zhanyi Ding, Zhongyan Wang, Yeyubei Zhang et al.

Social media platforms provide valuable insights into mental health trends by capturing user-generated discussions on conditions such as depression, anxiety, and suicidal ideation. Machine learning (ML) and deep learning (DL) models have been increasingly applied to classify mental health conditions from textual data, but selecting the most effective model involves trade-offs in accuracy, interpretability, and computational efficiency. This study evaluates multiple ML models, including logistic regression, random forest, and LightGBM, alongside deep learning architectures such as ALBERT and Gated Recurrent Units (GRUs), for both binary and multi-class classification of mental health conditions. Our findings indicate that ML and DL models achieve comparable classification performance on medium-sized datasets, with ML models offering greater interpretability through variable importance scores, while DL models are more robust to complex linguistic patterns. Additionally, ML models require explicit feature engineering, whereas DL models learn hierarchical representations directly from text. Logistic regression provides the advantage of capturing both positive and negative associations between features and mental health conditions, whereas tree-based models prioritize decision-making power through split-based feature selection. This study offers empirical insights into the advantages and limitations of different modeling approaches and provides recommendations for selecting appropriate methods based on dataset size, interpretability needs, and computational constraints.

LGOct 21, 2024

Machine Learning Approaches for Mental Illness Detection on Social Media: A Systematic Review of Biases and Methodological Challenges

Yuchen Cao, Jianglai Dai, Zhongyan Wang et al.

The global increase in mental illness requires innovative detection methods for early intervention. Social media provides a valuable platform to identify mental illness through user-generated content. This systematic review examines machine learning (ML) models for detecting mental illness, with a particular focus on depression, using social media data. It highlights biases and methodological challenges encountered throughout the ML lifecycle. A search of PubMed, IEEE Xplore, and Google Scholar identified 47 relevant studies published after 2010. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was utilized to assess methodological quality and risk of bias. The review reveals significant biases affecting model reliability and generalizability. A predominant reliance on Twitter (63.8%) and English-language content (over 90%) limits diversity, with most studies focused on users from the United States and Europe. Non-probability sampling (80%) limits representativeness. Only 23% explicitly addressed linguistic nuances like negations, crucial for accurate sentiment analysis. Inconsistent hyperparameter tuning (27.7%) and inadequate data partitioning (17%) risk overfitting. While 74.5% used appropriate evaluation metrics for imbalanced data, others relied on accuracy without addressing class imbalance, potentially skewing results. Reporting transparency varied, often lacking critical methodological details. These findings highlight the need to diversify data sources, standardize preprocessing, ensure consistent model development, address class imbalance, and enhance reporting transparency. By overcoming these challenges, future research can develop more robust and generalizable ML models for depression detection on social media, contributing to improved mental health outcomes globally.

LGOct 26, 2024

A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases

Yunchong Liu, Xiaorui Shen, Yeyubei Zhang et al.

Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.

CLFeb 3, 2025

Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection

Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding et al.

Social media has become an important source for understanding mental health, providing researchers with a way to detect conditions like depression from user-generated posts. This tutorial provides practical guidance to address common challenges in applying machine learning and deep learning methods for mental health detection on these platforms. It focuses on strategies for working with diverse datasets, improving text preprocessing, and addressing issues such as imbalanced data and model evaluation. Real-world examples and step-by-step instructions demonstrate how to apply these techniques effectively, with an emphasis on transparency, reproducibility, and ethical considerations. By sharing these approaches, this tutorial aims to help researchers build more reliable and widely applicable models for mental health research, contributing to better tools for early detection and intervention.

SEApr 6

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Yuchen Cao, Hanlin Zhang, Jacky Wai Keung et al.

Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports multi-dimensional scorecards for spec fidelity, risk discipline, reliability, and out-of-sample robustness indicators, together with cost-effectiveness signals. We evaluate 17 models across 12 strategies. Top models achieve validity above 91.7 percent with strong aggregate scores, but evidence-driven iteration also induces code convergence by Iter2. These findings suggest that LLM iteration complements rather than replaces human quantitative researcher governance: LLMs excel at rapid prototyping and shallow bug fixes, while human oversight remains essential for critical strategies requiring solution diversity and ensemble robustness.

LGAug 15, 2025

SeamlessFlow: A Trainer Agent Isolation RL Framework Achieving Bubble-Free Pipelines via Tag Scheduling

Jinghui Wang, Shaojie Wang, Yinghan Cui et al.

We introduce SeamlessFlow, a server based reinforcement learning (RL) framework that addresses two core challenges in industrial scale RL: (1) decoupling RL training from the complex execution flow of agents; (2) maximizing GPU utilization with minimal idle time while preserving the stability and scalability required for large-scale deployments. First, SeamlessFlow introduces a data plane that decouples the RL trainer from diverse, complex agent implementations while sustaining high throughput. A central trajectory manager maintains complete interaction histories and supports partial rollout, allowing rollout to pause for weight updates and resume seamlessly, keeping agents unaware of service interruptions. Second, we propose a tag driven scheduling paradigm that abstracts hardware into capability tagged resources, unifying colocated and disaggregated architectures. Based on this, SeamlessFlow introduces a spatiotemporal multiplexing pipeline that dynamically reassigns idle training nodes to rollout in a train rollout separated setup, eliminating pipeline bubbles and fully exploiting heterogeneous cluster resources. By combining these innovations, SeamlessFlow delivers both stability and high performance, making it well suited for multi agent, long horizon, and other complex RL tasks.

CVSep 12, 2021

CropDefender: deep watermark which is more convenient to train and more robust against cropping

Jiayu Ding, Yuchen Cao, Changhao Shi

Digital image watermarking, which is a technique for invisibly embedding information into an image, is used in fields such as property rights protection. In recent years, some research has proposed the use of neural networks to add watermarks to natural images. We take StegaStamp as an example for our research. Whether facing traditional image editing methods, such as brightness, contrast, saturation adjustment, or style change like 1-bit conversion, GAN, StegaStamp has robustness far beyond traditional watermarking techniques, but it still has two drawbacks: it is vulnerable to cropping and is hard to train. We found that the causes of vulnerability to cropping is not the loss of information on the edge, but the movement of watermark position. By explicitly introducing the perturbation of cropping into the training, the cropping resistance is significantly improved. For the problem of difficult training, we introduce instance normalization to solve the vanishing gradient, set losses' weights as learnable parameters to reduce the number of hyperparameters, and use sigmoid to restrict pixel values of the generated image.

NUCL-THJan 16, 2020

Quantified limits of the nuclear landscape

Léo Neufcourt, Yuchen Cao, Samuel A. Giuliani et al.

The chart of the nuclides is limited by particle drip lines beyond which nuclear stability to proton or neutron emission is lost. Predicting the range of particle-bound isotopes poses an appreciable challenge for nuclear theory as it involves extreme extrapolations of nuclear masses beyond the regions where experimental information is available. Still, quantified extrapolations are crucial for a variety of applications, including the modeling of stellar nucleosynthesis. We use microscopic nuclear mass models and Bayesian methodology to provide quantified predictions of proton and neutron separation energies as well as Bayesian probabilities of existence throughout the nuclear landscape all the way to the particle drip lines. We apply nuclear density functional theory with several energy density functionals. To account for uncertainties, Bayesian Gaussian processes are trained on the separation-energy residuals for each individual model, and the resulting predictions are combined via Bayesian model averaging. This framework allows to account for systematic and statistical uncertainties and propagate them to extrapolative predictions. We characterize the drip-line regions where the probability that the nucleus is particle-bound decreases from $1$ to $0$. In these regions, we provide quantified predictions for one- and two-nucleon separation energies. According to our Bayesian model averaging analysis, 7759 nuclei with $Z\leq 119$ have a probability of existence $\geq 0.5$. The extrapolations obtained in this study will be put through stringent tests when new experimental information on exotic nuclei becomes available. In this respect, the quantified landscape of nuclear existence obtained in this study should be viewed as a dynamical prediction that will be fine-tuned when new experimental information and improved global mass models become available.

NUCL-THOct 28, 2019

Beyond the proton drip line: Bayesian analysis of proton-emitting nuclei

Léo Neufcourt, Yuchen Cao, Samuel Giuliani et al.

The limits of the nuclear landscape are determined by nuclear binding energies. Beyond the proton drip lines, where the separation energy becomes negative, there is not enough binding energy to prevent protons from escaping the nucleus. Predicting properties of unstable nuclear states in the vast territory of proton emitters poses an appreciable challenge for nuclear theory as it often involves far extrapolations. In addition, significant discrepancies between nuclear models in the proton-rich territory call for quantified predictions. With the help of Bayesian methodology, we mix a family of nuclear mass models corrected with statistical emulators trained on the experimental mass measurements, in the proton-rich region of the nuclear chart. Separation energies were computed within nuclear density functional theory using several Skyrme and Gogny energy density functionals. We also considered mass predictions based on two models used in astrophysical studies. Quantified predictions were obtained for each model using Bayesian Gaussian processes trained on separation-energy residuals and combined via Bayesian model averaging. We obtained a good agreement between averaged predictions of statistically corrected models and experiment. In particular, we quantified model results for one- and two-proton separation energies and derived probabilities of proton emission. This information enabled us to produce a quantified landscape of proton-rich nuclei. The most promising candidates for two-proton decay studies have been identified. The methodology used in this work has broad applications to model-based extrapolations of various nuclear observables. It also provides a reliable uncertainty quantification of theoretical predictions.

NUCL-THJan 22, 2019

Neutron drip line in the Ca region from Bayesian model averaging

Léo Neufcourt, Yuchen Cao, Witold Nazarewicz et al.

The region of heavy calcium isotopes forms the frontier of experimental and theoretical nuclear structure research where the basic concepts of nuclear physics are put to stringent test. The recent discovery of the extremely neutron-rich nuclei around $^{60}$Ca [Tarasov, 2018] and the experimental determination of masses for $^{55-57}$Ca (Michimasa, 2018] provide unique information about the binding energy surface in this region. To assess the impact of these experimental discoveries on the nuclear landscape's extent, we use global mass models and statistical machine learning to make predictions, with quantified levels of certainty, for bound nuclides between Si and Ti. Using a Bayesian model averaging analysis based on Gaussian-process-based extrapolations we introduce the posterior probability $p_{ex}$ for each nucleus to be bound to neutron emission. We find that extrapolations for drip-line locations, at which the nuclear binding ends, are consistent across the global mass models used, in spite of significant variations between their raw predictions. In particular, considering the current experimental information and current global mass models, we predict that $^{68}$Ca has an average posterior probability ${p_{ex}\approx76}$% to be bound to two-neutron emission while the nucleus $^{61}$Ca is likely to decay by emitting a neutron (${p_{ex}\approx 46}$ %).

CVOct 11, 2018

Dense Object Reconstruction from RGBD Images with Embedded Deep Shape Representations

Lan Hu, Yuchen Cao, Peng Wu et al.

Most problems involving simultaneous localization and mapping can nowadays be solved using one of two fundamentally different approaches. The traditional approach is given by a least-squares objective, which minimizes many local photometric or geometric residuals over explicitly parametrized structure and camera parameters. Unmodeled effects violating the lambertian surface assumption or geometric invariances of individual residuals are encountered through statistical averaging or the addition of robust kernels and smoothness terms. Aiming at more accurate measurement models and the inclusion of higher-order shape priors, the community more recently shifted its attention to deep end-to-end models for solving geometric localization and mapping problems. However, at test-time, these feed-forward models ignore the more traditional geometric or photometric consistency terms, thus leading to a low ability to recover fine details and potentially complete failure in corner case scenarios. With an application to dense object modeling from RGBD images, our work aims at taking the best of both worlds by embedding modern higher-order object shape priors into classical iterative residual minimization objectives. We demonstrate a general ability to improve mapping accuracy with respect to each modality alone, and present a successful application to real data.

NUCL-THJun 1, 2018

Bayesian approach to model-based extrapolation of nuclear observables

Léo Neufcourt, Yuchen Cao, Witold Nazarewicz et al.

The mass, or binding energy, is the basis property of the atomic nucleus. It determines its stability, and reaction and decay rates. Quantifying the nuclear binding is important for understanding the origin of elements in the universe. The astrophysical processes responsible for the nucleosynthesis in stars often take place far from the valley of stability, where experimental masses are not known. In such cases, missing nuclear information must be provided by theoretical predictions using extreme extrapolations. Bayesian machine learning techniques can be applied to improve predictions by taking full advantage of the information contained in the deviations between experimental and calculated masses. We consider 10 global models based on nuclear Density Functional Theory as well as two more phenomenological mass models. The emulators of S2n residuals and credibility intervals defining theoretical error bars are constructed using Bayesian Gaussian processes and Bayesian neural networks. We consider a large training dataset pertaining to nuclei whose masses were measured before 2003. For the testing datasets, we considered those exotic nuclei whose masses have been determined after 2003. We then carried out extrapolations towards the 2n dripline. While both Gaussian processes and Bayesian neural networks reduce the rms deviation from experiment significantly, GP offers a better and much more stable performance. The increase in the predictive power is quite astonishing: the resulting rms deviations from experiment on the testing dataset are similar to those of more phenomenological models. The empirical coverage probability curves we obtain match very well the reference values which is highly desirable to ensure honesty of uncertainty quantification, and the estimated credibility intervals on predictions make it possible to evaluate predictive power of individual models.