Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard BenchmarksMahdi Morafah, Saeed Vahidian, Chen Chen et al.
Though successful, federated learning presents new challenges for machine learning, especially when the issue of data heterogeneity, also known as Non-IID data, arises. To cope with the statistical heterogeneity, previous works incorporated a proximal term in local optimization or modified the model aggregation scheme at the server side or advocated clustered federated learning approaches where the central server groups agent population into clusters with jointly trainable data distributions to take the advantage of a certain level of personalization. While effective, they lack a deep elaboration on what kind of data heterogeneity and how the data heterogeneity impacts the accuracy performance of the participating clients. In contrast to many of the prior federated learning approaches, we demonstrate not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive: (1) Dissimilar labels of clients (label skew) are not necessarily considered data heterogeneity, and (2) the principal angle between the agents' data subspaces spanned by their corresponding principal vectors of data is a better estimate of the data heterogeneity. Our code is available at https://github.com/MMorafah/FL-SC-NIID.
5.6CVApr 17
FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from ImagesSabab Ishraq, Aarushi Aarushi, Juncai Jiang et al. · deepmind
Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.
TARGET: Federated Class-Continual Learning via Exemplar-Free DistillationJie Zhang, Chen Chen, Weiming Zhuang et al.
This paper focuses on an under-explored yet important problem: Federated Class-Continual Learning (FCCL), where new classes are dynamically added in federated learning. Existing FCCL works suffer from various limitations, such as requiring additional datasets or storing the private data from previous tasks. In response, we first demonstrate that non-IID data exacerbates catastrophic forgetting issue in FL. Then we propose a novel method called TARGET (federat\textbf{T}ed cl\textbf{A}ss-continual lea\textbf{R}nin\textbf{G} via \textbf{E}xemplar-free dis\textbf{T}illation), which alleviates catastrophic forgetting in FCCL while preserving client data privacy. Our proposed method leverages the previously trained global model to transfer knowledge of old tasks to the current task at the model level. Moreover, a generator is trained to produce synthetic data to simulate the global distribution of data on each client at the data level. Compared to previous FCCL methods, TARGET does not require any additional datasets or storing real data from previous tasks, which makes it ideal for data-sensitive scenarios.
3.7CVNov 17, 2022
Language-Assisted Deep Learning for Autistic Behaviors RecognitionAndong Deng, Taojiannan Yang, Chen Chen et al.
Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).
FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated LearningGuangyu Sun, Matias Mendieta, Jun Luo et al.
Personalized Federated Learning (PFL) represents a promising solution for decentralized learning in heterogeneous data environments. Partial model personalization has been proposed to improve the efficiency of PFL by selectively updating local model parameters instead of aggregating all of them. However, previous work on partial model personalization has mainly focused on Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can be applied to other popular models such as Vision Transformers (ViTs). In this work, we investigate where and how to partially personalize a ViT model. Specifically, we empirically evaluate the sensitivity to data distribution of each type of layer. Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness in improving the model's performance compared to several advanced PFL methods.
IDEAL: Query-Efficient Data-Free Learning from Black-box ModelsJie Zhang, Chen Chen, Lingjuan Lyu
Knowledge Distillation (KD) is a typical method for training a lightweight student model with the help of a well-trained teacher model. However, most KD methods require access to either the teacher's training data or model parameters, which is unrealistic. To tackle this problem, recent works study KD under data-free and black-box settings. Nevertheless, these works require a large number of queries to the teacher model, which incurs significant monetary and computational costs. To address these problems, we propose a novel method called \emph{query-effIcient Data-free lEarning from blAck-box modeLs} (IDEAL), which aims to query-efficiently learn from black-box model APIs to train a good student without any real data. In detail, IDEAL trains the student model in two stages: data generation and model distillation. Note that IDEAL does not require any query in the data generation stage and queries the teacher only once for each sample in the distillation stage. Extensive experiments on various real-world datasets show the effectiveness of the proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method DFME by 5.83% on CIFAR10 dataset with only 0.02x the query budget of DFME.
15.1LGOct 4, 2022
Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in Federated LearningGuangyu Sun, Umar Khalid, Matias Mendieta et al.
Federated learning (FL) has emerged as a promising paradigm for enabling the collaborative training of models without centralized access to the raw data on local devices. In the typical FL paradigm (e.g., FedAvg), model weights are sent to and from the server each round to participating clients. Recently, the use of small pre-trained models has been shown to be effective in federated learning optimization and improving convergence. However, recent state-of-the-art pre-trained models are getting more capable but also have more parameters, known as the "Foundation Models." In conventional FL, sharing the enormous model weights can quickly put a massive communication burden on the system, especially if more capable models are employed. Can we find a solution to enable those strong and readily available pre-trained models in FL to achieve excellent performance while simultaneously reducing the communication burden? To this end, we investigate the use of parameter-efficient fine-tuning in federated learning and thus introduce a new framework: FedPEFT. Specifically, we systemically evaluate the performance of FedPEFT across a variety of client stability, data distribution, and differential privacy settings. By only locally tuning and globally sharing a small portion of the model weights, significant reductions in the total communication overhead can be achieved while maintaining competitive or even better performance in a wide range of federated learning scenarios, providing insight into a new paradigm for practical and effective federated systems.
46.4CVFeb 26, 2025Code
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement LearningJiazhen Pan, Che Liu, Junde Wu et al.
Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice. Inference model is available at: https://huggingface.co/JZPeterPan/MedVLM-R1.
14.7AIJul 29, 2024
Apple Intelligence Foundation Language ModelsTom Gunter, Zirui Wang, Chong Wang et al.
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
9.6CLMay 22, 2024Code
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queriesAdam Yang, Chen Chen, Konstantinos Pitas
State-of-the-art large language models are sometimes distributed as open-source software but are also increasingly provided as a closed-source service. These closed-source large-language models typically see the widest usage by the public, however, they often do not provide an estimate of their uncertainty when responding to queries. As even the best models are prone to ``hallucinating" false information with high confidence, a lack of a reliable estimate of uncertainty limits the applicability of these models in critical settings. We explore estimating the uncertainty of closed-source LLMs via multiple rephrasings of an original base query. Specifically, we ask the model, multiple rephrased questions, and use the similarity of the answers as an estimate of uncertainty. We diverge from previous work in i) providing rules for rephrasing that are simple to memorize and use in practice ii) proposing a theoretical framework for why multiple rephrased queries obtain calibrated uncertainty estimates. Our method demonstrates significant improvements in the calibration of uncertainty estimates compared to the baseline and provides intuition as to how query strategies should be designed for optimal test calibration.
19.7LGJul 30, 2025Code
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language ModelsJiazhen Pan, Bailiang Jian, Paul Hager et al.
Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
6.2CVAug 20, 2025Code
Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation ModelsJiabo Huang, Chen Chen, Lingjuan Lyu
Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we present a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented GenerationRuizhe Li, Chen Chen, Yuchen Hu et al.
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.
Multimodal Latent Fusion of ECG Leads for Early Assessment of Pulmonary HypertensionMohammod N. I. Suvon, Shuo Zhou, Prasun C. Tripathi et al.
Recent advancements in early assessment of pulmonary hypertension (PH) primarily focus on applying machine learning methods to centralized diagnostic modalities, such as 12-lead electrocardiogram (12L-ECG). Despite their potential, these approaches fall short in decentralized clinical settings, e.g., point-of-care and general practice, where handheld 6-lead ECG (6L-ECG) can offer an alternative but is limited by the scarcity of labeled data for developing reliable models. To address this, we propose a lead-specific electrocardiogram multimodal variational autoencoder (\textsc{LS-EMVAE}), which incorporates a hierarchical modality expert (HiME) fusion mechanism and a latent representation alignment loss. HiME combines mixture-of-experts and product-of-experts to enable flexible, adaptive latent fusion, while the alignment loss improves coherence among lead-specific and shared representations. To alleviate data scarcity and enhance representation learning, we adopt a transfer learning strategy: the model is first pre-trained on a large unlabeled 12L-ECG dataset and then fine-tuned on smaller task-specific labeled 6L-ECG datasets. We validate \textsc{LS-EMVAE} across two retrospective cohorts in a 6L-ECG setting: 892 subjects from the ASPIRE registry for (1) PH detection and (2) phenotyping pre-/post-capillary PH, and 16,416 subjects from UK Biobank for (3) predicting elevated pulmonary atrial wedge pressure, where it consistently outperforms unimodal and multimodal baseline methods and demonstrates strong generalizability and interpretability. The code is available at https://github.com/Shef-AIRE/LS-EMVAE.
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksKai Zhang, Rong Zhou, Eashan Adhikarla et al.
Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize holistic information. Generalist AI holds the potential to address these limitations due to its versatility in interpreting different data types and generating tailored outputs for diverse needs. However, existing biomedical generalist AI solutions are typically heavyweight and closed source to researchers, practitioners, and patients. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model, designed as a generalist capable of performing various biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25 experiments while maintaining a computing-friendly model scale. We also conducted human evaluations to assess the capabilities of BiomedGPT in radiology visual question answering, report generation, and summarization. BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in question answering, satisfactory performance with an error rate of 8.3% in writing complex radiology reports, and competitive summarization ability with a nearly equivalent preference score to human experts. Our method demonstrates that effective training with diverse data can lead to more practical biomedical AI for improving diagnosis and workflow efficiency.
26.3SDJan 27, 2025
Audio Large Language Models Can Be Descriptive Speech Quality EvaluatorsChen Chen, Yuchen Hu, Siyin Wang et al.
An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task training due to the lack of suitable datasets. To address this, we introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. In addition to the overall Mean Opinion Score (MOS), this corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. It also enables descriptive comparisons between two speech samples (A/B tests) with human-like judgment. Leveraging this corpus, we propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech and generating meaningful responses. Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models. This work advances the comprehensive perception of speech signals by audio LLMs, contributing to the development of real-world auditory and sensory intelligent agents.
36.6LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
5.5CLApr 24, 2024
CORM: Cache Optimization with Recent Message for Large Language Model InferenceJincheng Dai, Zhuowei Huang, Haiyun Jiang et al.
Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
24.9IRSep 22, 2025
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late InteractionZilin Xiao, Qi Ma, Mengting Gu et al.
Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval, where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.
Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability DetectionMohammod N. I. Suvon, Prasun C. Tripathi, Wenrui Fan et al.
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
7.6CVNov 1, 2024
Replay-Free Continual Low-Rank Adaptation with Dynamic MemoryHuancheng Chen, Jingtao Li, Weiming Zhuang et al.
We revisit continual learning~(CL), which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time. However, as the scale of these models increases, catastrophic forgetting remains a more serious challenge. Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning (PEFT), which focuses on fine-tuning only a small set of trainable parameters to adapt to downstream tasks, such as low-rank adaptation (LoRA). While LoRA achieves faster convergence and requires fewer trainable parameters, it has seldom been explored in the context of continual learning. To address this gap, we propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA), which introduces both an orthogonal LoRA adapter and a residual LoRA adapter parallel to pre-trained weights in each layer. These components are orchestrated by a dynamic memory mechanism to strike a balance between stability and plasticity. Additionally, we propose a scheme to predict task identity with confidence and calibrate the model's outputs accordingly. On ViT-based models, we demonstrate that DualLoRA offers significant advantages in accuracy, inference speed, and computation efficiency in training over existing CL methods across multiple benchmarks.
8.6IVAug 19, 2025
Latent Interpolation Learning Using Diffusion Models for Cardiac Volume ReconstructionNiklas Bubeck, Suprosanna Shit, Chen Chen et al.
Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel Cardiac Latent Interpolation Diffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.
6.4CRFeb 23, 2025
Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMsPeng Yifeng, Wu Zhizheng, Chen Chen
Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks: localized data poisoning that alters specific factual knowledge while preserving overall model utility. We systematically demonstrate these attacks exploit inherent architectural properties of LLMs, achieving 54.6% increased retrieval inaccuracy on long-tail knowledge versus dominant topics and up to 25.5% increase retrieval inaccuracy on compressed models versus original architectures. Through controlled mutations (e.g., temporal/spatial/entity alterations) and, our method induces localized memorization deterioration with negligible impact on models' performance on regular standard benchmarks (e.g., <2% performance drop on MMLU/GPQA), leading to potential detection evasion. Our findings suggest: (1) Disproportionate vulnerability in long-tail knowledge may result from reduced parameter redundancy; (2) Model compression may increase attack surfaces, with pruned/distilled models requiring 30% fewer poison samples for equivalent damage; (3) Associative memory enables both spread of collateral damage to related concepts and amplification of damage from simultaneous attack, particularly for dominant topics. These findings raise concerns over current scaling paradigms since attack costs are lowering while defense complexity is rising. Our work establishes poison pills as both a security threat and diagnostic tool, revealing critical security-efficiency trade-offs in language model compression that challenges prevailing safety assumptions.
5.8AIDec 21, 2024
Enhancing Conflict Resolution in Language Models via Abstract ArgumentationZhaoqun Li, Xiaotong Fang, Chen Chen et al.
In recent years, large language models (LLMs) have made significant advancements in developing human-like and engaging dialogue systems. However, in tasks such as consensus-building and persuasion, LLMs often struggle to resolve conflicts arising from incomplete or inconsistent information, revealing their limitations in real-world applications. Given these limitations, abstract argumentation, a specialized logical framework designed to resolve conflicts and inconsistencies, becomes particularly relevant. In this paper, we aim to enhance the conflict-solving capabilities of LLMs by leveraging formal abstract argumentation, integrating language model learning with symbolic computation. To achieve this, we develop and curate a dataset comprising diverse abstract argumentation frameworks, accompanied by detailed explanations of the argument acceptability computation process. Subsequently, we fine-tune LLMs on this dataset, focusing on abstract conflict resolution tasks. As a comparative baseline, LLMs are also evaluated using a chain-of-thought approach, however, they fail to solve the conflict-based arguments effectively. Our experiments demonstrate that process explanations play a crucial role in learning. Models trained with explanations exhibit superior generalization accuracy compared to those trained solely on question-answer pairs. Furthermore, leveraging LLMs' self-explanation capabilities, our approach provides detailed illustrations that mitigate the lack of transparency typically associated with neural networks.
10.8CLJun 2, 2024
Enhancing Zero-shot Text-to-Speech Synthesis with Human FeedbackChen Chen, Yuchen Hu, Wen Wu et al.
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically, UNO eliminates the need for a reward model or preference data by directly maximizing the utility of speech generations while considering the uncertainty that lies in the inherent variability in subjective human speech perception and evaluations. Experimental results of both subjective and objective evaluations demonstrate that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. Additionally, we present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
22.0LGMar 14, 2024
Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data DomainsLei Wang, Jieming Bian, Letian Zhang et al.
Federated learning (FL) allows collaborative machine learning training without sharing private data. While most FL methods assume identical data domains across clients, real-world scenarios often involve heterogeneous data domains. Federated Prototype Learning (FedPL) addresses this issue, using mean feature vectors as prototypes to enhance model generalization. However, existing FedPL methods create the same number of prototypes for each client, leading to cross-domain performance gaps and disparities for clients with varied data distributions. To mitigate cross-domain feature representation variance, we introduce FedPLVM, which establishes variance-aware dual-level prototypes clustering and employs a novel $α$-sparsity prototype loss. The dual-level prototypes clustering strategy creates local clustered prototypes based on private data features, then performs global prototypes clustering to reduce communication complexity and preserve local data privacy. The $α$-sparsity prototype loss aligns samples from underrepresented domains, enhancing intra-class similarity and reducing inter-class similarity. Evaluations on Digit-5, Office-10, and DomainNet datasets demonstrate our method's superiority over existing approaches.
DENSE: Data-Free One-Shot Federated LearningJie Zhang, Chen Chen, Bo Li et al.
One-shot Federated Learning (FL) has recently emerged as a promising approach, which allows the central server to learn a model in a single communication round. Despite the low communication cost, existing one-shot FL methods are mostly impractical or face inherent limitations, \eg a public dataset is required, clients' models are homogeneous, and additional data/model information need to be uploaded. To overcome these issues, we propose a novel two-stage \textbf{D}ata-fre\textbf{E} o\textbf{N}e-\textbf{S}hot federated l\textbf{E}arning (DENSE) framework, which trains the global model by a data generation stage and a model distillation stage. DENSE is a practical one-shot FL method that can be applied in reality due to the following advantages: (1) DENSE requires no additional information compared with other methods (except the model parameters) to be transferred between clients and the server; (2) DENSE does not require any auxiliary dataset for training; (3) DENSE considers model heterogeneity in FL, \ie different clients can have different model architectures. Experiments on a variety of real-world datasets demonstrate the superiority of our method.For example, DENSE outperforms the best baseline method Fed-ADI by 5.08\% on CIFAR10 dataset.
The Multi-Agent Behavior Dataset: Mouse Dyadic Social InteractionsJennifer J. Sun, Tomomi Karigo, Dipam Chakraborty et al.
Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay. To help accelerate behavioral studies, the CalMS21 dataset provides benchmarks to evaluate the performance of automated behavior classification methods in three settings: (1) for training on large behavioral datasets all annotated by a single annotator, (2) for style transfer to learn inter-annotator differences in behavior definitions, and (3) for learning of new behaviors of interest given limited training data. The dataset consists of 6 million frames of unlabeled tracked poses of interacting mice, as well as over 1 million frames with tracked poses and corresponding frame-level behavior annotations. The challenge of our dataset is to be able to classify behaviors accurately using both labeled and unlabeled tracking data, as well as being able to generalize to new settings.
7.9LGAug 18, 2020
Stochastic Adaptive Line Search for Differentially Private OptimizationChen Chen, Jaewoo Lee
The performance of private gradient-based optimization algorithms is highly dependent on the choice of step size (or learning rate) which often requires non-trivial amount of tuning. In this paper, we introduce a stochastic variant of classic backtracking line search algorithm that satisfies Rényi differential privacy. Specifically, the proposed algorithm adaptively chooses the step size satsisfying the the Armijo condition (with high probability) using noisy gradients and function estimates. Furthermore, to improve the probability with which the chosen step size satisfies the condition, it adjusts per-iteration privacy budget during runtime according to the reliability of noisy gradient. A naive implementation of the backtracking search algorithm may end up using unacceptably large privacy budget as the ability of adaptive step size selection comes at the cost of extra function evaluations. The proposed algorithm avoids this problem by using the sparse vector technique combined with the recent privacy amplification lemma. We also introduce a privacy budget adaptation strategy in which the algorithm adaptively increases the budget when it detects that directions pointed by consecutive gradients are drastically different. Extensive experiments on both convex and non-convex problems show that the adaptively chosen step sizes allow the proposed algorithm to efficiently use the privacy budget and show competitive performance against existing private optimizers.
15.4CRJul 14, 2020
PrivColl: Practical Privacy-Preserving Collaborative Machine LearningYanjun Zhang, Guangdong Bai, Xue Li et al.
Collaborative learning enables two or more participants, each with their own training dataset, to collaboratively learn a joint model. It is desirable that the collaboration should not cause the disclosure of either the raw datasets of each individual owner or the local model parameters trained on them. This privacy-preservation requirement has been approached through differential privacy mechanisms, homomorphic encryption (HE) and secure multiparty computation (MPC), but existing attempts may either introduce the loss of model accuracy or imply significant computational and/or communicational overhead. In this work, we address this problem with the lightweight additive secret sharing technique. We propose PrivColl, a framework for protecting local data and local models while ensuring the correctness of training processes. PrivColl employs secret sharing technique for securely evaluating addition operations in a multiparty computation environment, and achieves practicability by employing only the homomorphic addition operations. We formally prove that it guarantees privacy preservation even though the majority (n-2 out of n) of participants are corrupted. With experiments on real-world datasets, we further demonstrate that PrivColl retains high efficiency. It achieves a speedup of more than 45X over the state-of-the-art MPC/HE based schemes for training linear/logistic regression, and 216X faster for training neural network.
12.0LGJun 15, 2020
Robust Federated Recommendation SystemChen Chen, Jingfeng Zhang, Anthony K. H. Tung et al.
Federated recommendation systems can provide good performance without collecting users' private data, making them attractive. However, they are susceptible to low-cost poisoning attacks that can degrade their performance. In this paper, we develop a novel federated recommendation technique that is robust against the poisoning attack where Byzantine clients prevail. We argue that the key to Byzantine detection is monitoring of gradients of the model parameters of clients. We then propose a robust learning strategy where instead of using model parameters, the central server computes and utilizes the gradients to filter out Byzantine clients. Theoretically, we justify our robust learning strategy by our proposed definition of Byzantine resilience. Empirically, we confirm the efficacy of our robust learning strategy employing four datasets in a federated recommendation system.
1.8LGSep 18, 2019
Renyi Differentially Private ADMM for Non-Smooth Regularized OptimizationChen Chen, Jaewoo Lee
In this paper we consider the problem of minimizing composite objective functions consisting of a convex differentiable loss function plus a non-smooth regularization term, such as $L_1$ norm or nuclear norm, under Rényi differential privacy (RDP). To solve the problem, we propose two stochastic alternating direction method of multipliers (ADMM) algorithms: ssADMM based on gradient perturbation and mpADMM based on output perturbation. Both algorithms decompose the original problem into sub-problems that have closed-form solutions. The first algorithm, ssADMM, applies the recent privacy amplification result for RDP to reduce the amount of noise to add. The second algorithm, mpADMM, numerically computes the sensitivity of ADMM variable updates and releases the updated parameter vector at the end of each epoch. We compare the performance of our algorithms with several baseline algorithms on both real and simulated datasets. Experimental results show that, in high privacy regimes (small $ε$), ssADMM and mpADMM outperform other baseline algorithms in terms of classification and feature selection performance, respectively.
1.5NEAug 5, 2017
Fast Modeling Methods for Complex System with Separable FeaturesChen Chen, Changtong Luo, Zonglin Jiang
Data-driven modeling plays an increasingly important role in different areas of engineering. For most of existing methods, such as genetic programming (GP), the convergence speed might be too slow for large scale problems with a large number of variables. Fortunately, in many applications, the target models are separable in some sense. In this paper, we analyze different types of separability of some real-world engineering equations and establish a mathematical model of generalized separable system (GS system). In order to get the structure of the GS system, two concepts, namely block and factor are introduced, and a special method, block and factor detection is also proposed, in which the target model is decomposed into a number of blocks, further into minimal blocks and factors. Compare to the conventional GP, the new method can make large reductions to the search space. The minimal blocks and factors are optimized and assembled with a global optimization search engine, low dimensional simplex evolution (LDSE). An extensive study between the proposed method and a state-of-the-art data-driven fitting tool, Eureqa, has been presented with several man-made problems. Test results indicate that the proposed method is more effective and efficient under all the investigated cases.
6.1NEMay 23, 2017
A divide and conquer method for symbolic regressionChangtong Luo, Chen Chen, Zonglin Jiang
Symbolic regression aims to find a function that best explains the relationship between independent variables and the objective value based on a given set of sample data. Genetic programming (GP) is usually considered as an appropriate method for the problem since it can optimize functional structure and coefficients simultaneously. However, the convergence speed of GP might be too slow for large scale problems that involve a large number of variables. Fortunately, in many applications, the target function is separable or partially separable. This feature motivated us to develop a new method, divide and conquer (D&C), for symbolic regression, in which the target function is divided into a number of sub-functions and the sub-functions are then determined by any of a GP algorithm. The separability is probed by a new proposed technique, Bi-Correlation test (BiCT). D&C powered GP has been tested on some real-world applications, and the study shows that D&C can help GP to get the target function much more rapidly.
4.1NEMay 22, 2017
Block building programming for symbolic regressionChen Chen, Changtong Luo, Zonglin Jiang
Symbolic regression that aims to detect underlying data-driven models has become increasingly important for industrial data analysis. For most existing algorithms such as genetic programming (GP), the convergence speed might be too slow for large-scale problems with a large number of variables. This situation may become even worse with increasing problem size. The aforementioned difficulty makes symbolic regression limited in practical applications. Fortunately, in many engineering problems, the independent variables in target models are separable or partially separable. This feature inspires us to develop a new approach, block building programming (BBP). BBP divides the original target function into several blocks, and further into factors. The factors are then modeled by an optimization engine (e.g. GP). Under such circumstances, BBP can make large reductions to the search space. The partition of separability is based on a special method, block and factor detection. Two different optimization engines are applied to test the performance of BBP on a set of symbolic regression problems. Numerical results show that BBP has a good capability of structure and coefficient optimization with high computational efficiency.