h-index78
104papers
1,780citations
Novelty51%
AI Score61

104 Papers

CVJun 1, 2023
MammalNet: A Large-scale Video Benchmark for Mammal Recognition and Behavior Understanding

Jun Chen, Ming Hu, Darren J. Coker et al. · mit

Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognition systems is currently hindered by a lack of appropriately labeled datasets. Existing video datasets 1) do not classify animals according to established biological taxonomies; 2) are too small to facilitate large-scale behavioral studies and are often limited to a single species; and 3) do not feature temporally localized annotations and therefore do not facilitate localization of targeted behaviors within longer video sequences. Thus, we propose MammalNet, a new large-scale animal behavior dataset with taxonomy-guided annotations of mammals and their common behaviors. MammalNet contains over 18K videos totaling 539 hours, which is ~10 times larger than the largest existing animal behavior dataset. It covers 17 orders, 69 families, and 173 mammal categories for animal categorization and captures 12 high-level animal behaviors that received focus in previous animal behavior studies. We establish three benchmarks on MammalNet: standard animal and behavior recognition, compositional low-shot animal and behavior recognition, and behavior detection. Our dataset and code have been made available at: https://mammal-net.github.io.

CVJun 1, 2022Code
Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Jun Chen, Ming Hu, Boyang Li et al.

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7$\times$7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384$\times$384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1$\times$ faster than MAE with 0.2% higher classification accuracy on pretraining 448$\times$448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.

LGFeb 27, 2023
Towards Interpretable Federated Learning

Anran Li, Rui Liu, Ming Hu et al. · mit

Federated learning (FL) enables multiple data owners to build machine learning models collaboratively without exposing their private local data. In order for FL to achieve widespread adoption, it is important to balance the need for performance, privacy-preservation and interpretability, especially in mission critical applications such as finance and healthcare. Thus, interpretable federated learning (IFL) has become an emerging topic of research attracting significant interest from the academia and the industry alike. Its interdisciplinary nature can be challenging for new researchers to pick up. In this paper, we bridge this gap by providing (to the best of our knowledge) the first survey on IFL. We propose a unique IFL taxonomy which covers relevant works enabling FL models to explain the prediction results, support model debugging, and provide insights into the contributions made by individual data owners or data samples, which in turn, is crucial for allocating rewards fairly to motivate active and reliable participation in FL. We conduct comprehensive analysis of the representative IFL approaches, the commonly adopted performance evaluation metrics, and promising directions towards building versatile IFL techniques.

SEApr 23Code
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

Di Yang, Xinou Xie, Xiuwen Yang et al.

Software requirement ambiguity is ubiquitous in real-world development, stemming from the inherent imprecision of natural language and the varying interpretations of stakeholders. While Large Language Models (LLMs) have demonstrated impressive capabilities in generating code from precise specifications, such ambiguity poses a significant obstacle to reliable automated code generation. Existing benchmarks typically assume clear and unambiguous requirements, leaving an empirical gap in understanding how LLMs behave when faced with the inherent uncertainty of real-world software requirements. In this paper, we introduce Orchid, the first code generation benchmark specifically designed with ambiguous requirements. It comprises 1,304 function-level tasks covering four distinct types of ambiguity: lexical, syntactic, semantic, and vagueness. Leveraging this dataset, we conduct the first systematic empirical study to evaluate the impact of requirement ambiguity on LLM-based code generation. Our results demonstrate that ambiguity consistently degrades the performance of all evaluated LLMs, with the most pronounced negative effects observed in highly advanced models. Furthermore, we observe that LLMs frequently produce functionally divergent implementations for the same ambiguous requirement and lack the capability to identify or resolve such ambiguity autonomously. These findings reveal a significant performance gap between clear and ambiguous requirements, underscoring the urgent need for ambiguity-aware techniques in the next generation of automated software engineering tools. The Orchid benchmark is publicly available at https://huggingface.co/datasets/SII-YDD/Orchid.

CVOct 20, 2023Code
NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity Understanding

Ming Hu, Lin Wang, Siyuan Yan et al.

The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hindered by the scarcity of appropriately labeled datasets. The existing video datasets pose several limitations: 1) these datasets are small-scale in size to support comprehensive investigations of nursing activity; 2) they primarily focus on single procedures, lacking expert-level annotations for various nursing procedures and action steps; and 3) they lack temporally localized annotations, which prevents the effective localization of targeted actions within longer video sequences. To mitigate these limitations, we propose NurViD, a large video dataset with expert-level annotation for nursing procedure activity understanding. NurViD consists of over 1.5k videos totaling 144 hours, making it approximately four times longer than the existing largest nursing activity datasets. Notably, it encompasses 51 distinct nursing procedures and 177 action steps, providing a much more comprehensive coverage compared to existing datasets that primarily focus on limited procedures. To evaluate the efficacy of current deep learning methods on nursing activity understanding, we establish three benchmarks on NurViD: procedure recognition on untrimmed videos, procedure and action recognition on trimmed videos, and action detection. Our benchmark and code will be available at \url{https://github.com/minghu0830/NurViD-benchmark}.

CVNov 23, 2023Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

Peng Xia, Xingtong Yu, Ming Hu et al.

Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https://github.com/richard-peng-xia/HGCLIP.

LGOct 15, 2022
FedCross: Towards Accurate Federated Learning via Multi-Model Cross-Aggregation

Ming Hu, Peiheng Zhou, Zhihao Yue et al.

As a promising distributed machine learning paradigm, Federated Learning (FL) has attracted increasing attention to deal with data silo problems without compromising user privacy. By adopting the classic one-to-multi training scheme (i.e., FedAvg), where the cloud server dispatches one single global model to multiple involved clients, conventional FL methods can achieve collaborative model training without data sharing. However, since only one global model cannot always accommodate all the incompatible convergence directions of local models, existing FL approaches greatly suffer from inferior classification accuracy. To address this issue, we present an efficient FL framework named FedCross, which uses a novel multi-to-multi FL training scheme based on our proposed multi-model cross-aggregation approach. Unlike traditional FL methods, in each round of FL training, FedCross uses multiple middleware models to conduct weighted fusion individually. Since the middleware models used by FedCross can quickly converge into the same flat valley in terms of loss landscapes, the generated global model can achieve a well-generalization. Experimental results on various well-known datasets show that, compared with state-of-the-art FL methods, FedCross can significantly improve FL accuracy within both IID and non-IID scenarios without causing additional communication overhead.

CVApr 20Code
MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Jiyao Liu, Jianghan Shen, Sida Song et al.

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

LGNov 22, 2022
GitFL: Adaptive Asynchronous Federated Learning using Version Control

Ming Hu, Zeke Xia, Zhihao Yue et al.

As a promising distributed machine learning paradigm that enables collaborative training without compromising data privacy, Federated Learning (FL) has been increasingly used in AIoT (Artificial Intelligence of Things) design. However, due to the lack of efficient management of straggling devices, existing FL methods greatly suffer from the problems of low inference accuracy and long training time. Things become even worse when taking various uncertain factors (e.g., network delays, performance variances caused by process variation) existing in AIoT scenarios into account. To address this issue, this paper proposes a novel asynchronous FL framework named GitFL, whose implementation is inspired by the famous version control system Git. Unlike traditional FL, the cloud server of GitFL maintains a master model (i.e., the global model) together with a set of branch models indicating the trained local models committed by selected devices, where the master model is updated based on both all the pushed branch models and their version information, and only the branch models after the pull operation are dispatched to devices. By using our proposed Reinforcement Learning (RL)-based device selection mechanism, a pulled branch model with an older version will be more likely to be dispatched to a faster and less frequently selected device for the next round of local training. In this way, GitFL enables both effective control of model staleness and adaptive load balance of versioned models among straggling devices, thus avoiding the performance deterioration. Comprehensive experimental results on well-known models and datasets show that, compared with state-of-the-art asynchronous FL methods, GitFL can achieve up to 2.64X training acceleration and 7.88% inference accuracy improvements in various uncertain scenarios.

CVMar 29
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Zhongying Deng, Cheng Tang, Ziyan Huang et al. · pku

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

MTRL-SCIMar 27, 2022
Physics Guided Deep Learning for Generative Design of Crystal Materials with Symmetry Constraints

Yong Zhao, Edirisuriya M. Dilanga Siriwardane, Zhenyao Wu et al.

Discovering new materials is a challenging task in materials science crucial to the progress of human society. Conventional approaches based on experiments and simulations are labor-intensive or costly with success heavily depending on experts' heuristic knowledge. Here, we propose a deep learning based Physics Guided Crystal Generative Model (PGCGM) for efficient crystal material design with high structural diversity and symmetry. Our model increases the generation validity by more than 700\% compared to FTCP, one of the latest structure generators and by more than 45\% compared to our previous CubicGAN model. Density Functional Theory (DFT) calculations are used to validate the generated structures with 1,869 materials out of 2,000 are successfully optimized and deposited into the Carolina Materials Database \url{www.carolinamatdb.org}, of which 39.6\% have negative formation energy and 5.3\% have energy-above-hull less than 0.25 eV/atom, indicating their thermodynamic stability and potential synthesizability.

LGMay 9, 2022Code
Model-Contrastive Learning for Backdoor Defense

Zhihao Yue, Jun Xia, Zhiwei Ling et al.

Due to the popularity of Artificial Intelligence (AI) techniques, we are witnessing an increasing number of backdoor injection attacks that are designed to maliciously threaten Deep Neural Networks (DNNs) causing misclassification. Although there exist various defense methods that can effectively erase backdoors from DNNs, they greatly suffer from both high Attack Success Rate (ASR) and a non-negligible loss in Benign Accuracy (BA). Inspired by the observation that a backdoored DNN tends to form a new cluster in its feature spaces for poisoned data, in this paper we propose a novel two-stage backdoor defense method, named MCLDef, based on Model-Contrastive Learning (MCL). In the first stage, our approach performs trigger inversion based on trigger synthesis, where the resultant trigger can be used to generate poisoned data. In the second stage, under the guidance of MCL and our defined positive and negative pairs, MCLDef can purify the backdoored model by pulling the feature representations of poisoned data towards those of their clean data counterparts. Due to the shrunken cluster of poisoned data, the backdoor formed by end-to-end supervised learning is eliminated. Comprehensive experimental results show that, with only 5% of clean data, MCLDef significantly outperforms state-of-the-art defense methods by up to 95.79% reduction in ASR, while in most cases the BA degradation can be controlled within less than 2%. Our code is available at https://github.com/WeCanShow/MCL.

CVFeb 6Code
AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion

Mingyu Dou, Shi Qiu, Ming Hu et al.

Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89\% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.

LGJan 28, 2023
CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning

Pengyu Zhang, Yingbo Zhou, Ming Hu et al. · salesforce

Federated learning (FL) has been proposed to enable distributed learning on Artificial Intelligence Internet of Things (AIoT) devices with guarantees of high-level data privacy. Since random initial models in FL can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially in non-IID scenarios. To address this problem, we propose a novel method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. We formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Moreover, we systematically prove that our method can achieve faster convergence speed under various convexity assumptions. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected AIoT devices cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the maximum classification accuracy by up to $14.11\%$ but also significantly accelerate the overall FL training process.

LGDec 5, 2022
HierarchyFL: Heterogeneous Federated Learning via Hierarchical Self-Distillation

Jun Xia, Yi Zhang, Zhihao Yue et al.

Federated learning (FL) has been recognized as a privacy-preserving distributed machine learning paradigm that enables knowledge sharing among various heterogeneous artificial intelligence (AIoT) devices through centralized global model aggregation. FL suffers from model inaccuracy and slow convergence due to the model heterogeneity of the AIoT devices involved. Although various existing methods try to solve the bottleneck of the model heterogeneity problem, most of them improve the accuracy of heterogeneous models in a coarse-grained manner, which makes it still a great challenge to deploy large-scale AIoT devices. To alleviate the negative impact of this problem and take full advantage of the diversity of each heterogeneous model, we propose an efficient framework named HierarchyFL, which uses a small amount of public data for efficient and scalable knowledge across a variety of differently structured models. By using self-distillation and our proposed ensemble library, each hierarchical model can intelligently learn from each other on cloud servers. Experimental results on various well-known datasets show that HierarchyFL can not only maximize the knowledge sharing among various heterogeneous models in large-scale AIoT systems, but also greatly improve the model performance of each involved heterogeneous AIoT device.

LGAug 22, 2023
FilterFL: Knowledge Filtering-based Data-Free Backdoor Defense for Federated Learning

Yanxin Yang, Ming Hu, Xiaofei Xie et al.

As a distributed machine learning paradigm, Federated Learning (FL) enables large-scale clients to collaboratively train a model without sharing their raw data. However, due to the lack of data auditing for untrusted clients, FL is vulnerable to poisoning attacks, especially backdoor attacks. By using poisoned data for local training or directly changing the model parameters, attackers can easily inject backdoors into the model, which can trigger the model to make misclassification of targeted patterns in images. To address these issues, we propose a novel data-free trigger-generation-based defense approach based on the two characteristics of backdoor attacks: i) triggers are learned faster than normal knowledge, and ii) trigger patterns have a greater effect on image classification than normal class patterns. Our approach generates the images with newly learned knowledge by identifying the differences between the old and new global models, and filters trigger images by evaluating the effect of these generated images. By using these trigger images, our approach eliminates poisoned models to ensure the updated global model is benign. Comprehensive experiments demonstrate that our approach can defend against almost all the existing types of backdoor attacks and outperform all the seven state-of-the-art defense methods with both IID and non-IID scenarios. Especially, our approach can successfully defend against the backdoor attack even when 80\% of the clients are malicious.

CVMay 14Code
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Yize Liu, Siyuan Yan, Ming Hu et al.

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

CLApr 9Code
OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

Jing Hao, Siyuan Dai, Yongxin Zhang et al.

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

CVMay 25
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Xiang An, Yin Xie, Feilong Tang et al.

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

CLJan 29
A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Anran Li, Yuanyuan Chen, Wenjun Long et al.

Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

CVFeb 11
A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan, Xieji Li, Dan Mo et al.

Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

LGNov 22, 2023
AdaptiveFL: Adaptive Heterogeneous Federated Learning for Resource-Constrained AIoT Systems

Chentao Jia, Ming Hu, Zekai Chen et al.

Although Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.

CVFeb 13
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Wenhui Liao, Hongliang Li, Pengyu Xie et al.

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

LGNov 23, 2023
AdapterFL: Adaptive Heterogeneous Federated Learning for Resource-constrained Mobile Computing Systems

Ruixuan Liu, Ming Hu, Zeke Xia et al.

Federated Learning (FL) enables collaborative learning of large-scale distributed clients without data sharing. However, due to the disparity of computing resources among massive mobile computing devices, the performance of traditional homogeneous model-based Federated Learning (FL) is seriously limited. On the one hand, to achieve model training in all the diverse clients, mobile computing systems can only use small low-performance models for collaborative learning. On the other hand, devices with high computing resources cannot train a high-performance large model with their insufficient raw data. To address the resource-constrained problem in mobile computing systems, we present a novel heterogeneous FL approach named AdapterFL, which uses a model reassemble strategy to facilitate collaborative training of massive heterogeneous mobile devices adaptively. Specifically, we select multiple candidate heterogeneous models based on the computing performance of massive mobile devices and then divide each heterogeneous model into two partitions. By reassembling the partitions, we can generate models with varied sizes that are combined by the partial parameters of the large model with the partial parameters of the small model. Using these reassembled models for FL training, we can train the partial parameters of the large model using low-performance devices. In this way, we can alleviate performance degradation in large models due to resource constraints. The experimental results show that AdapterFL can achieve up to 12\% accuracy improvement compared to the state-of-the-art heterogeneous federated learning methods in resource-constrained scenarios.

LGMay 24, 2022
FedEntropy: Efficient Device Grouping for Federated Learning Using Maximum Entropy Judgment

Zhiwei Ling, Zhihao Yue, Jun Xia et al.

Along with the popularity of Artificial Intelligence (AI) and Internet-of-Things (IoT), Federated Learning (FL) has attracted steadily increasing attentions as a promising distributed machine learning paradigm, which enables the training of a central model on for numerous decentralized devices without exposing their privacy. However, due to the biased data distributions on involved devices, FL inherently suffers from low classification accuracy in non-IID scenarios. Although various device grouping method have been proposed to address this problem, most of them neglect both i) distinct data distribution characteristics of heterogeneous devices, and ii) contributions and hazards of local models, which are extremely important in determining the quality of global model aggregation. In this paper, we present an effective FL method named FedEntropy with a novel dynamic device grouping scheme, which makes full use of the above two factors based on our proposed maximum entropy judgement heuristic.Unlike existing FL methods that directly aggregate local models returned from all the selected devices, in one FL round FedEntropy firstly makes a judgement based on the pre-collected soft labels of selected devices and then only aggregates the local models that can maximize the overall entropy of these soft labels. Without collecting local models that are harmful for aggregation, FedEntropy can effectively improve global model accuracy while reducing the overall communication overhead. Comprehensive experimental results on well-known benchmarks show that, FedEntropy not only outperforms state-of-the-art FL methods in terms of model accuracy and communication overhead, but also can be integrated into them to enhance their classification performance.

MLNov 20, 2022
Algorithmic Decision-Making Safeguarded by Human Knowledge

Ningyuan Chen, Ming Hu, Wenhao Li

Commercial AI solutions provide analysts and managers with data-driven business intelligence for a wide range of decisions, such as demand forecasting and pricing. However, human analysts may have their own insights and experiences about the decision-making that is at odds with the algorithmic recommendation. In view of such a conflict, we provide a general analytical framework to study the augmentation of algorithmic decisions with human knowledge: the analyst uses the knowledge to set a guardrail by which the algorithmic decision is clipped if the algorithmic output is out of bound, and seems unreasonable. We study the conditions under which the augmentation is beneficial relative to the raw algorithmic decision. We show that when the algorithmic decision is asymptotically optimal with large data, the non-data-driven human guardrail usually provides no benefit. However, we point out three common pitfalls of the algorithmic decision: (1) lack of domain knowledge, such as the market competition, (2) model misspecification, and (3) data contamination. In these cases, even with sufficient data, the augmentation from human knowledge can still improve the performance of the algorithmic decision.

AIDec 18, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou et al.

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

LGNov 22, 2023
Have Your Cake and Eat It Too: Toward Efficient and Accurate Split Federated Learning

Dengke Yan, Ming Hu, Zeke Xia et al.

Due to its advantages in resource constraint scenarios, Split Federated Learning (SFL) is promising in AIoT systems. However, due to data heterogeneity and stragglers, SFL suffers from the challenges of low inference accuracy and low efficiency. To address these issues, this paper presents a novel SFL approach, named Sliding Split Federated Learning (S$^2$FL), which adopts an adaptive sliding model split strategy and a data balance-based training mechanism. By dynamically dispatching different model portions to AIoT devices according to their computing capability, S$^2$FL can alleviate the low training efficiency caused by stragglers. By combining features uploaded by devices with different data distributions to generate multiple larger batches with a uniform distribution for back-propagation, S$^2$FL can alleviate the performance degradation caused by data heterogeneity. Experimental results demonstrate that, compared to conventional SFL, S$^2$FL can achieve up to 16.5\% inference accuracy improvement and 3.54X training acceleration.

CVApr 22
Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Siyuan Yao, Siavash Ghorbany, Kuangshi Ai et al.

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

CVMar 9
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Zhongxing Xu, Zhonghua Wang, Zhe Qian et al.

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

CVFeb 9
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan et al.

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

LGAug 21, 2025Code
Intern-S1: A Scientific Multimodal Foundation Model

Lei Bai, Zhongrui Cai, Yuhang Cao et al.

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

CVApr 2, 2025Code
GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

Yanzhou Su, Tianbin Li, Jiyao Liu et al.

Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model's generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \href{https://github.com/uni-medical/GMAI-VL-R1}{this link}.

CVOct 19, 2024Code
A Multimodal Vision Foundation Model for Clinical Dermatology

Siyuan Yan, Zhen Yu, Clare Primiero et al.

Diagnosing and treating skin diseases require advanced visual skills across domains and the ability to synthesize information from multiple imaging modalities. While current deep learning models excel at specific tasks like skin cancer diagnosis from dermoscopic images, they struggle to meet the complex, multimodal requirements of clinical practice. Here, we introduce PanDerm, a multimodal dermatology foundation model pretrained through self-supervised learning on over 2 million real-world skin disease images from 11 clinical institutions across 4 imaging modalities. We evaluated PanDerm on 28 diverse benchmarks, including skin cancer screening, risk stratification, differential diagnosis of common and rare skin conditions, lesion segmentation, longitudinal monitoring, and metastasis prediction and prognosis. PanDerm achieved state-of-the-art performance across all evaluated tasks, often outperforming existing models when using only 10% of labeled data. We conducted three reader studies to assess PanDerm's potential clinical utility. PanDerm outperformed clinicians by 10.2% in early-stage melanoma detection through longitudinal analysis, improved clinicians' skin cancer diagnostic accuracy by 11% on dermoscopy images, and enhanced non-dermatologist healthcare providers' differential diagnosis by 16.5% across 128 skin conditions on clinical photographs. These results demonstrate PanDerm's potential to improve patient care across diverse clinical scenarios and serve as a model for developing multimodal foundation models in other medical specialties, potentially accelerating the integration of AI support in healthcare. The code can be found at https://github.com/SiyuanYan1/PanDerm.

CVOct 15, 2024Code
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Ying Chen, Guoan Wang, Yuanfeng Ji et al.

Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). Our code, data, and model is publicly accessible at https://uni-medical.github.io/SlideChat.github.io.

CVOct 7, 2025Code
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Yi Xin, Qi Qin, Siqi Luo et al.

We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

CVMar 19, 2025Code
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan, Ming Hu, Yiwen Jiang et al.

The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be publicly available at https://github.com/SiyuanYan1/Derm1M upon acceptance.

CVApr 11
Radiology Report Generation for Low-Quality X-Ray Images

Hongze Zhu, Chen Hu, Jiaxuan Jiang et al.

Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.

CVMay 14, 2025Code
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment

Siyuan Yan, Xieji Li, Ming Hu et al.

Dermatological diagnosis represents a complex multimodal challenge that requires integrating visual features with specialized clinical knowledge. While vision-language pretraining (VLP) has advanced medical AI, its effectiveness in dermatology is limited by text length constraints and the lack of structured texts. In this paper, we introduce MAKE, a Multi-Aspect Knowledge-Enhanced vision-language pretraining framework for zero-shot dermatological tasks. Recognizing that comprehensive dermatological descriptions require multiple knowledge aspects that exceed standard text constraints, our framework introduces: (1) a multi-aspect contrastive learning strategy that decomposes clinical narratives into knowledge-enhanced sub-texts through large language models, (2) a fine-grained alignment mechanism that connects subcaptions with diagnostically relevant image features, and (3) a diagnosis-guided weighting scheme that adaptively prioritizes different sub-captions based on clinical significance prior. Through pretraining on 403,563 dermatological image-text pairs collected from education resources, MAKE significantly outperforms state-of-the-art VLP models on eight datasets across zero-shot skin disease classification, concept annotation, and cross-modal retrieval tasks. Our code will be made publicly available at https: //github.com/SiyuanYan1/MAKE.

LGAug 16, 2022
FedMR: Fedreated Learning via Model Recombination

Ming Hu, Zhihao Yue, Zhiwei Ling et al.

As a promising privacy-preserving machine learning method, Federated Learning (FL) enables global model training across clients without compromising their confidential local data. However, existing FL methods suffer from the problem of low inference performance for unevenly distributed data, since most of them rely on Federated Averaging (FedAvg)-based aggregation. By averaging model parameters in a coarse manner, FedAvg eclipses the individual characteristics of local models, which strongly limits the inference capability of FL. Worse still, in each round of FL training, FedAvg dispatches the same initial local models to clients, which can easily result in stuck-at-local-search for optimal global models. To address the above issues, this paper proposes a novel and effective FL paradigm named FedMR (Federating Model Recombination). Unlike conventional FedAvg-based methods, the cloud server of FedMR shuffles each layer of collected local models and recombines them to achieve new models for local training on clients. Due to the fine-grained model recombination and local training in each FL round, FedMR can quickly figure out one globally optimal model for all the clients. Comprehensive experimental results demonstrate that, compared with state-of-the-art FL methods, FedMR can significantly improve the inference accuracy without causing extra communication overhead.

DCSep 23, 2024
Federated Graph Learning with Adaptive Importance-based Sampling

Anran Li, Yuanyuan Chen, Chao Ren et al.

For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ignore graph structural information or dynamics of optimization, resulting in high variance and inaccurate node embeddings. To address this limitation, we propose the Federated Adaptive Importance-based Sampling (FedAIS) approach. It achieves substantial computational cost saving by focusing the limited resources on training important nodes, while reducing communication overhead via adaptive historical embedding synchronization. The proposed adaptive importance-based sampling method jointly considers the graph structural heterogeneity and the optimization dynamics to achieve optimal trade-off between efficiency and accuracy. Extensive evaluations against five state-of-the-art baselines on five real-world graph datasets show that FedAIS achieves comparable or up to 3.23% higher test accuracy, while saving communication and computation costs by 91.77% and 85.59%.

IVMay 19, 2025Code
RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

Junzhi Ning, Cheng Tang, Kaijing Zhou et al.

The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. Existing methods for synthesising Colour Fundus Photographs (CFPs) largely rely on predefined disease labels, which restricts their ability to generate images that reflect fine-grained anatomical variations, subtle disease stages, and diverse pathological features beyond coarse class categories. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, captioned retinal dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses the visual language model(VLM) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Building on this dataset, we employ a novel three-step training framework, RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Through extensive experiments, our method demonstrates superior performance across multiple datasets, with 62.07% of text-driven synthetic CFPs indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 5%-10% in diabetic retinopathy grading and glaucoma detection. Codes are available at https://github.com/uni-medical/retina-text2cfp.

IVMay 18, 2024Code
Diffusion Model Driven Test-Time Image Adaptation for Robust Skin Lesion Classification

Ming Hu, Siyuan Yan, Peng Xia et al.

Deep learning-based diagnostic systems have demonstrated potential in skin disease diagnosis. However, their performance can easily degrade on test domains due to distribution shifts caused by input-level corruptions, such as imaging equipment variability, brightness changes, and image blur. This will reduce the reliability of model deployment in real-world scenarios. Most existing solutions focus on adapting the source model through retraining on different target domains. Although effective, this retraining process is sensitive to the amount of data and the hyperparameter configuration for optimization. In this paper, we propose a test-time image adaptation method to enhance the accuracy of the model on test data by simultaneously updating and predicting test images. We modify the target test images by projecting them back to the source domain using a diffusion model. Specifically, we design a structure guidance module that adds refinement operations through low-pass filtering during reverse sampling, regularizing the diffusion to preserve structural information. Additionally, we introduce a self-ensembling scheme automatically adjusts the reliance on adapted and unadapted inputs, enhancing adaptation robustness by rejecting inappropriate generative modeling results. To facilitate this study, we constructed the ISIC2019-C and Dermnet-C corruption robustness evaluation benchmarks. Extensive experiments on the proposed benchmarks demonstrate that our method makes the classifier more robust across various corruptions, architectures, and data regimes. Our datasets and code will be available at \url{https://github.com/minghu0830/Skin-TTA_Diffusion}.

IVMay 12, 2025Code
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model

Wei Li, Ming Hu, Guoan Wang et al.

In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/uni-medical/Ophora.

CVMar 23, 2025Code
PG-SAM: Prior-Guided SAM with Medical for Multi-organ Segmentation

Yiheng Zhong, Zihong Luo, Chengzhi Liu et al.

Segment Anything Model (SAM) demonstrates powerful zero-shot capabilities; however, its accuracy and robustness significantly decrease when applied to medical image segmentation. Existing methods address this issue through modality fusion, integrating textual and image information to provide more detailed priors. In this study, we argue that the granularity of text and the domain gap affect the accuracy of the priors. Furthermore, the discrepancy between high-level abstract semantics and pixel-level boundary details in images can introduce noise into the fusion process. To address this, we propose Prior-Guided SAM (PG-SAM), which employs a fine-grained modality prior aligner to leverage specialized medical knowledge for better modality alignment. The core of our method lies in efficiently addressing the domain gap with fine-grained text from a medical LLM. Meanwhile, it also enhances the priors' quality after modality alignment, ensuring more accurate segmentation. In addition, our decoder enhances the model's expressive capabilities through multi-level feature fusion and iterative mask optimizer operations, supporting unprompted learning. We also propose a unified pipeline that effectively supplies high-quality semantic information to SAM. Extensive experiments on the Synapse dataset demonstrate that the proposed PG-SAM achieves state-of-the-art performance. Our code is released at https://github.com/logan-0623/PG-SAM.

CVMar 20
FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

Ming Hu, Yongsheng Huo, Mingyu Dou et al.

Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.

CLSep 25, 2025Code
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Yizhou Wang, Chen Tang, Han Deng et al.

We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

CVJun 22, 2025Code
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Guankun Wang, Junyi Wang, Wenjin Mo et al.

Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.

IVMar 17, 2025Code
MSWAL: 3D Multi-class Segmentation of Whole Abdominal Lesions Dataset

Zhaodong Wu, Qiaochu Zhao, Ming Hu et al.

With the significantly increasing incidence and prevalence of abdominal diseases, there is a need to embrace greater use of new innovations and technology for the diagnosis and treatment of patients. Although deep-learning methods have notably been developed to assist radiologists in diagnosing abdominal diseases, existing models have the restricted ability to segment common lesions in the abdomen due to missing annotations for typical abdominal pathologies in their training datasets. To address the limitation, we introduce MSWAL, the first 3D Multi-class Segmentation of the Whole Abdominal Lesions dataset, which broadens the coverage of various common lesion types, such as gallstones, kidney stones, liver tumors, kidney tumors, pancreatic cancer, liver cysts, and kidney cysts. With CT scans collected from 694 patients (191,417 slices) of different genders across various scanning phases, MSWAL demonstrates strong robustness and generalizability. The transfer learning experiment from MSWAL to two public datasets, LiTS and KiTS, effectively demonstrates consistent improvements, with Dice Similarity Coefficient (DSC) increase of 3.00% for liver tumors and 0.89% for kidney tumors, demonstrating that the comprehensive annotations and diverse lesion types in MSWAL facilitate effective learning across different domains and data distributions. Furthermore, we propose Inception nnU-Net, a novel segmentation framework that effectively integrates an Inception module with the nnU-Net architecture to extract information from different receptive fields, achieving significant enhancement in both voxel-level DSC and region-level F1 compared to the cutting-edge public algorithms on MSWAL. Our dataset will be released after being accepted, and the code is publicly released at https://github.com/tiuxuxsh76075/MSWAL-.

CVDec 15, 2025Code
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang, Weijia Wu, Yi Zhan et al.

Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: (i) Immersive ASMR video-audio sources. Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. (ii) Peer-Review evaluation. An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56% accuracy (random 50%), far below that of human experts (81.25%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.