Haoyuan Li

CV
h-index39
58papers
1,626citations
Novelty54%
AI Score61

58 Papers

CVAug 20, 2023Code
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

Haoyuan Li, Haoye Dong, Hanchao Jia et al.

Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.

IVJun 21, 2023Code
DIAS: A Dataset and Benchmark for Intracranial Artery Segmentation in DSA sequences

Wentao Liu, Tong Tian, Lemeng Wang et al.

The automated segmentation of Intracranial Arteries (IA) in Digital Subtraction Angiography (DSA) plays a crucial role in the quantification of vascular morphology, significantly contributing to computer-assisted stroke research and clinical practice. Current research primarily focuses on the segmentation of single-frame DSA using proprietary datasets. However, these methods face challenges due to the inherent limitation of single-frame DSA, which only partially displays vascular contrast, thereby hindering accurate vascular structure representation. In this work, we introduce DIAS, a dataset specifically developed for IA segmentation in DSA sequences. We establish a comprehensive benchmark for evaluating DIAS, covering full, weak, and semi-supervised segmentation methods. Specifically, we propose the vessel sequence segmentation network, in which the sequence feature extraction module effectively captures spatiotemporal representations of intravascular contrast, achieving intracranial artery segmentation in 2D+Time DSA sequences. For weakly-supervised IA segmentation, we propose a novel scribble learning-based image segmentation framework, which, under the guidance of scribble labels, employs cross pseudo-supervision and consistency regularization to improve the performance of the segmentation network. Furthermore, we introduce the random patch-based self-training framework, aimed at alleviating the performance constraints encountered in IA segmentation due to the limited availability of annotated DSA data. Our extensive experiments on the DIAS dataset demonstrate the effectiveness of these methods as potential baselines for future research and clinical applications. The dataset and code are publicly available at https://doi.org/10.5281/zenodo.11396520 and https://github.com/lseventeen/DIAS.

CVSep 26, 2024Code
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

CVAug 28, 2024Code
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Fangxun Shu, Yue Liao, Le Zhuo et al.

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

CLAug 19, 2024Code
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition

Tianwei Lin, Jiang Liu, Wenqiao Zhang et al.

While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue, one straightforward solution is to introduce task-specific LoRA modules as domain experts, leveraging the modeling of multiple experts' capabilities and thus enhancing the general capability of multi-task learning. Despite promising, these additional components often add complexity to the training and inference process, contravening the efficient characterization of PEFT designed for. Considering this, we introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for experts, and thus achieving the right balance of effectiveness and efficiency: (i) For collaboration, a novel knowledge-sharing and -organizing mechanism is devised to appropriately reduce the scale of matrix operations, thereby boosting the training and inference speed. (ii) For competition, we propose leveraging a game-theoretic interaction mechanism for experts, encouraging experts to transfer their domain-specific knowledge while facing diverse downstream tasks, and thus enhancing the performance. By doing so, TeamLoRA elegantly connects the experts as a "Team" with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm for multi-task learning. To validate the superiority of TeamLoRA, we curate a comprehensive multi-task evaluation(CME) benchmark to thoroughly assess the capability of multi-task learning. Experiments conducted on our CME and other benchmarks indicate the effectiveness and efficiency of TeamLoRA. Our project is available at https://github.com/Lin-Tianwei/TeamLoRA.

CVJul 23, 2022Code
Combining Self-Training and Hybrid Architecture for Semi-supervised Abdominal Organ Segmentation

Wentao Liu, Weijin Xu, Songlin Yan et al.

Abdominal organ segmentation has many important clinical applications, such as organ quantification, surgical planning, and disease diagnosis. However, manually annotating organs from CT scans is time-consuming and labor-intensive. Semi-supervised learning has shown the potential to alleviate this challenge by learning from a large set of unlabeled images and limited labeled samples. In this work, we follow the self-training strategy and employ a high-performance hybrid architecture (PHTrans) consisting of CNN and Swin Transformer for the teacher model to generate precise pseudo labels for unlabeled data. Afterward, we introduce them with labeled data together into a two-stage segmentation framework with lightweight PHTrans for training to improve the performance and generalization ability of the model while remaining efficient. Experiments on the validation set of FLARE2022 demonstrate that our method achieves excellent segmentation performance as well as fast and low-resource model inference. The average DSC and NSD are 0.8956 and 0.9316, respectively. Under our development environments, the average inference time is 18.62 s, the average maximum GPU memory is 1995.04 MB, and the area under the GPU memory-time curve and the average area under the CPU utilization-time curve are 23196.84 and 319.67. The code is available at https://github.com/lseventeen/FLARE22-TwoStagePHTrans.

CVApr 7, 2023Code
DATE: Domain Adaptive Product Seeker for E-commerce

Haoyuan Li, Hao Jiang, Tao Jin et al.

Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a {\bf D}omain {\bf A}daptive Produc{\bf t} S{\bf e}eker ({\bf DATE}) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query {\bf date} the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here\footnote{\url{https://github.com/Taobao-live/Product-Seeking}}.

CVFeb 5Code
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang et al.

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

AISep 27, 2024Code
Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Hongzhe Huang, Jiang Liu, Zhewen Yu et al.

Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9$\times$ smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.

CVSep 1, 2022
Video-Guided Curriculum Learning for Spoken Video Grounding

Yan Xia, Zhou Zhao, Shangwei Ye et al.

In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.

CVJul 10, 2024
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Wanggui He, Siming Fu, Mushui Liu et al.

Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

LGSep 3, 2024
Collaboratively Learning Federated Models from Noisy Decentralized Data

Haoyuan Li, Mathias Funk, Nezihe Merve Gürel et al.

Federated learning (FL) has emerged as a prominent method for collaboratively training machine learning models using local data from edge devices, all while keeping data decentralized. However, accounting for the quality of data contributed by local clients remains a critical challenge in FL, as local data are often susceptible to corruption by various forms of noise and perturbations, which compromise the aggregation process and lead to a subpar global model. In this work, we focus on addressing the problem of noisy data in the input space, an under-explored area compared to the label noise. We propose a comprehensive assessment of client input in the gradient space, inspired by the distinct disparity observed between the density of gradient norm distributions of models trained on noisy and clean input data. Based on this observation, we introduce a straightforward yet effective approach to identify clients with low-quality data at the initial stage of FL. Furthermore, we propose a noise-aware FL aggregation method, namely Federated Noise-Sifting (FedNS), which can be used as a plug-in approach in conjunction with widely used FL strategies. Our extensive evaluation on diverse benchmark datasets under different federated settings demonstrates the efficacy of FedNS. Our method effortlessly integrates with existing FL strategies, enhancing the global model's performance by up to 13.68% in IID and 15.85% in non-IID settings when learning from noisy decentralized data.

AINov 11, 2023
TrainerAgent: Customizable and Efficient Model Training through LLM-Powered Multi-Agent System

Haoyuan Li, Hao Jiang, Tianke Zhang et al.

Training AI models has always been challenging, especially when there is a need for custom models to provide personalized services. Algorithm engineers often face a lengthy process to iteratively develop models tailored to specific business requirements, making it even more difficult for non-experts. The quest for high-quality and efficient model development, along with the emergence of Large Language Model (LLM) Agents, has become a key focus in the industry. Leveraging the powerful analytical, planning, and decision-making capabilities of LLM, we propose a TrainerAgent system comprising a multi-agent framework including Task, Data, Model and Server agents. These agents analyze user-defined tasks, input data, and requirements (e.g., accuracy, speed), optimizing them comprehensively from both data and model perspectives to obtain satisfactory models, and finally deploy these models as online service. Experimental evaluations on classical discriminative and generative tasks in computer vision and natural language processing domains demonstrate that our system consistently produces models that meet the desired criteria. Furthermore, the system exhibits the ability to critically identify and reject unattainable tasks, such as fantastical scenarios or unethical requests, ensuring robustness and safety. This research presents a significant advancement in achieving desired models with increased efficiency and quality as compared to traditional model development, facilitated by the integration of LLM-powered analysis, decision-making, and execution capabilities, as well as the collaboration among four agents. We anticipate that our work will contribute to the advancement of research on TrainerAgent in both academic and industry communities, potentially establishing it as a new paradigm for model development in the field of AI.

CVFeb 14, 2025Code
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li et al.

We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.

52.4CVApr 2
Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception

Haoyuan Li, Wen Yang, Fang Xu et al.

Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.

AIMar 20, 2024Code
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang, Tianwei Lin, Jiang Liu et al.

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.

IVSep 11, 2023
Two-Stage Hybrid Supervision Framework for Fast, Low-resource, and Accurate Organ and Pan-cancer Segmentation in Abdomen CT

Wentao Liu, Tong Tian, Weijin Xu et al.

Abdominal organ and tumour segmentation has many important clinical applications, such as organ quantification, surgical planning, and disease diagnosis. However, manual assessment is inherently subjective with considerable inter- and intra-expert variability. In the paper, we propose a hybrid supervised framework, StMt, that integrates self-training and mean teacher for the segmentation of abdominal organs and tumors using partially labeled and unlabeled data. We introduce a two-stage segmentation pipeline and whole-volume-based input strategy to maximize segmentation accuracy while meeting the requirements of inference time and GPU memory usage. Experiments on the validation set of FLARE2023 demonstrate that our method achieves excellent segmentation performance as well as fast and low-resource model inference. Our method achieved an average DSC score of 89.79\% and 45.55 \% for the organs and lesions on the validation set and the average running time and area under GPU memory-time cure are 11.25s and 9627.82MB, respectively.

CVApr 25, 2024Code
CFMW: Cross-modality Fusion Mamba for Robust Object Detection under Adverse Weather

Haoyuan Li, Qi Hu, Binjia Zhou et al.

Visible-infrared image pairs provide complementary information, enhancing the reliability and robustness of object detection applications in real-world scenarios. However, most existing methods face challenges in maintaining robustness under complex weather conditions, which limits their applicability. Meanwhile, the reliance on attention mechanisms in modality fusion introduces significant computational complexity and storage overhead, particularly when dealing with high-resolution images. To address these challenges, we propose the Cross-modality Fusion Mamba with Weather-removal (CFMW) to augment stability and cost-effectiveness under adverse weather conditions. Leveraging the proposed Perturbation-Adaptive Diffusion Model (PADM) and Cross-modality Fusion Mamba (CFM) modules, CFMW is able to reconstruct visual features affected by adverse weather, enriching the representation of image details. With efficient architecture design, CFMW is 3 times faster than Transformer-style fusion (e.g., CFT). To bridge the gap in relevant datasets, we construct a new Severe Weather Visible-Infrared (SWVI) dataset, encompassing diverse adverse weather scenarios such as rain, haze, and snow. The dataset contains 64,281 paired visible-infrared images, providing a valuable resource for future research. Extensive experiments on public datasets (i.e., M3FD and LLVIP) and the newly constructed SWVI dataset conclusively demonstrate that CFMW achieves state-of-the-art detection performance. Both the dataset and source code will be made publicly available at https://github.com/lhy-zjut/CFMW.

CLMar 30, 2024Code
Rationale-based Opinion Summarization

Haoyuan Li, Snigdha Chaturvedi

Opinion summarization aims to generate concise summaries that present popular opinions of a large group of reviews. However, these summaries can be too generic and lack supporting details. To address these issues, we propose a new paradigm for summarizing reviews, rationale-based opinion summarization. Rationale-based opinion summaries output the representative opinions as well as one or more corresponding rationales. To extract good rationales, we define four desirable properties: relatedness, specificity, popularity, and diversity and present a Gibbs-sampling-based method to extract rationales. Overall, we propose RATION, an unsupervised extractive system that has two components: an Opinion Extractor (to extract representative opinions) and Rationales Extractor (to extract corresponding rationales). We conduct automatic and human evaluations to show that rationales extracted by RATION have the proposed properties and its summaries are more useful than conventional summaries. The implementation of our work is available at https://github.com/leehaoyuan/RATION.

42.6ARMar 27
VeRA+: Vector-Based Lightweight Digital Compensation for Drift-Resilient RRAM In-Memory Computing

Weirong Dong, Kai Zhou, Zhen Kong et al.

RRAM-based in-memory computing (IMC) offers high energy efficiency but suffers from conductance drift that severely degrades long-term accuracy. Existing approaches including retraining, noise-aware training, and Batch Normalization (BN)-based calibration either require RRAM rewriting, demand large storage overhead, or rely on online correction. We propose VeRA+, a lightweight drift compensation framework that reuses shared projection matrices and introduces only two compact drift-specific vectors per drift level. A drift-aware scheduling algorithm offline-trains a small set of VeRA+ parameters and selects the appropriate set over time without any on-chip retraining or data replay. VeRA+ preserves up to 99.77% of the drift-free accuracy after ten years of simulated drift and reduces storage overhead by more than three orders of magnitude compared with BN-based calibration. To validate VeRA+ under realistic device behavior, we extract one-week drift statistics from measurements on our fabricated 1T1R RRAM devices and use them to simulate realistic drifted weights. Under these measured drift conditions, VeRA+ achieves accuracy close to the drift-free baseline, providing an efficient and practical solution for long-term drift resilience in RRAM-IMC.

ARAug 23, 2024
NAS-Cap: Deep-Learning Driven 3-D Capacitance Extraction with Neural Architecture Search and Data Augmentation

Haoyuan Li, Dingcheng Yang, Chunyan Pei et al.

More accurate capacitance extraction is demanded for designing integrated circuits under advanced process technology. The pattern matching approach and the field solver for capacitance extraction have the drawbacks of inaccuracy and large computational cost, respectively. Recent work \cite{yang2023cnn} proposes a grid-based data representation and a convolutional neural network (CNN) based capacitance models (called CNN-Cap), which opens the third way for 3-D capacitance extraction to get accurate results with much less time cost than field solver. In this work, the techniques of neural architecture search (NAS) and data augmentation are proposed to train better CNN models for 3-D capacitance extraction. Experimental results on datasets from different designs show that the obtained NAS-Cap models achieve remarkably higher accuracy than CNN-Cap, while consuming less runtime for inference and space for model storage. Meanwhile, the transferability of the NAS is validated, as the once searched architecture brought similar error reduction on coupling/total capacitance for the test cases from different design and/or process technology.

87.7ROMar 10
Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

Haoyuan Li, Rui Liu, Hehe Fan et al.

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.

IVSep 7, 2023
TSI-Net: A Timing Sequence Image Segmentation Network for Intracranial Artery Segmentation in Digital Subtraction Angiography

Lemeng Wang, Wentao Liu, Weijin Xu et al.

Cerebrovascular disease is one of the major diseases facing the world today. Automatic segmentation of intracranial artery (IA) in digital subtraction angiography (DSA) sequences is an important step in the diagnosis of vascular related diseases and in guiding neurointerventional procedures. While, a single image can only show part of the IA within the contrast medium according to the imaging principle of DSA technology. Therefore, 2D DSA segmentation methods are unable to capture the complete IA information and treatment of cerebrovascular diseases. We propose A timing sequence image segmentation network with U-shape, called TSI-Net, which incorporates a bi-directional ConvGRU module (BCM) in the encoder. The network incorporates a bi-directional ConvGRU module (BCM) in the encoder, which can input variable-length DSA sequences, retain past and future information, segment them into 2D images. In addition, we introduce a sensitive detail branch (SDB) at the end for supervising fine vessels. Experimented on the DSA sequence dataset DIAS, the method performs significantly better than state-of-the-art networks in recent years. In particular, it achieves a Sen evaluation metric of 0.797, which is a 3% improvement compared to other methods.

LGMar 26, 2024Code
Empowering Data Mesh with Federated Learning

Haoyuan Li, Salman Toor

The evolution of data architecture has seen the rise of data lakes, aiming to solve the bottlenecks of data management and promote intelligent decision-making. However, this centralized architecture is limited by the proliferation of data sources and the growing demand for timely analysis and processing. A new data paradigm, Data Mesh, is proposed to overcome these challenges. Data Mesh treats domains as a first-class concern by distributing the data ownership from the central team to each data domain, while keeping the federated governance to monitor domains and their data products. Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture. In this decentralized architecture where data is locally preserved by each domain team, traditional centralized machine learning is incapable of conducting effective analysis across multiple domains, especially for security-sensitive organizations. To this end, we introduce a pioneering approach that incorporates Federated Learning into Data Mesh. To the best of our knowledge, this is the first open-source applied work that represents a critical advancement toward the integration of federated learning methods into the Data Mesh paradigm, underscoring the promising prospects for privacy-preserving and decentralized data analysis strategies within Data Mesh architecture.

CLDec 22, 2023Code
Language Model is a Branch Predictor for Simultaneous Machine Translation

Aoxiong Yin, Tianyun Zhong, Haoyuan Li et al.

The primary objective of simultaneous machine translation (SiMT) is to minimize latency while preserving the quality of the final translation. Drawing inspiration from CPU branch prediction techniques, we propose incorporating branch prediction techniques in SiMT tasks to reduce translation latency. Specifically, we utilize a language model as a branch predictor to predict potential branch directions, namely, future source words. Subsequently, we utilize the predicted source words to decode the output in advance. When the actual source word deviates from the predicted source word, we use the real source word to decode the output again, replacing the predicted output. To further reduce computational costs, we share the parameters of the encoder and the branch predictor, and utilize a pre-trained language model for initialization. Our proposed method can be seamlessly integrated with any SiMT model. Extensive experimental results demonstrate that our approach can improve translation quality and latency at the same time. Our code is available at https://github.com/YinAoXiong/simt_branch_predictor .

AIDec 27, 2024Code
Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework

Jiang Liu, Bolin Li, Haoyuan Li et al.

Efficient multimodal large language models (EMLLMs), in contrast to multimodal large language models (MLLMs), reduce model size and computational costs and are often deployed on resource-constrained devices. However, due to data privacy concerns, existing open-source EMLLMs rarely have access to private domain-specific data during the pre-training process, making them difficult to directly apply in device-specific domains, such as certain business scenarios. To address this weakness, this paper focuses on the efficient adaptation of EMLLMs to private domains, specifically in two areas: 1) how to reduce data requirements, and 2) how to avoid parameter fine-tuning. Specifically, we propose a tun\textbf{\underline{I}}ng-free, a\textbf{\underline{D}}aptiv\textbf{\underline{E}}, univers\textbf{\underline{AL}} \textbf{\underline{Prompt}} Optimization Framework, abbreviated as \textit{\textbf{\ourmethod{}}} which consists of two stages: 1) Predefined Prompt, based on the reinforcement searching strategy, generate a prompt optimization strategy tree to acquire optimization priors; 2) Prompt Reflection initializes the prompt based on optimization priors, followed by self-reflection to further search and refine the prompt. By doing so, \ourmethod{} elegantly generates the ``ideal prompts'' for processing private domain-specific data. Note that our method requires no parameter fine-tuning and only a small amount of data to quickly adapt to the data distribution of private data. Extensive experiments across multiple tasks demonstrate that our proposed \ourmethod{} significantly improves both efficiency and performance compared to baselines.

CLDec 11, 2024Code
Coverage-based Fairness in Multi-document Summarization

Haoyuan Li, Yusen Zhang, Rui Zhang et al.

Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values. The code is available at https://github.com/leehaoyuan/coverage_fairness.

DATA-ANAug 1, 2023
Beam Detection Based on Machine Learning Algorithms

Haoyuan Li, Qing Yin

The positions of free electron laser beams on screens are precisely determined by a sequence of machine learning models. Transfer training is conducted in a self-constructed convolutional neural network based on VGG16 model. Output of intermediate layers are passed as features to a support vector regression model. With this sequence, 85.8% correct prediction is achieved on test data.

CVDec 1, 2025
Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Meng Cao, Haokun Lin, Haoyuan Li et al.

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

CLJun 9, 2025Code
Improving Fairness of Large Language Models in Multi-document Summarization

Haoyuan Li, Rui Zhang, Snigdha Chaturvedi

Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.

CVMar 4, 2025Code
DQO-MAP: Dual Quadrics Multi-Object mapping with Gaussian Splatting

Haoyuan Li, Ziqin Ye, Yue Hao et al.

Accurate object perception is essential for robotic applications such as object navigation. In this paper, we propose DQO-MAP, a novel object-SLAM system that seamlessly integrates object pose estimation and reconstruction. We employ 3D Gaussian Splatting for high-fidelity object reconstruction and leverage quadrics for precise object pose estimation. Both of them management is handled on the CPU, while optimization is performed on the GPU, significantly improving system efficiency. By associating objects with unique IDs, our system enables rapid object extraction from the scene. Extensive experimental results on object reconstruction and pose estimation demonstrate that DQO-MAP achieves outstanding performance in terms of precision, reconstruction quality, and computational efficiency. The code and dataset are available at: https://github.com/LiHaoy-ux/DQO-MAP.

CVApr 22, 2024
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan et al.

The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.

MEOct 13, 2023
A novel decomposed-ensemble time series forecasting framework: capturing underlying volatility information

Zhengtao Gui, Haoyuan Li, Sijie Xu et al.

Time series forecasting represents a significant and challenging task across various fields. Recently, methods based on mode decomposition have dominated the forecasting of complex time series because of the advantages of capturing local characteristics and extracting intrinsic modes from data. Unfortunately, most models fail to capture the implied volatilities that contain significant information. To enhance the prediction of contemporary diverse and complex time series, we propose a novel time series forecasting paradigm that integrates decomposition with the capability to capture the underlying fluctuation information of the series. In our methodology, we implement the Variational Mode Decomposition algorithm to decompose the time series into K distinct sub-modes. Following this decomposition, we apply the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model to extract the volatility information in these sub-modes. Subsequently, both the numerical data and the volatility information for each sub-mode are harnessed to train a neural network. This network is adept at predicting the information of the sub-modes, and we aggregate the predictions of all sub-modes to generate the final output. By integrating econometric and artificial intelligence methods, and taking into account both the numerical and volatility information of the time series, our proposed framework demonstrates superior performance in time series forecasting, as evidenced by the significant decrease in MSE, RMSE, and MAPE in our comparative experimental results.

LGFeb 2
Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Yaxiang Zhang, Yingru Li, Jiacai Liu et al.

Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to "training inference mismatch stemming" from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.

CVMar 1, 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang et al.

We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.

CLDec 31, 2025
R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Maoyuan Li, Zhongsheng Wang, Haoyuan Li et al.

We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

CVMar 7, 2025
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Guanghao Zhang, Tao Zhong, Yan Xia et al.

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

CVAug 11, 2025
Matrix-3D: Omnidirectional Explorable 3D World Generation

Zhongqi Yang, Wenhang Ge, Yuqi Li et al.

Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

CVDec 5, 2024
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

Ziwei Huang, Wanggui He, Quanyu Long et al.

Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.

CLOct 31, 2025
Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

Haoyuan Li, Yuanbo Tong, Yuchen Li et al.

Personality recognition from text is typically cast as hard-label classification, which obscures the graded, prototype-like nature of human personality judgments. We present ProtoMBTI, a cognitively aligned framework for MBTI inference that operationalizes prototype theory within an LLM-based pipeline. First, we construct a balanced, quality-controlled corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). Next, we LoRA-fine-tune a lightweight (<=2B) encoder to learn discriminative embeddings and to standardize a bank of personality prototypes. At inference, we retrieve top-k prototypes for a query post and perform a retrieve--reuse--revise--retain cycle: the model aggregates prototype evidence via prompt-based voting, revises when inconsistencies arise, and, upon correct prediction, retains the sample to continually enrich the prototype library. Across Kaggle and Pandora benchmarks, ProtoMBTI improves over baselines on both the four MBTI dichotomies and the full 16-type task, and exhibits robust cross-dataset generalization. Our results indicate that aligning the inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.

CVNov 22, 2024
Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Haoyuan Li, Chang Xu, Wen Yang et al.

Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.

CVJun 5, 2025
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Haoyuan Li, Yanpeng Zhou, Yufei Gao et al.

Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.

CVFeb 25, 2025
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting

Haoyuan Li, Yanpeng Zhou, Tao Tang et al.

Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).

LGApr 3, 2025
FAST: Federated Active Learning with Foundation Models for Communication-efficient Sampling and Training

Haoyuan Li, Mathias Funk, Jindong Wang et al.

Federated Active Learning (FAL) has emerged as a promising framework to leverage large quantities of unlabeled data across distributed clients while preserving data privacy. However, real-world deployments remain limited by high annotation costs and communication-intensive sampling processes, particularly in a cross-silo setting, when clients possess substantial local datasets. This paper addresses the crucial question: What is the best practice to reduce communication costs in human-in-the-loop learning with minimal annotator effort? Existing FAL methods typically rely on iterative annotation processes that separate active sampling from federated updates, leading to multiple rounds of expensive communication and annotation. In response, we introduce FAST, a two-pass FAL framework that harnesses foundation models for weak labeling in a preliminary pass, followed by a refinement pass focused exclusively on the most uncertain samples. By leveraging representation knowledge from foundation models and integrating refinement steps into a streamlined workflow, FAST substantially reduces the overhead incurred by iterative active sampling. Extensive experiments on diverse medical and natural image benchmarks demonstrate that FAST outperforms existing FAL methods by an average of 4.36% while reducing communication rounds eightfold under a limited 5% labeling budget.

IVMar 9, 2024
UDCR: Unsupervised Aortic DSA/CTA Rigid Registration Using Deep Reinforcement Learning and Overlap Degree Calculation

Wentao Liu, Bowen Liang, Weijin Xu et al.

The rigid registration of aortic Digital Subtraction Angiography (DSA) and Computed Tomography Angiography (CTA) can provide 3D anatomical details of the vasculature for the interventional surgical treatment of conditions such as aortic dissection and aortic aneurysms, holding significant value for clinical research. However, the current methods for 2D/3D image registration are dependent on manual annotations or synthetic data, as well as the extraction of landmarks, which is not suitable for cross-modal registration of aortic DSA/CTA. In this paper, we propose an unsupervised method, UDCR, for aortic DSA/CTA rigid registration based on deep reinforcement learning. Leveraging the imaging principles and characteristics of DSA and CTA, we have constructed a cross-dimensional registration environment based on spatial transformations. Specifically, we propose an overlap degree calculation reward function that measures the intensity difference between the foreground and background, aimed at assessing the accuracy of registration between segmentation maps and DSA images. This method is highly flexible, allowing for the loading of pre-trained models to perform registration directly or to seek the optimal spatial transformation parameters through online learning. We manually annotated 61 pairs of aortic DSA/CTA for algorithm evaluation. The results indicate that the proposed UDCR achieved a Mean Absolute Error (MAE) of 2.85 mm in translation and 4.35° in rotation, showing significant potential for clinical applications.

CVFeb 9, 2024
GS-CLIP: Gaussian Splatting for Contrastive Language-Image-3D Pretraining from Real-World Data

Haoyuan Li, Yanpeng Zhou, Yihan Zeng et al.

3D Shape represented as point cloud has achieve advancements in multimodal pre-training to align image and language descriptions, which is curial to object identification, classification, and retrieval. However, the discrete representations of point cloud lost the object's surface shape information and creates a gap between rendering results and 2D correspondences. To address this problem, we propose GS-CLIP for the first attempt to introduce 3DGS (3D Gaussian Splatting) into multimodal pre-training to enhance 3D representation. GS-CLIP leverages a pre-trained vision-language model for a learned common visual and textual space on massive real world image-text pairs and then learns a 3D Encoder for aligning 3DGS optimized per object. Additionally, a novel Gaussian-Aware Fusion is proposed to extract and fuse global explicit feature. As a general framework for language-image-3D pre-training, GS-CLIP is agnostic to 3D backbone networks. Experiments on challenging shows that GS-CLIP significantly improves the state-of-the-art, outperforming the previously best results.

CVMar 8
EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

Binjia Zhou, Dawei Luo, Shuai Chen et al.

With the rapid advancement of AIGC technology, developing identification methods to address the security challenges posed by deepfakes has become urgent. Face forgery identification techniques can be categorized into two types: traditional classification methods and explainable VLM approaches. The former provides classification results but lacks explanatory ability, while the latter, although capable of providing coarse-grained explanations, often suffers from hallucinations and insufficient detail. To overcome these limitations, we propose EvolveReason, which mimics the reasoning and observational processes of human auditors when identifying face forgeries. By constructing a chain-of-thought dataset, CoT-Face, tailored for advanced VLMs, our approach guides the model to think in a human-like way, prompting it to output reasoning processes and judgment results. This provides practitioners with reliable analysis and helps alleviate hallucination. Additionally, our framework incorporates a forgery latent-space distribution capture module, enabling EvolveReason to identify high-frequency forgery cues difficult to extract from the original images. To further enhance the reliability of textual explanations, we introduce a self-evolution exploration strategy, leveraging reinforcement learning to allow the model to iteratively explore and optimize its textual descriptions in a two-stage process. Experimental results show that EvolveReason not only outperforms the current state-of-the-art methods in identification performance but also accurately identifies forgery details and demonstrates generalization capabilities.

CLOct 23, 2025
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

Haoyuan Li, Zhengyuan Shen, Sullam Jeoung et al.

As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.

AIOct 16, 2025
Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Haoyuan Li, Mathias Funk, Aaqib Saeed

Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

CLSep 25, 2025
Comparative Personalization for Multi-document Summarization

Haoyuan Li, Snigdha Chaturvedi

Personalized multi-document summarization (MDS) is essential for meeting individual user preferences of writing style and content focus for summaries. In this paper, we propose that for effective personalization, it is important to identify fine-grained differences between users' preferences by comparing the given user's preferences with other users' preferences.Motivated by this, we propose ComPSum, a personalized MDS framework. It first generates a structured analysis of a user by comparing their preferences with other users' preferences. The generated structured analysis is then used to guide the generation of personalized summaries. To evaluate the performance of ComPSum, we propose AuthorMap, a fine-grained reference-free evaluation framework for personalized MDS. It evaluates the personalization of a system based on the authorship attribution between two personalized summaries generated for different users. For robust evaluation of personalized MDS, we construct PerMSum, a personalized MDS dataset in the review and news domain. We evaluate the performance of ComPSum on PerMSum using AuthorMap, showing that it outperforms strong baselines.