Chun Li

CV
h-index10
18papers
56citations
Novelty57%
AI Score55

18 Papers

CVMay 20Code
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

Siqi Wei, Hongbin Xu, Feng Xiao et al.

Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

LGJan 30Code
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Xiaoce Wang, Guibin Zhang, Junzhe Li et al.

Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.

CVNov 1, 2025
Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

Xihang Qiu, Jiarong Cheng, Yuhao Fang et al.

Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

IVMay 7, 2024
Semi-Supervised Disease Classification based on Limited Medical Image Data

Yan Zhang, Chun Li, Zhaoxia Liu et al.

In recent years, significant progress has been made in the field of learning from positive and unlabeled examples (PU learning), particularly in the context of advancing image and text classification tasks. However, applying PU learning to semi-supervised disease classification remains a formidable challenge, primarily due to the limited availability of labeled medical images. In the realm of medical image-aided diagnosis algorithms, numerous theoretical and practical obstacles persist. The research on PU learning for medical image-assisted diagnosis holds substantial importance, as it aims to reduce the time spent by professional experts in classifying images. Unlike natural images, medical images are typically accompanied by a scarcity of annotated data, while an abundance of unlabeled cases exists. Addressing these challenges, this paper introduces a novel generative model inspired by Hölder divergence, specifically designed for semi-supervised disease classification using positive and unlabeled medical image data. In this paper, we present a comprehensive formulation of the problem and establish its theoretical feasibility through rigorous mathematical analysis. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on five benchmark datasets commonly used in PU medical learning: BreastMNIST, PneumoniaMNIST, BloodMNIST, OCTMNIST, and AMD. The experimental results clearly demonstrate the superiority of our method over existing approaches based on KL divergence. Notably, our approach achieves state-of-the-art performance on all five disease classification benchmarks. By addressing the limitations imposed by limited labeled data and harnessing the untapped potential of unlabeled medical images, our novel generative model presents a promising direction for enhancing semi-supervised disease classification in the field of medical image analysis.

ROAug 4, 2025
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

Cui Miao, Tao Chang, Meihan Wu et al.

Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer.Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.

CVJan 15, 2025
Unified Few-shot Crack Segmentation and its Precise 3D Automatic Measurement in Concrete Structures

Pengru Deng, Jiapeng Yao, Chun Li et al.

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

IVNov 13, 2024
Robust Divergence Learning for Missing-Modality Segmentation

Runze Cheng, Zhongao Sun, Ye Zhang et al.

Multimodal Magnetic Resonance Imaging (MRI) provides essential complementary information for analyzing brain tumor subregions. While methods using four common MRI modalities for automatic segmentation have shown success, they often face challenges with missing modalities due to image quality issues, inconsistent protocols, allergic reactions, or cost factors. Thus, developing a segmentation paradigm that handles missing modalities is clinically valuable. A novel single-modality parallel processing network framework based on Hölder divergence and mutual information is introduced. Each modality is independently input into a shared network backbone for parallel processing, preserving unique information. Additionally, a dynamic sharing framework is introduced that adjusts network parameters based on modality availability. A Hölder divergence and mutual information-based loss functions are used for evaluating discrepancies between predictions and labels. Extensive testing on the BraTS 2018 and BraTS 2020 datasets demonstrates that our method outperforms existing techniques in handling missing modalities and validates each component's effectiveness.

CLMay 31, 2025
DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA

Yuelyu Ji, Hang Zhang, Shiven Verma et al.

We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.

LGJan 2, 2025
General Information Metrics for Improving AI Model Training Efficiency

Jianfeng Xu, Congcong Liu, Xiaoying Tan et al.

To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.

CVNov 1, 2025
A DeepONet joint Neural Tangent Kernel Hybrid Framework for Physics-Informed Inverse Source Problems and Robust Image Reconstruction

Yuhao Fang, Zijian Wang, Yao Lu et al.

This work presents a novel hybrid approach that integrates Deep Operator Networks (DeepONet) with the Neural Tangent Kernel (NTK) to solve complex inverse problem. The method effectively addresses tasks such as source localization governed by the Navier-Stokes equations and image reconstruction, overcoming challenges related to nonlinearity, sparsity, and noisy data. By incorporating physics-informed constraints and task-specific regularization into the loss function, the framework ensures solutions that are both physically consistent and accurate. Validation on diverse synthetic and real datasets demonstrates its robustness, scalability, and precision, showcasing its broad potential applications in computational physics and imaging sciences.

CVOct 5, 2025
Diffusion Low Rank Hybrid Reconstruction for Sparse View Medical Imaging

Zongyin Deng, Qing Zhou, Yuhao Fang et al.

This work presents TV-LoRA, a novel method for low-dose sparse-view CT reconstruction that combines a diffusion generative prior (NCSN++ with SDE modeling) and multi-regularization constraints, including anisotropic TV and nuclear norm (LoRA), within an ADMM framework. To address ill-posedness and texture loss under extremely sparse views, TV-LoRA integrates generative and physical constraints, and utilizes a 2D slice-based strategy with FFT acceleration and tensor-parallel optimization for efficient inference. Experiments on AAPM-2016, CTHD, and LIDC datasets with $N_{\mathrm{view}}=8,4,2$ show that TV-LoRA consistently surpasses benchmarks in SSIM, texture recovery, edge clarity, and artifact suppression, demonstrating strong robustness and generalizability. Ablation studies confirm the complementary effects of LoRA regularization and diffusion priors, while the FFT-PCG module provides a speedup. Overall, Diffusion + TV-LoRA achieves high-fidelity, efficient 3D CT reconstruction and broad clinical applicability in low-dose, sparse-sampling scenarios.

CVJul 14, 2025
Uncertainty Quantification for Incomplete Multi-View Data Using Divergence Measures

Zhipeng Xue, Yan Zhang, Ming Li et al.

Existing multi-view classification and clustering methods typically improve task accuracy by leveraging and fusing information from different views. However, ensuring the reliability of multi-view integration and final decisions is crucial, particularly when dealing with noisy or corrupted data. Current methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty of network predictions, ignoring domain gaps between different modalities. To address this issue, KPHD-Net, based on Hölder divergence, is proposed for multi-view classification and clustering tasks. Generally, our KPHD-Net employs a variational Dirichlet distribution to represent class probability distributions, models evidences from different views, and then integrates it with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation effects. Our theoretical analysis demonstrates that Proper Hölder divergence offers a more effective measure of distribution discrepancies, ensuring enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence theory, recognized for its superior performance in multi-view fusion tasks, is introduced and combined with the Kalman filter to provide future state estimations. This integration further enhances the reliability of the final fusion results. Extensive experiments show that the proposed KPHD-Net outperforms the current state-of-the-art methods in both classification and clustering tasks regarding accuracy, robustness, and reliability, with theoretical guarantees.

CVJul 2, 2025
Robust Brain Tumor Segmentation with Incomplete MRI Modalities Using Hölder Divergence and Mutual Information-Enhanced Knowledge Transfer

Runze Cheng, Xihang Qiu, Ming Li et al.

Multimodal MRI provides critical complementary information for accurate brain tumor segmentation. However, conventional methods struggle when certain modalities are missing due to issues such as image quality, protocol inconsistencies, patient allergies, or financial constraints. To address this, we propose a robust single-modality parallel processing framework that achieves high segmentation accuracy even with incomplete modalities. Leveraging Holder divergence and mutual information, our model maintains modality-specific features while dynamically adjusting network parameters based on the available inputs. By using these divergence- and information-based loss functions, the framework effectively quantifies discrepancies between predictions and ground-truth labels, resulting in consistently accurate segmentation. Extensive evaluations on the BraTS 2018 and BraTS 2020 datasets demonstrate superior performance over existing methods in handling missing modalities.

CRFeb 13, 2025
X-SG$^2$S: Safe and Generalizable Gaussian Splatting with X-dimensional Watermarks

Zihang Cheng, Huiping Zhuang, Chun Li et al.

3D Gaussian Splatting (3DGS) has been widely used in 3D reconstruction and 3D generation. Training to get a 3DGS scene often takes a lot of time and resources and even valuable inspiration. The increasing amount of 3DGS digital asset have brought great challenges to the copyright protection. However, it still lacks profound exploration targeted at 3DGS. In this paper, we propose a new framework X-SG$^2$S which can simultaneously watermark 1 to 3D messages while keeping the original 3DGS scene almost unchanged. Generally, we have a X-SG$^2$S injector for adding multi-modal messages simultaneously and an extractor for extract them. Specifically, we first split the watermarks into message patches in a fixed manner and sort the 3DGS points. A self-adaption gate is used to pick out suitable location for watermarking. Then use a XD(multi-dimension)-injection heads to add multi-modal messages into sorted 3DGS points. A learnable gate can recognize the location with extra messages and XD-extraction heads can restore hidden messages from the location recommended by the learnable gate. Extensive experiments demonstrated that the proposed X-SG$^2$S can effectively conceal multi modal messages without changing pretrained 3DGS pipeline or the original form of 3DGS parameters. Meanwhile, with simple and efficient model structure and high practicality, X-SG$^2$S still shows good performance in hiding and extracting multi-modal inner structured or unstructured messages. X-SG$^2$S is the first to unify 1 to 3D watermarking model for 3DGS and the first framework to add multi-modal watermarks simultaneous in one 3DGS which pave the wave for later researches.

CVNov 30, 2024
EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients

Meihan Wu, Tao Chang, Cui Miao et al.

Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained edge devices remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained edge devices, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on masked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. Extensive experiments on popular benchmarks show that EFTViT achieves up to 28.17% accuracy improvement, reduces local training computational cost by up to 2.8$\times$, and cuts local training time by up to 4.4$\times$ compared to existing methods.

CVOct 29, 2024
Uncertainty Quantification via Hölder Divergence for Multi-View Representation Learning

Yan Zhang, Ming Li, Chun Li et al.

Evidence-based deep learning represents a burgeoning paradigm for uncertainty estimation, offering reliable predictions with negligible extra computational overheads. Existing methods usually adopt Kullback-Leibler divergence to estimate the uncertainty of network predictions, ignoring domain gaps among various modalities. To tackle this issue, this paper introduces a novel algorithm based on Hölder Divergence (HD) to enhance the reliability of multi-view learning by addressing inherent uncertainty challenges from incomplete or noisy data. Generally, our method extracts the representations of multiple modalities through parallel network branches, and then employs HD to estimate the prediction uncertainties. Through the Dempster-Shafer theory, integration of uncertainty from different modalities, thereby generating a comprehensive result that considers all available representations. Mathematically, HD proves to better measure the ``distance'' between real data distribution and predictive distribution of the model and improve the performances of multi-class recognition tasks. Specifically, our method surpass the existing state-of-the-art counterparts on all evaluating benchmarks. We further conduct extensive experiments on different backbones to verify our superior robustness. It is demonstrated that our method successfully pushes the corresponding performance boundaries. Finally, we perform experiments on more challenging scenarios, \textit{i.e.}, learning with incomplete or noisy data, revealing that our method exhibits a high tolerance to such corrupted data.

IVJun 12, 2024
Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning

Zhongao Sun, Jiameng Li, Yuhan Wang et al.

Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) where missing modality images are common in clinical settings, leading to reduced segmentation accuracy. To address this issue, we propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data. Additionally, in the fine-tuning phase, we utilize a knowledge distillation technique to align features between complete and missing modality data, simultaneously enhancing model robustness. Notably, we leverage the Holder pseudo-divergence instead of the KLD for distillation loss, offering improve mathematical interpretability and properties. Extensive experiments on the BRATS2018 and BRATS2020 datasets demonstrate significant performance enhancements compared to existing state-of-the-art methods.

LGMar 10, 2019
Multinomial Random Forest: Toward Consistency and Privacy-Preservation

Yiming Li, Jiawang Bai, Jiawei Li et al.

Despite the impressive performance of random forests (RF), its theoretical properties have not been thoroughly understood. In this paper, we propose a novel RF framework, dubbed multinomial random forest (MRF), to analyze the \emph{consistency} and \emph{privacy-preservation}. Instead of deterministic greedy split rule or with simple randomness, the MRF adopts two impurity-based multinomial distributions to randomly select a split feature and a split value respectively. Theoretically, we prove the consistency of the proposed MRF and analyze its privacy-preservation within the framework of differential privacy. We also demonstrate with multiple datasets that its performance is on par with the standard RF. To the best of our knowledge, MRF is the first consistent RF variant that has comparable performance to the standard RF.