Yun Li

CV
h-index62
102papers
3,298citations
Novelty46%
AI Score61

102 Papers

85.6ITJun 2
On Secure EKF-enhanced UAV-ISAC Systems

Hongjiang Lei, Heng Jin, Ki-Hong Park et al.

Integrated sensing and communication (ISAC) has emerged as a promising key technology for future wireless networks, enabling the efficient coordination of sensing and communication functions within limited resources. This work investigates a secure ISAC system assisted by an uncrewed aerial vehicle (UAV). By incorporating the extended Kalman filter (EKF), the proposed system is capable of delivering communication services to legitimate users while simultaneously jamming eavesdroppers and performing joint prediction and tracking of the trajectories of both legitimate and illegitimate users. Considering practical constraints such as {sensing beamwidth}, transmit power, and UAV's propulsion energy consumption, the secrecy rate is maximized through the joint design of transmit beamforming and UAV trajectory. To tackle the resulting highly non-convex optimization problem, an efficient iterative algorithm is developed by integrating block coordinate descent, successive convex approximation, and EKF, thereby yielding a high-quality suboptimal solution. Extensive simulation results validate the superior performance of the proposed scheme compared to benchmarks.

LGJul 6, 2024Code
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Peng Xia, Kangyu Zhu, Haoran Li et al.

The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on medical VQA and report generation tasks across three datasets, achieving an average improvement of 47.4% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.

CLAug 20, 2024Code
MEGen: Generative Backdoor into Large Language Models via Model Editing

Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang et al.

Large language models (LLMs) have exhibited remarkable versatility and adaptability, while their widespread adoption across various applications also raises critical safety concerns. This paper focuses on the impact of backdoored LLMs. Traditional backdoor injection methods are primarily limited to yes-or-no discriminative tasks, leading users to underestimate the potential risks of backdoored LLMs. Given the inherently generative nature of LLMs, this paper reveals that a generative backdoor injected into LLMs can expose the true safety risks in their applications. We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks in a unified format of any text-to any text, leading to natural generations with a specific intention. Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local parameters with few-shot samples. Notably, we show that the backdoored model, when triggered, can freely output pre-set dangerous information while completing downstream tasks. Our work highlights that MEGen enables backdoors in LLMs to exhibit generative capabilities, causing potential safety risks by altering the generative style. The code is available at https://github.com/MonoQ-hub/MEGen.

NIAug 2, 2022
Generative Adversarial Learning for Intelligent Trust Management in 6G Wireless Networks

Liu Yang, Yun Li, Simon X. Yang et al.

Emerging six generation (6G) is the integration of heterogeneous wireless networks, which can seamlessly support anywhere and anytime networking. But high Quality-of-Trust should be offered by 6G to meet mobile user expectations. Artificial intelligence (AI) is considered as one of the most important components in 6G. Then AI-based trust management is a promising paradigm to provide trusted and reliable services. In this article, a generative adversarial learning-enabled trust management method is presented for 6G wireless networks. Some typical AI-based trust management schemes are first reviewed, and then a potential heterogeneous and intelligent 6G architecture is introduced. Next, the integration of AI and trust management is developed to optimize the intelligence and security. Finally, the presented AI-based trust management method is applied to secure clustering to achieve reliable and real-time communications. Simulation results have demonstrated its excellent performance in guaranteeing network security and service quality.

CVMar 1, 2023
Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning

Yun Li, Zhe Liu, Saurav Jha et al.

Open-World Compositional Zero-Shot Learning (OW-CZSL) aims to recognize new compositions of seen attributes and objects. In OW-CZSL, methods built on the conventional closed-world setting degrade severely due to the unconstrained OW test space. While previous works alleviate the issue by pruning compositions according to external knowledge or correlations in seen pairs, they introduce biases that harm the generalization. Some methods thus predict state and object with independently constructed and trained classifiers, ignoring that attributes are highly context-dependent and visually entangled with objects. In this paper, we propose a novel Distilled Reverse Attention Network to address the challenges. We also model attributes and objects separately but with different motivations, capturing contextuality and locality, respectively. We further design a reverse-and-distill strategy that learns disentangled representations of elementary components in training data supervised by reverse attention and knowledge distillation. We conduct experiments on three datasets and consistently achieve state-of-the-art (SOTA) performance.

CLAug 13, 2024Code
PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Tianyu Liu, Yun Li, Qitan Lv et al.

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However, existing SD methods suffer from the mutual waiting problem, i.e., the target model gets stuck when the draft model is guessing tokens, and vice versa. This problem is directly incurred by the asynchronous execution of the draft model and the target model and is exacerbated due to the fixed draft length in speculative decoding. To address these challenges, we propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL). Specifically, PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. PEARL parallels the drafting phase and the verification phase via applying the two strategies, and achieves adaptive draft length for different scenarios, which effectively alleviates the mutual waiting problem. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$\times$ and 1.50$\times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively. Our code is available at https://github.com/smart-lty/ParallelSpeculativeDecoding.

NIOct 14, 2022
Generative Adversarial Learning for Trusted and Secure Clustering in Industrial Wireless Sensor Networks

Liu Yang, Simon X. Yang, Yun Li et al.

Traditional machine learning techniques have been widely used to establish the trust management systems. However, the scale of training dataset can significantly affect the security performances of the systems, while it is a great challenge to detect malicious nodes due to the absence of labeled data regarding novel attacks. To address this issue, this paper presents a generative adversarial network (GAN) based trust management mechanism for Industrial Wireless Sensor Networks (IWSNs). First, type-2 fuzzy logic is adopted to evaluate the reputation of sensor nodes while alleviating the uncertainty problem. Then, trust vectors are collected to train a GAN-based codec structure, which is used for further malicious node detection. Moreover, to avoid normal nodes being isolated from the network permanently due to error detections, a GAN-based trust redemption model is constructed to enhance the resilience of trust management. Based on the latest detection results, a trust model update method is developed to adapt to the dynamic industrial environment. The proposed trust management mechanism is finally applied to secure clustering for reliable and real-time data transmission, and simulation results show that it achieves a high detection rate up to 96%, as well as a low false positive rate below 8%.

96.0LGApr 15Code
MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution

Jianwen Chen, Xinyu Yang, Peng Xia et al.

Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks. However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri net theory. The framework adopts a full-stack design across data, model architecture, and system execution. For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning paths and transforms them into Petri net-structured representations. At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency. Systematically, we develop a customized inference engine that supports parallel execution without additional overhead. Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance while delivering a 1.3x reduction in inference latency and a 1.7x increase in generation throughput, enabled by its parallel decoding capability. Code is available at https://github.com/aiming-lab/MedVerse.

SPMay 3, 2022
Disentangled and Side-aware Unsupervised Domain Adaptation for Cross-dataset Subjective Tinnitus Diagnosis

Yun Li, Zhe Liu, Lina Yao et al.

EEG-based tinnitus classification is a valuable tool for tinnitus diagnosis, research, and treatments. Most current works are limited to a single dataset where data patterns are similar. But EEG signals are highly non-stationary, resulting in model's poor generalization to new users, sessions or datasets. Thus, designing a model that can generalize to new datasets is beneficial and indispensable. To mitigate distribution discrepancy across datasets, we propose to achieve Disentangled and Side-aware Unsupervised Domain Adaptation (DSUDA) for cross-dataset tinnitus diagnosis. A disentangled auto-encoder is developed to decouple class-irrelevant information from the EEG signals to improve the classifying ability. The side-aware unsupervised domain adaptation module adapts the class-irrelevant information as domain variance to a new dataset and excludes the variance to obtain the class-distill features for the new dataset classification. It also align signals of left and right ears to overcome inherent EEG pattern difference. We compare DSUDA with state-of-the-art methods, and our model achieves significant improvements over competitors regarding comprehensive evaluation criteria. The results demonstrate our model can successfully generalize to a new dataset and effectively diagnose tinnitus.

CRFeb 22, 2023
PAD: Towards Principled Adversarial Malware Detection Against Evasion Attacks

Deqiang Li, Shicheng Cui, Yun Li et al.

Machine Learning (ML) techniques can facilitate the automation of malicious software (malware for short) detection, but suffer from evasion attacks. Many studies counter such attacks in heuristic manners, lacking theoretical guarantees and defense effectiveness. In this paper, we propose a new adversarial training framework, termed Principled Adversarial Malware Detection (PAD), which offers convergence guarantees for robust optimization methods. PAD lays on a learnable convex measurement that quantifies distribution-wise discrete perturbations to protect malware detectors from adversaries, whereby for smooth detectors, adversarial training can be performed with theoretical treatments. To promote defense effectiveness, we propose a new mixture of attacks to instantiate PAD to enhance deep neural network-based measurements and malware detectors. Experimental results on two Android malware datasets demonstrate: (i) the proposed method significantly outperforms the state-of-the-art defenses; (ii) it can harden ML-based malware detection against 27 evasion attacks with detection accuracies greater than 83.45%, at the price of suffering an accuracy decrease smaller than 2.16% in the absence of attacks; (iii) it matches or outperforms many anti-malware scanners in VirusTotal against realistic adversarial malware.

99.2AIMay 19Code
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Jiaqi Liu, Shi Qiu, Mairui Li et al.

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

CLApr 15, 2022
Chinese Idiom Paraphrasing

Jipeng Qiang, Yang Li, Chaowei Zhang et al.

Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters. Due to the properties of non-compositionality and metaphorical meaning, Chinese Idioms are hard to be understood by children and non-native speakers. This study proposes a novel task, denoted as Chinese Idiom Paraphrasing (CIP). CIP aims to rephrase idioms-included sentences to non-idiomatic ones under the premise of preserving the original sentence's meaning. Since the sentences without idioms are easier handled by Chinese NLP systems, CIP can be used to pre-process Chinese datasets, thereby facilitating and improving the performance of Chinese NLP tasks, e.g., machine translation system, Chinese idiom cloze, and Chinese idiom embeddings. In this study, CIP task is treated as a special paraphrase generation task. To circumvent difficulties in acquiring annotations, we first establish a large-scale CIP dataset based on human and machine collaboration, which consists of 115,530 sentence pairs. We further deploy three baselines and two novel CIP approaches to deal with CIP problems. The results show that the proposed methods have better performances than the baselines based on the established CIP dataset.

ROMar 1Code
An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving

Yun Li, Simon Thompson, Yidu Zhang et al.

Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in {3, 5, 7, 10, 15, 20}), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.

CVAug 10, 2024Code
SAM-FNet: SAM-Guided Fusion Network for Laryngo-Pharyngeal Tumor Detection

Jia Wei, Yun Li, Meiyu Qiu et al.

Laryngo-pharyngeal cancer (LPC) is a highly fatal malignant disease affecting the head and neck region. Previous studies on endoscopic tumor detection, particularly those leveraging dual-branch network architectures, have shown significant advancements in tumor detection. These studies highlight the potential of dual-branch networks in improving diagnostic accuracy by effectively integrating global and local (lesion) feature extraction. However, they are still limited in their capabilities to accurately locate the lesion region and capture the discriminative feature information between the global and local branches. To address these issues, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch network for laryngo-pharyngeal tumor detection. By leveraging the powerful object segmentation capabilities of the Segment Anything Model (SAM), we introduce the SAM into the SAM-FNet to accurately segment the lesion region. Furthermore, we propose a GAN-like feature optimization (GFO) module to capture the discriminative features between the global and local branches, enhancing the fusion feature complementarity. Additionally, we collect two LPC datasets from the First Affiliated Hospital (FAHSYSU) and the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University. The FAHSYSU dataset is used as the internal dataset for training the model, while the SAHSYSU dataset is used as the external dataset for evaluating the model's performance. Extensive experiments on both datasets of FAHSYSU and SAHSYSU demonstrate that the SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts. The source code of SAM-FNet is available at the URL of https://github.com/VVJia/SAM-FNet.

CLFeb 23, 2023
Sentence Simplification via Large Language Models

Yutao Feng, Jipeng Qiang, Yun Li et al.

Sentence Simplification aims to rephrase complex sentences into simpler sentences while retaining original meaning. Large Language models (LLMs) have demonstrated the ability to perform a variety of natural language processing tasks. However, it is not yet known whether LLMs can be served as a high-quality sentence simplification system. In this work, we empirically analyze the zero-/few-shot learning ability of LLMs by evaluating them on a number of benchmark test sets. Experimental results show LLMs outperform state-of-the-art sentence simplification methods, and are judged to be on a par with human annotators.

CYJul 6, 2024
AI Safety in Generative AI Large Language Models: A Survey

Jaymari Chua, Yun Li, Shiyi Yang et al.

Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) inevitably raises concerns about the risks and safety associated with these models. This article provides an up-to-date survey of recent trends in AI safety research of GAI-LLMs from a computer scientist's perspective: specific and technical. In this survey, we explore the background and motivation for the identified harms and risks in the context of LLMs being generative language models; our survey differentiates by emphasising the need for unified theories of the distinct safety challenges in the research development and applications of LLMs. We start our discussion with a concise introduction to the workings of LLMs, supported by relevant literature. Then we discuss earlier research that has pointed out the fundamental constraints of generative models, or lack of understanding thereof (e.g., performance and safety trade-offs as LLMs scale in number of parameters). We provide a sufficient coverage of LLM alignment -- delving into various approaches, contending methods and present challenges associated with aligning LLMs with human preferences. By highlighting the gaps in the literature and possible implementation oversights, our aim is to create a comprehensive analysis that provides insights for addressing AI safety in LLMs and encourages the development of aligned and secure models. We conclude our survey by discussing future directions of LLMs for AI safety, offering insights into ongoing research in this critical area.

SPMay 1, 2018
Controlled Tracking in Urban Terrain: Closing the Loop

Patricia R. Barbosa, Yugandhar Sarkale, Edwin K. P. Chong et al.

We investigate the challenging problem of integrating detection, signal processing, target tracking, and adaptive waveform scheduling with lookahead in urban terrain. We propose a closed-loop active sensing system to address this problem by exploiting three distinct levels of diversity: (1) spatial diversity through the use of coordinated multistatic radars; (2) waveform diversity by adaptively scheduling the transmitted waveform; and (3) motion model diversity by using a bank of parallel filters matched to different motion models. Specifically, at every radar scan, the waveform that yields the minimum trace of the one-step-ahead error covariance matrix is transmitted; the received signal goes through a matched-filter, and curve fitting is used to extract range and range-rate measurements that feed the LMIPDA-VSIMM algorithm for data association and filtering. Monte Carlo simulations demonstrate the effectiveness of the proposed system in an urban scenario contaminated by dense and uneven clutter, strong multipath, and limited line-of-sight.

CLNov 1, 2025Code
ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

Jiani Guo, Zuchao Li, Jie Wu et al.

Large Language Models (LLMs), constrained by limited context windows, often face significant performance degradation when reasoning over long contexts. To address this, Retrieval-Augmented Generation (RAG) retrieves and reasons over chunks but frequently sacrifices logical coherence due to its reliance on similarity-based rankings. Similarly, divide-and-conquer frameworks (DCF) split documents into small chunks for independent reasoning and aggregation. While effective for local reasoning, DCF struggles to capture long-range dependencies and risks inducing conflicts by processing chunks in isolation. To overcome these limitations, we propose ToM, a novel Tree-oriented MapReduce framework for long-context reasoning. ToM leverages the inherent hierarchical structure of long documents (e.g., main headings and subheadings) by constructing a DocTree through hierarchical semantic parsing and performing bottom-up aggregation. Using a Tree MapReduce approach, ToM enables recursive reasoning: in the Map step, rationales are generated at child nodes; in the Reduce step, these rationales are aggregated across sibling nodes to resolve conflicts or reach consensus at parent nodes. Experimental results on 70B+ LLMs show that ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning. Our code is available at https://github.com/gjn12-31/ToM .

CVMay 14, 2024Code
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin et al.

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

AIJul 27, 2024
Large Language Models for Human-like Autonomous Driving: A Survey

Yun Li, Kai Katsumata, Ehsan Javanmardi et al.

Large Language Models (LLMs), AI models trained on massive text corpora with remarkable language understanding and generation capabilities, are transforming the field of Autonomous Driving (AD). As AD systems evolve from rule-based and optimization-based methods to learning-based techniques like deep reinforcement learning, they are now poised to embrace a third and more advanced category: knowledge-based AD empowered by LLMs. This shift promises to bring AD closer to human-like AD. However, integrating LLMs into AD systems poses challenges in real-time inference, safety assurance, and deployment costs. This survey provides a comprehensive and critical review of recent progress in leveraging LLMs for AD, focusing on their applications in modular AD pipelines and end-to-end AD systems. We highlight key advancements, identify pressing challenges, and propose promising research directions to bridge the gap between LLMs and AD, thereby facilitating the development of more human-like AD systems. The survey first introduces LLMs' key features and common training schemes, then delves into their applications in modular AD pipelines and end-to-end AD, respectively, followed by discussions on open challenges and future directions. Through this in-depth analysis, we aim to provide insights and inspiration for researchers and practitioners working at the intersection of AI and autonomous vehicles, ultimately contributing to safer, smarter, and more human-centric AD technologies.

CLJun 16, 2023
Clickbait Detection via Large Language Models

Han Wang, Yi Zhu, Ye Wang et al.

Clickbait, which aims to induce users with some surprising and even thrilling headlines for increasing click-through rates, permeates almost all online content publishers, such as news portals and social media. Recently, Large Language Models (LLMs) have emerged as a powerful instrument and achieved tremendous success in a series of NLP downstream tasks. However, it is not yet known whether LLMs can be served as a high-quality clickbait detection system. In this paper, we analyze the performance of LLMs in the few-shot and zero-shot scenarios on several English and Chinese benchmark datasets. Experimental results show that LLMs cannot achieve the best results compared to the state-of-the-art deep and fine-tuning PLMs methods. Different from human intuition, the experiments demonstrated that LLMs cannot make satisfied clickbait detection just by the headlines.

CLJul 28, 2023
Multilingual Lexical Simplification via Paraphrase Generation

Kang Liu, Jipeng Qiang, Yun Li et al.

Lexical simplification (LS) methods based on pretrained language models have made remarkable progress, generating potential substitutes for a complex word through analysis of its contextual surroundings. However, these methods require separate pretrained models for different languages and disregard the preservation of sentence meaning. In this paper, we propose a novel multilingual LS method via paraphrase generation, as paraphrases provide diversity in word selection while preserving the sentence's meaning. We regard paraphrasing as a zero-shot translation task within multilingual neural machine translation that supports hundreds of languages. After feeding the input sentence into the encoder of paraphrase modeling, we generate the substitutes based on a novel decoding strategy that concentrates solely on the lexical variations of the complex word. Experimental results demonstrate that our approach surpasses BERT-based methods and zero-shot GPT3-based method significantly on English, Spanish, and Portuguese.

LGMay 23, 2024Code
Calibrated Self-Rewarding Vision Language Models

Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng et al.

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. Our data and code are available at https://github.com/YiyangZhou/CSR.

SPMay 3, 2022
Side-aware Meta-Learning for Cross-Dataset Listener Diagnosis with Subjective Tinnitus

Yun Li, Zhe Liu, Lina Yao et al.

With the development of digital technology, machine learning has paved the way for the next generation of tinnitus diagnoses. Although machine learning has been widely applied in EEG-based tinnitus analysis, most current models are dataset-specific. Each dataset may be limited to a specific range of symptoms, overall disease severity, and demographic attributes; further, dataset formats may differ, impacting model performance. This paper proposes a side-aware meta-learning for cross-dataset tinnitus diagnosis, which can effectively classify tinnitus in subjects of divergent ages and genders from different data collection processes. Owing to the superiority of meta-learning, our method does not rely on large-scale datasets like conventional deep learning models. Moreover, we design a subject-specific training process to assist the model in fitting the data pattern of different patients or healthy people. Our method achieves a high accuracy of 73.8\% in the cross-dataset classification. We conduct an extensive analysis to show the effectiveness of side information of ears in enhancing model performance and side-aware meta-learning in improving the quality of the learned features.

CVApr 6, 2023
Synthetic Hard Negative Samples for Contrastive Learning

Hengkui Dong, Xianzhong Long, Yun Li et al.

Contrastive learning has emerged as an essential approach for self-supervised learning in visual representation learning. The central objective of contrastive learning is to maximize the similarities between two augmented versions of an image (positive pairs), while minimizing the similarities between different images (negative pairs). Recent studies have demonstrated that harder negative samples, i.e., those that are more difficult to differentiate from the anchor sample, perform a more crucial function in contrastive learning. This paper proposes a novel feature-level method, namely sampling synthetic hard negative samples for contrastive learning (SSCL), to exploit harder negative samples more effectively. Specifically, 1) we generate more and harder negative samples by mixing negative samples, and then sample them by controlling the contrast of anchor sample with the other negative samples; 2) considering the possibility of false negative samples, we further debias the negative samples. Our proposed method improves the classification performance on different image datasets and can be readily integrated into existing methods.

CVNov 5, 2022
Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning

Zhe Liu, Yun Li, Lina Yao et al.

The task of Compositional Zero-Shot Learning (CZSL) is to recognize images of novel state-object compositions that are absent during the training stage. Previous methods of learning compositional embedding have shown effectiveness in closed-world CZSL. However, in Open-World CZSL (OW-CZSL), their performance tends to degrade significantly due to the large cardinality of possible compositions. Some recent works separately predict simple primitives (i.e., states and objects) to reduce cardinality. However, they consider simple primitives as independent probability distributions, ignoring the heavy dependence between states, objects, and compositions. In this paper, we model the dependence of compositions via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility relations between simple primitives, e.g., \textit{hairy} is more feasible with \textit{cat} than with \textit{building} in the real world. Contextuality-dependence represents the contextual variance in images, e.g., \textit{cat} shows diverse appearances under the state of \textit{dry} and \textit{wet}. We design Semantic Attention (SA) and generative Knowledge Disentanglement (KD) to learn the dependence of feasibility and contextuality, respectively. SA captures semantics in compositions to alleviate impossible predictions, driven by the visual similarity between simple primitives. KD disentangles images into unbiased feature representations, easing contextual bias in predictions. Moreover, we complement the current compositional probability model with feasibility and contextuality in a compatible format. Finally, we conduct comprehensive experiments to analyze and validate the superior or competitive performance of our model, Semantic Attention and knowledge Disentanglement guided Simple Primitives (SAD-SP), on three widely-used benchmark OW-CZSL datasets.

LGOct 24, 2023
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity

Yun Li, Lin Niu, Xipeng Zhang et al.

Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.

LGFeb 25, 2023
Scalable Attribution of Adversarial Attacks via Multi-Task Learning

Zhongyi Guo, Keji Han, Yao Ge et al.

Deep neural networks (DNNs) can be easily fooled by adversarial attacks during inference phase when attackers add imperceptible perturbations to original examples, i.e., adversarial examples. Many works focus on adversarial detection and adversarial training to defend against adversarial attacks. However, few works explore the tool-chains behind adversarial examples, which can help defenders to seize the clues about the originator of the attack, their goals, and provide insight into the most effective defense algorithm against corresponding attacks. With such a gap, it is necessary to develop techniques that can recognize tool-chains that are leveraged to generate the adversarial examples, which is called Adversarial Attribution Problem (AAP). In this paper, AAP is defined as the recognition of three signatures, i.e., {\em attack algorithm}, {\em victim model} and {\em hyperparameter}. Current works transfer AAP into single label classification task and ignore the relationship between these signatures. The former will meet combination explosion problem as the number of signatures is increasing. The latter dictates that we cannot treat AAP simply as a single task problem. We first conduct some experiments to validate the attributability of adversarial examples. Furthermore, we propose a multi-task learning framework named Multi-Task Adversarial Attribution (MTAA) to recognize the three signatures simultaneously. MTAA contains perturbation extraction module, adversarial-only extraction module and classification and regression module. It takes the relationship between attack algorithm and corresponding hyperparameter into account and uses the uncertainty weighted loss to adjust the weights of three recognition tasks. The experimental results on MNIST and ImageNet show the feasibility and scalability of the proposed framework as well as its effectiveness in dealing with false alarms.

LGOct 3, 2022
Deep Spatial Domain Generalization

Dazhou Yu, Guangji Bai, Yun Li et al.

Spatial autocorrelation and spatial heterogeneity widely exist in spatial data, which make the traditional machine learning model perform badly. Spatial domain generalization is a spatial extension of domain generalization, which can generalize to unseen spatial domains in continuous 2D space. Specifically, it learns a model under varying data distributions that generalizes to unseen domains. Although tremendous success has been achieved in domain generalization, there exist very few works on spatial domain generalization. The advancement of this area is challenged by: 1) Difficulty in characterizing spatial heterogeneity, and 2) Difficulty in obtaining predictive models for unseen locations without training data. To address these challenges, this paper proposes a generic framework for spatial domain generalization. Specifically, We develop the spatial interpolation graph neural network that handles spatial data as a graph and learns the spatial embedding on each node and their relationships. The spatial interpolation graph neural network infers the spatial embedding of an unseen location during the test phase. Then the spatial embedding of the target location is used to decode the parameters of the downstream-task model directly on the target location. Finally, extensive experiments on thirteen real-world datasets demonstrate the proposed method's strength.

LGMar 18, 2025Code
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Siwei Han, Peng Xia, Ruiyi Zhang et al.

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at https://github.com/aiming-lab/MDocAgent.

MLOct 25, 2023
On the Identifiability and Interpretability of Gaussian Process Models

Jiawen Chen, Wancen Mu, Yun Li et al.

In this paper, we critically examine the prevalent practice of using additive mixtures of Matérn kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Matérn kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Matérn kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Matérn. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.

CLFeb 4, 2025Code
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing

Wenhao Zheng, Yixiao Chen, Weitong Zhang et al.

Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.

50.9ROApr 2
Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

Yun Li, Yidu Zhang, Simon Thompson et al.

Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.

LGMay 31, 2025Code
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning

Peng Xia, Jinglu Wang, Yibo Peng et al.

Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy that progressively teaches the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns. Notably, it achieves an average performance gain of 20.7% over supervised fine-tuning baselines.

CVDec 9, 2024Code
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu, Peng Xia, Yun Li et al.

The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.

33.0CVApr 27Code
RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting

Jinghao Shi, Mengqi Lei, Kunliang He et al.

RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.

64.8CVApr 22Code
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

Yongji Long, Shijun Liang, Jintao Li et al.

Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.

CVJun 28, 2023
Multi-network Contrastive Learning Based on Global and Local Representations

Weiquan Li, Xianzhong Long, Yun Li

The popularity of self-supervised learning has made it possible to train models without relying on labeled data, which saves expensive annotation costs. However, most existing self-supervised contrastive learning methods often overlook the combination of global and local feature information. This paper proposes a multi-network contrastive learning framework based on global and local representations. We introduce global and local feature information for self-supervised contrastive learning through multiple networks. The model learns feature information at different scales of an image by contrasting the embedding pairs generated by multiple networks. The framework also expands the number of samples used for contrast and improves the training efficiency of the model. Linear evaluation results on three benchmark datasets show that our method outperforms several existing classical self-supervised learning methods.

CVNov 22, 2023
Rethinking Radiology Report Generation via Causal Inspired Counterfactual Augmentation

Xiao Song, Jiafan Liu, Yun Li et al.

Radiology Report Generation (RRG) draws attention as a vision-and-language interaction of biomedical fields. Previous works inherited the ideology of traditional language generation tasks, aiming to generate paragraphs with high readability as reports. Despite significant progress, the independence between diseases-a specific property of RRG-was neglected, yielding the models being confused by the co-occurrence of diseases brought on by the biased data distribution, thus generating inaccurate reports. In this paper, to rethink this issue, we first model the causal effects between the variables from a causal perspective, through which we prove that the co-occurrence relationships between diseases on the biased distribution function as confounders, confusing the accuracy through two backdoor paths, i.e. the Joint Vision Coupling and the Conditional Sequential Coupling. Then, we proposed a novel model-agnostic counterfactual augmentation method that contains two strategies, i.e. the Prototype-based Counterfactual Sample Synthesis (P-CSS) and the Magic-Cube-like Counterfactual Report Reconstruction (Cube), to intervene the backdoor paths, thus enhancing the accuracy and generalization of RRG models. Experimental results on the widely used MIMIC-CXR dataset demonstrate the effectiveness of our proposed method. Additionally, a generalization performance is evaluated on IU X-Ray dataset, which verifies our work can effectively reduce the impact of co-occurrences caused by different distributions on the results.

LGDec 1, 2025
Sum Rate Maximization in STAR-RIS-UAV-Assisted Networks: A CA-DDPG Approach for Joint Optimization

Yujie Huang, Haibin Wan, Xiangcheng Li et al.

With the rapid advances in programmable materials, reconfigurable intelligent surfaces (RIS) have become a pivotal technology for future wireless communications. The simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) can both transmit and reflect signals, enabling comprehensive signal control and expanding application scenarios. This paper introduces an unmanned aerial vehicle (UAV) to further enhance system flexibility and proposes an optimization design for the spectrum efficiency of the STAR-RIS-UAV-assisted wireless communication system. We present a deep reinforcement learning (DRL) algorithm capable of iteratively optimizing beamforming, phase shifts, and UAV positioning to maximize the system's sum rate through continuous interactions with the environment. To improve exploration in deterministic policies, we introduce a stochastic perturbation factor, which enhances exploration capabilities. As exploration is strengthened, the algorithm's ability to accurately evaluate the state-action value function becomes critical. Thus, based on the deep deterministic policy gradient (DDPG) algorithm, we propose a convolution-augmented deep deterministic policy gradient (CA-DDPG) algorithm that balances exploration and evaluation to improve the system's sum rate. The simulation results demonstrate that the CA-DDPG algorithm effectively interacts with the environment, optimizing the beamforming matrix, phase shift matrix, and UAV location, thereby improving system capacity and achieving better performance than other algorithms.

CVNov 1, 2023
Rethinking Samples Selection for Contrastive Learning: Mining of Potential Samples

Hengkui Dong, Xianzhong Long, Yun Li

Contrastive learning predicts whether two images belong to the same category by training a model to make their feature representations as close or as far away as possible. In this paper, we rethink how to mine samples in contrastive learning, unlike other methods, our approach is more comprehensive, taking into account both positive and negative samples, and mining potential samples from two aspects: First, for positive samples, we consider both the augmented sample views obtained by data augmentation and the mined sample views through data mining. Then, we weight and combine them using both soft and hard weighting strategies. Second, considering the existence of uninformative negative samples and false negative samples in the negative samples, we analyze the negative samples from the gradient perspective and finally mine negative samples that are neither too hard nor too easy as potential negative samples, i.e., those negative samples that are close to positive samples. The experiments show the obvious advantages of our method compared with some traditional self-supervised methods. Our method achieves 88.57%, 61.10%, and 36.69% top-1 accuracy on CIFAR10, CIFAR100, and TinyImagenet, respectively.

CLFeb 21, 2025Code
When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements

Tianjie Ju, Bowen Wang, Hao Fei et al.

Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi-agent systems (MAS). However, it remains unclear how disagreements shape collective decision-making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task-critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE-cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among agents and as task-critical counterfactual knowledge edits injected into context or parameters. Experiments reveal that general disagreements consistently improve success by encouraging complementary exploration. By contrast, task-critical disagreements substantially reduce success on single-path reasoning, yet have a limited impact on programming, where agents can choose alternative solutions. Trace analyses show that MAS frequently bypasses the edited facts in programming but rarely does so in reasoning, revealing an emergent self-repair capability that depends on solution-path rather than scale alone. Our code is available at https://github.com/wbw625/MultiAgentRobustness.

CVSep 2, 2024
3D-LSPTM: An Automatic Framework with 3D-Large-Scale Pretrained Model for Laryngeal Cancer Detection Using Laryngoscopic Videos

Meiyu Qiu, Yun Li, Wenjun Huang et al.

Laryngeal cancer is a malignant disease with a high morality rate in otorhinolaryngology, posing an significant threat to human health. Traditionally larygologists manually visual-inspect laryngeal cancer in laryngoscopic videos, which is quite time-consuming and subjective. In this study, we propose a novel automatic framework via 3D-large-scale pretrained models termed 3D-LSPTM for laryngeal cancer detection. Firstly, we collect 1,109 laryngoscopic videos from the First Affiliated Hospital Sun Yat-sen University with the approval of the Ethics Committee. Then we utilize the 3D-large-scale pretrained models of C3D, TimeSformer, and Video-Swin-Transformer, with the merit of advanced featuring videos, for laryngeal cancer detection with fine-tuning techniques. Extensive experiments show that our proposed 3D-LSPTM can achieve promising performance on the task of laryngeal cancer detection. Particularly, 3D-LSPTM with the backbone of Video-Swin-Transformer can achieve 92.4% accuracy, 95.6% sensitivity, 94.1% precision, and 94.8% F_1.

74.8CVMar 16
Learning Latent Proxies for Controllable Single-Image Relighting

Haoze Zheng, Zihao Wang, Xianfeng Wu et al.

Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

LGMar 30, 2024
Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods

Yuji Cao, Huan Zhao, Yuheng Cheng et al.

With extensive pre-trained knowledge and high-level general capabilities, large language models (LLMs) emerge as a promising avenue to augment reinforcement learning (RL) in aspects such as multi-task learning, sample efficiency, and high-level task planning. In this survey, we provide a comprehensive review of the existing literature in LLM-enhanced RL and summarize its characteristics compared to conventional RL methods, aiming to clarify the research scope and directions for future studies. Utilizing the classical agent-environment interaction paradigm, we propose a structured taxonomy to systematically categorize LLMs' functionalities in RL, including four roles: information processor, reward designer, decision-maker, and generator. For each role, we summarize the methodologies, analyze the specific RL challenges that are mitigated, and provide insights into future directions. Lastly, a comparative analysis of each role, potential applications, prospective opportunities, and challenges of the LLM-enhanced RL are discussed. By proposing this taxonomy, we aim to provide a framework for researchers to effectively leverage LLMs in the RL field, potentially accelerating RL applications in complex applications such as robotics, autonomous driving, and energy systems.

CVJul 5, 2024
Research, Applications and Prospects of Event-Based Pedestrian Detection: A Survey

Han Wang, Yuman Nie, Yun Li et al.

Event-based cameras, inspired by the biological retina, have evolved into cutting-edge sensors distinguished by their minimal power requirements, negligible latency, superior temporal resolution, and expansive dynamic range. At present, cameras used for pedestrian detection are mainly frame-based imaging sensors, which have suffered from lethargic response times and hefty data redundancy. In contrast, event-based cameras address these limitations by eschewing extraneous data transmissions and obviating motion blur in high-speed imaging scenarios. On pedestrian detection via event-based cameras, this paper offers an exhaustive review of research and applications particularly in the autonomous driving context. Through methodically scrutinizing relevant literature, the paper outlines the foundational principles, developmental trajectory, and the comparative merits and demerits of eventbased detection relative to traditional frame-based methodologies. This review conducts thorough analyses of various event stream inputs and their corresponding network models to evaluate their applicability across diverse operational environments. It also delves into pivotal elements such as crucial datasets and data acquisition techniques essential for advancing this technology, as well as advanced algorithms for processing event stream data. Culminating with a synthesis of the extant landscape, the review accentuates the unique advantages and persistent challenges inherent in event-based pedestrian detection, offering a prognostic view on potential future developments in this fast-progressing field.

LGSep 14, 2025Code
From Parameters to Performance: A Data-Driven Study on LLM Structure and Development

Suqing Wang, Zuchao Li, Luohe Shi et al.

Large language models (LLMs) have achieved remarkable success across various domains, driving significant technological advancements and innovations. Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce. To address this gap, we present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks. Leveraging this dataset, we conduct a systematic, data mining-driven analysis to validate and quantify the relationship between structural configurations and performance. Our study begins with a review of the historical development of LLMs and an exploration of potential future trends. We then analyze how various structural choices impact performance across benchmarks and further corroborate our findings using mechanistic interpretability techniques. By providing data-driven insights into LLM optimization, our work aims to guide the targeted development and application of future models. We will release our dataset at https://huggingface.co/datasets/DX0369/LLM-Structure-Performance-Dataset

CVAug 14, 2025Code
HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

Zhaoyuan Qi, Weihua Gao, Wenlong Niu et al.

In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

AIMay 26, 2025Code
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models

Jianxing Liao, Junyan Xu, Yatao Sun et al.

Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts. The code is available at https://jianxliao.github.io/cadllm-page/

CVDec 3, 2024Code
Low-Contrast-Enhanced Contrastive Learning for Semi-Supervised Endoscopic Image Segmentation

Lingcong Cai, Yun Li, Xiaomao Fan et al.

The segmentation of endoscopic images plays a vital role in computer-aided diagnosis and treatment. The advancements in deep learning have led to the employment of numerous models for endoscopic tumor segmentation, achieving promising segmentation performance. Despite recent advancements, precise segmentation remains challenging due to limited annotations and the issue of low contrast. To address these issues, we propose a novel semi-supervised segmentation framework termed LoCo via low-contrast-enhanced contrastive learning (LCC). This innovative approach effectively harnesses the vast amounts of unlabeled data available for endoscopic image segmentation, improving both accuracy and robustness in the segmentation process. Specifically, LCC incorporates two advanced strategies to enhance the distinctiveness of low-contrast pixels: inter-class contrast enhancement (ICE) and boundary contrast enhancement (BCE), enabling models to segment low-contrast pixels among malignant tumors, benign tumors, and normal tissues. Additionally, a confidence-based dynamic filter (CDF) is designed for pseudo-label selection, enhancing the utilization of generated pseudo-labels for unlabeled data with a specific focus on minority classes. Extensive experiments conducted on two public datasets, as well as a large proprietary dataset collected over three years, demonstrate that LoCo achieves state-of-the-art results, significantly outperforming previous methods. The source code of LoCo is available at the URL of \href{https://github.com/AnoK3111/LoCo}{https://github.com/AnoK3111/LoCo}.