Jianjiang Yang

CV
h-index15
6papers
78citations
Novelty55%
AI Score48

6 Papers

CVFeb 3, 2024Code
Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark

Yi Xin, Jianjiang Yang, Siqi Luo et al.

Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks, showcasing exceptional performance. However, as these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands. To address these challenges, parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters. This paper presents a comprehensive survey of the latest advancements in the visual PEFT field, systematically reviewing current methodologies and categorizing them into four primary categories: addition-based, partial-based, unified-based, and multi-task tuning. In addition, this paper offers an in-depth analysis of widely used visual datasets and real-world applications where PEFT methods have been successfully applied. Furthermore, this paper introduces the V-PEFT Bench, a unified benchmark designed to standardize the evaluation of PEFT methods across a diverse set of vision tasks, ensuring consistency and fairness in comparison. Finally, the paper outlines potential directions for future research to propel advances in the PEFT field. A comprehensive collection of resources is available at https://github.com/synbol/Awesome-Parameter-Efficient-Transfer-Learning.

CLMay 21, 2025
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li, Jianjiang Yang, Tian Yun et al.

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

CVMay 21, 2025
CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

Yanshu Li, Jianjiang Yang, Ziteng Yang et al.

Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications. However, ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context. While existing efforts focus on prompt engineering or post-hoc logit calibration, we instead investigate the underlying attention dynamics to overcome LVLMs' inherent limitations. We identify two critical deficits in their self-attention that impair effective ICL. To bridge the gap, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM's attention logits based on the input in-context sequence. CAMA employs a two-stage attention modulation to address both identified deficits, enhancing the focus on semantically significant tokens, particularly visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, demonstrating great effectiveness and generalization. It can also activate the desired effects of prompt engineering methods and remains robust under diverse sequence configurations. Thus, CAMA paves the way for deeper explorations of attention dynamics to advance multimodal reasoning.

CVAug 11, 2025
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Zhennan Shen et al.

Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8\% of the image tokens, CATP produces an average performance gain of 0.6\% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78\% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

CVJul 7, 2025
Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Jianjiang Yang, Ziyan Huang, Yanshu li et al.

Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the Hallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

CVJul 7, 2025
ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

Jianjiang Yang, Yanshu li, Ziyan Huang

While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to "seeing twice and thinking backwards". Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.