Chunyi Peng

h-index15

6papers

26citations

Novelty61%

AI Score51

Ranked #17,698 of 194,257 authors (top 9%)#3,752 in CL (top 12%)

6 Papers

4.4AIJan 29Code

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Yuqi Xiong, Chunyi Peng, Zhipeng Xu et al.

Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.

19.4CLJun 12, 2025Code

ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Zhensheng Jin, Xinze Li, Yifan Ji et al.

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.

14.7CLMay 28, 2025

Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning

Chunyi Peng, Zhipeng Xu, Zhenghao Liu et al.

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.

12.0CLOct 10, 2025

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Yubo Sun, Chunyi Peng, Yukun Yan et al.

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

1.2NINov 10, 2021

Towards Live Video Analytics with On-Drone Deeper-yet-Compatible Compression

Junpeng Guo, Chunyi Peng

In this work, we present DCC(Deeper-yet-Compatible Compression), one enabling technique for real-time drone-sourced edge-assisted video analytics built on top of the existing codec. DCC tackles an important technical problem to compress streamed video from the drone to the edge without scarifying accuracy and timeliness of video analytical tasks performed at the edge. DCC is inspired by the fact that not every bit in streamed video is equally valuable to video analytics, which opens new compression room over the conventional analytics-oblivious video codec technology. We exploit drone-specific context and intermediate hints from object detection to pursue adaptive fidelity needed to retain analytical quality. We have prototyped DCC in one showcase application of vehicle detection and validated its efficiency in representative scenarios. DCC has reduced transmission volume by 9.5-fold over the baseline approach and 19-683% over the state-of-the-art with comparable detection accuracy.

2.3CRNov 27, 2018

The Untold Secrets of Operational Wi-Fi Calling Services: Vulnerabilities, Attacks, and Countermeasures

Tian Xie, Guan-Hua Tu, Bangjie Yin et al.

Since 2016, all of four major U.S. operators have rolled out nationwide Wi-Fi calling services. They are projected to surpass VoLTE (Voice over LTE) and other VoIP services in terms of mobile IP voice usage minutes in 2018. They enable mobile users to place cellular calls over Wi-Fi networks based on the 3GPP IMS (IP Multimedia Subsystem) technology. Compared with conventional cellular voice solutions, the major difference lies in that their traffic traverses untrustful Wi-Fi networks and the Internet. This exposure to insecure networks may cause the Wi-Fi calling users to suffer from security threats. Its security mechanisms are similar to the VoLTE, because both of them are supported by the IMS. They include SIM-based security, 3GPP AKA (Authentication and Key Agreement), IPSec (Internet Protocol Security), etc. However, are they sufficient to secure Wi-Fi calling services? Unfortunately, our study yields a negative answer. We conduct the first study of exploring security issues of the operational Wi-Fi calling services in three major U.S. operators' networks using commodity devices. We disclose that current Wi-Fi calling security is not bullet-proof and uncover four vulnerabilities which stem from improper standard designs, device implementation issues and network operation slips. By exploiting the vulnerabilities, together with several state-of-the-art computer visual recognition technologies, we devise two proof-of-concept attacks: user privacy leakage and telephony harassment or denial of voice service (THDoS); both of them can bypass the security defenses deployed on mobile devices and the network infrastructure. We have confirmed their feasibility and simplicity using real-world experiments, as well as assessed their potential damages and proposed recommended solutions.