Chunyi Peng

CL
h-index21
8papers
27citations
Novelty62%
AI Score57

8 Papers

AIJan 29Code
Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains

Yuqi Xiong, Chunyi Peng, Zhipeng Xu et al.

Visual Retrieval-Augmented Generation (VRAG) enhances Vision-Language Models (VLMs) by incorporating external visual documents to address a given query. Existing VRAG frameworks usually depend on rigid, pre-defined external tools to extend the perceptual capabilities of VLMs, typically by explicitly separating visual perception from subsequent reasoning processes. However, this decoupled design can lead to unnecessary loss of visual information, particularly when image-based operations such as cropping are applied. In this paper, we propose Lang2Act, which enables fine-grained visual perception and reasoning through self-emergent linguistic toolchains. Rather than invoking fixed external engines, Lang2Act collects self-emergent actions as linguistic tools and leverages them to enhance the visual perception capabilities of VLMs. To support this mechanism, we design a two-stage Reinforcement Learning (RL)-based training framework. Specifically, the first stage optimizes VLMs to self-explore high-quality actions for constructing a reusable linguistic toolbox, and the second stage further optimizes VLMs to exploit these linguistic tools for downstream reasoning effectively. Experimental results demonstrate the effectiveness of Lang2Act in substantially enhancing the visual perception capabilities of VLMs, achieving performance improvements of over 4%. All code and data are available at https://github.com/NEUIR/Lang2Act.

91.7CLMay 5Code
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Zhipeng Xu, Junhao Ji, Zulong Chen et al.

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.

CLJun 12, 2025Code
ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization

Zhensheng Jin, Xinze Li, Yifan Ji et al.

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.

CLMay 28, 2025
Learning to Route Queries Across Knowledge Bases for Step-wise Retrieval-Augmented Reasoning

Chunyi Peng, Zhipeng Xu, Zhenghao Liu et al.

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge during generation. Existing MRAG methods typically adopt a static retrieval pipeline that fetches relevant information from multiple Knowledge Bases (KBs), followed by a refinement step. However, these approaches overlook the reasoning and planning capabilities of MLLMs to dynamically determine how to interact with different KBs during the reasoning process. To address this limitation, we propose R1-Router, a novel MRAG framework that learns to decide when and where to retrieve knowledge based on the evolving reasoning state. Specifically, R1-Router can generate follow-up queries according to the current reasoning step, routing these intermediate queries to the most suitable KB, and integrating external knowledge into a coherent reasoning trajectory to answer the original query. Furthermore, we introduce Step-wise Group Relative Policy Optimization (Step-GRPO), a tailored reinforcement learning algorithm that assigns step-specific rewards to optimize the reasoning behavior of MLLMs. Experimental results on various open-domain QA benchmarks across multiple modalities demonstrate that R1-Router outperforms baseline models by over 7%. Further analysis shows that R1-Router can adaptively and effectively leverage diverse KBs, reducing unnecessary retrievals and improving both efficiency and accuracy.

CLOct 10, 2025
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Yubo Sun, Chunyi Peng, Yukun Yan et al.

Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.

NINov 10, 2021
Towards Live Video Analytics with On-Drone Deeper-yet-Compatible Compression

Junpeng Guo, Chunyi Peng

In this work, we present DCC(Deeper-yet-Compatible Compression), one enabling technique for real-time drone-sourced edge-assisted video analytics built on top of the existing codec. DCC tackles an important technical problem to compress streamed video from the drone to the edge without scarifying accuracy and timeliness of video analytical tasks performed at the edge. DCC is inspired by the fact that not every bit in streamed video is equally valuable to video analytics, which opens new compression room over the conventional analytics-oblivious video codec technology. We exploit drone-specific context and intermediate hints from object detection to pursue adaptive fidelity needed to retain analytical quality. We have prototyped DCC in one showcase application of vehicle detection and validated its efficiency in representative scenarios. DCC has reduced transmission volume by 9.5-fold over the baseline approach and 19-683% over the state-of-the-art with comparable detection accuracy.

CRNov 27, 2018
The Untold Secrets of Operational Wi-Fi Calling Services: Vulnerabilities, Attacks, and Countermeasures

Tian Xie, Guan-Hua Tu, Bangjie Yin et al.

Since 2016, all of four major U.S. operators have rolled out nationwide Wi-Fi calling services. They are projected to surpass VoLTE (Voice over LTE) and other VoIP services in terms of mobile IP voice usage minutes in 2018. They enable mobile users to place cellular calls over Wi-Fi networks based on the 3GPP IMS (IP Multimedia Subsystem) technology. Compared with conventional cellular voice solutions, the major difference lies in that their traffic traverses untrustful Wi-Fi networks and the Internet. This exposure to insecure networks may cause the Wi-Fi calling users to suffer from security threats. Its security mechanisms are similar to the VoLTE, because both of them are supported by the IMS. They include SIM-based security, 3GPP AKA (Authentication and Key Agreement), IPSec (Internet Protocol Security), etc. However, are they sufficient to secure Wi-Fi calling services? Unfortunately, our study yields a negative answer. We conduct the first study of exploring security issues of the operational Wi-Fi calling services in three major U.S. operators' networks using commodity devices. We disclose that current Wi-Fi calling security is not bullet-proof and uncover four vulnerabilities which stem from improper standard designs, device implementation issues and network operation slips. By exploiting the vulnerabilities, together with several state-of-the-art computer visual recognition technologies, we devise two proof-of-concept attacks: user privacy leakage and telephony harassment or denial of voice service (THDoS); both of them can bypass the security defenses deployed on mobile devices and the network infrastructure. We have confirmed their feasibility and simplicity using real-world experiments, as well as assessed their potential damages and proposed recommended solutions.

CROct 29, 2015
New Threats to SMS-Assisted Mobile Internet Services from 4G LTE: Lessons Learnt from Distributed Mobile-Initiated Attacks towards Facebook and Other Services

Guan-Hua Tu, Yuanjie Li, Chunyi Peng et al.

Mobile Internet is becoming the norm. With more personalized mobile devices in hand, many services choose to offer alternative, usually more convenient, approaches to authenticating and delivering the content between mobile users and service providers. One main option is to use SMS (i.e., short messaging service). Such carrier-grade text service has been widely used to assist versatile mobile services, including social networking, banking, to name a few. Though the text service can be spoofed via certain Internet text service providers which cooperated with carriers, such attacks haven well studied and defended by industry due to the efforts of research community. However, as cellular network technology advances to the latest IP-based 4G LTE, we find that these mobile services are somehow exposed to new threats raised by this change, particularly on 4G LTE Text service (via brand-new distributed Mobile-Initiated Spoofed SMS attack which is not available in legacy 2G/3G systems). The reason is that messaging service over LTE shifts from the circuit-switched (CS) design to the packet-switched (PS) paradigm as 4G LTE supports PS only. Due to this change, 4G LTE Text Service becomes open to access. However, its shields to messaging integrity and user authentication are not in place. As a consequence, such weaknesses can be exploited to launch attacks (e.g., hijack Facebook accounts) against a targeted individual, a large scale of mobile users and even service providers, from mobile devices. Current defenses for Internet-Initiated Spoofed SMS attacks cannot defend the unprecedented attack. Our study shows that 53 of 64 mobile services over 27 industries are vulnerable to at least one threat. We validate these proof-of-concept attacks in one major US carrier which supports more than 100 million users. We finally propose quick fixes and discuss security insights and lessons we have learnt.