Heng Lian

SE
h-index8
9papers
169citations
Novelty61%
AI Score61

9 Papers

94.7CVMay 18Code
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang et al.

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

LGApr 25, 2022
Online Deep Learning from Doubly-Streaming Data

Heng Lian, John Scovil Atwood, Bojian Hou et al.

This paper investigates a new online learning problem with doubly-streaming data, where the data streams are described by feature spaces that constantly evolve, with new features emerging and old features fading away. The challenges of this problem are two folds: 1) Data samples ceaselessly flowing in may carry shifted patterns over time, requiring learners to update hence adapt on-the-fly. 2) Newly emerging features are described by very few samples, resulting in weak learners that tend to make error predictions. A plausible idea to overcome the challenges is to establish relationship between the pre-and-post evolving feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional media streams with complex feature interplay, which suffers an tradeoff between onlineness (biasing shallow learners) and expressiveness(requiring deep learners). Motivated by this, we propose a novel OLD^3S paradigm, where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building intermediate feature mapping relationship. A key trait of OLD^3S is to treat the model capacity as a learnable semantics, yields optimal model depth and parameters jointly, in accordance with the complexity and non-linearity of the input data streams in an online fashion. Both theoretical analyses and empirical studies substantiate the viability and effectiveness of our proposal.

86.0CLMay 28
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Weihan Peng, Chenxu Zhang, Qianao Wang et al.

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

CRJan 9Code
FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

Zhi Yang, Runguo Li, Qiqi Qiang et al.

Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated financial environments. However, existing safety evaluations largely focus on language-model-level content compliance or abstract agent settings, failing to capture execution-grounded risks arising from real operational workflows and state-changing actions. To bridge this gap, we propose FinVault, the first execution-grounded security benchmark for financial agents, comprising 31 regulatory case-driven sandbox scenarios with state-writable databases and explicit compliance constraints, together with 107 real-world vulnerabilities and 963 test cases that systematically cover prompt injection, jailbreaking, financially adapted attacks, as well as benign inputs for false-positive evaluation. Experimental results reveal that existing defense mechanisms remain ineffective in realistic financial agent settings, with average attack success rates (ASR) still reaching up to 50.0\% on state-of-the-art models and remaining non-negligible even for the most robust systems (ASR 6.7\%), highlighting the limited transferability of current safety designs and the need for stronger financial-specific defenses. Our code can be found at https://github.com/aifinlab/FinVault.

SEJul 31, 2025Code
SWE-Exp: Experience-Driven Software Issue Resolution

Silin Chen, Shaoxin Lin, Xiaodong Gu et al.

Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

SEJul 31, 2025Code
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Han Li, Yuling Shi, Shaoxin Lin et al.

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents' independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

SEJan 23
SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang, Yuling Shi, Mo Yang et al.

LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task-specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE-Pruner, a self-adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers "selectively skim" source code during development and debugging, SWE-Pruner performs task-aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., "focus on error handling") as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE-Pruner's effectiveness in various scenarios, achieving 23-54% token reduction on agent tasks like SWE-Bench Verified while even improving success rates, and up to 14.84x compression on single-turn tasks like LongCodeQA with minimal performance impact.

MLAug 18, 2017
Debiased distributed learning for sparse partial linear models in high dimensions

Shaogao Lv, Heng Lian

Although various distributed machine learning schemes have been proposed recently for pure linear models and fully nonparametric models, little attention has been paid on distributed optimization for semi-paramemetric models with multiple-level structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learning algorithm for partially sparse linear models with an increasing number of features. The proposed method is based on the classical divide and conquer strategy for handing big data and each sub-method defined on each subsample consists of a debiased estimation of the double-regularized least squares approach. With the proposed method, we theoretically prove that our global parametric estimator can achieve optimal parametric rate in our semi-parametric model given an appropriate partition on the total data. Specially, the choice of data partition relies on the underlying smoothness of the nonparametric component, but it is adaptive to the sparsity parameter. Even under the non-distributed setting, we develop a new and easily-read proof for optimal estimation of the parametric error in high dimensional partial linear model. Finally, several simulated experiments are implemented to indicate comparable empirical performance of our debiased technique under the distributed setting.

CVJun 2, 2009
Total Variation, Adaptive Total Variation and Nonconvex Smoothly Clipped Absolute Deviation Penalty for Denoising Blocky Images

Aditya Chopra, Heng Lian

The total variation-based image denoising model has been generalized and extended in numerous ways, improving its performance in different contexts. We propose a new penalty function motivated by the recent progress in the statistical literature on high-dimensional variable selection. Using a particular instantiation of the majorization-minimization algorithm, the optimization problem can be efficiently solved and the computational procedure realized is similar to the spatially adaptive total variation model. Our two-pixel image model shows theoretically that the new penalty function solves the bias problem inherent in the total variation model. The superior performance of the new penalty is demonstrated through several experiments. Our investigation is limited to "blocky" images which have small total variation.