Bingnan Liu

CV
h-index49
3papers
26citations
Novelty60%
AI Score54

3 Papers

94.9CVMay 19Code
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Bingnan Liu, Chenhang Cui, Rui Huang et al.

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

AIFeb 3
Risky-Bench: Probing Agentic Safety Risks under Real-World Deployment

Jingnan Zheng, Yanzhen Luo, Jingjun Xu et al.

Large Language Models (LLMs) are increasingly deployed as agents that operate in real-world environments, introducing safety risks beyond linguistic harm. Existing agent safety evaluations rely on risk-oriented tasks tailored to specific agent settings, resulting in limited coverage of safety risk space and failing to assess agent safety behavior during long-horizon, interactive task execution in complex real-world deployments. Moreover, their specialization to particular agent settings limits adaptability across diverse agent configurations. To address these limitations, we propose Risky-Bench, a framework that enables systematic agent safety evaluation grounded in real-world deployment. Risky-Bench organizes evaluation around domain-agnostic safety principles to derive context-aware safety rubrics that delineate safety space, and systematically evaluates safety risks across this space through realistic task execution under varying threat assumptions. When applied to life-assist agent settings, Risky-Bench uncovers substantial safety risks in state-of-the-art agents under realistic execution conditions. Moreover, as a well-structured evaluation pipeline, Risky-Bench is not confined to life-assist scenarios and can be adapted to other deployment settings to construct environment-specific safety evaluations, providing an extensible methodology for agent safety assessment.

CVJul 22, 2025Code
LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs

Zitong Xu, Huiyu Duan, Bingnan Liu et al.

The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.