CVNov 24, 2025
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied AgentsDayong Liu, Chao Xu, Weihong Chen et al.
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents. Project page: \href{https://cfg-bench.github.io/}{https://cfg-bench.github.io/}.
CVApr 13, 2025
Intelligent driving vehicle front multi-target tracking and detection based on YOLOv5 and point cloud 3D projectionDayong Liu, Qingrui Zhang, Zeyang Meng
In multi-target tracking and detection tasks, it is necessary to continuously track multiple targets, such as vehicles, pedestrians, etc. To achieve this goal, the system must be able to continuously acquire and process image frames containing these targets. These consecutive frame images enable the algorithm to update the position and state of the target in real-time in each frame of the image. How to accurately associate the detected target with the target in the previous or next frame to form a stable trajectory is a complex problem. Therefore, a multi object tracking and detection method for intelligent driving vehicles based on YOLOv5 and point cloud 3D projection is proposed. Using Retinex algorithm to enhance the image of the environment in front of the vehicle, remove light interference in the image, and build an intelligent detection model based on YOLOv5 network structure. The enhanced image is input into the model, and multiple targets in front of the vehicle are identified through feature extraction and target localization. By combining point cloud 3D projection technology, the correlation between the position changes of adjacent frame images in the projection coordinate system can be inferred. By sequentially projecting the multi-target recognition results of multiple consecutive frame images into the 3D laser point cloud environment, effective tracking of the motion trajectories of all targets in front of the vehicle can be achieved. The experimental results show that the application of this method for intelligent driving vehicle front multi-target tracking and detection yields a MOTA (Tracking Accuracy) value greater than 30, demonstrating its superior tracking and detection performance.