99.2CLApr 14
AlphaEval: Evaluating Agents in ProductionPengrui Lu, Bingyu Xu, Wenjun Zhang et al.
The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics -- conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products -- Claude Code, Codex, etc. -- as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework -- a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.
53.2CVMay 21
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal GroundingZelin Zheng, Xinyan Liu, Ruixin Li et al.
Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.
ROApr 14, 2021Code
Look at my new blue force-sensing shoes!Yuanfeng Han, Ruixin Li, Gregory S. Chirikjian
To function autonomously in the physical world, humanoid robots need high-fidelity sensing systems, especially for forces that cannot be easily modeled. Modeling forces in robot feet is particularly challenging due to static indeterminacy, thereby requiring direct sensing. Unfortunately, resolving forces in the feet of some smaller-sized humanoids is limited both by the quality of sensors and the current algorithms used to interpret the data. This paper presents light-weight, low-cost and open-source force-sensing shoes to improve force measurement for popular smaller-sized humanoid robots, and a method for calibrating the shoes. The shoes measure center of pressure (CoP) and normal ground reaction force (GRF). The calibration method enables each individual shoe to reach high measurement precision by applying known forces at different locations of the shoe and using a regularized least squares optimization to interpret sensor outputs. A NAO robot is used as our experimental platform. Experiments are conducted to compare the measurement performance between the shoes and the robot's factory-installed force-sensing resistors (FSRs), and to evaluate the calibration method over these two sensing modules. Experimental results show that the shoes significantly improve CoP and GRF measurement precision compared to the robot's built-in FSRs. Moreover, the developed calibration method improves the measurement performance for both our shoes and the built-in FSRs.
36.3CVMar 31
Sketch It Out: Exploring Label-Free Structural Cues for Multimodal Gait RecognitionChao Zhang, Zhuang Zheng, Ruixin Li et al.
Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
LGJun 17, 2024
Modulated differentiable STFT and balanced spectrum metric for freight train wheelset bearing cross-machine transfer monitoring under speed fluctuationsChao He, Hongmei Shi, Ruixin Li et al.
The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentiable short-time Fourier transform (STFT) and physics-informed balanced spectrum quality metric is proposed to learn domain-invariant and discriminative features under time-varying speeds. Firstly, due to insufficiency in extracting extract frequency components of time-varying speed signals using fixed windows, a modulated differentiable STFT (MDSTFT) that is interpretable with STFT-informed theoretical support, is proposed to extract the robust time-frequency spectrum (TFS). During training process, multiple windows with different lengths dynamically change. Also, in addition to the classification metric and domain discrepancy metric, we creatively introduce a third kind of metric, referred to as the physics-informed metric, to enhance transferable TFS. A physics-informed balanced spectrum quality (BSQ) regularization loss is devised to guide an optimization direction for MDSTFT and model. With it, not only can model acquire high-quality TFS, but also a physics-restricted domain adaptation network can be also acquired, making it learn real-world physics knowledge, ultimately diminish the domain discrepancy across different datasets. The experiment is conducted in the scenario of migrating from the laboratory datasets to the freight train dataset, indicating that the hybrid-driven pyDSN outperforms existing methods and has practical value.
ROAug 9, 2020
Can I lift it? Humanoid robot reasoning about the feasibility of lifting a heavy box with unknown physical propertiesYuanfeng Han, Ruixin Li, Gregory S. Chirikjian
A robot cannot lift up an object if it is not feasible to do so. However, in most research on robot lifting, "feasibility" is usually presumed to exist a priori. This paper proposes a three-step method for a humanoid robot to reason about the feasibility of lifting a heavy box with physical properties that are unknown to the robot. Since feasibility of lifting is directly related to the physical properties of the box, we first discretize a range for the unknown values of parameters describing these properties and tabulate all valid optimal quasi-static lifting trajectories generated by simulations over all combinations of indices. Second, a physical-interaction-based algorithm is introduced to identify the robust gripping position and physical parameters corresponding to the box. During this process, the stability and safety of the robot are ensured. On the basis of the above two steps, a third step of mapping operation is carried out to best match the estimated parameters to the indices in the table. The matched indices are then queried to determine whether a valid trajectory exists. If so, the lifting motion is feasible; otherwise, the robot decides that the task is beyond its capability. Our method efficiently evaluates the feasibility of a lifting task through simple interactions between the robot and the box, while simultaneously obtaining the desired safe and stable trajectory. We successfully demonstrated the proposed method using a NAO humanoid robot.