92.4CVJun 3Code
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?Rui Zhao, Kaiming Yang, Jifeng Zhu et al.
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.
AIFeb 12, 2025Code
WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting PointHenry Hengyuan Zhao, Kaiming Yang, Wendi Yu et al.
GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to the sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state, often lead to planning errors. This issue is widespread in real application scenarios, but existing benchmarks fail to evaluate it. To address this gap, we introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications (e.g., PowerPoint, VSCode, Acrobat), each instantiated with diverse initial states to simulate authentic human-computer interactions. Complementing this, we propose WorldGUI-Agent, a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization to proactively detect and correct errors. Experimental evaluation shows that WorldGUI-Agent outperforms the outstanding existing model (Claude-3.5 Computer Use) by 12.4% in success rate on WorldGUI, and achieves a 31.2% overall success rate on WindowsAgentArena, surpassing the prior state-of-the-art by 11.7%. Our analysis further reveals that dynamic augmentation tasks and desktop environments pose substantial hurdles, underscoring the necessity of adaptive planning and feedback-driven execution for advancing real-world GUI automation. The code and data are available at https://github.com/showlab/WorldGUI.
RONov 15, 2020
Intention-Based Lane Changing and Lane Keeping Haptic Guidance Steering SystemZhanhong Yan, Kaiming Yang, Zheng Wang et al.
Haptic guidance in a shared steering assistance system has drawn significant attention in intelligent vehicle fields, owing to its mutual communication ability for vehicle control. By exerting continuous torque on the steering wheel, both the driver and support system can share lateral control of the vehicle. However, current haptic guidance steering systems demonstrate some deficiencies in assisting lane changing. This study explored a new steering interaction method, including the design and evaluation of an intention-based haptic shared steering system. Such an intention-based method can support both lane keeping and lane changing assistance, by detecting a driver lane change intention. By using a deep learning-based method to model a driver decision timing regarding lane crossing, an adaptive gain control method was proposed for realizing a steering control system. An intention consistency method was proposed to detect whether the driver and the system were acting towards the same target trajectories and to accurately capture the driver intention. A driving simulator experiment was conducted to test the system performance. Participants were required to perform six trials with assistive methods and one trial without assistance. The results demonstrated that the supporting system decreased the lane departure risk in the lane keeping tasks and could support a fast and stable lane changing maneuver.