AIJun 4
DragOn: A Benchmark and Dataset for Drag-Based GUI InteractionsNathan Bout, Maxime Langevin, Ronan Riochet
GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.
AIOct 22, 2025
Surfer 2: The Next Generation of Cross-Platform Computer Use AgentsMathieu Andreux, Märt Bakler, Yanael Barbier et al. · cambridge
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
CVApr 30, 2020
Occlusion resistant learning of intuitive physics from videosRonan Riochet, Josef Sivic, Ivan Laptev et al.
To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.
AIMar 20, 2018
IntPhys: A Framework and Benchmark for Visual Intuitive Physics ReasoningRonan Riochet, Mario Ynocente Castro, Mathieu Bernard et al.
In order to reach human performance on complexvisual tasks, artificial systems need to incorporate a sig-nificant amount of understanding of the world in termsof macroscopic objects, movements, forces, etc. Inspiredby work on intuitive physics in infants, we propose anevaluation benchmark which diagnoses how much a givensystem understands about physics by testing whether itcan tell apart well matched videos of possible versusimpossible events constructed with a game engine. Thetest requires systems to compute a physical plausibilityscore over an entire video. It is free of bias and cantest a range of basic physical reasoning concepts. Wethen describe two Deep Neural Networks systems aimedat learning intuitive physics in an unsupervised way,using only physically possible videos. The systems aretrained with a future semantic mask prediction objectiveand tested on the possible versus impossible discrimi-nation task. The analysis of their results compared tohuman data gives novel insights in the potentials andlimitations of next frame prediction architectures.