ROAug 15, 2023
Hierarchical generative modelling for autonomous robotsKai Yuan, Noor Sajid, Karl Friston et al.
Humans can produce complex whole-body motions when interacting with their surroundings, by planning, executing and combining individual limb movements. We investigated this fundamental aspect of motor control in the setting of autonomous robotic operations. We approach this problem by hierarchical generative modelling equipped with multi-level planning-for autonomous task completion-that mimics the deep temporal architecture of human motor control. Here, temporal depth refers to the nested time scales at which successive levels of a forward or generative model unfold, for example, delivering an object requires a global plan to contextualise the fast coordination of multiple local movements of limbs. This separation of temporal scales also motivates robotics and control. Specifically, to achieve versatile sensorimotor control, it is advantageous to hierarchically structure the planning and low-level motor control of individual limbs. We use numerical and physical simulation to conduct experiments and to establish the efficacy of this formulation. Using a hierarchical generative model, we show how a humanoid robot can autonomously complete a complex task that necessitates a holistic use of locomotion, manipulation, and grasping. Specifically, we demonstrate the ability of a humanoid robot that can retrieve and transport a box, open and walk through a door to reach the destination, approach and kick a football, while showing robust performance in presence of body damage and ground irregularities. Our findings demonstrated the effectiveness of using human-inspired motor control algorithms, and our method provides a viable hierarchical architecture for the autonomous completion of challenging goal-directed tasks.
AIMar 26, 2024Code
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUsKai Yuan, Christoph Bauinger, Xiangyi Zhang et al.
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.
GRSep 2, 2025Code
Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common FrameworkNina Wiedemann, Sainan Liu, Quentin Leboutet et al.
Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations. In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly. We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation. The code for our framework is available at https://github.com/isl-org/unifi3d.
LGAug 8, 2024
Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking OptimizationZhou Qin, Kai Yuan, Pratik Lahiri et al.
In a typical e-commerce setting, Content Ranking Optimization (CRO) mechanisms are employed to surface content on the search page to fulfill customers' shopping missions. CRO commonly utilizes models such as contextual deep bandits model to independently rank content at different positions, e.g., one optimizer dedicated to organic search results and another to sponsored results. However, this regional optimization approach does not necessarily translate to whole page optimization, e.g., maximizing revenue at the top of the page may inadvertently diminish the revenue of lower positions. In this paper, we propose a reinforcement learning based method for whole page ranking to jointly optimize across all positions by: 1) shifting from position level optimization to whole page level optimization to achieve an overall optimized ranking; 2) applying reinforcement learning to optimize for the cumulative rewards instead of the instant reward. We formulate page level CRO as a cooperative Multi-agent Markov Decision Process , and address it with the novel Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model. MADDPG supports a flexible and scalable joint optimization framework by adopting a "centralized training and decentralized execution" approach. Extensive experiments demonstrate that MADDPG scales to a 2.5 billion action space in the public Mujoco environment, and outperforms the deep bandits modeling by 25.7% on the offline CRO data set from a leading e-commerce company. We foresee that this novel multi-agent optimization is applicable to similar joint optimization problems in the field of information retrieval.
AIJan 30
Best-of-Q: Improving VLM agents with Q-function Action Ranking at InferenceEmilien Biré, María Santos, Kai Yuan
Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM's role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.
AISep 22, 2024
Can Large Language Models Logically Predict Myocardial Infarction? Evaluation based on UK Biobank CohortYuxing Zhi, Yuan Guo, Kai Yuan et al.
Background: Large language models (LLMs) have seen extraordinary advances with applications in clinical decision support. However, high-quality evidence is urgently needed on the potential and limitation of LLMs in providing accurate clinical decisions based on real-world medical data. Objective: To evaluate quantitatively whether universal state-of-the-art LLMs (ChatGPT and GPT-4) can predict the incidence risk of myocardial infarction (MI) with logical inference, and to further make comparison between various models to assess the performance of LLMs comprehensively. Methods: In this retrospective cohort study, 482,310 participants recruited from 2006 to 2010 were initially included in UK Biobank database and later on resampled into a final cohort of 690 participants. For each participant, tabular data of the risk factors of MI were transformed into standardized textual descriptions for ChatGPT recognition. Responses were generated by asking ChatGPT to select a score ranging from 0 to 10 representing the risk. Chain of Thought (CoT) questioning was used to evaluate whether LLMs make prediction logically. The predictive performance of ChatGPT was compared with published medical indices, traditional machine learning models and other large language models. Conclusions: Current LLMs are not ready to be applied in clinical medicine fields. Future medical LLMs are suggested to be expert in medical domain knowledge to understand both natural languages and quantified medical data, and further make logical inferences.
ROFeb 8, 2025
Generating Physically Realistic and Directable Human Motions from Multi-Modal InputsAayam Shrestha, Pan Liu, German Ros et al.
This work focuses on generating realistic, physically-based human behaviors from multi-modal inputs, which may only partially specify the desired motion. For example, the input may come from a VR controller providing arm motion and body velocity, partial key-point animation, computer vision applied to videos, or even higher-level motion goals. This requires a versatile low-level humanoid controller that can handle such sparse, under-specified guidance, seamlessly switch between skills, and recover from failures. Current approaches for learning humanoid controllers from demonstration data capture some of these characteristics, but none achieve them all. To this end, we introduce the Masked Humanoid Controller (MHC), a novel approach that applies multi-objective imitation learning on augmented and selectively masked motion demonstrations. The training methodology results in an MHC that exhibits the key capabilities of catch-up to out-of-sync input commands, combining elements from multiple motion sequences, and completing unspecified parts of motions from sparse multimodal input. We demonstrate these key capabilities for an MHC learned over a dataset of 87 diverse skills and showcase different multi-modal use cases, including integration with planning frameworks to highlight MHC's ability to solve new user-defined tasks without any finetuning.
ITAug 11, 2025
Adaptive Source-Channel Coding for Semantic CommunicationsDongxu Li, Kai Yuan, Jianhao Huang et al.
Semantic communications (SemComs) have emerged as a promising paradigm for joint data and task-oriented transmissions, combining the demands for both the bit-accurate delivery and end-to-end (E2E) distortion minimization. However, current joint source-channel coding (JSCC) in SemComs is not compatible with the existing communication systems and cannot adapt to the variations of the sources or the channels, while separate source-channel coding (SSCC) is suboptimal in the finite blocklength regime. To address these issues, we propose an adaptive source-channel coding (ASCC) scheme for SemComs over parallel Gaussian channels, where the deep neural network (DNN)-based semantic source coding and conventional digital channel coding are separately deployed and adaptively designed. To enable efficient adaptation between the source and channel coding, we first approximate the E2E data and semantic distortions as functions of source coding rate and bit error ratio (BER) via logistic regression, where BER is further modeled as functions of signal-to-noise ratio (SNR) and channel coding rate. Then, we formulate the weighted sum E2E distortion minimization problem for joint source-channel coding rate and power allocation over parallel channels, which is solved by the successive convex approximation. Finally, simulation results demonstrate that the proposed ASCC scheme outperforms typical deep JSCC and SSCC schemes for both the single- and parallel-channel scenarios while maintaining full compatibility with practical digital systems.
IRFeb 1
Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective AlignmentKai Yuan, Anthony Zheng, Jia Hu et al.
Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44\% reduction in keystrokes and 3.46\% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.
AIOct 22, 2025
Surfer 2: The Next Generation of Cross-Platform Computer Use AgentsMathieu Andreux, Märt Bakler, Yanael Barbier et al. · cambridge
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
LGDec 10, 2024
AHSG: Adversarial Attack on High-level Semantics in Graph Neural NetworksKai Yuan, Jiahao Zhang, Yidi Wang et al.
Adversarial attacks on Graph Neural Networks aim to perturb the performance of the learner by carefully modifying the graph topology and node attributes. Existing methods achieve attack stealthiness by constraining the modification budget and differences in graph properties. However, these methods typically disrupt task-relevant primary semantics directly, which results in low defensibility and detectability of the attack. In this paper, we propose an Adversarial Attack on High-level Semantics for Graph Neural Networks (AHSG), which is a graph structure attack model that ensures the retention of primary semantics. By combining latent representations with shared primary semantics, our model retains detectable attributes and relational patterns of the original graph while leveraging more subtle changes to carry out the attack. Then we use the Projected Gradient Descent algorithm to map the latent representations with attack effects to the adversarial graph. Through experiments on robust graph deep learning models equipped with defense strategies, we demonstrate that AHSG outperforms other state-of-the-art methods in attack effectiveness. Additionally, using Contextual Stochastic Block Models to detect the attacked graph further validates that our method preserves the primary semantics of the graph.
LGJun 2, 2024
IENE: Identifying and Extrapolating the Node Environment for Out-of-Distribution Generalization on GraphsHaoran Yang, Xiaobing Pei, Kai Yuan
Due to the performance degradation of graph neural networks (GNNs) under distribution shifts, the work on out-of-distribution (OOD) generalization on graphs has received widespread attention. A novel perspective involves distinguishing potential confounding biases from different environments through environmental identification, enabling the model to escape environmentally-sensitive correlations and maintain stable performance under distribution shifts. However, in graph data, confounding factors not only affect the generation process of node features but also influence the complex interaction between nodes. We observe that neglecting either aspect of them will lead to a decrease in performance. In this paper, we propose IENE, an OOD generalization method on graphs based on node-level environmental identification and extrapolation techniques. It strengthens the model's ability to extract invariance from two granularities simultaneously, leading to improved generalization. Specifically, to identify invariance in features, we utilize the disentangled information bottleneck framework to achieve mutual promotion between node-level environmental estimation and invariant feature learning. Furthermore, we extrapolate topological environments through graph augmentation techniques to identify structural invariance. We implement the conceptual method with specific algorithms and provide theoretical analysis and proofs for our approach. Extensive experimental evaluations on two synthetic and four real-world OOD datasets validate the superiority of IENE, which outperforms existing techniques and provides a flexible framework for enhancing the generalization of GNNs.
ROSep 28, 2021
Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual ObservationsFernando Acero, Kai Yuan, Zhibin Li
To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable. We aim to investigate the feasibility and performance of using sparse visual observations to achieve perceptual locomotion over a range of common terrains (steps, ramps, gaps, and stairs) in human-centered environments. We formulate a selection of sparse visual inputs suitable for locomotion over the terrains of interest, and propose a learning framework to integrate exteroceptive and proprioceptive states. We specifically design the state observations and a training curriculum to learn feedback control policies effectively over a range of different terrains. We extensively validate and benchmark the learned policy in various tasks: omnidirectional walking on flat ground, and forward locomotion over various obstacles, showing high success rate of traversability. Furthermore, we study exteroceptive ablations and evaluate policy generalization by adding various levels of noise and testing on new unseen terrains. We demonstrate the capabilities of autonomous perceptual locomotion that can be achieved by only using sparse visual observations from direct depth measurements, which are easily available from a Lidar or RGB-D sensor, showing robust ascent and descent over high stairs of 20 cm height, i.e., 50% leg length, and robustness against noise and unseen terrains.
IRAug 11, 2021
Deep Pairwise Learning To Rank For Search AutocompleteKai Yuan, Da Kuang
Autocomplete (a.k.a "Query Auto-Completion", "AC") suggests full queries based on a prefix typed by customer. Autocomplete has been a core feature of commercial search engine. In this paper, we propose a novel context-aware neural network based pairwise ranker (DeepPLTR) to improve AC ranking, DeepPLTR leverages contextual and behavioral features to rank queries by minimizing a pairwise loss, based on a fully-connected neural network structure. Compared to LambdaMART ranker, DeepPLTR shows +3.90% MeanReciprocalRank (MRR) lift in offline evaluation, and yielded +0.06% (p < 0.1) Gross Merchandise Value (GMV) lift in an Amazon's online A/B experiment.
RODec 10, 2020
Multi-expert learning of adaptive legged locomotionChuanyu Yang, Kai Yuan, Qiuguo Zhu et al.
Achieving versatile robot locomotion requires motor skills which can adapt to previously unseen situations. We propose a Multi-Expert Learning Architecture (MELA) that learns to generate adaptive skills from a group of representative expert skills. During training, MELA is first initialised by a distinct set of pre-trained experts, each in a separate deep neural network (DNN). Then by learning the combination of these DNNs using a Gating Neural Network (GNN), MELA can acquire more specialised experts and transitional skills across various locomotion modes. During runtime, MELA constantly blends multiple DNNs and dynamically synthesises a new DNN to produce adaptive behaviours in response to changing situations. This approach leverages the advantages of trained expert skills and the fast online synthesis of adaptive policies to generate responsive motor skills during the changing tasks. Using a unified MELA framework, we demonstrated successful multi-skill locomotion on a real quadruped robot that performed coherent trotting, steering, and fall recovery autonomously, and showed the merit of multi-expert learning generating behaviours which can adapt to unseen scenarios.
ROMay 20, 2020
Learning natural locomotion behaviors for humanoid robots using human knowledgeChuanyu Yang, Kai Yuan, Shuai Heng et al.
This paper presents a new learning framework that leverages the knowledge from imitation learning, deep reinforcement learning, and control theories to achieve human-style locomotion that is natural, dynamic, and robust for humanoids. We proposed novel approaches to introduce human bias, i.e. motion capture data and a special Multi-Expert network structure. We used the Multi-Expert network structure to smoothly blend behavioral features, and used the augmented reward design for the task and imitation rewards. Our reward design is composable, tunable, and explainable by using fundamental concepts from conventional humanoid control. We rigorously validated and benchmarked the learning framework which consistently produced robust locomotion behaviors in various test scenarios. Further, we demonstrated the capability of learning robust and versatile policies in the presence of disturbances, such as terrain irregularities and external pushes.
ROFeb 15, 2020
Learning Pregrasp Manipulation of Objects from Ungraspable PosesZhaole Sun, Kai Yuan, Wenbin Hu et al.
In robotic grasping, objects are often occluded in ungraspable configurations such that no pregrasp pose can be found, eg large flat boxes on the table that can only be grasped from the side. Inspired by humans' bimanual manipulation, eg one hand to lift up things and the other to grasp, we address this type of problems by introducing pregrasp manipulation - push and lift actions. We propose a model-free Deep Reinforcement Learning framework to train control policies that utilize visual information and proprioceptive states of the robot to autonomously discover robust pregrasp manipulation. The robot arm learns to first push the object towards a support surface and establishes a pivot to lift up one side of the object, thus creating a clearance between the object and the table for possible grasping solutions. Furthermore, we show the effectiveness of our proposed learning framework in training robust pregrasp policies that can directly transfer from simulation to real hardware through suitable design of training procedures, state, and action space. Lastly, we evaluate the effectiveness and the generalisation ability of the learned policies in real-world experiments, and demonstrate pregrasp manipulation of objects with various size, shape, weight, and surface friction.
ROFeb 11, 2020
Reaching, Grasping and Re-grasping: Learning Multimode Grasping SkillsWenbin Hu, Chuanyu Yang, Kai Yuan et al.
The ability to adapt to uncertainties, recover from failures, and coordinate between hand and fingers are essential sensorimotor skills for fully autonomous robotic grasping. In this paper, we aim to study a unified feedback control policy for generating the finger actions and the motion of hand to accomplish seamlessly coordinated tasks of reaching, grasping and re-grasping. We proposed a set of quantified metrics for task-orientated rewards to guide the policy exploration, and we analyzed and demonstrated the effectiveness of each reward term. To acquire a robust re-grasping motion, we deployed different initial states in training to experience failures that the robot would encounter during grasping due to inaccurate perception or disturbances. The performance of learned policy is evaluated on three different tasks: grasping a static target, grasping a dynamic target, and re-grasping. The quality of learned grasping policy was evaluated based on success rates in different scenarios and the recovery time from failures. The results indicate that the learned policy is able to achieve stable grasps of a static or moving object. Moreover, the policy can adapt to new environmental changes on the fly and execute collision-free re-grasp after a failed attempt within a short recovery time even in difficult configurations.
ROFeb 7, 2020
Learning Whole-body Motor Skills for HumanoidsChuanyu Yang, Kai Yuan, Wolfgang Merkt et al.
This paper presents a hierarchical framework for Deep Reinforcement Learning that acquires motor skills for a variety of push recovery and balancing behaviors, i.e., ankle, hip, foot tilting, and stepping strategies. The policy is trained in a physics simulator with realistic setting of robot model and low-level impedance control that are easy to transfer the learned skills to real robots. The advantage over traditional methods is the integration of high-level planner and feedback control all in one single coherent policy network, which is generic for learning versatile balancing and recovery motions against unknown perturbations at arbitrary locations (e.g., legs, torso). Furthermore, the proposed framework allows the policy to be learned quickly by many state-of-the-art learning algorithms. By comparing our learned results to studies of preprogrammed, special-purpose controllers in the literature, self-learned skills are comparable in terms of disturbance rejection but with additional advantages of producing a wide range of adaptive, versatile and robust behaviors.
ROFeb 5, 2020
Force-guided High-precision Grasping Control of Fragile and Deformable Objects using sEMG-based Force PredictionRuoshi Wen, Kai Yuan, Qiang Wang et al.
Regulating contact forces with high precision is crucial for grasping and manipulating fragile or deformable objects. We aim to utilize the dexterity of human hands to regulate the contact forces for robotic hands and exploit human sensory-motor synergies in a wearable and non-invasive way. We extracted force information from the electric activities of skeletal muscles during their voluntary contractions through surface electromyography (sEMG). We built a regression model based on a Neural Network to predict the gripping force from the preprocessed sEMG signals and achieved high accuracy (R2 = 0.982). Based on the force command predicted from human muscles, we developed a force-guided control framework, where force control was realized via an admittance controller that tracked the predicted gripping force reference to grasp delicate and deformable objects. We demonstrated the effectiveness of the proposed method on a set of representative fragile and deformable objects from daily life, all of which were successfully grasped without any damage or deformation.
ROMay 6, 2018
Comparison Study of Nonlinear Optimization of Step Durations and Foot Placement for Dynamic WalkingWenbin Hu, Iordanis Chatzinikolaidis, Kai Yuan et al.
This paper studies bipedal locomotion as a nonlinear optimization problem based on continuous and discrete dynamics, by simultaneously optimizing the remaining step duration, the next step duration and the foot location to achieve robustness. The linear inverted pendulum as the motion model captures the center of mass dynamics and its low-dimensionality makes the problem more tractable. We first formulate a holistic approach to search for optimality in the three-dimensional parametric space and use these results as baseline. To further improve computational efficiency, our study investigates a sequential approach with two stages of customized optimization that first optimizes the current step duration, and subsequently the duration and location of the next step. The effectiveness of both approaches is successfully demonstrated in simulation by applying different perturbations. The comparison study shows that these two approaches find mostly the same optimal solutions, but the latter requires considerably less computational time, which suggests that the proposed sequential approach is well suited for real-time implementation with a minor trade-off in optimality.