Xinyao Li

CV
h-index31
11papers
284citations
Novelty56%
AI Score61

11 Papers

ROMar 3
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He, Sirui Xu, Xinyao Li et al.

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

CVMar 8, 2024Code
Agile Multi-Source-Free Domain Adaptation

Xinyao Li, Jingjing Li, Fengling Li et al.

Efficiently utilizing rich knowledge in pretrained models has become a critical topic in the era of large models. This work focuses on adaptively utilizing knowledge from multiple source-pretrained models to an unlabeled target domain without accessing the source data. Despite being a practically useful setting, existing methods require extensive parameter tuning over each source model, which is computationally expensive when facing abundant source domains or larger source models. To address this challenge, we propose a novel approach which is free of the parameter tuning over source backbones. Our technical contribution lies in the Bi-level ATtention ENsemble (Bi-ATEN) module, which learns both intra-domain weights and inter-domain ensemble weights to achieve a fine balance between instance specificity and domain consistency. By slightly tuning source bottlenecks, we achieve comparable or even superior performance on a challenging benchmark DomainNet with less than 3% trained parameters and 8 times of throughput compared with SOTA method. Furthermore, with minor modifications, the proposed module can be easily equipped to existing methods and gain more than 4% performance boost. Code is available at https://github.com/TL-UESTC/Bi-ATEN.

CVMar 11, 2024Code
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Xinyao Li, Yuke Li, Zhekai Du et al.

Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS

CVDec 2, 2025
Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen et al.

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

LGJan 23
Learning ORDER-Aware Multimodal Representations for Composite Materials Design

Xinyao Li, Hangwei Qian, Jingjing Li et al.

Artificial intelligence (AI) has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces that lack well-defined graph structures. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that fundamentally determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented multimodal frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work, we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a public Nanofiber-enforced composite dataset and an internally curated dataset that simulates the construction of carbon fiber T700 with diverse fiber distributions. ORDER achieves consistent improvements over state-of-the-art multimodal baselines across property prediction, cross-modal retrieval, and microstructure generation tasks.

AIMar 5, 2024
Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation

Zhekai Du, Xinyao Li, Fengling Li et al.

Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.

AISep 2, 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song et al. · pku

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

CVJun 23, 2025
Generalizing vision-language models to novel domains: A comprehensive survey

Xinyao Li, Jingjing Li, Fengling Li et al.

Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.

ROSep 2, 2025
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

Minghuan Liu, Zhengbang Zhu, Xiaoshen Han et al.

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

ROFeb 18, 2025
RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations

Jingxiao Chen, Xinyao Li, Jiahang Cao et al.

Humanoid robots have shown success in locomotion and manipulation. Despite these basic abilities, humanoids are still required to quickly understand human instructions and react based on human interaction signals to become valuable assistants in human daily life. Unfortunately, most existing works only focus on multi-stage interactions, treating each task separately, and neglecting real-time feedback. In this work, we aim to empower humanoid robots with real-time reaction abilities to achieve various tasks, allowing human to interrupt robots at any time, and making robots respond to humans immediately. To support such abilities, we propose a general humanoid-human-object interaction framework, named RHINO, i.e., Real-time Humanoid-human Interaction and Object manipulation. RHINO provides a unified view of reactive motion, instruction-based manipulation, and safety concerns, over multiple human signal modalities, such as languages, images, and motions. RHINO is a hierarchical learning framework, enabling humanoids to learn reaction skills from human-human-object demonstrations and teleoperation data. In particular, it decouples the interaction process into two levels: 1) a high-level planner inferring human intentions from real-time human behaviors; and 2) a low-level controller achieving reactive motion behaviors and object manipulation skills based on the predicted intentions. We evaluate the proposed framework on a real humanoid robot and demonstrate its effectiveness, flexibility, and safety in various scenarios.

CVAug 7, 2025
Unified modality separation: A vision-language framework for unsupervised domain adaptation

Xinyao Li, Jingjing Li, Zhekai Du et al.

Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.