Xinyuan Liu

LG
h-index20
17papers
365citations
Novelty52%
AI Score60

17 Papers

LGMay 30
CUPID in the Model Zoo: Online Matchmaking for Selecting Your Dream LLM

Son Nguyen, Xinyuan Liu, Ransalu Senanayake

Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM's responses or deployment. We propose an interaction-efficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user's latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match well-aligned LLMs to users at a lower cost.

AIJan 9Code
TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Dawei Wang, Chengming Zhou, Di Zhao et al.

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

ROApr 15
Humanoid Factors: Design Principles for AI Humanoids in Human Worlds

Xinyuan Liu, Eren Sadikoglu, Ransalu Senanayake et al.

Human factors research has long focused on optimizing environments, tools, and systems to account for human performance. Yet, as humanoid robots begin to share our workplaces, homes, and public spaces, the design challenge expands. We must now consider not only factors for humans but also factors for humanoids, since both will coexist and interact within the same environments. Unlike conventional machines, humanoids introduce expectations of human-like behavior, communication, and social presence, which reshape usability, trust, and safety considerations. In this article, we introduce the concept of humanoid factors as a framework structured around four pillars - physical, cognitive, social, and ethical - that shape the development of humanoids to help them effectively coexist and collaborate with humans. This framework characterizes the overlap and divergence between human capabilities and those of general-purpose humanoids powered by AI foundation models. To demonstrate our framework's practical utility, we then apply the framework to evaluate a real-world humanoid control algorithm, illustrating how conventional task completion metrics in robotics overlook key human cognitive and interaction principles. We thus position humanoid factors as a foundational framework for designing, evaluating, and governing sustained human-humanoid coexistence.

AIMar 13
LLM Routing as Reasoning: A MaxSAT View

Son Nguyen, Xinyuan Liu, Ransalu Senanayake

Routing a query through an appropriate LLM is challenging, particularly when user preferences are expressed in natural language and model attributes are only partially observable. We propose a constraint-based interpretation of language-conditioned LLM routing, formulating it as a weighted MaxSAT/MaxSMT problem in which natural language feedback induces hard and soft constraints over model attributes. Under this view, routing corresponds to selecting models that approximately maximize satisfaction of feedback-conditioned clauses. Empirical analysis on a 25-model benchmark shows that language feedback produces near-feasible recommendation sets, while no-feedback scenarios reveal systematic priors. Our results suggest that LLM routing can be understood as structured constraint optimization under language-conditioned preferences.

CVMay 23, 2024Code
TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving Scenes

Yanping Fu, Wenbin Liao, Xinyuan Liu et al.

As an emerging task that integrates perception and reasoning, topology reasoning in autonomous driving scenes has recently garnered widespread attention. However, existing work often emphasizes "perception over reasoning": they typically boost reasoning performance by enhancing the perception of lanes and directly adopt MLP to learn lane topology from lane query. This paradigm overlooks the geometric features intrinsic to the lanes themselves and are prone to being influenced by inherent endpoint shifts in lane detection. To tackle this issue, we propose an interpretable method for lane topology reasoning based on lane geometric distance and lane query similarity, named TopoLogic. This method mitigates the impact of endpoint shifts in geometric space, and introduces explicit similarity calculation in semantic space as a complement. By integrating results from both spaces, our methods provides more comprehensive information for lane topology. Ultimately, our approach significantly outperforms the existing state-of-the-art methods on the mainstream benchmark OpenLane-V2 (23.9 v.s. 10.9 in TOP$_{ll}$ and 44.1 v.s. 39.8 in OLS on subset_A. Additionally, our proposed geometric distance topology reasoning method can be incorporated into well-trained models without re-training, significantly boost the performance of lane topology reasoning. The code is released at https://github.com/Franpin/TopoLogic.

CVMay 23, 2025Code
TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving

Yanping Fu, Xinyuan Liu, Tianyu Li et al.

Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET$_p$ to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET$_p$). The code is released at https://github.com/Franpin/TopoPoint.

LGOct 6, 2025Code
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Siwei Han, Jiaqi Liu, Yaofeng Su et al.

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

CVJun 12, 2025Code
Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

Xinyuan Liu, Hang Xu, Yike Ma et al.

Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.

CVMay 17, 2023Code
Rethinking Boundary Discontinuity Problem for Oriented Object Detection

Hang Xu, Xinyuan Liu, Haonan Xu et al.

Oriented object detection has been developed rapidly in the past few years, where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate, but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle, which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary, and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However, we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis, we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods, the model essentially attempts to fit the angular relationship between box and object, where the break point at angular boundary makes the predictions highly unstable.To deal with this issue, we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities, which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters.Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover, typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm.

CVFeb 9
Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation

Chufeng Zhou, Jian Wang, Xinyuan Liu et al.

Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.

DBJul 15, 2025
Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

Xinyuan Liu, Jiahui Chen, Bocheng Hu et al.

Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.

LGJan 27, 2025
Data-Efficient Machine Learning Potentials via Difference Vectors Based on Local Atomic Environments

Xuqiang Shao, Yuqi Zhang, Di Zhang et al.

Constructing efficient and diverse datasets is essential for the development of accurate machine learning potentials (MLPs) in atomistic simulations. However, existing approaches often suffer from data redundancy and high computational costs. Herein, we propose a new method--Difference Vectors based on Local Atomic Environments (DV-LAE)--that encodes structural differences via histogram-based descriptors and enables visual analysis through t-SNE dimensionality reduction. This approach facilitates redundancy detection and dataset optimization while preserving structural diversity. We demonstrate that DV-LAE significantly reduces dataset size and training time across various materials systems, including high-pressure hydrogen, iron-hydrogen binaries, magnesium hydrides, and carbon allotropes, with minimal compromise in prediction accuracy. For instance, in the $α$-Fe/H system, maintaining a highly similar MLP accuracy, the dataset size was reduced by 56%, and the training time per iteration dropped by over 50%. Moreover, we show how visualizing the DV-LAE representation aids in identifying out-of-distribution data by examining the spatial distribution of high-error prediction points, providing a robust reliability metric for new structures during simulations. Our results highlight the utility of local environment visualization not only as an interpretability tool but also as a practical means for accelerating MLP development and ensuring data efficiency in large-scale atomistic modeling.

IVDec 20, 2023
Computational Spectral Imaging with Unified Encoding Model: A Comparative Study and Beyond

Xinyuan Liu, Lizhi Wang, Lingen Li et al.

Computational spectral imaging is drawing increasing attention owing to the snapshot advantage, and amplitude, phase, and wavelength encoding systems are three types of representative implementations. Fairly comparing and understanding the performance of these systems is essential, but challenging due to the heterogeneity in encoding design. To overcome this limitation, we propose the unified encoding model (UEM) that covers all physical systems using the three encoding types. Specifically, the UEM comprises physical amplitude, physical phase, and physical wavelength encoding models that can be combined with a digital decoding model in a joint encoder-decoder optimization framework to compare the three systems under a unified experimental setup fairly. Furthermore, we extend the UEMs to ideal versions, namely, ideal amplitude, ideal phase, and ideal wavelength encoding models, which are free from physical constraints, to explore the full potential of the three types of computational spectral imaging systems. Finally, we conduct a holistic comparison of the three types of computational spectral imaging systems and provide valuable insights for designing and exploiting these systems in the future.

LGMay 16, 2023
Contrastive Label Enhancement

Yifei Wang, Yiyang Zhou, Jihua Zhu et al.

Label distribution learning (LDL) is a new machine learning paradigm for solving label ambiguity. Since it is difficult to directly obtain label distributions, many studies are focusing on how to recover label distributions from logical labels, dubbed label enhancement (LE). Existing LE methods estimate label distributions by simply building a mapping relationship between features and label distributions under the supervision of logical labels. They typically overlook the fact that both features and logical labels are descriptions of the instance from different views. Therefore, we propose a novel method called Contrastive Label Enhancement (ConLE) which integrates features and logical labels into the unified projection space to generate high-level features by contrastive learning strategy. In this approach, features and logical labels belonging to the same sample are pulled closer, while those of different samples are projected farther away from each other in the projection space. Subsequently, we leverage the obtained high-level features to gain label distributions through a welldesigned training strategy that considers the consistency of label attributes. Extensive experiments on LDL benchmark datasets demonstrate the effectiveness and superiority of our method.

LGJul 7, 2020
Bidirectional Loss Function for Label Enhancement and Distribution Learning

Xinyuan Liu, Jihua Zhu, Qinghai Zheng et al.

Label distribution learning (LDL) is an interpretable and general learning paradigm that has been applied in many real-world applications. In contrast to the simple logical vector in single-label learning (SLL) and multi-label learning (MLL), LDL assigns labels with a description degree to each instance. In practice, two challenges exist in LDL, namely, how to address the dimensional gap problem during the learning process of LDL and how to exactly recover label distributions from existing logical labels, i.e., Label Enhancement (LE). For most existing LDL and LE algorithms, the fact that the dimension of the input matrix is much higher than that of the output one is alway ignored and it typically leads to the dimensional reduction owing to the unidirectional projection. The valuable information hidden in the feature space is lost during the mapping process. To this end, this study considers bidirectional projections function which can be applied in LE and LDL problems simultaneously. More specifically, this novel loss function not only considers the mapping errors generated from the projection of the input space into the output one but also accounts for the reconstruction errors generated from the projection of the output space back to the input one. This loss function aims to potentially reconstruct the input data from the output data. Therefore, it is expected to obtain more accurate results. Finally, experiments on several real-world datasets are carried out to demonstrate the superiority of the proposed method for both LE and LDL.

LGApr 7, 2020
Generalized Label Enhancement with Sample Correlations

Qinghai Zheng, Jihua Zhu, Haoyu Tang et al.

Recently, label distribution learning (LDL) has drawn much attention in machine learning, where LDL model is learned from labelel instances. Different from single-label and multi-label annotations, label distributions describe the instance by multiple labels with different intensities and accommodate to more general scenes. Since most existing machine learning datasets merely provide logical labels, label distributions are unavailable in many real-world applications. To handle this problem, we propose two novel label enhancement methods, i.e., Label Enhancement with Sample Correlations (LESC) and generalized Label Enhancement with Sample Correlations (gLESC). More specifically, LESC employs a low-rank representation of samples in the feature space, and gLESC leverages a tensor multi-rank minimization to further investigate the sample correlations in both the feature space and label space. Benefitting from the sample correlations, the proposed methods can boost the performance of label enhancement. Extensive experiments on 14 benchmark datasets demonstrate the effectiveness and superiority of our methods.

SEJun 28, 2017
Identifying Patch Correctness in Test-Based Program Repair

Yingfei Xiong, Xinyuan Liu, Muhan Zeng et al.

Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGenProg, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches to be generated, without blocking any correct patches.