ROMar 27
Line-of-Sight-Constrained Multi-Robot Mapless Navigation via Polygonal Visible RegionsRuofei Bai, Shenghai Yuan, Xinhang Xu et al.
Multi-robot systems rely on underlying connectivity to ensure reliable communication and timely coordination. This paper studies the line-of-sight (LoS) connectivity maintenance problem in multi-robot navigation with unknown obstacles. Prior works typically assume known environment maps to formulate LoS constraints between robots, which hinders their practical deployment. To overcome this limitation, we propose an inherently distributed approach where each robot only constructs an egocentric visible region based on its real-time LiDAR scans, instead of endeavoring to build a global map online. The individual visible regions are shared through distributed communication to establish inter-robot LoS constraints, which are then incorporated into a multi-robot navigation framework to ensure LoS-connectivity. Moreover, we enhance the robustness of connectivity maintenance by proposing a more accurate LoS-distance metric, which further enables flexible topology optimization that eliminates redundant and effort-demanding connections. The proposed framework is evaluated through extensive multi-robot navigation and exploration tasks in both simulation and real-world experiments. Results show that it reliably maintains LoS-connectivity between robots in challenging environments cluttered with obstacles, even under large visible ranges and fragile minimal topologies, where existing methods consistently fail. Ablation studies also reveal that topology optimization boosts navigation efficiency by around $20\%$, demonstrating the framework's potential for efficient navigation under connectivity constraints.
LGAug 16, 2024
Self-Explainable Graph Transformer for Link Sign PredictionLu Li, Jiale Liu, Xingyu Ji et al.
Signed Graph Neural Networks (SGNNs) have been shown to be effective in analyzing complex patterns in real-world situations where positive and negative links coexist. However, SGNN models suffer from poor explainability, which limit their adoptions in critical scenarios that require understanding the rationale behind predictions. To the best of our knowledge, there is currently no research work on the explainability of the SGNN models. Our goal is to address the explainability of decision-making for the downstream task of link sign prediction specific to signed graph neural networks. Since post-hoc explanations are not derived directly from the models, they may be biased and misrepresent the true explanations. Therefore, in this paper we introduce a Self-Explainable Signed Graph transformer (SE-SGformer) framework, which can not only outputs explainable information while ensuring high prediction accuracy. Specifically, We propose a new Transformer architecture for signed graphs and theoretically demonstrate that using positional encoding based on signed random walks has greater expressive power than current SGNN methods and other positional encoding graph Transformer-based approaches. We constructs a novel explainable decision process by discovering the $K$-nearest (farthest) positive (negative) neighbors of a node to replace the neural network-based decoder for predicting edge signs. These $K$ positive (negative) neighbors represent crucial information about the formation of positive (negative) edges between nodes and thus can serve as important explanatory information in the decision-making process. We conducted experiments on several real-world datasets to validate the effectiveness of SE-SGformer, which outperforms the state-of-the-art methods by improving 2.2\% prediction accuracy and 73.1\% explainablity accuracy in the best-case scenario.
CVMar 2
Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo CameraTutian Tang, Xingyu Ji, Yutong Li et al.
Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
LGOct 17, 2023
Enhancing Signed Graph Neural Networks through Curriculum-Based TrainingZeyu Zhang, Lu Li, Xingyu Ji et al.
Signed graphs are powerful models for representing complex relations with both positive and negative connections. Recently, Signed Graph Neural Networks (SGNNs) have emerged as potent tools for analyzing such graphs. To our knowledge, no prior research has been conducted on devising a training plan specifically for SGNNs. The prevailing training approach feeds samples (edges) to models in a random order, resulting in equal contributions from each sample during the training process, but fails to account for varying learning difficulties based on the graph's structure. We contend that SGNNs can benefit from a curriculum that progresses from easy to difficult, similar to human learning. The main challenge is evaluating the difficulty of edges in a signed graph. We address this by theoretically analyzing the difficulty of SGNNs in learning adequate representations for edges in unbalanced cycles and propose a lightweight difficulty measurer. This forms the basis for our innovative Curriculum representation learning framework for Signed Graphs, referred to as CSG. The process involves using the measurer to assign difficulty scores to training samples, adjusting their order using a scheduler and training the SGNN model accordingly. We empirically our approach on six real-world signed graph datasets. Our method demonstrates remarkable results, enhancing the accuracy of popular SGNN models by up to 23.7% and showing a reduction of 8.4% in standard deviation, enhancing model stability.
IRMay 14, 2025
TARGET: Benchmarking Table Retrieval for Generative TasksXingyu Ji, Parker Glenn, Aditya G. Parameswaran et al.
The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
IRMar 7
Fine-Grained Table Retrieval Through the Lens of Complex QueriesWojciech Kosiuk, Xingyu Ji, Yeounoh Chung et al.
Enabling question answering over tables and databases in natural language has become a key capability in the democratization of insights from tabular data sources. These systems first require retrieval of data that is relevant to a given natural language query, for which several methods have been introduced. In this work we present and study a table retrieval mechanism devising fine-grained typed query decomposition and global connectivity-awareness (DCTR), to handle the challenges induced by open-domain question answering over relational databases in complex usage contexts. We evaluate the effectiveness of the two mechanisms through the lens of retrieval complexity which we measure along the axes of query- and data complexity. Our analyses over industry-aligned benchmarks illustrate the robustness of DCTR for highly composite queries and densely connected databases.
ROMar 9
Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLATutian Tang, Xingyu Ji, Wanli Xing et al.
While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
ROOct 14, 2025
Gaussian Semantic Field for One-shot LiDAR Global LocalizationPengyu Yin, Shenghai Yuan, Haozhi Cao et al.
We present a one-shot LiDAR global localization algorithm featuring semantic disambiguation ability based on a lightweight tri-layered scene graph. While landmark semantic registration-based methods have shown promising performance improvements in global localization compared with geometric-only methods, landmarks can be repetitive and misleading for correspondence establishment. We propose to mitigate this problem by modeling semantic distributions with continuous functions learned from a population of Gaussian processes. Compared with discrete semantic labels, the continuous functions capture finer-grained geo-semantic information and also provide more detailed metric information for correspondence establishment. We insert this continuous function as the middle layer between the object layer and the metric-semantic layer, forming a tri-layered 3D scene graph, serving as a light-weight yet performant backend for one-shot localization. We term our global localization pipeline Outram-GSF (Gaussian semantic field) and conduct a wide range of experiments on publicly available data sets, validating the superior performance against the current state-of-the-art.
CVMar 11, 2024
Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal SegmentationHaozhi Cao, Yuecong Xu, Pengyu Yin et al.
Multi-modal test-time adaptation (MM-TTA) adapts models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. While previous MM-TTA methods for 3D segmentation offer a promising solution by leveraging self-refinement per frame, they suffer from two major limitations: 1) unstable frame-wise predictions caused by temporal inconsistency, and 2) consistently incorrect predictions that violate the assumption of reliable modality guidance. To address these limitations, this work introduces a comprehensive two-fold framework. Firstly, building upon our previous work ReLiable Spatial-temporal Voxels (Latte), we propose Latte++ that better suppresses the unstable frame-wise predictions with more informative geometric correspondences. Instead of utilizing a universal sliding window, Latte++ employs multi-window aggregation to capture more reliable correspondences to better evaluate the local prediction consistency of different semantic categories. Secondly, to tackle the consistently incorrect predictions, we propose Interactive Test-Time Adaptation (ITTA), a flexible add-on to empower effortless human feedback with existing MM-TTA methods. ITTA introduces a novel human-in-the-loop approach that efficiently integrates minimal human feedback through interactive segmentation, requiring only simple point clicks and bounding box annotations. Instead of using independent interactive networks, ITTA employs a lightweight promptable branch with a momentum gradient module to capture and reuse knowledge from scarce human feedback during online inference. Extensive experiments across five MM-TTA benchmarks demonstrate that ITTA achieves consistent and notable improvements with robust performance gains for target classes of interest in challenging imbalanced scenarios, while Latte++ provides complementary benefits for temporal stability.