Bo Yu

RO
h-index98
37papers
431citations
Novelty47%
AI Score56

37 Papers

CVApr 18, 2023Code
Data and Knowledge Co-driving for Cancer Subtype Classification on Multi-Scale Histopathological Slides

Bo Yu, Hechang Chen, Yunke Zhang et al.

Artificial intelligence-enabled histopathological data analysis has become a valuable assistant to the pathologist. However, existing models lack representation and inference abilities compared with those of pathologists, especially in cancer subtype diagnosis, which is unconvincing in clinical practice. For instance, pathologists typically observe the lesions of a slide from global to local, and then can give a diagnosis based on their knowledge and experience. In this paper, we propose a Data and Knowledge Co-driving (D&K) model to replicate the process of cancer subtype classification on a histopathological slide like a pathologist. Specifically, in the data-driven module, the bagging mechanism in ensemble learning is leveraged to integrate the histological features from various bags extracted by the embedding representation unit. Furthermore, a knowledge-driven module is established based on the Gestalt principle in psychology to build the three-dimensional (3D) expert knowledge space and map histological features into this space for metric. Then, the diagnosis can be made according to the Euclidean distance between them. Extensive experimental results on both public and in-house datasets demonstrate that the D&K model has a high performance and credible results compared with the state-of-the-art methods for diagnosing histopathological subtypes. Code: https://github.com/Dennis-YB/Data-and-Knowledge-Co-driving-for-Cancer-Subtypes-Classification

ROSep 23, 2024Code
KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

Zixuan Wang, Bo Yu, Junzhe Zhao et al.

Embodied AI agents responsible for executing interconnected, long-sequence household tasks often face difficulties with in-context memory, leading to inefficiencies and errors in task execution. To address this issue, we introduce KARMA, an innovative memory system that integrates long-term and short-term memory modules, enhancing large language models (LLMs) for planning in embodied agents through memory-augmented prompting. KARMA distinguishes between long-term and short-term memory, with long-term memory capturing comprehensive 3D scene graphs as representations of the environment, while short-term memory dynamically records changes in objects' positions and states. This dual-memory structure allows agents to retrieve relevant past scene experiences, thereby improving the accuracy and efficiency of task planning. Short-term memory employs strategies for effective and adaptive memory replacement, ensuring the retention of critical information while discarding less pertinent data. Compared to state-of-the-art embodied agents enhanced with memory, our memory-augmented embodied AI agent improves success rates by 1.3x and 2.3x in Composite Tasks and Complex Tasks within the AI2-THOR simulator, respectively, and enhances task execution efficiency by 3.4x and 62.7x. Furthermore, we demonstrate that KARMA's plug-and-play capability allows for seamless deployment on real-world robotic systems, such as mobile manipulation platforms.Through this plug-and-play memory system, KARMA significantly enhances the ability of embodied agents to generate coherent and contextually appropriate plans, making the execution of complex household tasks more efficient. The experimental videos from the work can be found at https://youtu.be/4BT7fnw9ehs. Our code is available at https://github.com/WZX0Swarm0Robotics/KARMA/tree/master.

ARDec 5, 2022
Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators

Abhishek Tyagi, Yiming Gan, Shaoshan Liu et al.

As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primarily focuses on metrics such as Failures In Time (FIT) rate and the Silent Data Corruption (SDC) rate, which quantify how often a device fails. Instead, this paper focuses on quantifying the DNN accuracy given that a transient error has occurred, which tells us how well a network behaves when a transient error occurs. We call this metric Resiliency Accuracy (RA). We show that existing RA formulation is fundamentally inaccurate, because it incorrectly assumes that software variables (model weights/activations) have equal faulty probability under hardware transient faults. We present an algorithm that captures the faulty probabilities of DNN variables under transient faults and, thus, provides correct RA estimations validated by hardware. To accelerate RA estimation, we reformulate RA calculation as a Monte Carlo integration problem, and solve it using importance sampling driven by DNN specific heuristics. Using our lightweight RA estimation method, we show that transient faults lead to far greater accuracy degradation than what todays DNN resiliency tools estimate. We show how our RA estimation tool can help design more resilient DNNs by integrating it with a Network Architecture Search framework.

AIMay 27
CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Fengze Yang, Bo Yu, Xuewen Luo et al.

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

IVApr 18, 2023
Multi-Modality Multi-Scale Cardiovascular Disease Subtypes Classification Using Raman Image and Medical History

Bo Yu, Hechang Chen, Chengyou Jia et al.

Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g., cardiovascular disease (CVD), owing to its efficiency and component-specific testing capabilities. A series of popular deep learning methods have recently been introduced to learn nuance features from RS for binary classifications and achieved outstanding performance than conventional machine learning methods. However, these existing deep learning methods still confront some challenges in classifying subtypes of CVD. For example, the nuance between subtypes is quite hard to capture and represent by intelligent models due to the chillingly similar shape of RS sequences. Moreover, medical history information is an essential resource for distinguishing subtypes, but they are underutilized. In light of this, we propose a multi-modality multi-scale model called M3S, which is a novel deep learning method with two core modules to address these issues. First, we convert RS data to various resolution images by the Gramian angular field (GAF) to enlarge nuance, and a two-branch structure is leveraged to get embeddings for distinction in the multi-scale feature extraction module. Second, a probability matrix and a weight matrix are used to enhance the classification capacity by combining the RS and medical history data in the multi-modality data fusion module. We perform extensive evaluations of M3S and found its outstanding performance on our in-house dataset, with accuracy, precision, recall, specificity, and F1 score of 0.9330, 0.9379, 0.9291, 0.9752, and 0.9334, respectively. These results demonstrate that the M3S has high performance and robustness compared with popular methods in diagnosing CVD subtypes.

ROMar 17Code
DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

Zebin Yang, Yijiahao Qi, Tong Xie et al.

Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.

ROJul 8, 2023
Autonomy 2.0: The Quest for Economies of Scale

Shuang Wu, Bo Yu, Shaoshan Liu et al.

With the advancement of robotics and AI technologies in the past decade, we have now entered the age of autonomous machines. In this new age of information technology, autonomous machines, such as service robots, autonomous drones, delivery robots, and autonomous vehicles, rather than humans, will provide services. In this article, through examining the technical challenges and economic impact of the digital economy, we argue that scalability is both highly necessary from a technical perspective and significantly advantageous from an economic perspective, thus is the key for the autonomy industry to achieve its full potential. Nonetheless, the current development paradigm, dubbed Autonomy 1.0, scales with the number of engineers, instead of with the amount of data or compute resources, hence preventing the autonomy industry to fully benefit from the economies of scale, especially the exponentially cheapening compute cost and the explosion of available data. We further analyze the key scalability blockers and explain how a new development paradigm, dubbed Autonomy 2.0, can address these problems to greatly boost the autonomy industry.

HCSep 27, 2024
Esports Debut as a Medal Event at 2023 Asian Games: Exploring Public Perceptions with BERTopic and GPT-4 Topic Fine-Tuning

Tyreal Yizhou Qian, Bo Yu, Weizhe Li et al.

This study examined the public opinions of esports at the 2023 Asian Games and value co-creation during the event using an LLM-enhanced BERTopic modeling analysis. We identified five major themes representing public perceptions, as well as how major stakeholders co-created value within and beyond the esports ecosystem. Key findings highlighted the strategic use of social media marketing to influence public opinion and promote esports events and brands, emphasizing the importance of event logistics and infrastructure. Additionally, the study revealed the co-creation value contributed by stakeholders outside the traditional esports ecosystem, particularly in promoting national representation and performance. Our findings supported the ongoing efforts to legitimize esports as a sport, noting that mainstream recognition remains a challenge. The inclusion of esports as a medal event showcased broader acceptance and helped mitigate negative public perceptions. Moreover, contributions from non-traditional stakeholders underscored the value of cross-subcultural collaborations in esports.

CVApr 24, 2024Code
OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Lizhi Wang, Feng Zhou, Bo yu et al.

Recent advancements in 3D reconstruction technologies have paved the way for high-quality and real-time rendering of complex 3D scenes. Despite these achievements, a notable challenge persists: it is difficult to precisely reconstruct specific objects from large scenes. Current scene reconstruction techniques frequently result in the loss of object detail textures and are unable to reconstruct object portions that are occluded or unseen in views. To address this challenge, we delve into the meticulous 3D reconstruction of specific objects within large scenes and propose a framework termed OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation. Specifically, we proposed a novel 3D target segmentation technique based on 2D Gaussian Splatting, which segments 3D consistent target masks in multi-view scene images and generates a preliminary target model. Moreover, to reconstruct the unseen portions of the target, we propose a novel target replenishment technique driven by large-scale generative diffusion priors. We demonstrate that our method can accurately reconstruct specific targets from large scenes, both quantitatively and qualitatively. Our experiments show that OMEGAS significantly outperforms existing reconstruction methods across various scenarios. Our project page is at: https://github.com/CrystalWlz/OMEGAS

AIApr 15
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

Bo Yu, Cheng Yang, Dongyang Hou et al.

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

ROMar 11
KnowDiffuser: A Knowledge-Guided Diffusion Planner with LM Reasoning and Prior-Informed Trajectory Initialization

Fan Ding, Xuewen Luo, Fengze Yang et al.

Recent advancements in Language Models (LMs) have demonstrated strong semantic reasoning capabilities, enabling their application in high-level decision-making for autonomous driving (AD). However, LMs operate over discrete token spaces and lack the ability to generate continuous, physically feasible trajectories required for motion planning. Meanwhile, diffusion models have proven effective at generating reliable and dynamically consistent trajectories, but often lack semantic interpretability and alignment with scene-level understanding. To address these limitations, we propose \textbf{KnowDiffuser}, a knowledge-guided motion planning framework that tightly integrates the semantic understanding of language models with the generative power of diffusion models. The framework employs a language model to infer context-aware meta-actions from structured scene representations, which are then mapped to prior trajectories that anchor the subsequent denoising process. A two-stage truncated denoising mechanism refines these trajectories efficiently, preserving both semantic alignment and physical feasibility. Experiments on the nuPlan benchmark demonstrate that KnowDiffuser significantly outperforms existing planners in both open-loop and closed-loop evaluations, establishing a robust and interpretable framework that effectively bridges the semantic-to-physical gap in AD systems.

ROOct 21, 2025Code
EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

Zebin Yang, Sunjian Zheng, Tong Xie et al.

Object-goal navigation (ObjNav) tasks an agent with navigating to the location of a specific object in an unseen environment. Embodied agents equipped with large language models (LLMs) and online constructed navigation maps can perform ObjNav in a zero-shot manner. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying ObjNav on local devices. At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local devices. In this paper, we propose EfficientNav to enable on-device efficient LLM-based zero-shot ObjNav. To help the smaller LLMs better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps. To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the KV cache. Extensive experimental results demonstrate that EfficientNav achieves 11.1% improvement in success rate on HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our code will be released soon.

CVMar 13
Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

Bo Yu, Fengze Yang, Yiming Liu et al.

The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.

CLApr 1
Adaptive Stopping for Multi-Turn LLM Reasoning

Xiaofan Zhou, Huy Nguyen, Bo Yu et al.

Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

LGMar 24
StateLinFormer: Stateful Training Enhancing Long-term Memory in Navigation

Zhiyuan Chen, Yuxuan Zhong, Fan Wang et al.

Effective navigation intelligence relies on long-term memory to support both immediate generalization and sustained adaptation. However, existing approaches face a dilemma: modular systems rely on explicit mapping but lack flexibility, while Transformer-based end-to-end models are constrained by fixed context windows, limiting persistent memory across extended interactions. We introduce StateLinFormer, a linear-attention navigation model trained with a stateful memory mechanism that preserves recurrent memory states across consecutive training segments instead of reinitializing them at each batch boundary. This training paradigm effectively approximates learning on infinitely long sequences, enabling the model to achieve long-horizon memory retention. Experiments across both MAZE and ProcTHOR environments demonstrate that StateLinFormer significantly outperforms its stateless linear-attention counterpart and standard Transformer baselines with fixed context windows. Notably, as interaction length increases, persistent stateful training substantially improves context-dependent adaptation, suggesting an enhancement in the model's In-Context Learning (ICL) capabilities for navigation tasks.

CVFeb 26
Don't let the information slip away

Taozhe Li, Guansu Wang, Bo Yu et al.

Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

CVMay 22, 2025
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Shuhao Han, Haotian Fan, Fangyuan Kong et al.

This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

LGFeb 5, 2025
Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Fan Wang, Pengtao Shao, Yiming Zhang et al.

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

CVMar 14, 2025
VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

Jiangning Wei, Lixiong Qin, Bo Yu et al.

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

CVApr 25, 2024
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection

Mehmet Kerem Turkcan, Sanjeev Narasimhan, Chengbo Zang et al.

We introduce Constellation, a dataset of 13K images suitable for research on detection of objects in dense urban streetscapes observed from high-elevation cameras, collected for a variety of temporal conditions. The dataset addresses the need for curated data to explore problems in small object detection exemplified by the limited pixel footprint of pedestrians observed tens of meters from above. It enables the testing of object detection models for variations in lighting, building shadows, weather, and scene dynamics. We evaluate contemporary object detection architectures on the dataset, observing that state-of-the-art methods have lower performance in detecting small pedestrians compared to vehicles, corresponding to a 10% difference in average precision (AP). Using structurally similar datasets for pretraining the models results in an increase of 1.8% mean AP (mAP). We further find that incorporating domain-specific data augmentations helps improve model performance. Using pseudo-labeled data, obtained from inference outcomes of the best-performing models, improves the performance of the models. Finally, comparing the models trained using the data collected in two different time intervals, we find a performance drift in models due to the changes in intersection conditions over time. The best-performing model achieves a pedestrian AP of 92.0% with 11.5 ms inference time on NVIDIA A100 GPUs, and an mAP of 95.4%.

AIAug 1, 2025
Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance

Fengze Yang, Bo Yu, Yang Zhou et al.

Autonomous driving (AD) systems relying solely on onboard sensors may fail to detect distant or obstacle hazards, potentially causing preventable collisions; however, existing transformer-based Vehicle-to-Everything (V2X) approaches, which mitigate AD sensing limitations, either lack effective multimodal fusion and reasoning or struggle to meet real-time performance requirements under complex, high-dimensional traffic conditions. This paper proposes the Real-time Edge-based Autonomous Co-pilot Trajectory planner (REACT), a V2X-integrated trajectory optimization framework for AD based on a fine-tuned lightweight Vision-Language Model (VLM). REACT integrates infrastructure-provided hazard alerts with onboard sensor data, capturing intricate surrounding traffic dynamics and vehicle intents through visual embeddings, interpreting precise numerical data from symbolic inputs, and employing contextual reasoning to generate optimized, safety-oriented trajectories. To ensure robust real-time deployment on edge devices, REACT innovatively employs Residual Trajectory Fusion (RTF) design and specialized edge-adaptation strategies to reduce model complexity and improve inference efficiency. Evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin. Ablation studies validate the contribution of each input, module, and edge adaptation strategy. These results highlight the effectiveness of lightweight VLMs in enabling real-time cooperative planning on edge platforms and underscore the potential of language-guided contextual reasoning for improving traffic safety and responsiveness.

CVMar 7, 2025
GaussianCAD: Robust Self-Supervised CAD Reconstruction from Three Orthographic Views Using 3D Gaussian Splatting

Zheng Zhou, Zhe Li, Bo Yu et al.

The automatic reconstruction of 3D computer-aided design (CAD) models from CAD sketches has recently gained significant attention in the computer vision community. Most existing methods, however, rely on vector CAD sketches and 3D ground truth for supervision, which are often difficult to be obtained in industrial applications and are sensitive to noise inputs. We propose viewing CAD reconstruction as a specific instance of sparse-view 3D reconstruction to overcome these limitations. While this reformulation offers a promising perspective, existing 3D reconstruction methods typically require natural images and corresponding camera poses as inputs, which introduces two major significant challenges: (1) modality discrepancy between CAD sketches and natural images, and (2) difficulty of accurate camera pose estimation for CAD sketches. To solve these issues, we first transform the CAD sketches into representations resembling natural images and extract corresponding masks. Next, we manually calculate the camera poses for the orthographic views to ensure accurate alignment within the 3D coordinate system. Finally, we employ a customized sparse-view 3D reconstruction method to achieve high-quality reconstructions from aligned orthographic views. By leveraging raster CAD sketches for self-supervision, our approach eliminates the reliance on vector CAD sketches and 3D ground truth. Experiments on the Sub-Fusion360 dataset demonstrate that our proposed method significantly outperforms previous approaches in CAD reconstruction performance and exhibits strong robustness to noisy inputs.

LGSep 26, 2025
Context and Diversity Matter: The Emergence of In-Context Learning in World Models

Fan Wang, Zhiyuan Chen, Yuxuan Zhong et al.

The capability of predicting environmental dynamics underpins both biological neural systems and general embodied AI in adapting to their surroundings. Yet prevailing approaches rest on static world models that falter when confronted with novel or rare configurations. We investigate in-context environment learning (ICEL), shifting attention from zero-shot performance to the growth and asymptotic limits of the world model. Our contributions are three-fold: (1) we formalize in-context learning of a world model and identify two core mechanisms: environment recognition and environment learning; (2) we derive error upper-bounds for both mechanisms that expose how the mechanisms emerge; and (3) we empirically confirm that distinct ICL mechanisms exist in the world model, and we further investigate how data distribution and model architecture affect ICL in a manner consistent with theory. These findings demonstrate the potential of self-adapting world models and highlight the key factors behind the emergence of ICEL, most notably the necessity of long context and diverse environments.

CVSep 24, 2025
Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

Bo Yu, Jianhua Yang, Zetao Du et al.

Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

AISep 9, 2025
BDPM: A Machine Learning-Based Feature Extractor for Parkinson's Disease Classification via Gut Microbiota Analysis

Bo Yu, Zhixiu Hua, Bo Zhao

Background: Parkinson's disease remains a major neurodegenerative disorder with high misdiagnosis rates, primarily due to reliance on clinical rating scales. Recent studies have demonstrated a strong association between gut microbiota and Parkinson's disease, suggesting that microbial composition may serve as a promising biomarker. Although deep learning models based ongut microbiota show potential for early prediction, most approaches rely on single classifiers and often overlook inter-strain correlations or temporal dynamics. Therefore, there is an urgent need for more robust feature extraction methods tailored to microbiome data. Methods: We proposed BDPM (A Machine Learning-Based Feature Extractor for Parkinson's Disease Classification via Gut Microbiota Analysis). First, we collected gut microbiota profiles from 39 Parkinson's patients and their healthy spouses to identify differentially abundant taxa. Second, we developed an innovative feature selection framework named RFRE (Random Forest combined with Recursive Feature Elimination), integrating ecological knowledge to enhance biological interpretability. Finally, we designed a hybrid classification model to capture temporal and spatial patterns in microbiome data.

CVMar 17, 2025
L2HCount:Generalizing Crowd Counting from Low to High Crowd Density via Density Simulation

Guoliang Xu, Jianqin Yin, Ren Zhang et al.

Since COVID-19, crowd-counting tasks have gained wide applications. While supervised methods are reliable, annotation is more challenging in high-density scenes due to small head sizes and severe occlusion, whereas it's simpler in low-density scenes. Interestingly, can we train the model in low-density scenes and generalize it to high-density scenes? Therefore, we propose a low- to high-density generalization framework (L2HCount) that learns the pattern related to high-density scenes from low-density ones, enabling it to generalize well to high-density scenes. Specifically, we first introduce a High-Density Simulation Module and a Ground-Truth Generation Module to construct fake high-density images along with their corresponding ground-truth crowd annotations respectively by image-shifting technique, effectively simulating high-density crowd patterns. However, the simulated images have two issues: image blurring and loss of low-density image characteristics. Therefore, we second propose a Head Feature Enhancement Module to extract clear features in the simulated high-density scene. Third, we propose a Dual-Density Memory Encoding Module that uses two crowd memories to learn scene-specific patterns from low- and simulated high-density scenes, respectively. Extensive experiments on four challenging datasets have shown the promising performance of L2HCount.

ARSep 15, 2021
Dataflow Accelerator Architecture for Autonomous Machine Computing

Shaoshan Liu, Yuhao Zhu, Bo Yu et al.

Commercial autonomous machines is a thriving sector, one that is likely the next ubiquitous computing platform, after Personal Computers (PC), cloud computing, and mobile computing. Nevertheless, a suitable computing substrate for autonomous machines is missing, and many companies are forced to develop ad hoc computing solutions that are neither principled nor extensible. By analyzing the demands of autonomous machine computing, this article proposes Dataflow Accelerator Architecture (DAA), a modern instantiation of the classic dataflow principle, that matches the characteristics of autonomous machine software.

ARApr 11, 2021
iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo Matching on FPGA Platform

Tian Gao, Zishen Wan, Yuyang Zhang et al.

Stereo matching is a critical task for robot navigation and autonomous vehicles, providing the depth estimation of surroundings. Among all stereo matching algorithms, Efficient Large-scale Stereo (ELAS) offers one of the best tradeoffs between efficiency and accuracy. However, due to the inherent iterative process and unpredictable memory access pattern, ELAS can only run at 1.5-3 fps on high-end CPUs and difficult to achieve real-time performance on low-power platforms. In this paper, we propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform. Moreover, the original computational-intensive and irregular triangulation module is reformed in a regular manner with points interpolation, which is much more hardware-friendly. Optimizations, including memory management, parallelism, and pipelining, are further utilized to reduce memory footprint and improve throughput. Compared with Intel i7 CPU and the state-of-the-art CPU+FPGA implementation, our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.

ARApr 1, 2021
An Energy-Efficient Quad-Camera Visual System for Autonomous Machines on FPGA Platform

Zishen Wan, Yuyang Zhang, Arijit Raychowdhury et al.

In our past few years' of commercial deployment experiences, we identify localization as a critical task in autonomous machine applications, and a great acceleration target. In this paper, based on the observation that the visual frontend is a major performance and energy consumption bottleneck, we present our design and implementation of an energy-efficient hardware architecture for ORB (Oriented-Fast and Rotated- BRIEF) based localization system on FPGAs. To support our multi-sensor autonomous machine localization system, we present hardware synchronization, frame-multiplexing, and parallelization techniques, which are integrated in our design. Compared to Nvidia TX1 and Intel i7, our FPGA-based implementation achieves 5.6x and 3.4x speedup, as well as 3.0x and 34.6x power reduction, respectively.

ROMar 30, 2021
The Matter of Time -- A General and Efficient System for Precise Sensor Synchronization in Robotic Computing

Shaoshan Liu, Bo Yu, Yahui Liu et al.

Time synchronization is a critical task in robotic computing such as autonomous driving. In the past few years, as we developed advanced robotic applications, our synchronization system has evolved as well. In this paper, we first introduce the time synchronization problem and explain the challenges of time synchronization, especially in robotic workloads. Summarizing these challenges, we then present a general hardware synchronization system for robotic computing, which delivers high synchronization accuracy while maintaining low energy and resource consumption. The proposed hardware synchronization system is a key building block in our future robotic products.

ROMar 3, 2021
Towards Fully Intelligent Transportation through Infrastructure-Vehicle Cooperative Autonomous Driving: Challenges and Opportunities

Shaoshan Liu, Bo Yu, Jie Tang et al.

The infrastructure-vehicle cooperative autonomous driving approach depends on the cooperation between intelligent roads and intelligent vehicles. This approach is not only safer but also more economical compared to the traditional on-vehicle-only autonomous driving approach. In this paper, we introduce our real-world deployment experiences of cooperative autonomous driving, and delve into the details of new challenges and opportunities. Specifically, based on our progress towards commercial deployment, we follow a three-stage development roadmap of the cooperative autonomous driving approach:infrastructure-augmented autonomous driving (IAAD), infrastructure-guided autonomous driving (IGAD), and infrastructure-planned autonomous driving (IPAD).

RONov 12, 2020
On Designing Computing Systems for Autonomous Vehicles: a PerceptIn Case Study

Bo Yu, Jie Tang, Shaoshan Liu

PerceptIn develops and commercializes autonomous vehicles for micromobility around the globe. This paper makes a holistic summary of PerceptIn's development and operating experiences. This paper provides the business tale behind our product, and presents the development of the computing system for our vehicles. We illustrate the design decision made for the computing system, and show the advantage of offloading localization workloads onto an FPGA platform.

ROSep 13, 2020
A Survey of FPGA-Based Robotic Computing

Zishen Wan, Bo Yu, Thomas Yuang Li et al.

Recent researches on robotics have shown significant improvement, spanning from algorithms, mechanics to hardware architectures. Robotics, including manipulators, legged robots, drones, and autonomous vehicles, are now widely applied in diverse scenarios. However, the high computation and data complexity of robotic algorithms pose great challenges to its applications. On the one hand, CPU platform is flexible to handle multiple robotic tasks. GPU platform has higher computational capacities and easy-touse development frameworks, so they have been widely adopted in several applications. On the other hand, FPGA-based robotic accelerators are becoming increasingly competitive alternatives, especially in latency-critical and power-limited scenarios. With specialized designed hardware logic and algorithm kernels, FPGA-based accelerators can surpass CPU and GPU in performance and energy efficiency. In this paper, we give an overview of previous work on FPGA-based robotic accelerators covering different stages of the robotic system pipeline. An analysis of software and hardware optimization techniques and main technical issues is presented, along with some commercial and space applications, to serve as a guide for future work.

IVMay 7, 2019
PI-BA Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization

Shuzhen Qin, Qiang Liu, Bo Yu et al.

Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.

CLSep 10, 2018
Learning to Generate Structured Queries from Natural Language with Indirect Supervision

Ziwei Bai, Bo Yu, Bowen Wu et al.

Generating structured query language (SQL) from natural language is an emerging research topic. This paper presents a new learning paradigm from indirect supervision of the answers to natural language questions, instead of SQL queries. This paradigm facilitates the acquisition of training data due to the abundant resources of question-answer pairs for various domains in the Internet, and expels the difficult SQL annotation job. An end-to-end neural model integrating with reinforcement learning is proposed to learn SQL generation policy within the answer-driven learning paradigm. The model is evaluated on datasets of different domains, including movie and academic publication. Experimental results show that our model outperforms the baseline models.

ROFeb 23, 2018
PIRT: A Runtime Framework to Enable Energy-Efficient Real-Time Robotic Applications on Heterogeneous Architectures

Liu Liu, Shaoshan Liu, Zhe Zhang et al.

Enabling full robotic workloads with diverse behaviors on mobile systems with stringent resource and energy constraints remains a challenge. In recent years, attempts have been made to deploy single-accelerator-based computing platforms (such as GPU, DSP, or FPGA) to address this challenge, but with little success. The core problem is two-fold: firstly, different robotic tasks require different accelerators, and secondly, managing multiple accelerators simultaneously is overwhelming for developers. In this paper, we propose PIRT, the first robotic runtime framework to efficiently manage dynamic task executions on mobile systems with multiple accelerators as well as on the cloud to achieve better performance and energy savings. With PIRT, we enable a robot to simultaneously perform autonomous navigation with 25 FPS of localization, obstacle detection with 3 FPS, route planning, large map generation, and scene understanding, traveling at a max speed of 5 miles per hour, all within an 11W computing power envelope.

ROOct 18, 2017
FPGA-based ORB Feature Extraction for Real-Time Visual SLAM

Weikang Fang, Yanjun Zhang, Bo Yu et al.

Simultaneous Localization And Mapping (SLAM) is the problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. How to enable SLAM robustly and durably on mobile, or even IoT grade devices, is the main challenge faced by the industry today. The main problems we need to address are: 1.) how to accelerate the SLAM pipeline to meet real-time requirements; and 2.) how to reduce SLAM energy consumption to extend battery life. After delving into the problem, we found out that feature extraction is indeed the bottleneck of performance and energy consumption. Hence, in this paper, we design, implement, and evaluate a hardware ORB feature extractor and prove that our design is a great balance between performance and energy consumption compared with ARM Krait and Intel Core i5.