h-index53
25papers
405citations
Novelty50%
AI Score56

25 Papers

CVApr 20, 2022
Human-Object Interaction Detection via Disentangled Transformer

Desen Zhou, Zhichao Liu, Jian Wang et al.

Human-Object Interaction Detection tackles the problem of joint localization and classification of human object interactions. Existing HOI transformers either adopt a single decoder for triplet prediction, or utilize two parallel decoders to detect individual objects and interactions separately, and compose triplets by a matching process. In contrast, we decouple the triplet prediction into human-object pair detection and interaction classification. Our main motivation is that detecting the human-object instances and classifying interactions accurately needs to learn representations that focus on different regions. To this end, we present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks. To associate the predictions of disentangled decoders, we first generate a unified representation for HOI triplets with a base decoder, and then utilize it as input feature of each disentangled decoder. Extensive experiments show that our method outperforms prior work on two public HOI benchmarks by a sizeable margin. Code will be available.

LGJan 29, 2023
Time-Series Pattern Recognition in Smart Manufacturing Systems: A Literature Review and Ontology

Mojtaba A. Farahani, M. R. McCormick, Robert Gianinny et al.

Since the inception of Industry 4.0 in 2012, emerging technologies have enabled the acquisition of vast amounts of data from diverse sources such as machine tools, robust and affordable sensor systems with advanced information models, and other sources within Smart Manufacturing Systems (SMS). As a result, the amount of data that is available in manufacturing settings has exploded, allowing data-hungry tools such as Artificial Intelligence (AI) and Machine Learning (ML) to be leveraged. Time-series analytics has been successfully applied in a variety of industries, and that success is now being migrated to pattern recognition applications in manufacturing to support higher quality products, zero defect manufacturing, and improved customer satisfaction. However, the diverse landscape of manufacturing presents a challenge for successfully solving problems in industry using time-series pattern recognition. The resulting research gap of understanding and applying the subject matter of time-series pattern recognition in manufacturing is a major limiting factor for adoption in industry. The purpose of this paper is to provide a structured perspective of the current state of time-series pattern recognition in manufacturing with a problem-solving focus. By using an ontology to classify and define concepts, how they are structured, their properties, the relationships between them, and considerations when applying them, this paper aims to provide practical and actionable guidelines for application and recommendations for advancing time-series analytics.

CVFeb 25, 2023
Temporal Segment Transformer for Action Segmentation

Zhichao Liu, Leshan Wang, Desen Zhou et al.

Recognizing human actions from untrimmed videos is an important task in activity understanding, and poses unique challenges in modeling long-range temporal relations. Recent works adopt a predict-and-refine strategy which converts an initial prediction to action segments for global context modeling. However, the generated segment representations are often noisy and exhibit inaccurate segment boundaries, over-segmentation and other problems. To deal with these issues, we propose an attention based approach which we call \textit{temporal segment transformer}, for joint segment relation modeling and denoising. The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments. The refined segment representations are used to predict action labels and adjust segment boundaries, and a final action segmentation is produced based on voting from segment masks. We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks. We also conduct extensive ablations to demonstrate the effectiveness of different components of our design.

76.5CVMar 18
GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye, Boyuan Wang, Chaojun Ni et al.

World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

ARJul 26, 2024Code
ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model

Ning Xu, Zhaoyang Zhang, Lei Qi et al.

The field of integrated circuit (IC) design is highly specialized, presenting significant barriers to entry and research and development challenges. Although large language models (LLMs) have achieved remarkable success in various domains, existing LLMs often fail to meet the specific needs of students, engineers, and researchers. Consequently, the potential of LLMs in the IC design domain remains largely unexplored. To address these issues, we introduce ChipExpert, the first open-source, instructional LLM specifically tailored for the IC design field. ChipExpert is trained on one of the current best open-source base model (Llama-3 8B). The entire training process encompasses several key stages, including data preparation, continue pre-training, instruction-guided supervised fine-tuning, preference alignment, and evaluation. In the data preparation stage, we construct multiple high-quality custom datasets through manual selection and data synthesis techniques. In the subsequent two stages, ChipExpert acquires a vast amount of IC design knowledge and learns how to respond to user queries professionally. ChipExpert also undergoes an alignment phase, using Direct Preference Optimization, to achieve a high standard of ethical performance. Finally, to mitigate the hallucinations of ChipExpert, we have developed a Retrieval-Augmented Generation (RAG) system, based on the IC design knowledge base. We also released the first IC design benchmark ChipICD-Bench, to evaluate the capabilities of LLMs across multiple IC design sub-domains. Through comprehensive experiments conducted on this benchmark, ChipExpert demonstrated a high level of expertise in IC design knowledge Question-and-Answer tasks.

IVApr 17, 2022
Automatic spinal curvature measurement on ultrasound spine images using Faster R-CNN

Zhichao Liu, Liyue Qian, Wenke Jing et al.

Ultrasound spine imaging technique has been applied to the assessment of spine deformity. However, manual measurements of scoliotic angles on ultrasound images are time-consuming and heavily rely on raters experience. The objectives of this study are to construct a fully automatic framework based on Faster R-CNN for detecting vertebral lamina and to measure the fitting spinal curves from the detected lamina pairs. The framework consisted of two closely linked modules: 1) the lamina detector for identifying and locating each lamina pairs on ultrasound coronal images, and 2) the spinal curvature estimator for calculating the scoliotic angles based on the chain of detected lamina. Two hundred ultrasound images obtained from AIS patients were identified and used for the training and evaluation of the proposed method. The experimental results showed the 0.76 AP on the test set, and the Mean Absolute Difference (MAD) between automatic and manual measurement which was within the clinical acceptance error. Meanwhile the correlation between automatic measurement and Cobb angle from radiographs was 0.79. The results revealed that our proposed technique could provide accurate and reliable automatic curvature measurements on ultrasound spine images for spine deformities.

99.9CVMar 10
MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Zongxia Li, Hongyang Du, Chengsong Huang et al.

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

CVFeb 12
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team, Boyuan Wang, Bohan Li et al.

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

CLFeb 13, 2025Code
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Wenbo Pan, Zhichao Liu, Qiguang Chen et al.

Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.

ROSep 28, 2024
Language-guided Robust Navigation for Mobile Robots in Dynamically-changing Environments

Cody Simons, Zhichao Liu, Brandon Marcus et al.

In this paper, we develop an embodied AI system for human-in-the-loop navigation with a wheeled mobile robot. We propose a direct yet effective method of monitoring the robot's current plan to detect changes in the environment that impact the intended trajectory of the robot significantly and then query a human for feedback. We also develop a means to parse human feedback expressed in natural language into local navigation waypoints and integrate it into a global planning system, by leveraging a map of semantic features and an aligned obstacle map. Extensive testing in simulation and physical hardware experiments with a resource-constrained wheeled robot tasked to navigate in a real-world environment validate the efficacy and robustness of our method. This work can support applications like precision agriculture and construction, where persistent monitoring of the environment provides a human with information about the environment state.

87.0CRMay 8
WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

Zhichao Liu, Wenbo Pan, Haining Yu et al.

Browser agents are increasingly deployed in long-horizon tasks, which require executing extended action chains to accomplish user goals. However, this prolonged execution process provides attackers with more opportunities to inject malicious instructions. Existing prompt injection attacks against browser agents expose two key gaps: (1) low effectiveness, as attacks optimized for toy baselines fail to achieve end-to-end goals in real-world scenarios with complex environments and longer steps; (2) weak stealthiness, since most attacks pit the attack goal against the user goal, causing a significant drop in system usability under attack. To address these gaps, we propose WebTrap, a mid-task hijacking injection attack. It employs multi-step instruction fusion steering to seamlessly combine both goals, enabling the agent to resume the original user task after executing the attack goal. Furthermore, we design a context-grounded generation method to align the injected content with the task environment and system instructions, maximizing the hijacking success rate. Extensive experiments on two browser agent tasks, based on extended WASP and InjecAgent environments, demonstrate that our method achieves a high attack success rate while preserving the usability of the original system. We find that WebTrap exploits the agent's navigation vulnerabilities, binding the two goals so tightly that standard defense mechanisms cannot restore the system to normal operation. These findings reveal a critical vulnerability in agent systems during long-horizon tasks that they can be stealthily hijacked.

LGFeb 2
Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Wenbo Pan, Zhichao Liu, Xianlong Wang et al.

Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(M*N) operations, making long-context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

CVDec 7, 2023
An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything

Israt Zarin Era, Imtiaz Ahmed, Zhichao Liu et al.

Foundation models are currently driving a paradigm shift in computer vision tasks for various fields including biology, astronomy, and robotics among others, leveraging user-generated prompts to enhance their performance. In the Laser Additive Manufacturing (LAM) domain, accurate image-based defect segmentation is imperative to ensure product quality and facilitate real-time process control. However, such tasks are often characterized by multiple challenges including the absence of labels and the requirement for low latency inference among others. Porosity is a very common defect in LAM due to lack of fusion, entrapped gas, and keyholes, directly affecting mechanical properties like tensile strength, stiffness, and hardness, thereby compromising the quality of the final product. To address these issues, we construct a framework for image segmentation using a state-of-the-art Vision Transformer (ViT) based Foundation model (Segment Anything Model) with a novel multi-point prompt generation scheme using unsupervised clustering. Utilizing our framework we perform porosity segmentation in a case study of laser-based powder bed fusion (L-PBF) and obtain high accuracy without using any labeled data to guide the prompt tuning process. By capitalizing on lightweight foundation model inference combined with unsupervised prompt generation, we envision constructing a real-time anomaly detection pipeline that could revolutionize current laser additive manufacturing processes, thereby facilitating the shift towards Industry 4.0 and promoting defect-free production along with operational efficiency.

ROJul 27, 2025
Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots

Wei Cui, Haoyu Wang, Wenkang Qin et al.

Humanoid robot technology is advancing rapidly, with manufacturers introducing diverse heterogeneous visual perception modules tailored to specific scenarios. Among various perception paradigms, occupancy-based representation has become widely recognized as particularly suitable for humanoid robots, as it provides both rich semantic and 3D geometric information essential for comprehensive environmental understanding. In this work, we present Humanoid Occupancy, a generalized multimodal occupancy perception system that integrates hardware and software components, data acquisition devices, and a dedicated annotation pipeline. Our framework employs advanced multi-modal fusion techniques to generate grid-based occupancy outputs encoding both occupancy status and semantic labels, thereby enabling holistic environmental understanding for downstream tasks such as task planning and navigation. To address the unique challenges of humanoid robots, we overcome issues such as kinematic interference and occlusion, and establish an effective sensor layout strategy. Furthermore, we have developed the first panoramic occupancy dataset specifically for humanoid robots, offering a valuable benchmark and resource for future research and development in this domain. The network architecture incorporates multi-modal feature fusion and temporal information integration to ensure robust perception. Overall, Humanoid Occupancy delivers effective environmental perception for humanoid robots and establishes a technical foundation for standardizing universal visual modules, paving the way for the widespread deployment of humanoid robots in complex real-world scenarios.

ROOct 22, 2025
GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain Team, Angen Ye, Boyuan Wang et al.

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

CVNov 18, 2024
In-Situ Melt Pool Characterization via Thermal Imaging for Defect Detection in Directed Energy Deposition Using Vision Transformers

Israt Zarin Era, Fan Zhou, Ahmed Shoyeb Raihan et al.

Directed Energy Deposition (DED) offers significant potential for manufacturing complex and multi-material parts. However, internal defects such as porosity and cracks can compromise mechanical properties and overall performance. This study focuses on in-situ monitoring and characterization of melt pools associated with porosity, aiming to improve defect detection and quality control in DED-printed parts. Traditional machine learning approaches for defect identification rely on extensive labeled datasets, often scarce and expensive to generate in real-world manufacturing. To address this, our framework employs self-supervised learning on unlabeled melt pool data using a Vision Transformer-based Masked Autoencoder (MAE) to produce highly representative embeddings. These fine-tuned embeddings are leveraged via transfer learning to train classifiers on a limited labeled dataset, enabling the effective identification of melt pool anomalies. We evaluate two classifiers: (1) a Vision Transformer (ViT) classifier utilizing the fine-tuned MAE Encoder's parameters and (2) the fine-tuned MAE Encoder combined with an MLP classifier head. Our framework achieves overall accuracy ranging from 95.44% to 99.17% and an average F1 score exceeding 80%, with the ViT Classifier slightly outperforming the MAE Encoder Classifier. This demonstrates the scalability and cost-effectiveness of our approach for automated quality control in DED, effectively detecting defects with minimal labeled data.

CVNov 19, 2025
First Frame Is the Place to Go for Video Content Customization

Jingxi Chen, Zongxia Li, Zhichao Liu et al.

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

CVOct 25, 2025
Audio Frequency-Time Dual Domain Evaluation on Depression Diagnosis

Yu Luo, Nan Huang, Sophie Yu et al.

Depression, as a typical mental disorder, has become a prevalent issue significantly impacting public health. However, the prevention and treatment of depression still face multiple challenges, including complex diagnostic procedures, ambiguous criteria, and low consultation rates, which severely hinder timely assessment and intervention. To address these issues, this study adopts voice as a physiological signal and leverages its frequency-time dual domain multimodal characteristics along with deep learning models to develop an intelligent assessment and diagnostic algorithm for depression. Experimental results demonstrate that the proposed method achieves excellent performance in the classification task for depression diagnosis, offering new insights and approaches for the assessment, screening, and diagnosis of depression.

LGMar 27, 2025
Confidence Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART): A Data-driven Active Learning Framework for Accelerating Material Discovery under Resource Constraints

Ahmed Shoyeb Raihan, Zhichao Liu, Tanveer Hossain Bhuiyan et al.

Accelerating the discovery and manufacturing of advanced materials with specific properties is a critical yet formidable challenge due to vast search space, high costs of experiments, and time-intensive nature of material characterization. In recent years, active learning, where a surrogate machine learning (ML) model mimics the scientific discovery process of a human scientist, has emerged as a promising approach to address these challenges by guiding experimentation toward high-value outcomes with a limited budget. Among the diverse active learning philosophies, the concept of surprise (capturing the divergence between expected and observed outcomes) has demonstrated significant potential to drive experimental trials and refine predictive models. Scientific discovery often stems from surprise thereby making it a natural driver to guide the search process. Despite its promise, prior studies leveraging surprise metrics such as Shannon and Bayesian surprise lack mechanisms to account for prior confidence, leading to excessive exploration of uncertain regions that may not yield useful information. To address this, we propose the Confidence-Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART), a novel Bayesian active learning framework tailored for optimizing data-driven experimentation. On a high level, CA-SMART incorporates Confidence-Adjusted Surprise (CAS) to dynamically balance exploration and exploitation by amplifying surprises in regions where the model is more certain while discounting them in highly uncertain areas. We evaluated CA-SMART on two benchmark functions (Six-Hump Camelback and Griewank) and in predicting the fatigue strength of steel. The results demonstrate superior accuracy and efficiency compared to traditional surprise metrics, standard Bayesian Optimization (BO) acquisition functions and conventional ML methods.

ROAug 4, 2021
Deformation Recovery Control and Post-Impact Trajectory Replanning for Collision-Resilient Mobile Robots

Zhouyu Lu, Zhichao Liu, Konstantinos Karydis

The paper focuses on collision-inclusive motion planning for impact-resilient mobile robots. We propose a new deformation recovery and replanning strategy to handle collisions that may occur at run-time. Contrary to collision avoidance methods that generate trajectories only in conservative local space or require collision checking that has high computational cost, our method directly generates (local) trajectories with imposing only waypoint constraints. If a collision occurs, our method then estimates the post-impact state and computes from there an intermediate waypoint to recover from the collision. To achieve so, we develop two novel components: 1) a deformation recovery controller that optimizes the robot's states during post-impact recovery phase, and 2) a post-impact trajectory replanner that adjusts the next waypoint with the information from the collision for the robot to pass through and generates a polynomial-based minimum effort trajectory. The proposed strategy is evaluated experimentally with an omni-directional impact-resilient wheeled robot. The robot is designed in house, and it can perceive collisions with the aid of Hall effect sensors embodied between the robot's main chassis and a surrounding deflection ring-like structure.

ROAug 3, 2021
Position Control and Variable-Height Trajectory Tracking of a Soft Pneumatic Legged Robot

Zhichao Liu, Konstantinos Karydis

Soft pneumatic legged robots show promise in their ability to traverse a range of different types of terrain, including natural unstructured terrain met in applications like precision agriculture. They can adapt their body morphology to the intricacies of the terrain at hand, thus enabling robust and resilient locomotion. In this paper we capitalize upon recent developments on soft pneumatic legged robots to introduce a closed-loop trajectory tracking control scheme for operation over flat ground. Closed-loop pneumatic actuation feedback is achieved via a compact and portable pneumatic regulation board. Experimental results reveal that our soft legged robot can precisely control its body height and orientation while in quasi-static operation based on a geometric model. The robot can track both straight line and curved trajectories as well as variable-height trajectories. This work lays the basis to enable autonomous navigation for soft legged robots.

RONov 3, 2020
Toward Impact-resilient Quadrotor Design, Collision Characterization and Recovery Control to Sustain Flight after Collisions

Zhichao Liu, Konstantinos Karydis

Collision detection and recovery for aerial robots remain a challenge because of the limited space for sensors and local stability of the flight controller. We introduce a novel collision-resilient quadrotor that features a compliant arm design to enable free flight while allowing for one passive degree of freedom to absorb shocks. We further propose a novel collision detection and characterization method based on Hall sensors, as well as a new recovery control method to generate and track a smooth trajectory after a collision occurs. Experimental results demonstrate that the robot can detect and recover from high-speed collisions with various obstacles such as walls and poles. Moreover, it can survive collisions that are hard to detect with existing methods based on IMU data and contact models, for example, when colliding with unstructured surfaces, or being hit by a moving obstacle while hovering.

ROSep 4, 2020
Motion Planning for Collision-resilient Mobile Robots in Obstacle-cluttered Unknown Environments with Risk Reward Trade-offs

Zhouyu Lu, Zhichao Liu, Gustavo J. Correa et al.

Collision avoidance in unknown obstacle-cluttered environments may not always be feasible. This paper focuses on an emerging paradigm shift in which potential collisions with the environment can be harnessed instead of being avoided altogether. To this end, we introduce a new sampling-based online planning algorithm that can explicitly handle the risk of colliding with the environment and can switch between collision avoidance and collision exploitation. Central to the planner's capabilities is a novel joint optimization function that evaluates the effect of possible collisions using a reflection model. This way, the planner can make deliberate decisions to collide with the environment if such collision is expected to help the robot make progress toward its goal. To make the algorithm online, we present a state expansion pruning technique that significantly reduces the search space while ensuring completeness. The proposed algorithm is evaluated experimentally with a built-in-house holonomic wheeled robot that can withstand collisions. We perform an extensive parametric study to investigate trade-offs between (user-tuned) levels of risk, deliberate collision decision making, and trajectory statistics such as time to reach the goal and path length.

CVMay 2, 2019
DS-VIO: Robust and Efficient Stereo Visual Inertial Odometry based on Dual Stage EKF

Xiaogang Xiong, Wenqing Chen, Zhichao Liu et al.

This paper presents a dual stage EKF (Extended Kalman Filter)-based algorithm for the real-time and robust stereo VIO (visual inertial odometry). The first stage of this EKF-based algorithm performs the fusion of accelerometer and gyroscope while the second performs the fusion of stereo camera and IMU. Due to the sufficient complementary characteristics between accelerometer and gyroscope as well as stereo camera and IMU, the dual stage EKF-based algorithm can achieve a high precision of odometry estimations. At the same time, because of the low dimension of state vector in this algorithm, its computational efficiency is comparable to previous filter-based approaches. We call our approach DS-VIO (dual stage EKFbased stereo visual inertial odometry) and evaluate our DSVIO algorithm by comparing it with the state-of-art approaches including OKVIS, ROVIO, VINS-MONO and S-MSCKF on the EuRoC dataset. Results show that our algorithm can achieve comparable or even better performances in terms of the RMS error

CVDec 16, 2017
SRPGAN: Perceptual Generative Adversarial Network for Single Image Super Resolution

Bingzhe Wu, Haodong Duan, Zhichao Liu et al.

Single image super resolution (SISR) is to reconstruct a high resolution image from a single low resolution image. The SISR task has been a very attractive research topic over the last two decades. In recent years, convolutional neural network (CNN) based models have achieved great performance on SISR task. Despite the breakthroughs achieved by using CNN models, there are still some problems remaining unsolved, such as how to recover high frequency details of high resolution images. Previous CNN based models always use a pixel wise loss, such as l2 loss. Although the high resolution images constructed by these models have high peak signal-to-noise ratio (PSNR), they often tend to be blurry and lack high-frequency details, especially at a large scaling factor. In this paper, we build a super resolution perceptual generative adversarial network (SRPGAN) framework for SISR tasks. In the framework, we propose a robust perceptual loss based on the discriminator of the built SRPGAN model. We use the Charbonnier loss function to build the content loss and combine it with the proposed perceptual loss and the adversarial loss. Compared with other state-of-the-art methods, our method has demonstrated great ability to construct images with sharp edges and rich details. We also evaluate our method on different benchmarks and compare it with previous CNN based methods. The results show that our method can achieve much higher structural similarity index (SSIM) scores on most of the benchmarks than the previous state-of-art methods.