Yuliang Gu

CV
h-index54
15papers
67citations
Novelty57%
AI Score53

15 Papers

LGSep 27, 2022
Phy-Taylor: Physics-Model-Based Deep Neural Networks

Yanbing Mao, Lui Sha, Huajie Shao et al.

Purely data-driven deep neural networks (DNNs) applied to physical engineering systems can infer relations that violate physics laws, thus leading to unexpected consequences. To address this challenge, we propose a physics-model-based DNN framework, called Phy-Taylor, that accelerates learning compliant representations with physical knowledge. The Phy-Taylor framework makes two key contributions; it introduces a new architectural Physics-compatible neural network (PhN), and features a novel compliance mechanism, we call {\em Physics-guided Neural Network Editing\}. The PhN aims to directly capture nonlinearities inspired by physical quantities, such as kinetic energy, potential energy, electrical power, and aerodynamic drag force. To do so, the PhN augments neural network layers with two key components: (i) monomials of Taylor series expansion of nonlinear functions capturing physical knowledge, and (ii) a suppressor for mitigating the influence of noise. The neural-network editing mechanism further modifies network links and activation functions consistently with physical knowledge. As an extension, we also propose a self-correcting Phy-Taylor framework that introduces two additional capabilities: (i) physics-model-based safety relationship learning, and (ii) automatic output correction when violations of safety occur. Through experiments, we show that (by expressing hard-to-learn nonlinearities directly and by constraining dependencies) Phy-Taylor features considerably fewer parameters, and a remarkably accelerated training process, while offering enhanced model robustness and accuracy.

CVSep 19, 2024Code
Prompting Segment Anything Model with Domain-Adaptive Prototype for Generalizable Medical Image Segmentation

Zhikai Wei, Wenhui Dong, Peilin Zhou et al.

Deep learning based methods often suffer from performance degradation caused by domain shift. In recent years, many sophisticated network structures have been designed to tackle this problem. However, the advent of large model trained on massive data, with its exceptional segmentation capability, introduces a new perspective for solving medical segmentation problems. In this paper, we propose a novel Domain-Adaptive Prompt framework for fine-tuning the Segment Anything Model (termed as DAPSAM) to address single-source domain generalization (SDG) in segmenting medical images. DAPSAM not only utilizes a more generalization-friendly adapter to fine-tune the large model, but also introduces a self-learning prototype-based prompt generator to enhance model's generalization ability. Specifically, we first merge the important low-level features into intermediate features before feeding to each adapter, followed by an attention filter to remove redundant information. This yields more robust image embeddings. Then, we propose using a learnable memory bank to construct domain-adaptive prototypes for prompt generation, helping to achieve generalizable medical image segmentation. Extensive experimental results demonstrate that our DAPSAM achieves state-of-the-art performance on two SDG medical image segmentation tasks with different modalities. The code is available at https://github.com/wkklavis/DAPSAM.

CVSep 2, 2024Code
Progressive Retinal Image Registration via Global and Local Deformable Transformations

Yepeng Liu, Baosheng Yu, Tian Chen et al.

Retinal image registration plays an important role in the ophthalmological diagnosis process. Since there exist variances in viewing angles and anatomical structures across different retinal images, keypoint-based approaches become the mainstream methods for retinal image registration thanks to their robustness and low latency. These methods typically assume the retinal surfaces are planar, and adopt feature matching to obtain the homography matrix that represents the global transformation between images. Yet, such a planar hypothesis inevitably introduces registration errors since retinal surface is approximately curved. This limitation is more prominent when registering image pairs with significant differences in viewing angles. To address this problem, we propose a hybrid registration framework called HybridRetina, which progressively registers retinal images with global and local deformable transformations. For that, we use a keypoint detector and a deformation network called GAMorph to estimate the global transformation and local deformable transformation, respectively. Specifically, we integrate multi-level pixel relation knowledge to guide the training of GAMorph. Additionally, we utilize an edge attention module that includes the geometric priors of the images, ensuring the deformation field focuses more on the vascular regions of clinical interest. Experiments on two widely-used datasets, FIRE and FLoRI21, show that our proposed HybridRetina significantly outperforms some state-of-the-art methods. The code is available at https://github.com/lyp-deeplearning/awesome-retinal-registration.

CVMay 20, 2024Code
Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays

Zhichao Sun, Yuliang Gu, Yepeng Liu et al.

Anomaly detection in chest X-rays is a critical task. Most methods mainly model the distribution of normal images, and then regard significant deviation from normal distribution as anomaly. Recently, CLIP-based methods, pre-trained on a large number of medical images, have shown impressive performance on zero/few-shot downstream tasks. In this paper, we aim to explore the potential of CLIP-based methods for anomaly detection in chest X-rays. Considering the discrepancy between the CLIP pre-training data and the task-specific data, we propose a position-guided prompt learning method. Specifically, inspired by the fact that experts diagnose chest X-rays by carefully examining distinct lung regions, we propose learnable position-guided text and image prompts to adapt the task data to the frozen pre-trained CLIP-based model. To enhance the model's discriminative capability, we propose a novel structure-preserving anomaly synthesis method within chest x-rays during the training process. Extensive experiments on three datasets demonstrate that our proposed method outperforms some state-of-the-art methods. The code of our implementation is available at https://github.com/sunzc-sunny/PPAD.

IVMar 18, 2024Code
WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising

Haoyu Zhao, Yuliang Gu, Zhou Zhao et al.

In clinical examinations and diagnoses, low-dose computed tomography (LDCT) is crucial for minimizing health risks compared with normal-dose computed tomography (NDCT). However, reducing the radiation dose compromises the signal-to-noise ratio, leading to degraded quality of CT images. To address this, we analyze LDCT denoising task based on experimental results from the frequency perspective, and then introduce a novel self-supervised CT image denoising method called WIA-LD2ND, only using NDCT data. The proposed WIA-LD2ND comprises two modules: Wavelet-based Image Alignment (WIA) and Frequency-Aware Multi-scale Loss (FAM). First, WIA is introduced to align NDCT with LDCT by mainly adding noise to the high-frequency components, which is the main difference between LDCT and NDCT. Second, to better capture high-frequency components and detailed information, Frequency-Aware Multi-scale Loss (FAM) is proposed by effectively utilizing multi-scale feature space. Extensive experiments on two public LDCT denoising datasets demonstrate that our WIA-LD2ND, only uses NDCT, outperforms existing several state-of-the-art weakly-supervised and self-supervised methods. Source code is available at https://github.com/zhaohaoyu376/WI-LD2ND.

CVFeb 1, 2025Code
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes

Zhichao Sun, Yepeng Liu, Huachao Zhu et al.

Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc-sunny/refdrone.

CVOct 18, 2024Code
Shape Transformation Driven by Active Contour for Class-Imbalanced Semi-Supervised Medical Image Segmentation

Yuliang Gu, Yepeng Liu, Zhichao Sun et al.

Annotating 3D medical images demands expert knowledge and is time-consuming. As a result, semi-supervised learning (SSL) approaches have gained significant interest in 3D medical image segmentation. The significant size differences among various organs in the human body lead to imbalanced class distribution, which is a major challenge in the real-world application of these SSL approaches. To address this issue, we develop a novel Shape Transformation driven by Active Contour (STAC), that enlarges smaller organs to alleviate imbalanced class distribution across different organs. Inspired by curve evolution theory in active contour methods, STAC employs a signed distance function (SDF) as the level set function, to implicitly represent the shape of organs, and deforms voxels in the direction of the steepest descent of SDF (i.e., the normal vector). To ensure that the voxels far from expansion organs remain unchanged, we design an SDF-based weight function to control the degree of deformation for each voxel. We then use STAC as a data-augmentation process during the training stage. Experimental results on two benchmark datasets demonstrate that the proposed method significantly outperforms some state-of-the-art methods. Source code is publicly available at https://github.com/GuGuLL123/STAC.

CVJan 23, 2025Code
Leveraging Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ Segmentation

Yuliang Gu, Weilun Tsao, Bo Du et al.

Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: https://github.com/Lunn88/TAK-Semi.

CVFeb 24
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang et al.

Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.

ROJan 27, 2024
Proto-MPC: An Encoder-Prototype-Decoder Approach for Quadrotor Control in Challenging Winds

Yuliang Gu, Sheng Cheng, Naira Hovakimyan

Quadrotors are increasingly used in the evolving field of aerial robotics for their agility and mechanical simplicity. However, inherent uncertainties, such as aerodynamic effects coupled with quadrotors' operation in dynamically changing environments, pose significant challenges for traditional, nominal model-based control designs. We propose a multi-task meta-learning method called Encoder-Prototype-Decoder (EPD), which has the advantage of effectively balancing shared and distinctive representations across diverse training tasks. Subsequently, we integrate the EPD model into a model predictive control problem (Proto-MPC) to enhance the quadrotor's ability to adapt and operate across a spectrum of dynamically changing tasks with an efficient online implementation. We validate the proposed method in simulations, which demonstrates Proto-MPC's robust performance in trajectory tracking of a quadrotor being subject to static and spatially varying side winds.

LGJul 25, 2025
Observations Meet Actions: Learning Control-Sufficient Representations for Robust Policy Generalization

Yuliang Gu, Hongpeng Cao, Marco Caccamo et al.

Capturing latent variations ("contexts") is key to deploying reinforcement-learning (RL) agents beyond their training regime. We recast context-based RL as a dual inference-control problem and formally characterize two properties and their hierarchy: observation sufficiency (preserving all predictive information) and control sufficiency (retaining decision-making relevant information). Exploiting this dichotomy, we derive a contextual evidence lower bound(ELBO)-style objective that cleanly separates representation learning from policy learning and optimizes it with Bottlenecked Contextual Policy Optimization (BCPO), an algorithm that places a variational information-bottleneck encoder in front of any off-policy policy learner. On standard continuous-control benchmarks with shifting physical parameters, BCPO matches or surpasses other baselines while using fewer samples and retaining performance far outside the training regime. The framework unifies theory, diagnostics, and practice for context-based RL.

CVJul 21, 2025
Coarse-to-fine crack cue for robust crack detection

Zelong Liu, Yuliang Gu, Zhichao Sun et al.

Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.

LGJun 2, 2025
Bregman Centroid Guided Cross-Entropy Method

Yuliang Gu, Hongpeng Cao, Marco Caccamo et al.

The Cross-Entropy Method (CEM) is a widely adopted trajectory optimizer in model-based reinforcement learning (MBRL), but its unimodal sampling strategy often leads to premature convergence in multimodal landscapes. In this work, we propose Bregman Centroid Guided CEM ($\mathcal{BC}$-EvoCEM), a lightweight enhancement to ensemble CEM that leverages $\textit{Bregman centroids}$ for principled information aggregation and diversity control. $\textbf{$\mathcal{BC}$-EvoCEM}$ computes a performance-weighted Bregman centroid across CEM workers and updates the least contributing ones by sampling within a trust region around the centroid. Leveraging the duality between Bregman divergences and exponential family distributions, we show that $\textbf{$\mathcal{BC}$-EvoCEM}$ integrates seamlessly into standard CEM pipelines with negligible overhead. Empirical results on synthetic benchmarks, a cluttered navigation task, and full MBRL pipelines demonstrate that $\textbf{$\mathcal{BC}$-EvoCEM}$ enhances both convergence and solution quality, providing a simple yet effective upgrade for CEM.

CVMar 16, 2025
Consistent-Point: Consistent Pseudo-Points for Semi-Supervised Crowd Counting and Localization

Yuda Zou, Zelong Liu, Yuliang Gu et al.

Crowd counting and localization are important in applications such as public security and traffic management. Existing methods have achieved impressive results thanks to extensive laborious annotations. This paper propose a novel point-localization-based semi-supervised crowd counting and localization method termed Consistent-Point. We identify and address two inconsistencies of pseudo-points, which have not been adequately explored. To enhance their position consistency, we aggregate the positions of neighboring auxiliary proposal-points. Additionally, an instance-wise uncertainty calibration is proposed to improve the class consistency of pseudo-points. By generating more consistent pseudo-points, Consistent-Point provides more stable supervision to the training process, yielding improved results. Extensive experiments across five widely used datasets and three different labeled ratio settings demonstrate that our method achieves state-of-the-art performance in crowd localization while also attaining impressive crowd counting results. The code will be available.

CVDec 12, 2023
Dual Structure-Aware Image Filterings for Semi-supervised Medical Image Segmentation

Yuliang Gu, Zhichao Sun, Tian Chen et al.

Semi-supervised image segmentation has attracted great attention recently. The key is how to leverage unlabeled images in the training process. Most methods maintain consistent predictions of the unlabeled images under variations (e.g., adding noise/perturbations, or creating alternative versions) in the image and/or model level. In most image-level variation, medical images often have prior structure information, which has not been well explored. In this paper, we propose novel dual structure-aware image filterings (DSAIF) as the image-level variations for semi-supervised medical image segmentation. Motivated by connected filtering that simplifies image via filtering in structure-aware tree-based image representation, we resort to the dual contrast invariant Max-tree and Min-tree representation. Specifically, we propose a novel connected filtering that removes topologically equivalent nodes (i.e. connected components) having no siblings in the Max/Min-tree. This results in two filtered images preserving topologically critical structure. Applying the proposed DSAIF to mutually supervised networks decreases the consensus of their erroneous predictions on unlabeled images. This helps to alleviate the confirmation bias issue of overfitting to noisy pseudo labels of unlabeled images, and thus effectively improves the segmentation performance. Extensive experimental results on three benchmark datasets demonstrate that the proposed method significantly/consistently outperforms some state-of-the-art methods. The source codes will be publicly available.