Yafeng Li

CV
h-index14
6papers
212citations
Novelty57%
AI Score54

6 Papers

CVJul 3, 2022Code
Can Language Understand Depth?

Renrui Zhang, Ziyao Zeng, Ziyu Guo et al.

Besides image classification, Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for a wide range of vision tasks, including object-level and 3D space understanding. However, it's still challenging to transfer semantic knowledge learned from CLIP into more intricate tasks of quantified targets, such as depth estimation with geometric information. In this paper, we propose to apply CLIP for zero-shot monocular depth estimation, named DepthCLIP. We found that the patches of the input image could respond to a certain semantic distance token and then be projected to a quantified depth bin for coarse estimation. Without any training, our DepthCLIP surpasses existing unsupervised methods and even approaches the early fully-supervised networks. To our best knowledge, we are the first to conduct zero-shot adaptation from the semantic language knowledge to quantified downstream tasks and perform zero-shot monocular depth estimation. We hope our work could cast a light on future research. The code is available at https://github.com/Adonis-galaxy/DepthCLIP.

SPJun 15, 2023
MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals

Donghong Cai, Junru Chen, Yang Yang et al.

Brain signals are important quantitative data for understanding physiological activities and diseases of human brain. Most existing studies pay attention to supervised learning methods, which, however, require high-cost clinical labels. In addition, the huge difference in the clinical patterns of brain signals measured by invasive (e.g., SEEG) and non-invasive (e.g., EEG) methods leads to the lack of a unified method. To handle the above issues, we propose to study the self-supervised learning (SSL) framework for brain signals that can be applied to pre-train either SEEG or EEG data. Intuitively, brain signals, generated by the firing of neurons, are transmitted among different connecting structures in human brain. Inspired by this, we propose MBrain to learn implicit spatial and temporal correlations between different channels (i.e., contacts of the electrode, corresponding to different brain areas) as the cornerstone for uniformly modeling different types of brain signals. Specifically, we represent the spatial correlation by a graph structure, which is built with proposed multi-channel CPC. We theoretically prove that optimizing the goal of multi-channel CPC can lead to a better predictive representation and apply the instantaneou-time-shift prediction task based on it. Then we capture the temporal correlation by designing the delayed-time-shift prediction task. Finally, replace-discriminative-learning task is proposed to preserve the characteristics of each channel. Extensive experiments of seizure detection on both EEG and SEEG large-scale real-world datasets demonstrate that our model outperforms several state-of-the-art time series SSL and unsupervised models, and has the ability to be deployed to clinical practice.

78.0SYMay 20
Fully Actuated Manifold Constraint Based Output Feedback Control for Input-Constrained Uncertain Nonlinear Systems

Dianrui Mu, Changchun Hua, Yafeng Li et al.

This paper presents a low-complexity, model-free, output-feedback controller for a class of unknown time-varying nonlinear systems with unknown input constraints. The controller achieves the preset control accuracy when the actuator is not saturated and maintains flexible control accuracy after actuator saturation. This result extends existing constraint control methods for linear manifolds to a more general form, including the construction of nonlinear manifolds and various types of constraints, thereby achieving preset control accuracy within finite or fixed time. Additionally, flexible control under unknown saturation is achieved through the construction of an error-driven flexible constraint. Finally, second-order and higher-order control examples and simulations are provided.

CVMar 14, 2025Code
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Hongbin Lin, Zilu Guo, Yifan Zhang et al.

In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at https://github.com/Hongbin98/DriveGEN.

CVNov 24, 2025Code
DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving

Hongbin Lin, Yiming Yang, Chaoda Zheng et al.

In autonomous driving, vision-centric 3D object detection recognizes and localizes 3D objects from RGB images. However, due to high annotation costs and diverse outdoor scenes, training data often fails to cover all possible test scenarios, known as the out-of-distribution (OOD) issue. Training-free image editing offers a promising solution for improving model robustness by training data enhancement without any modifications to pre-trained diffusion models. Nevertheless, inversion-based methods often suffer from limited effectiveness and inherent inaccuracies, while recent rectified-flow-based approaches struggle to preserve objects with accurate 3D geometry. In this paper, we propose DriveFlow, a Rectified Flow Adaptation method for training data enhancement in autonomous driving based on pre-trained Text-to-Image flow models. Based on frequency decomposition, DriveFlow introduces two strategies to adapt noise-free editing paths derived from text-conditioned velocities. 1) High-Frequency Foreground Preservation: DriveFlow incorporates a high-frequency alignment loss for foreground to maintain precise 3D object geometry. 2) Dual-Frequency Background Optimization: DriveFlow also conducts dual-frequency optimization for background, balancing editing flexibility and semantic consistency. Comprehensive experiments validate the effectiveness and efficiency of DriveFlow, demonstrating comprehensive performance improvements on all categories across OOD scenarios. Code is available at https://github.com/Hongbin98/DriveFlow.

CVDec 4, 2021
VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Longtian Qiu, Renrui Zhang, Ziyu Guo et al.

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.