Yu Hen Hu

CV
h-index19
14papers
64citations
Novelty43%
AI Score35

14 Papers

CVJul 22, 2023Code
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?

Cheng-En Wu, Yu Tian, Haichao Yu et al.

Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such a prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1) the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2) the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. The code is available at https://github.com/CEWu/PTNL.

CVFeb 2, 2023
An optimization method for out-of-distribution anomaly detection models

Ji Qiu, Hongmei Shi, Yu Hen Hu et al.

Frequent false alarms impede the promotion of unsupervised anomaly detection algorithms in industrial applications. Potential characteristics of false alarms depending on the trained detector are revealed by investigating density probability distributions of prediction scores in the out-of-distribution anomaly detection tasks. An SVM-based classifier is exploited as a post-processing module to identify false alarms from the anomaly map at the object level. Besides, a sample synthesis strategy is devised to incorporate fuzzy prior knowledge on the specific application in the anomaly-free training dataset. Experimental results illustrate that the proposed method comprehensively improves the performances of two segmentation models at both image and pixel levels on two industrial applications.

CVJan 20, 2022Code
Self-supervised Video Representation Learning with Cascade Positive Retrieval

Cheng-En Wu, Farley Lai, Yu Hen Hu et al.

Self-supervised video representation learning has been shown to effectively improve downstream tasks such as video retrieval and action recognition. In this paper, we present the Cascade Positive Retrieval (CPR) that successively mines positive examples w.r.t. the query for contrastive learning in a cascade of stages. Specifically, CPR exploits multiple views of a query example in different modalities, where an alternative view may help find another positive example dissimilar in the query view. We explore the effects of possible CPR configurations in ablations including the number of mining stages, the top similar example selection ratio in each stage, and progressive training with an incremental number of the final Top-k selection. The overall mining quality is measured to reflect the recall across training set classes. CPR reaches a median class mining recall of 83.3%, outperforming previous work by 5.5%. Implementation-wise, CPR is complementary to pretext tasks and can be easily applied to previous work. In the evaluation of pretraining on UCF101, CPR consistently improves existing work and even achieves state-of-the-art R@1 of 56.7% and 24.4% in video retrieval as well as 83.8% and 54.8% in action recognition on UCF101 and HMDB51. The code is available at https://github.com/necla-ml/CPR.

CVSep 22, 2024
Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Cheng-En Wu, Jinhong Lin, Yu Hen Hu et al.

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.

CVDec 21, 2024
EasyVis2: A Real Time Multi-view 3D Visualization System for Laparoscopic Surgery Training Enhanced by a Deep Neural Network YOLOv8-Pose

Yung-Hong Sun, Gefei Shen, Jiangang Chen et al.

EasyVis2 is a system designed to provide hands-free, real-time 3D visualization for laparoscopic surgery. It incorporates a surgical trocar equipped with an array of micro-cameras, which can be inserted into the body cavity to offer an enhanced field of view and a 3D perspective of the surgical procedure. A specialized deep neural network algorithm, YOLOv8-Pose, is utilized to estimate the position and orientation of surgical instruments in each individual camera view. These multi-view estimates enable the calculation of 3D poses of surgical tools, facilitating the rendering of a 3D surface model of the instruments, overlaid on the background scene, for real-time visualization. This study presents methods for adapting YOLOv8-Pose to the EasyVis2 system, including the development of a tailored training dataset. Experimental results demonstrate that, with an identical number of cameras, the new system improves 3D reconstruction accuracy and reduces computation time. Additionally, the adapted YOLOv8-Pose system shows high accuracy in 2D pose estimation.

CVDec 28, 2023
Block Pruning for Enhanced Efficiency in Convolutional Neural Networks

Cheng-En Wu, Azadeh Davoodi, Yu Hen Hu

This paper presents a novel approach to network pruning, targeting block pruning in deep neural networks for edge computing environments. Our method diverges from traditional techniques that utilize proxy metrics, instead employing a direct block removal strategy to assess the impact on classification accuracy. This hands-on approach allows for an accurate evaluation of each block's importance. We conducted extensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets using ResNet architectures. Our results demonstrate the efficacy of our method, particularly on large-scale datasets like ImageNet with ResNet50, where it excelled in reducing model size while retaining high accuracy, even when pruning a significant portion of the network. The findings underscore our method's capability in maintaining an optimal balance between model size and performance, especially in resource-constrained edge computing scenarios.

SYNov 16, 2024
A Wearable Gait Monitoring System for 17 Gait Parameters Based on Computer Vision

Jiangang Chen, Yung-Hong Sun, Kristen Pickett et al.

We developed a shoe-mounted gait monitoring system capable of tracking up to 17 gait parameters, including gait length, step time, stride velocity, and others. The system employs a stereo camera mounted on one shoe to track a marker placed on the opposite shoe, enabling the estimation of spatial gait parameters. Additionally, a Force Sensitive Resistor (FSR) affixed to the heel of the shoe, combined with a custom-designed algorithm, is utilized to measure temporal gait parameters. Through testing on multiple participants and comparison with the gait mat, the proposed gait monitoring system exhibited notable performance, with the accuracy of all measured gait parameters exceeding 93.61%. The system also demonstrated a low drift of 4.89% during long-distance walking. A gait identification task conducted on participants using a trained Transformer model achieved 95.7% accuracy on the dataset collected by the proposed system, demonstrating that our hardware has the potential to collect long-sequence gait data suitable for integration with current Large Language Models (LLMs). The system is cost-effective, user-friendly, and well-suited for real-life measurements.

CVNov 16, 2024
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling

Jinhong Lin, Cheng-En Wu, Huanran Li et al.

Masked Image Modeling (MIM) has emerged as a powerful self-supervised learning paradigm for visual representation learning, enabling models to acquire rich visual representations by predicting masked portions of images from their visible regions. While this approach has shown promising results, we hypothesize that its effectiveness may be limited by optimization challenges during early training stages, where models are expected to learn complex image distributions from partial observations before developing basic visual processing capabilities. To address this limitation, we propose a prototype-driven curriculum leagrning framework that structures the learning process to progress from prototypical examples to more complex variations in the dataset. Our approach introduces a temperature-based annealing scheme that gradually expands the training distribution, enabling more stable and efficient learning trajectories. Through extensive experiments on ImageNet-1K, we demonstrate that our curriculum learning strategy significantly improves both training efficiency and representation quality while requiring substantially fewer training epochs compared to standard Masked Auto-Encoding. Our findings suggest that carefully controlling the order of training examples plays a crucial role in self-supervised visual learning, providing a practical solution to the early-stage optimization challenges in MIM.

CVJul 18, 2025
C-DOG: Multi-View Multi-instance Feature Association Using Connected δ-Overlap Graphs

Yung-Hong Sun, Ting-Hung Lin, Jiangang Chen et al.

Multi-view multi-instance feature association constitutes a crucial step in 3D reconstruction, facilitating the consistent grouping of object instances across various camera perspectives. The presence of multiple identical objects within a scene often leads to ambiguities for appearance-based feature matching algorithms. Our work circumvents this challenge by exclusively employing geometrical constraints, specifically epipolar geometry, for feature association. We introduce C-DOG (Connected delta-Overlap Graph), an algorithm designed for robust geometrical feature association, even in the presence of noisy feature detections. In a C-DOG graph, two nodes representing 2D feature points from distinct views are connected by an edge if they correspond to the same 3D point. Each edge is weighted by its epipolar distance. Ideally, true associations yield a zero distance; however, noisy feature detections can result in non-zero values. To robustly retain edges where the epipolar distance is less than a threshold delta, we employ a Szymkiewicz--Simpson coefficient. This process leads to a delta-neighbor-overlap clustering of 2D nodes. Furthermore, unreliable nodes are pruned from these clusters using an Inter-quartile Range (IQR)-based criterion. Our extensive experiments on synthetic benchmarks demonstrate that C-DOG not only outperforms geometry-based baseline algorithms but also remains remarkably robust under demanding conditions. This includes scenes with high object density, no visual features, and restricted camera overlap, positioning C-DOG as an excellent solution for scalable 3D reconstruction in practical applications.

CVMay 23, 2025
Sampling Strategies for Efficient Training of Deep Learning Object Detection Algorithms

Gefei Shen, Yung-Hong Sun, Yu Hen Hu et al.

Two sampling strategies are investigated to enhance efficiency in training a deep learning object detection model. These sampling strategies are employed under the assumption of Lipschitz continuity of deep learning models. The first strategy is uniform sampling which seeks to obtain samples evenly yet randomly through the state space of the object dynamics. The second strategy of frame difference sampling is developed to explore the temporal redundancy among successive frames in a video. Experiment result indicates that these proposed sampling strategies provide a dataset that yields good training performance while requiring relatively few manually labelled samples.

CVMay 26, 2023
Live American Sign Language Letter Classification with Convolutional Neural Networks

Kyle Boone, Ben Wurster, Seth Thao et al.

This project is centered around building a neural network that is able to recognize ASL letters in images, particularly within the scope of a live video feed. Initial testing results came up short of expectations when both the convolutional network and VGG16 transfer learning approaches failed to generalize in settings of different backgrounds. The use of a pre-trained hand joint detection model was then adopted with the produced joint locations being fed into a fully-connected neural network. The results of this approach exceeded those of prior methods and generalized well to a live video feed application.

CVMay 25, 2023
SimHaze: game engine simulated data for real-world dehazing

Zhengyang Lou, Huan Xu, Fangzhou Mu et al.

Deep models have demonstrated recent success in single-image dehazing. Most prior methods consider fully supervised training and learn from paired clean and hazy images, where a hazy image is synthesized based on a clean image and its estimated depth map. This paradigm, however, can produce low-quality hazy images due to inaccurate depth estimation, resulting in poor generalization of the trained models. In this paper, we explore an alternative approach for generating paired clean-hazy images by leveraging computer graphics. Using a modern game engine, our approach renders crisp clean images and their precise depth maps, based on which high-quality hazy images can be synthesized for training dehazing models. To this end, we present SimHaze: a new synthetic haze dataset. More importantly, we show that training with SimHaze alone allows the latest dehazing models to achieve significantly better performance in comparison to previous dehazing datasets. Our dataset and code will be made publicly available.

CVAug 8, 2019
Efficient Inference of CNNs via Channel Pruning

Boyu Zhang, Azadeh Davoodi, Yu Hen Hu

The deployment of Convolutional Neural Networks (CNNs) on resource constrained platforms such as mobile devices and embedded systems has been greatly hindered by their high implementation cost, and thus motivated a lot research interest in compressing and accelerating trained CNN models. Among various techniques proposed in literature, structured pruning, especially channel pruning, has gain a lot focus due to 1) its superior performance in memory, computation, and energy reduction; and 2) it is friendly to existing hardware and software libraries. In this paper, we investigate the intermediate results of convolutional layers and present a novel pivoted QR factorization based channel pruning technique that can prune any specified number of input channels of any layer. We also explore more pruning opportunities in ResNet-like architectures by applying two tweaks to our technique. Experiment results on VGG-16 and ResNet-50 models with ImageNet ILSVRC 2012 dataset are very impressive with 4.29X and 2.84X computation reduction while only sacrificing about 1.40\% top-5 accuracy. Compared to many prior works, the pruned models produced by our technique require up to 47.7\% less computation while still achieve higher accuracies.

LGNov 9, 2018
Design Rule Violation Hotspot Prediction Based on Neural Network Ensembles

Wei Zeng, Azadeh Davoodi, Yu Hen Hu

Design rule check is a critical step in the physical design of integrated circuits to ensure manufacturability. However, it can be done only after a time-consuming detailed routing procedure, which adds drastically to the time of design iterations. With advanced technology nodes, the outcomes of global routing and detailed routing become less correlated, which adds to the difficulty of predicting design rule violations from earlier stages. In this paper, a framework based on neural network ensembles is proposed to predict design rule violation hotspots using information from placement and global routing. A soft voting structure and a PCA-based subset selection scheme are developed on top of a baseline neural network from a recent work. Experimental results show that the proposed architecture achieves significant improvement in model performance compared to the baseline case. For half of test cases, the performance is even better than random forest, a commonly-used ensemble learning model.