Yinan Zhao

CV
h-index106
17papers
1,302citations
Novelty50%
AI Score41

17 Papers

CVOct 9, 2022
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Feng Liang, Bichen Wu, Xiaoliang Dai et al.

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.

SPJun 6, 2022
Human Behavior Recognition Method Based on CEEMD-ES Radar Selection

Zhaolin Zhang, Mingqi Song, Wugang Meng et al.

In recent years, the millimeter-wave radar to identify human behavior has been widely used in medical,security, and other fields. When multiple radars are performing detection tasks, the validity of the features contained in each radar is difficult to guarantee. In addition, processing multiple radar data also requires a lot of time and computational cost. The Complementary Ensemble Empirical Mode Decomposition-Energy Slice (CEEMD-ES) multistatic radar selection method is proposed to solve these problems. First, this method decomposes and reconstructs the radar signal according to the difference in the reflected echo frequency between the limbs and the trunk of the human body. Then, the radar is selected according to the difference between the ratio of echo energy of limbs and trunk and the theoretical value. The time domain, frequency domain and various entropy features of the selected radar are extracted. Finally, the Extreme Learning Machine (ELM) recognition model of the ReLu core is established. Experiments show that this method can effectively select the radar, and the recognition rate of three kinds of human actions is 98.53%.

CVDec 6, 2023
AVID: Any-Length Video Inpainting with Diffusion Model

Zhixing Zhang, Bichen Wu, Xiaoyan Wang et al.

Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into the video domain, there have been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity levels, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with a middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality. More visualization results are made publicly available at https://zhang-zx.github.io/AVID/ .

CVDec 29, 2023
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Feng Liang, Bichen Wu, Jialiang Wang et al.

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

EPDec 4, 2023
The GPU Phase Folding and Deep Learning Method for Detecting Exoplanet Transits

Kaitlyn Wang, Jian Ge, Kevin Willis et al.

This paper presents GPFC, a novel Graphics Processing Unit (GPU) Phase Folding and Convolutional Neural Network (CNN) system to detect exoplanets using the transit method. We devise a fast folding algorithm parallelized on a GPU to amplify low signal-to-noise ratio transit signals, allowing a search at high precision and speed. A CNN trained on two million synthetic light curves reports a score indicating the likelihood of a planetary signal at each period. While the GPFC method has broad applicability across period ranges, this research specifically focuses on detecting ultra-short-period planets with orbital periods less than one day. GPFC improves on speed by three orders of magnitude over the predominant Box-fitting Least Squares (BLS) method. Our simulation results show GPFC achieves $97%$ training accuracy, higher true positive rate at the same false positive rate of detection, and higher precision at the same recall rate when compared to BLS. GPFC recovers $100\%$ of known ultra-short-period planets in $\textit{Kepler}$ light curves from a blind search. These results highlight the promise of GPFC as an alternative approach to the traditional BLS algorithm for finding new transiting exoplanets in data taken with $\textit{Kepler}$ and other space transit missions such as K2, TESS and future PLATO and Earth 2.0.

EPMay 21, 2024
Improving Earth-like planet detection in radial velocity using deep learning

Yinan Zhao, Xavier Dumusque, Michael Cretignier et al.

Many novel methods have been proposed to mitigate stellar activity for exoplanet detection as the presence of stellar activity in radial velocity (RV) measurements is the current major limitation. Unlike traditional methods that model stellar activity in the RV domain, more methods are moving in the direction of disentangling stellar activity at the spectral level. The goal of this paper is to present a novel convolutional neural network-based algorithm that efficiently models stellar activity signals at the spectral level, enhancing the detection of Earth-like planets. We trained a convolutional neural network to build the correlation between the change in the spectral line profile and the corresponding RV, full width at half maximum (FWHM) and bisector span (BIS) values derived from the classical cross-correlation function. This algorithm has been tested on three intensively observed stars: Alpha Centauri B (HD128621), Tau ceti (HD10700), and the Sun. By injecting simulated planetary signals at the spectral level, we demonstrate that our machine learning algorithm can achieve, for HD128621 and HD10700, a detection threshold of 0.5 m/s in semi-amplitude for planets with periods ranging from 10 to 300 days. This threshold would correspond to the detection of a $\sim$4$\mathrm{M}_{\oplus}$ in the habitable zone of those stars. On the HARPS-N solar dataset, our algorithm is even more efficient at mitigating stellar activity signals and can reach a threshold of 0.2 m/s, which would correspond to a 2.2$\mathrm{M}_{\oplus}$ planet on the orbit of the Earth. To the best of our knowledge, it is the first time that such low detection thresholds are reported for the Sun, but also for other stars, and therefore this highlights the efficiency of our convolutional neural network-based algorithm at mitigating stellar activity in RV measurements.

CVNov 25, 2025
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Chieh-Yun Chen, Zhonghao Wang, Qi Chen et al.

Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

EPDec 28, 2023
Discovery of Small Ultra-short-period Planets Orbiting KG Dwarfs in Kepler Survey Using GPU Phase Folding and Deep Learning Detection System

Kaitlyn Wang, Jian Ge, Kevin Willis et al.

Of over 5,000 exoplanets identified so far, only a few hundred possess sub-Earth radii. The formation processes of these sub-Earths remain elusive, and acquiring additional samples is essential for investigating this unique population. In our study, we employ the GPFC method, a novel GPU Phase Folding algorithm combined with a Convolutional Neural Network, on Kepler photometry data. This method enhances the transit search speed significantly over the traditional Box-fitting Least Squares method, allowing a complete search of the known Kepler KOI data within days using a commercial GPU card. To date, we have identified five new ultra-short-period planets (USPs): Kepler-158d, Kepler-963c, Kepler-879c, Kepler-1489c, and KOI-4978.02. Kepler-879c with a radius of $0.4 R_\oplus$ completes its orbit around a G dwarf in 0.646716 days. Kepler-158d with a radius of $0.43 R_\oplus$ orbits a K dwarf star every 0.645088 days. Kepler-1489c with a radius of $0.51 R_\oplus$ orbits a G dwarf in 0.680741 days. Kepler-963c with a radius of $0.6 R_\oplus$ revolves around a G dwarf in 0.919783 days, and KOI-4978.02 with a radius of $0.7 R_\oplus$ circles a G dwarf in 0.941967 days. Among our findings, Kepler-879c, Kepler-158d and Kepler-963c rank as the first, the third, the fourth smallest USPs identified to date. Notably, Kepler-158d stands as the smallest USP found orbiting K dwarfs while Kepler-963c, Kepler-879c, Kepler-1489c, and KOI-4978.02 are the smallest USPs found orbiting G dwarfs. Kepler-879c, Kepler-158d, Kepler-1489c, and KOI-4978.02 are among the smallest planets that are closest to their host stars, with orbits within 5 stellar radii. In addition, these discoveries highlight GPFC's promising capability in identifying small, new transiting exoplanets within photometry data from Kepler, TESS, and upcoming space transit missions, PLATO and ET.

CVDec 21, 2021
Generalized Few-Shot Semantic Segmentation: All You Need is Fine-Tuning

Josh Myers-Dean, Yinan Zhao, Brian Price et al.

Generalized few-shot semantic segmentation was introduced to move beyond only evaluating few-shot segmentation models on novel classes to include testing their ability to remember base classes. While the current state-of-the-art approach is based on meta-learning, it performs poorly and saturates in learning after observing only a few shots. We propose the first fine-tuning solution, and demonstrate that it addresses the saturation problem while achieving state-of-the-art results on two datasets, PASCAL-5i and COCO-20i. We also show that it outperforms existing methods, whether fine-tuning multiple final layers or only the final layer. Finally, we present a triplet loss regularization that shows how to redistribute the balance of performance between novel and base categories so that there is a smaller gap between them.

ROJun 8, 2021
A novel partially-decoupled translational parallel manipulator with symbolic kinematics, singularity identification and workspace determination

Huiping Shen, Yinan Zhao, Ju Li et al.

This paper presents a novel three-degree-of-freedom (3-DOF) translational parallel manipulator (TPM) by using a topological design method of parallel mechanism (PM) based on position and orientation characteristic (POC) equations. The proposed PM is only composed of lower-mobility joints and actuated prismatic joints, together with the investigations on three kinematic issues of importance. The first aspect pertains to geometric modeling of the TPM in connection with its topological characteristics, such as the POC, degree of freedom and coupling degree, from which its symbolic direct kinematic solutions are readily obtained. Moreover, the decoupled properties of input-output motions are directly evaluated without Jacobian analysis. Sequentially, based upon the inverse kinematics, the singular configurations of the TPM are identified, wherein the singular surfaces are visualized by means of a Gr{ö}bner based elimination operation. Finally, the workspace of the TPM is evaluated with a geometric approach. This 3-DOF TPM features less joints and links compared with the well-known Delta robot, which reduces the structural complexity. Its symbolic direct kinematics and partially-decoupled property will ease path planning and dynamic analysis. The TPM can be used for manufacturing large work pieces.

CVApr 6, 2020
Objectness-Aware Few-Shot Semantic Segmentation

Yinan Zhao, Brian Price, Scott Cohen et al.

Few-shot semantic segmentation models aim to segment images after learning from only a few annotated examples. A key challenge for them is how to avoid overfitting because limited training data is available. While prior works usually limited the overall model capacity to alleviate overfitting, this hampers segmentation accuracy. We demonstrate how to increase overall model capacity to achieve improved performance, by introducing objectness, which is class-agnostic and so not prone to overfitting, for complementary use with class-specific features. Extensive experiments demonstrate the versatility of our simple approach of introducing objectness for different base architectures that rely on different data loaders and training schedules (DENet, PFENet) as well as with different backbone models (ResNet-50, ResNet-101 and HRNetV2-W48). Given only one annotated example of an unseen category, experiments show that our method outperforms state-of-art methods with respect to mIoU by at least 4.7% and 1.5% on PASCAL-5i and COCO-20i respectively.

CVMar 27, 2020
Assessing Image Quality Issues for Real-World Problems

Tai-Yin Chiu, Yinan Zhao, Danna Gurari

We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: https://vizwiz.org.

CVFeb 20, 2020
Captioning Images Taken by People Who Are Blind

Danna Gurari, Yinan Zhao, Meng Zhang et al.

While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org

CVAug 10, 2019
Unconstrained Foreground Object Search

Yinan Zhao, Brian Price, Scott Cohen et al.

Many people search for foreground objects to use when editing images. While existing methods can retrieve candidates to aid in this, they are constrained to returning objects that belong to a pre-specified semantic class. We instead propose a novel problem of unconstrained foreground object (UFO) search and introduce a solution that supports efficient search by encoding the background image in the same latent space as the candidate foreground objects. A key contribution of our work is a cost-free, scalable approach for creating a large-scale training dataset with a variety of foreground objects of differing semantic categories per image location. Quantitative and human-perception experiments with two diverse datasets demonstrate the advantage of our UFO search solution over related baselines.

CVApr 30, 2019
Predicting How to Distribute Work Between Algorithms and Humans to Segment an Image Batch

Danna Gurari, Yinan Zhao, Suyog Dutt Jain et al.

Foreground object segmentation is a critical step for many image analysis tasks. While automated methods can produce high-quality results, their failures disappoint users in need of practical solutions. We propose a resource allocation framework for predicting how best to allocate a fixed budget of human annotation effort in order to collect higher quality segmentations for a given batch of images and automated methods. The framework is based on a prediction module that estimates the quality of given algorithm-drawn segmentations. We demonstrate the value of the framework for two novel tasks related to predicting how to distribute annotation efforts between algorithms and humans. Specifically, we develop two systems that automatically decide, for a batch of images, when to recruit humans versus computers to create 1) coarse segmentations required to initialize segmentation tools and 2) final, fine-grained segmentations. Experiments demonstrate the advantage of relying on a mix of human and computer efforts over relying on either resource alone for segmenting objects in images coming from three diverse modalities (visible, phase contrast microscopy, and fluorescence microscopy).

CVMar 22, 2018
Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image

Yinan Zhao, Brian Price, Scott Cohen et al.

Deep generative models have shown success in automatically synthesizing missing image regions using surrounding context. However, users cannot directly decide what content to synthesize with such approaches. We propose an end-to-end network for image inpainting that uses a different image to guide the synthesis of new content to fill the hole. A key challenge addressed by our approach is synthesizing new content in regions where the guidance image and the context of the original image are inconsistent. We conduct four studies that demonstrate our results yield more realistic image inpainting results over seven baselines.

CVNov 17, 2015
Identifying the Absorption Bump with Deep Learning

Min Li, Sudeep Gaddam, Xiaolin Li et al.

The pervasive interstellar dust grains provide significant insights to understand the formation and evolution of the stars, planetary systems, and the galaxies, and may harbor the building blocks of life. One of the most effective way to analyze the dust is via their interaction with the light from background sources. The observed extinction curves and spectral features carry the size and composition information of dust. The broad absorption bump at 2175 Angstrom is the most prominent feature in the extinction curves. Traditionally, statistical methods are applied to detect the existence of the absorption bump. These methods require heavy preprocessing and the co-existence of other reference features to alleviate the influence from the noises. In this paper, we apply Deep Learning techniques to detect the broad absorption bump. We demonstrate the key steps for training the selected models and their results. The success of Deep Learning based method inspires us to generalize a common methodology for broader science discovery problems. We present our on-going work to build the DeepDis system for such kind of applications.