Zhaoyuan Fang

CV
11papers
725citations
Novelty48%
AI Score30

11 Papers

CVJul 21, 2022
TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

Gabriel Sarch, Zhaoyuan Fang, Adam W. Harley et al.

We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules: i) visuo-semantic detectors that detect out-of-place objects, ii) an associative neural graph memory of objects and spatial relations that proposes plausible semantic receptacles and surfaces for object repositions, and iii) a visual search network that guides the agent's exploration for efficiently localizing the receptacle-of-interest in the current scene to reposition the object. We test TIDEE on tidying up disorganized scenes in the AI2THOR simulation environment. TIDEE carries out the task directly from pixel and raw depth input without ever having observed the same room beforehand, relying only on priors learned from a separate set of training houses. Human evaluations on the resulting room reorganizations show TIDEE outperforms ablative versions of the model that do not use one or more of the commonsense priors. On a related room rearrangement benchmark that allows the agent to view the goal state prior to rearrangement, a simplified version of our model significantly outperforms a top-performing method by a large margin. Code and data are available at the project website: https://tidee-agent.github.io/.

CVApr 8, 2022
Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories

Adam W. Harley, Zhaoyuan Fang, Katerina Fragkiadaki

Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this paper, we revisit Sand and Teller's "particle video" approach, and study pixel tracking as a long-range motion estimation problem, where every pixel is described with a trajectory that locates it in multiple future frames. We re-build this classic approach using components that drive the current state-of-the-art in flow and object tracking, such as dense cost maps, iterative optimization, and learned appearance updates. We train our models using long-range amodal point trajectories mined from existing optical flow data that we synthetically augment with multi-frame occlusions. We test our approach in trajectory estimation benchmarks and in keypoint label propagation tasks, and compare favorably against state-of-the-art optical flow and feature tracking methods.

CVJun 16, 2022
Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

Adam W. Harley, Zhaoyuan Fang, Jie Li et al.

Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV) feature representation of the 3D space around the vehicle. This line of work has produced a variety of novel "lifting" methods, but we observe that other details in the training setups have shifted at the same time, making it unclear what really matters in top-performing methods. We also observe that using cameras alone is not a real-world constraint, considering that additional sensors like radar have been integrated into real vehicles for years already. In this paper, we first of all attempt to elucidate the high-impact factors in the design and training protocol of BEV perception models. We find that batch size and input resolution greatly affect performance, while lifting strategies have a more modest effect -- even a simple parameter-free lifter works well. Second, we demonstrate that radar data can provide a substantial boost to performance, helping to close the gap between camera-only and LiDAR-enabled systems. We analyze the radar usage details that lead to good performance, and invite the community to re-consider this commonly-neglected part of the sensor platform.

CVApr 27, 2023
Analogy-Forming Transformers for Few-Shot 3D Parsing

Nikolaos Gkanatsios, Mayank Singh, Zhaoyuan Fang et al.

We present Analogical Networks, a model that encodes domain knowledge explicitly, in a collection of structured labelled 3D scenes, in addition to implicitly, as model parameters, and segments 3D object scenes with analogical reasoning: instead of mapping a scene to part segments directly, our model first retrieves related scenes from memory and their corresponding part structures, and then predicts analogous part structures for the input scene, via an end-to-end learnable modulation mechanism. By conditioning on more than one retrieved memories, compositions of structures are predicted, that mix and match parts across the retrieved memories. One-shot, few-shot or many-shot learning are treated uniformly in Analogical Networks, by conditioning on the appropriate set of memories, whether taken from a single, few or many memory exemplars, and inferring analogous parses. We show Analogical Networks are competitive with state-of-the-art 3D segmentation transformers in many-shot settings, and outperform them, as well as existing paradigms of meta-learning and few-shot learning, in few-shot settings. Analogical Networks successfully segment instances of novel object categories simply by expanding their memory, without any weight updates. Our code and models are publicly available in the project webpage: http://analogicalnets.github.io/.

CVSep 1, 2020Code
Iris Liveness Detection Competition (LivDet-Iris) -- The 2020 Edition

Priyanka Das, Joseph McGrath, Zhaoyuan Fang et al.

Launched in 2013, LivDet-Iris is an international competition series open to academia and industry with the aim to assess and report advances in iris Presentation Attack Detection (PAD). This paper presents results from the fourth competition of the series: LivDet-Iris 2020. This year's competition introduced several novel elements: (a) incorporated new types of attacks (samples displayed on a screen, cadaver eyes and prosthetic eyes), (b) initiated LivDet-Iris as an on-going effort, with a testing protocol available now to everyone via the Biometrics Evaluation and Testing (BEAT)(https://www.idiap.ch/software/beat/) open-source platform to facilitate reproducibility and benchmarking of new algorithms continuously, and (c) performance comparison of the submitted entries with three baseline methods (offered by the University of Notre Dame and Michigan State University), and three open-source iris PAD methods available in the public domain. The best performing entry to the competition reported a weighted average APCER of 59.10\% and a BPCER of 0.46\% over all five attack types. This paper serves as the latest evaluation of iris PAD on a large spectrum of presentation attack instruments.

CVAug 19, 2020Code
Open Source Iris Recognition Hardware and Software with Presentation Attack Detection

Zhaoyuan Fang, Adam Czajka

This paper proposes the first known to us open source hardware and software iris recognition system with presentation attack detection (PAD), which can be easily assembled for about 75 USD using Raspberry Pi board and a few peripherals. The primary goal of this work is to offer a low-cost baseline for spoof-resistant iris recognition, which may (a) stimulate research in iris PAD and allow for easy prototyping of secure iris recognition systems, (b) offer a low-cost secure iris recognition alternative to more sophisticated systems, and (c) serve as an educational platform. We propose a lightweight image complexity-guided convolutional network for fast and accurate iris segmentation, domain-specific human-inspired Binarized Statistical Image Features (BSIF) to build an iris template, and to combine 2D (iris texture) and 3D (photometric stereo-based) features for PAD. The proposed iris recognition runs in about 3.2 seconds and the proposed PAD runs in about 4.5 seconds on Raspberry Pi 3B+. The hardware specifications and all source codes of the entire pipeline are made available along with this paper.

CVFeb 21, 2020Code
Robust Iris Presentation Attack Detection Fusing 2D and 3D Information

Zhaoyuan Fang, Adam Czajka, Kevin W. Bowyer

Diversity and unpredictability of artifacts potentially presented to an iris sensor calls for presentation attack detection methods that are agnostic to specificity of presentation attack instruments. This paper proposes a method that combines two-dimensional and three-dimensional properties of the observed iris to address the problem of spoof detection in case when some properties of artifacts are unknown. The 2D (textural) iris features are extracted by a state-of-the-art method employing Binary Statistical Image Features (BSIF) and an ensemble of classifiers is used to deliver 2D modality-related decision. The 3D (shape) iris features are reconstructed by a photometric stereo method from only two images captured under near-infrared illumination placed at two different angles, as in many current commercial iris recognition sensors. The map of normal vectors is used to assess the convexity of the observed iris surface. The combination of these two approaches has been applied to detect whether a subject is wearing a textured contact lens to disguise their identity. Extensive experiments with NDCLD'15 dataset, and a newly collected NDIris3D dataset show that the proposed method is highly robust under various open-set testing scenarios, and that it outperforms all available open-source iris PAD methods tested in identical scenarios. The source code and the newly prepared benchmark are made available along with this paper.

CVNov 30, 2020
Move to See Better: Self-Improving Embodied Object Detection

Zhaoyuan Fang, Ayush Jain, Gabriel Sarch et al.

Passive methods for object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence across multiple views. Generalization to novel or difficult viewpoints thus requires additional training with lots of annotations. In contrast, humans often recognize objects by simply moving around, to get more informative viewpoints. In this paper, we propose a method for improving object detection in testing environments, assuming nothing but an embodied agent with a pre-trained 2D object detector. Our agent collects multi-view data, generates 2D and 3D pseudo-labels, and fine-tunes its detector in a self-supervised manner. Experiments on both indoor and outdoor datasets show that (1) our method obtains high-quality 2D and 3D pseudo-labels from multi-view RGB-D data; (2) fine-tuning with these pseudo-labels improves the 2D detector significantly in the test environment; (3) training a 3D detector with our pseudo-labels outperforms a prior self-supervised method by a large margin; (4) given weak supervision, our method can generate better pseudo-labels for novel objects.

CVJun 23, 2020
Iris Presentation Attack Detection: Where Are We Now?

Aidan Boyd, Zhaoyuan Fang, Adam Czajka et al.

As the popularity of iris recognition systems increases, the importance of effective security measures against presentation attacks becomes paramount. This work presents an overview of the most important advances in the area of iris presentation attack detection published in recent two years. Newly-released, publicly-available datasets for development and evaluation of iris presentation attack detection are discussed. Recent literature can be seen to be broken into three categories: traditional "hand-crafted" feature extraction and classification, deep learning-based solutions, and hybrid approaches fusing both methodologies. Conclusions of modern approaches underscore the difficulty of this task. Finally, commentary on possible directions for future research is provided.

CVFeb 12, 2020
AlignNet: A Unifying Approach to Audio-Visual Alignment

Jianren Wang, Zhaoyuan Fang, Hang Zhao

We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Project video and code are available at https://jianrenw.github.io/AlignNet.

CVNov 18, 2018
Iris Presentation Attack Detection Based on Photometric Stereo Features

Adam Czajka, Zhaoyuan Fang, Kevin W. Bowyer

We propose a new iris presentation attack detection method using three-dimensional features of an observed iris region estimated by photometric stereo. Our implementation uses a pair of iris images acquired by a common commercial iris sensor (LG 4000). No hardware modifications of any kind are required. Our approach should be applicable to any iris sensor that can illuminate the eye from two different directions. Each iris image in the pair is captured under near-infrared illumination at a different angle relative to the eye. Photometric stereo is used to estimate surface normal vectors in the non-occluded portions of the iris region. The variability of the normal vectors is used as the presentation attack detection score. This score is larger for a texture that is irregularly opaque and printed on a convex contact lens, and is smaller for an authentic iris texture. Thus the problem is formulated as binary classification into (a) an eye wearing textured contact lens and (b) the texture of an actual iris surface (possibly seen through a clear contact lens). Experiments were carried out on a database of approx. 2,900 iris image pairs acquired from approx. 100 subjects. Our method was able to correctly classify over 95% of samples when tested on contact lens brands unseen in training, and over 98% of samples when the contact lens brand was seen during training. The source codes of the method are made available to other researchers.