CVJul 18, 2024
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionBoyang Deng, Richard Tucker, Zhengqi Li et al. · deepmind
We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.
CVApr 25, 2023
LumiGAN: Unconditional Generation of Relightable 3D Human FacesBoyang Deng, Yifan Wang, Gordon Wetzstein · stanford
Unsupervised learning of 3D human faces from unstructured 2D image data is an active research area. While recent works have achieved an impressive level of photorealism, they commonly lack control of lighting, which prevents the generated assets from being deployed in novel environments. To this end, we introduce LumiGAN, an unconditional Generative Adversarial Network (GAN) for 3D human faces with a physically based lighting module that enables relighting under novel illumination at inference time. Unlike prior work, LumiGAN can create realistic shadow effects using an efficient visibility formulation that is learned in a self-supervised manner. LumiGAN generates plausible physical properties for relightable faces, including surface normals, diffuse albedo, and specular tint without any ground truth data. In addition to relightability, we demonstrate significantly improved geometry generation compared to state-of-the-art non-relightable 3D GANs and notably better photorealism than existing relightable GANs.
CVSep 18, 2024
Robust Symmetry Detection via Riemannian Langevin DynamicsJihyeon Je, Jiayi Liu, Guandao Yang et al. · stanford
Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating "votes" for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise, but often overlook partial symmetries due to the scarcity of annotated data. In this work, we address this challenge by proposing a novel symmetry detection method that marries classical symmetry detection techniques with recent advances in generative modeling. Specifically, we apply Langevin dynamics to a redefined symmetry space to enhance robustness against noise. We provide empirical results on a variety of shapes that suggest our method is not only robust to noise, but can also identify both partial and global symmetries. Moreover, we demonstrate the utility of our detected symmetries in various downstream tasks, such as compression and symmetrization of noisy shapes.
CVJun 1
Policy-based Foveated Imaging and PerceptionHoward Xiao, Jan Ackermann, Boyang Deng et al.
Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop. Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.
CVApr 4, 2023
GINA-3D: Learning to Generate Implicit Neural Assets in the WildBokui Shen, Xinchen Yan, Charles R. Qi et al.
Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
CVMay 18
GeoFlow: Enforcing Implicit Geometric Consistency in Video GenerationJan Ackermann, Shengqu Cai, Boyang Deng et al.
Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.
CVJun 12, 2022
Adding New Categories in Object Detection Using Few-Shot Copy-PasteBoyang Deng, Meiyan Lin, Shoulun Long
Developing data-efficient instance detection models that can handle rare object categories remains a key challenge in computer vision. However, existing research often overlooks data collection strategies and evaluation metrics tailored to real-world scenarios involving neural networks. In this study, we systematically investigate data collection and augmentation techniques focused on object occlusion, aiming to mimic occlusion relationships observed in practical applications. Surprisingly, we find that even a simple occlusion mechanism is sufficient to achieve strong performance when introducing new object categories. Notably, by adding just 15 images of a new category to a large-scale training dataset containing over half a million images across hundreds of categories, the model achieves 95\% accuracy on an unseen test set with thousands of instances of the new category.
AINov 18, 2023
An Improved CNN-based Neural Network Model for Fruit Sugar Level DetectionBoyang Deng, Xin Wen, Zhan Gao
Artificial Intelligence (AI) is widely used in image classification, recognition, text understanding, and natural language processing, leading to significant advancements. In this paper, we introduce AI into the field of fruit quality detection. We designed a regression model for fruit sugar level estimation, utilizing an Artificial Neural Network (ANN) based on the visible/near-infrared (V/NIR) spectra of fruits. After analyzing the fruit spectra, we proposed an innovative neural network structure: the lower layers consist of a Multilayer Perceptron (MLP), a middle layer features a 2-dimensional correlation matrix, and the upper layers contain several Convolutional Neural Network (CNN) layers. Using fruit sugar levels as the detection target, we collected data from two fruit types, Gan Nan Navel and Tian Shan Pear, and conducted separate experiments to compare their results. To assess the reliability of our dataset, we first applied Analysis of Variance (ANOVA). We then explored various strategies for processing spectral data and evaluated their impact. Additionally, we employed Wavelet Decomposition (WD) for dimensionality reduction and a Genetic Algorithm (GA) to identify optimal features. We compared the performance of Neural Network models with traditional Partial Least Squares (PLS) models, and specifically evaluated our proposed MLP-CNN structure against other traditional neural network architectures. Finally, we introduced a novel evaluation metric based on the dataset's standard deviation (STD) to assess detection performance, demonstrating the feasibility of using an artificial neural network model for nondestructive fruit sugar level detection.
CVApr 21
CityRAG: Stepping Into a City via Spatially-Grounded Video GenerationGene Chou, Charles Herrmann, Kyle Genova et al.
We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
CVFeb 24, 2025
Semi-Supervised Weed Detection in Vegetable Fields: In-domain and Cross-domain ExperimentsBoyang Deng, Yuzhen Lu
Robust weed detection remains a challenging task in precision weeding, requiring not only potent weed detection models but also large-scale, labeled data. However, the labeled data adequate for model training is practically difficult to come by due to the time-consuming, labor-intensive process that requires specialized expertise to recognize plant species. This study introduces semi-supervised object detection (SSOD) methods for leveraging unlabeled data for enhanced weed detection and proposes a new YOLOv8-based SSOD method, i.e., WeedTeacher. An experimental comparison of four SSOD methods, including three existing frameworks (i.e., DenseTeacher, EfficientTeacher, and SmallTeacher) and WeedTeacher, alongside fully supervised baselines, was conducted for weed detection in both in-domain and cross-domain contexts. A new, diverse weed dataset was created as the testbed, comprising a total of 19,931 field images from two differing domains, including 8,435 labeled (basic-domain) images acquired by handholding devices from 2021 to 2023 and 11,496 unlabeled (new-domain) images acquired by a ground-based mobile platform in 2024. The in-domain experiment with models trained using 10% of the labeled, basic-domain images and tested on the remaining 90% of the data, showed that the YOLOv8-basedWeedTeacher achieved the highest accuracy among all four SSOD methods, with an improvement of 2.6% mAP@50 and 3.1% mAP@50:95 over its supervised baseline (i.e., YOLOv8l). In the cross-domain experiment where the unlabeled new-domain data was incorporated, all four SSOD methods, however, resulted in no or limited improvements over their supervised counterparts. Research is needed to address the difficulty of cross-domain data utilization for robust weed detection.
CVDec 7, 2023
Stable Diffusion for Data Augmentation in COCO and Weed DatasetsBoyang Deng
Generative models have increasingly impacted various tasks, from computer vision to interior design and beyond. Stable Diffusion, a powerful diffusion model, enables the creation of high-resolution images with intricate details from text prompts or reference images. An intriguing challenge lies in improving performance for small datasets with image-sparse categories. This study explores the effectiveness of Stable Diffusion by evaluating seven common categories and three widespread weed species. Synthetic images were generated using three Stable Diffusion-based techniques: Image-to-Image Translation, DreamBooth, and ControlNet, each with distinct focuses. Classification and detection tasks were then performed on these synthetic images, and their performance was compared to models trained on original images. Promising results were achieved for certain classes, demonstrating the potential of Stable Diffusion in enhancing image-sparse datasets. This foundational study may accelerate the adaptation of diffusion models across various domains.
CVApr 11, 2025
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of ImagesBoyang Deng, Songyou Peng, Kyle Genova et al.
We present a system using Multimodal LLMs (MLLMs) to analyze a large database with tens of millions of images captured at different times, with the aim of discovering patterns in temporal changes. Specifically, we aim to capture frequent co-occurring changes ("trends") across a city over a certain period. Unlike previous visual analyses, our analysis answers open-ended queries (e.g., "what are the frequent types of changes in the city?") without any predetermined target subjects or training labels. These properties cast prior learning-based or unsupervised visual analysis tools unsuitable. We identify MLLMs as a novel tool for their open-ended semantic understanding capabilities. Yet, our datasets are four orders of magnitude too large for an MLLM to ingest as context. So we introduce a bottom-up procedure that decomposes the massive visual analysis problem into more tractable sub-problems. We carefully design MLLM-based solutions to each sub-problem. During experiments and ablation studies with our system, we find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities (e.g., "addition of outdoor dining,", "overpass was painted blue," etc.). See more results and interactive demos at https://boyangdeng.com/visual-chronicles.
CVSep 24, 2025
A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard ManagementXinyang Mu, Yuzhen Lu, Boyang Deng
Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.
CVDec 14, 2021
Revisiting 3D Object Detection From an Egocentric PerspectiveBoyang Deng, Charles R. Qi, Mahyar Najibi et al.
3D object detection is a key module for safety-critical robotics applications such as autonomous driving. For these applications, we care most about how the detections affect the ego-agent's behavior and safety (the egocentric perspective). Intuitively, we seek more accurate descriptions of object geometry when it's more likely to interfere with the ego-agent's motion trajectory. However, current detection metrics, based on box Intersection-over-Union (IoU), are object-centric and aren't designed to capture the spatio-temporal relationship between objects and the ego-agent. To address this issue, we propose a new egocentric measure to evaluate 3D object detection, namely Support Distance Error (SDE). Our analysis based on SDE reveals that the egocentric detection quality is bounded by the coarse geometry of the bounding boxes. Given the insight that SDE would benefit from more accurate geometry descriptions, we propose to represent objects as amodal contours, specifically amodal star-shaped polygons, and devise a simple model, StarPoly, to predict such contours. Our experiments on the large-scale Waymo Open Dataset show that SDE better reflects the impact of detection quality on the ego-agent's safety compared to IoU; and the estimated contours from StarPoly consistently improve the egocentric detection quality over recent 3D object detectors.
CVJun 3, 2021
NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown IlluminationXiuming Zhang, Pratul P. Srinivasan, Boyang Deng et al.
We address the problem of recovering the shape and spatially-varying reflectance of an object from multi-view images (and their camera poses) of an object illuminated by one unknown lighting condition. This enables the rendering of novel views of the object under arbitrary environment lighting and editing of the object's material properties. The key to our approach, which we call Neural Radiance Factorization (NeRFactor), is to distill the volumetric geometry of a Neural Radiance Field (NeRF) [Mildenhall et al. 2020] representation of the object into a surface representation and then jointly refine the geometry while solving for the spatially-varying reflectance and environment lighting. Specifically, NeRFactor recovers 3D neural fields of surface normals, light visibility, albedo, and Bidirectional Reflectance Distribution Functions (BRDFs) without any supervision, using only a re-rendering loss, simple smoothness priors, and a data-driven BRDF prior learned from real-world BRDF measurements. By explicitly modeling light visibility, NeRFactor is able to separate shadows from albedo and synthesize realistic soft or hard shadows under arbitrary lighting conditions. NeRFactor is able to recover convincing 3D models for free-viewpoint relighting in this challenging and underconstrained capture setup for both synthetic and real scenes. Qualitative and quantitative experiments show that NeRFactor outperforms classic and deep learning-based state of the art across various tasks. Our videos, code, and data are available at people.csail.mit.edu/xiuming/projects/nerfactor/.
CVMar 8, 2021
Offboard 3D Object Detection from Point Cloud SequencesCharles R. Qi, Yin Zhou, Mahyar Najibi et al.
While current 3D object recognition research mostly focuses on the real-time, onboard scenario, there are many offboard use cases of perception that are largely under-explored, such as using machines to automatically generate high-quality 3D labels. Existing 3D object detectors fail to satisfy the high-quality requirement for offboard uses due to the limited input and speed constraints. In this paper, we propose a novel offboard 3D object detection pipeline using point cloud sequence data. Observing that different frames capture complementary views of objects, we design the offboard detector to make use of the temporal points through both multi-frame object detection and novel object-centric refinement models. Evaluated on the Waymo Open Dataset, our pipeline named 3D Auto Labeling shows significant gains compared to the state-of-the-art onboard detectors and our offboard baselines. Its performance is even on par with human labels verified through a human label study. Further experiments demonstrate the application of auto labels for semi-supervised learning and provide extensive analysis to validate various design choices.
CVDec 8, 2020
Canonical Capsules: Self-Supervised Capsules in Canonical PoseWeiwei Sun, Andrea Tagliasacchi, Boyang Deng et al.
We propose a self-supervised capsule architecture for 3D point clouds. We compute capsule decompositions of objects through permutation-equivariant attention, and self-supervise the process by training with pairs of randomly rotated objects. Our key idea is to aggregate the attention masks into semantic keypoints, and use these to supervise a decomposition that satisfies the capsule invariance/equivariance properties. This not only enables the training of a semantically consistent decomposition, but also allows us to learn a canonicalization operation that enables object-centric reasoning. To train our neural network we require neither classification labels nor manually-aligned training datasets. Yet, by learning an object-centric representation in a self-supervised manner, our method outperforms the state-of-the-art on 3D point cloud reconstruction, canonicalization, and unsupervised classification.
CVDec 7, 2020
NeRV: Neural Reflectance and Visibility Fields for Relighting and View SynthesisPratul P. Srinivasan, Boyang Deng, Xiuming Zhang et al.
We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model's ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.
CVDec 6, 2019
NASA: Neural Articulated Shape ApproximationBoyang Deng, JP Lewis, Timothy Jeruzalski et al.
Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose. Occupancy testing using NASA is straightforward, circumventing the complexity of meshes and the issue of water-tightness. We demonstrate the effectiveness of NASA for 3D tracking applications, and discuss other potential extensions.
CVSep 12, 2019
CvxNet: Learnable Convex DecompositionBoyang Deng, Kyle Genova, Soroosh Yazdani et al.
Any solid object can be decomposed into a collection of convex polytopes (in short, convexes). When a small number of convexes are used, such a decomposition can be thought of as a piece-wise approximation of the geometry. This decomposition is fundamental in computer graphics, where it provides one of the most common ways to approximate geometry, for example, in real-time physics simulation. A convex object also has the property of being simultaneously an explicit and implicit representation: one can interpret it explicitly as a mesh derived by computing the vertices of a convex hull, or implicitly as the collection of half-space constraints or support functions. Their implicit representation makes them particularly well suited for neural network training, as they abstract away from the topology of the geometry they need to represent. However, at testing time, convexes can also generate explicit representations -- polygonal meshes -- which can then be used in any downstream application. We introduce a network architecture to represent a low dimensional family of convexes. This family is automatically derived via an auto-encoding process. We investigate the applications of this architecture including automatic convex decomposition, image to 3D reconstruction, and part-based shape retrieval.
CVMay 28, 2019
Cerberus: A Multi-headed DerendererBoyang Deng, Simon Kornblith, Geoffrey Hinton
To generalize to novel visual scenes with new viewpoints and new object poses, a visual system needs representations of the shapes of the parts of an object that are invariant to changes in viewpoint or pose. 3D graphics representations disentangle visual factors such as viewpoints and lighting from object structure in a natural way. It is possible to learn to invert the process that converts 3D graphics representations into 2D images, provided the 3D graphics representations are available as labels. When only the unlabeled images are available, however, learning to derender is much harder. We consider a simple model which is just a set of free floating parts. Each part has its own relation to the camera and its own triangular mesh which can be deformed to model the shape of the part. At test time, a neural network looks at a single image and extracts the shapes of the parts and their relations to the camera. Each part can be viewed as one head of a multi-headed derenderer. During training, the extracted parts are used as input to a differentiable 3D renderer and the reconstruction error is backpropagated to train the neural net. We make the learning task easier by encouraging the deformations of the part meshes to be invariant to changes in viewpoint and invariant to the changes in the relative positions of the parts that occur when the pose of an articulated body changes. Cerberus, our multi-headed derenderer, outperforms previous methods for extracting 3D parts from single images without part annotations, and it does quite well at extracting natural parts of human figures.
CVAug 16, 2018
BlockQNN: Efficient Block-wise Neural Network Architecture GenerationZhao Zhong, Zichen Yang, Boyang Deng et al.
Convolutional neural networks have gained a remarkable success in computer vision. However, most usable network architectures are hand-crafted and usually require expertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy. The optimal network block is constructed by the learning agent which is trained to choose component layers sequentially. We stack the block to construct the whole auto-generated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The block-wise generation brings unique advantages: (1) it yields state-of-the-art results in comparison to the hand-crafted networks on image classification, particularly, the best network generated by BlockQNN achieves 2.35% top-1 error rate on CIFAR-10. (2) it offers tremendous reduction of the search space in designing networks, spending only 3 days with 32 GPUs. A faster version can yield a comparable result with only 1 GPU in 20 hours. (3) it has strong generalizability in that the network built on CIFAR also performs well on the larger-scale dataset. The best network achieves very competitive accuracy of 82.0% top-1 and 96.0% top-5 on ImageNet.
LGDec 9, 2017
Peephole: Predicting Network Performance Before TrainingBoyang Deng, Junjie Yan, Dahua Lin
The quest for performant networks has been a significant force that drives the advancements of deep learning in recent years. While rewarding, improving network design has never been an easy journey. The large design space combined with the tremendous cost required for network training poses a major obstacle to this endeavor. In this work, we propose a new approach to this problem, namely, predicting the performance of a network before training, based on its architecture. Specifically, we develop a unified way to encode individual layers into vectors and bring them together to form an integrated description via LSTM. Taking advantage of the recurrent network's strong expressive power, this method can reliably predict the performances of various network architectures. Our empirical studies showed that it not only achieved accurate predictions but also produced consistent rankings across datasets -- a key desideratum in performance prediction.
CVNov 22, 2017
Few-shot Learning by Exploiting Visual Concepts within CNNsBoyang Deng, Qing Liu, Siyuan Qiao et al.
Convolutional neural networks (CNNs) are one of the driving forces for the advancement of computer vision. Despite their promising performances on many tasks, CNNs still face major obstacles on the road to achieving ideal machine intelligence. One is that CNNs are complex and hard to interpret. Another is that standard CNNs require large amounts of annotated data, which is sometimes hard to obtain, and it is desirable to learn to recognize objects from few examples. In this work, we address these limitations of CNNs by developing novel, flexible, and interpretable models for few-shot learning. Our models are based on the idea of encoding objects in terms of visual concepts (VCs), which are interpretable visual cues represented by the feature vectors within CNNs. We first adapt the learning of VCs to the few-shot setting, and then uncover two key properties of feature encoding using VCs, which we call category sensitivity and spatial pattern. Motivated by these properties, we present two intuitive models for the problem of few-shot learning. Experiments show that our models achieve competitive performances, while being more flexible and interpretable than alternative state-of-the-art few-shot learning methods. We conclude that using VCs helps expose the natural capability of CNNs for few-shot learning.
CVJul 11, 2017
Hierarchical Deep Recurrent Architecture for Video UnderstandingLuming Tang, Boyang Deng, Haiyu Zhao et al.
This paper introduces the system we developed for the Youtube-8M Video Understanding Challenge, in which a large-scale benchmark dataset was used for multi-label video classification. The proposed framework contains hierarchical deep architecture, including the frame-level sequence modeling part and the video-level classification part. In the frame-level sequence modelling part, we explore a set of methods including Pooling-LSTM (PLSTM), Hierarchical-LSTM (HLSTM), Random-LSTM (RLSTM) in order to address the problem of large amount of frames in a video. We also introduce two attention pooling methods, single attention pooling (ATT) and multiply attention pooling (Multi-ATT) so that we can pay more attention to the informative frames in a video and ignore the useless frames. In the video-level classification part, two methods are proposed to increase the classification performance, i.e. Hierarchical-Mixture-of-Experts (HMoE) and Classifier Chains (CC). Our final submission is an ensemble consisting of 18 sub-models. In terms of the official evaluation metric Global Average Precision (GAP) at 20, our best submission achieves 0.84346 on the public 50% of test dataset and 0.84333 on the private 50% of test data.