Rahul Sukthankar

CV
h-index117
41papers
10,101citations
Novelty51%
AI Score46

41 Papers

CVAug 15, 2023
Self-supervised Hypergraphs for Learning Multiple World Interpretations

Alina Marcu, Mihai Pirvu, Dragos Costea et al.

We present a method for learning multiple scene representations given a small labeled set, by exploiting the relationships between such representations in the form of a multi-task hypergraph. We also show how we can use the hypergraph to improve a powerful pretrained VisTransformer model without any additional labeled data. In our hypergraph, each node is an interpretation layer (e.g., depth or segmentation) of the scene. Within each hyperedge, one or several input nodes predict the layer at the output node. Thus, each node could be an input node in some hyperedges and an output node in others. In this way, multiple paths can reach the same node, to form ensembles from which we obtain robust pseudolabels, which allow self-supervised learning in the hypergraph. We test different ensemble models and different types of hyperedges and show superior performance to other multi-task graph models in the field. We also introduce Dronescapes, a large video dataset captured with UAVs in different complex real-world scenes, with multiple representations, suitable for multi-task learning.

CVDec 9, 2025
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula et al.

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

CVDec 23, 2021
HSPACE: Synthetic Parametric Humans Animated in Complex Environments

Eduard Gabriel Bazavan, Andrei Zanfir, Mihai Zanfir et al.

Advances in the state of the art for 3d human sensing are currently limited by the lack of visual datasets with 3d ground truth, including multiple people, in motion, operating in real-world environments, with complex illumination or occlusion, and potentially observed by a moving camera. Sophisticated scene understanding would require estimating human pose and shape as well as gestures, towards representations that ultimately combine useful metric and behavioral signals with free-viewpoint photo-realistic visualisation capabilities. To sustain progress, we build a large-scale photo-realistic dataset, Human-SPACE (HSPACE), of animated humans placed in complex synthetic indoor and outdoor environments. We combine a hundred diverse individuals of varying ages, gender, proportions, and ethnicity, with hundreds of motions and scenes, as well as parametric variations in body shape (for a total of 1,600 different humans), in order to generate an initial dataset of over 1 million frames. Human animations are obtained by fitting an expressive human body model, GHUM, to single scans of people, followed by novel re-targeting and positioning procedures that support the realistic animation of dressed humans, statistical variation of body proportions, and jointly consistent scene placement of multiple moving people. Assets are generated automatically, at scale, and are compatible with existing real time rendering and game engines. The dataset with evaluation server will be made available for research. Our large-scale analysis of the impact of synthetic data, in connection with real data and weak supervision, underlines the considerable potential for continuing quality improvements and limiting the sim-to-real gap, in this practical setting, in connection with increased model capacity.

CVNov 20, 2021
Discrete Representations Strengthen Vision Transformer Robustness

Chengzhi Mao, Lu Jiang, Mostafa Dehghani et al.

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

CVJun 17, 2021
THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Mihai Zanfir, Andrei Zanfir, Eduard Gabriel Bazavan et al.

We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3d marker representation, where we aim to combine the predictive power of model-free-output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface model like GHUM -- a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3d human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild.

CVOct 2, 2020
Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus

Marius Leordeanu, Mihai Pirvu, Dragos Costea et al.

We address the challenging problem of semi-supervised learning in the context of multiple visual interpretations of the world by finding consensus in a graph of neural networks. Each graph node is a scene interpretation layer, while each edge is a deep net that transforms one layer at one node into another from a different node. During the supervised phase edge networks are trained independently. During the next unsupervised stage edge nets are trained on the pseudo-ground truth provided by consensus among multiple paths that reach the nets' start and end nodes. These paths act as ensemble teachers for any given edge and strong consensus is used for high-confidence supervisory signal. The unsupervised learning process is repeated over several generations, in which each edge becomes a "student" and also part of different ensemble "teachers" for training other students. By optimizing such consensus between different paths, the graph reaches consistency and robustness over multiple interpretations and generations, in the face of unknown labels. We give theoretical justifications of the proposed idea and validate it on a large dataset. We show how prediction of different representations such as depth, semantic segmentation, surface normals and pose from RGB input could be effectively learned through self-supervised consensus in our graph. We also compare to state-of-the-art methods for multi-task and semi-supervised learning and show superior performance.

CVAug 16, 2020
Neural Descent for Visual 3D Human Pose and Shape

Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Zanfir et al.

We present deep neural network methodology to reconstruct the 3d pose and shape of people, given an input RGB image. We rely on a recently introduced, expressivefull body statistical 3d human model, GHUM, trained end-to-end, and learn to reconstruct its pose and shape state in a self-supervised regime. Central to our methodology, is a learning to learn and optimize approach, referred to as HUmanNeural Descent (HUND), which avoids both second-order differentiation when training the model parameters,and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively, but the process is meta-regularized in order to ensure end-progress. HUND's symmetry between training and testing makes it the first 3d human sensing architecture to natively support different operating regimes including self-supervised ones. In diverse tests, we show that HUND achieves very competitive results in datasets like H3.6M and 3DPW, aswell as good quality 3d reconstructions for complex imagery collected in-the-wild.

CVAug 3, 2020
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

Samuel Albanie, Yang Liu, Arsha Nagrani et al.

We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the results of the first edition of the challenge together with the findings of the participants.

CVJul 29, 2020
Learning Video Representations from Textual Web Supervision

Jonathan C. Stroud, Zhichao Lu, Chen Sun et al.

Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning.

CVMar 30, 2020
Speech2Action: Cross-modal Supervision for Action Recognition

Arsha Nagrani, Chen Sun, David Ross et al.

Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

CVMar 23, 2020
Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows

Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu et al.

Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body and thedifficulty to acquire training data for large-scale supervised learning in complex visual scenes. In this paper we present practical semi-supervised and self-supervised models that support training and good generalization in real-world images and video. Our formulation is based on kinematic latent normalizing flow representations and dynamics, as well as differentiable, semantic body part alignment loss functions that support self-supervised learning. In extensive experiments using 3D motion capture datasets like CMU, Human3.6M, 3DPW, or AMASS, as well as image repositories like COCO, we show that the proposed methods outperform the state of the art, supporting the practical construction of an accurate family of models based on large-scale training with diverse and incompletely labeled image and video data.

ROSep 14, 2019
Selfie Drone Stick: A Natural Interface for Quadcopter Photography

Saif Alabachi, Gita Sukthankar, Rahul Sukthankar

A physical selfie stick extends the user's reach, enabling the acquisition of personal photos that include more of the background scene. Similarly, a quadcopter can capture photos from vantage points unattainable by the user; but teleoperating a quadcopter to good viewpoints is a difficult task. This paper presents a natural interface for quadcopter photography, the SelfieDroneStick that allows the user to guide the quadcopter to the optimal vantage point based on the phone's sensors. Users specify the composition of their desired long-range selfies using their smartphone, and the quadcopter autonomously flies to a sequence of vantage points from where the desired shots can be taken. The robot controller is trained from a combination of real-world images and simulated flight data. This paper describes two key innovations required to deploy deep reinforcement learning models on a real robot: 1) an abstract state representation for transferring learning from simulation to the hardware platform, and 2) reward shaping and staging paradigms for training the controller. Both of these improvements were found to be essential in learning a robot controller from simulation that transfers successfully to the real robot.

CVJul 21, 2019
An Efficient 3D CNN for Action/Object Segmentation in Video

Rui Hou, Chen Chen, Rahul Sukthankar et al.

Convolutional Neural Network (CNN) based image segmentation has made great progress in recent years. However, video object segmentation remains a challenging task due to its high computational complexity. Most of the previous methods employ a two-stream CNN framework to handle spatial and motion features separately. In this paper, we propose an end-to-end encoder-decoder style 3D CNN to aggregate spatial and temporal information simultaneously for video object segmentation. To efficiently process video, we propose 3D separable convolution for the pyramid pooling module and decoder, which dramatically reduces the number of operations while maintaining the performance. Moreover, we also extend our framework to video action segmentation by adding an extra classifier to predict the action label for actors in videos. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.

CVApr 8, 2019
Relational Action Forecasting

Chen Sun, Abhinav Shrivastava, Carl Vondrick et al.

This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%.

ROFeb 27, 2019
Customizing Object Detectors for Indoor Robots

Saif Alabachi, Gita Sukthankar, Rahul Sukthankar

Object detection models based on convolutional neural networks (CNNs) demonstrate impressive performance when trained on large-scale labeled datasets. While a generic object detector trained on such a dataset performs adequately in applications where the input data is similar to user photographs, the detector performs poorly on small objects, particularly ones with limited training data or imaged from uncommon viewpoints. Also, a specific room will have many objects that are missed by standard object detectors, frustrating a robot that continually operates in the same indoor environment. This paper describes a system for rapidly creating customized object detectors. Data is collected from a quadcopter that is teleoperated with an interactive interface. Once an object is selected, the quadcopter autonomously photographs the object from multiple viewpoints to %create training data that is used by DUNet (Dense Upscaled Net), collect data to train DUNet (Dense Upscaled Network), our proposed model for learning customized object detectors from scratch given limited data. Our experiments compare the performance of learning models from scratch with DUNet vs.\ fine tuning existing state of the art object detectors, both on our indoor robotics domain and on standard datasets.

CVDec 19, 2018
D3D: Distilled 3D Networks for Video Action Recognition

Jonathan C. Stroud, David A. Ross, Chen Sun et al.

State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.

LGNov 30, 2018
Modulated Policy Hierarchies

Alexander Pashevich, Danijar Hafner, James Davidson et al.

Solving tasks with sparse rewards is a main challenge in reinforcement learning. While hierarchical controllers are an intuitive approach to this problem, current methods often require manual reward shaping, alternating training phases, or manually defined sub tasks. We introduce modulated policy hierarchies (MPH), that can learn end-to-end to solve tasks from sparse rewards. To achieve this, we study different modulation signals and exploration for hierarchical controllers. Specifically, we find that communicating via bit-vectors is more efficient than selecting one out of multiple skills, as it enables mixing between them. To facilitate exploration, MPH uses its different time scales for temporally extended intrinsic motivation at each level of the hierarchy. We evaluate MPH on the robotics tasks of pushing and sparse block stacking, where it outperforms recent baselines.

CVJul 28, 2018
Actor-Centric Relation Network

Chen Sun, Abhinav Shrivastava, Carl Vondrick et al.

Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

CVApr 20, 2018
Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold et al.

We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

CVJan 26, 2018
Object category learning and retrieval with weak supervision

Steven Hickson, Anelia Angelova, Irfan Essa et al.

We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision. To that end, we propose a fully differentiable unsupervised deep clustering approach to learn semantic classes in an end-to-end fashion without individual class labeling using only unlabeled object proposals. The key contributions of our work are 1) a kmeans clustering objective where the clusters are learned as parameters of the network and are represented as memory units, and 2) simultaneously building a feature representation, or embedding, while learning to cluster it. This approach shows promising results on two popular computer vision datasets: on CIFAR10 for clustering objects, and on the more complex and challenging Cityscapes dataset for semantically discovering classes which visually correspond to cars, people, and bicycles. Currently, the only supervision provided is segmentation objectness masks, but this method can be extended to use an unsupervised objectness-based object generation mechanism which will make the approach completely unsupervised.

CVMay 23, 2017
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Chunhui Gu, Chen Sun, David A. Ross et al.

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

CVMay 16, 2017
WebVision Challenge: Visual Learning and Understanding With Web Data

Wen Li, Limin Wang, Wei Li et al.

We present the 2017 WebVision Challenge, a public image recognition challenge designed for deep learning based on web images without instance-level human annotation. Following the spirit of previous vision challenges, such as ILSVRC, Places2 and PASCAL VOC, which have played critical roles in the development of computer vision by contributing to the community with large scale annotated data for model designing and standardized benchmarking, we contribute with this challenge a large scale web images dataset, and a public competition with a workshop co-located with CVPR 2017. The WebVision dataset contains more than $2.4$ million web images crawled from the Internet by using queries generated from the $1,000$ semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information is also included. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. The 2017 WebVision challenge consists of two tracks, the image classification task on WebVision test set, and the transfer learning task on PASCAL VOC 2012 dataset. In this paper, we describe the details of data collection and annotation, highlight the characteristics of the dataset, and introduce the evaluation metrics.

CVMay 5, 2017
Motion Prediction Under Multimodality with Conditional Stochastic Networks

Katerina Fragkiadaki, Jonathan Huang, Alex Alemi et al.

Given a visual history, multiple future outcomes for a video scene are equally probable, in other words, the distribution of future outcomes has multiple modes. Multimodality is notoriously hard to handle by standard regressors or classifiers: the former regress to the mean and the latter discretize a continuous high dimensional output space. In this work, we present stochastic neural network architectures that handle such multimodality through stochasticity: future trajectories of objects, body joints or frames are represented as deep, non-linear transformations of random (as opposed to deterministic) variables. Such random variables are sampled from simple Gaussian distributions whose means and variances are parametrized by the output of convolutional encoders over the visual history. We introduce novel convolutional architectures for predicting future body joint trajectories that outperform fully connected alternatives \cite{DBLP:journals/corr/WalkerDGH16}. We introduce stochastic spatial transformers through optical flow warping for predicting future frames, which outperform their deterministic equivalents \cite{DBLP:journals/corr/PatrauceanHC15}. Training stochastic networks involves an intractable marginalization over stochastic variables. We compare various training schemes that handle such marginalization through a) straightforward sampling from the prior, b) conditional variational autoencoders \cite{NIPS2015_5775,DBLP:journals/corr/WalkerDGH16}, and, c) a proposed K-best-sample loss that penalizes the best prediction under a fixed "prediction budget". We show experimental results on object trajectory prediction, human body joint trajectory prediction and video prediction under varying future uncertainty, validating quantitatively and qualitatively our architectural choices and training schemes.

CVApr 25, 2017
SfM-Net: Learning of Structure and Motion from Video

Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid et al.

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and back-propagates. The model can be trained with various degrees of supervision: 1) self-supervised by the re-projection photometric error (completely unsupervised), 2) supervised by ego-motion (camera motion), or 3) supervised by depth (e.g., as provided by RGBD sensors). SfM-Net extracts meaningful depth estimates and successfully estimates frame-to-frame camera rotations and translations. It often successfully segments the moving objects in the scene, even though such supervision is never provided.

LGMar 8, 2017
Robust Adversarial Reinforcement Learning

Lerrel Pinto, James Davidson, Rahul Sukthankar et al.

Deep neural networks coupled with fast simulation and improved computation have led to recent successes in the field of reinforcement learning (RL). However, most current RL-based approaches fail to generalize since: (a) the gap between simulation and real world is so large that policy-learning approaches fail to transfer; (b) even if policy learning is done in real world, the data scarcity leads to failed generalization from training to test scenarios (e.g., due to different friction or object masses). Inspired from H-infinity control methods, we note that both modeling errors and differences in training and test scenarios can be viewed as extra forces/disturbances in the system. This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced -- that is, it learns an optimal destabilization policy. We formulate the policy learning as a zero-sum, minimax objective function. Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah, Swimmer, Hopper and Walker2d) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary.

CVFeb 13, 2017
Cognitive Mapping and Planning for Visual Navigation

Saurabh Gupta, Varun Tolani, James Davidson et al.

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as 'going to a chair'. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.

AIFeb 3, 2017
Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data

Shumeet Baluja, Michele Covell, Rahul Sukthankar

Real-time optimization of traffic flow addresses important practical problems: reducing a driver's wasted time, improving city-wide efficiency, reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the current, already deployed, infrastructure. In this paper, we introduce a traffic light controller that employs bidding within micro-auctions to efficiently incorporate traffic sensor information; no other outside sources of information are assumed. We train and test traffic light controllers on large-scale data collected from opted-in Android cell-phone users over a period of several months in Mountain View, California and the River North neighborhood of Chicago, Illinois. The learned auction-based controllers surpass (in both the relevant metrics of road-capacity and mean travel time) the currently deployed lights, optimized static-program lights, and longer-term planning approaches, in both cities, measured using real user driving data.

CVDec 20, 2016
Beyond Skip Connections: Top-Down Modulation for Object Detection

Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik et al.

In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).

CVApr 21, 2016
The THUMOS Challenge on Action Recognition for Videos "in the Wild"

Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang et al.

Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THUMOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include `background videos' which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013--2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.

CVDec 1, 2015
Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision

Marius Leordeanu, Alexandra Radu, Shumeet Baluja et al.

Feature selection is essential for effective visual recognition. We propose an efficient joint classifier learning and feature selection method that discovers sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. Our method requires only the following knowledge, which we call the \emph{feature sign}---whether or not a particular feature has on average stronger values over positive samples than over negatives. We show how this can be estimated using as few as a single labeled training sample per class. Then, using these feature signs, we extend an initial supervised learning problem into an (almost) unsupervised clustering formulation that can incorporate new data without requiring ground truth labels. Our method works both as a feature selection mechanism and as a fully competitive classifier. It has important properties, low computational cost and excellent accuracy, especially in difficult cases of very limited training data. We experiment on large-scale recognition in video and show superior speed and performance to established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM.

CVNov 30, 2015
Behavior Discovery and Alignment of Articulated Object Classes from Unstructured Video

Luca Del Pero, Susanna Ricco, Rahul Sukthankar et al.

We propose an automatic system for organizing the content of a collection of unstructured videos of an articulated object class (e.g. tiger, horse). By exploiting the recurring motion patterns of the class across videos, our system: 1) identifies its characteristic behaviors; and 2) recovers pixel-to-pixel alignments across different instances. Our system can be useful for organizing video collections for indexing and retrieval. Moreover, it can be a platform for learning the appearance or behaviors of object classes from Internet video. Traditional supervised techniques cannot exploit this wealth of data directly, as they require a large amount of time-consuming manual annotations. The behavior discovery stage generates temporal video intervals, each automatically trimmed to one instance of the discovered behavior, clustered by type. It relies on our novel motion representation for articulated motion based on the displacement of ordered pairs of trajectories (PoTs). The alignment stage aligns hundreds of instances of the class to a great accuracy despite considerable appearance variations (e.g. an adult tiger and a cub). It uses a flexible Thin Plate Spline deformation model that can vary through time. We carefully evaluate each step of our system on a new, fully annotated dataset. On behavior discovery, we outperform the state-of-the-art Improved DTF descriptor. On spatial alignment, we outperform the popular SIFT Flow algorithm.

CVNov 19, 2015
Coreset-Based Adaptive Tracking

Abhimanyu Dubey, Nikhil Naik, Dan Raviv et al.

We propose a method for learning from streaming visual data using a compact, constant size representation of all the data that was seen until a given moment. Specifically, we construct a 'coreset' representation of streaming data using a parallelized algorithm, which is an approximation of a set with relation to the squared distances between this set and all other points in its ambient space. We learn an adaptive object appearance model from the coreset tree in constant time and logarithmic space and use it for object tracking by detection. Our method obtains excellent results for object tracking on three standard datasets over more than 100 videos. The ability to summarize data efficiently makes our method ideally suited for tracking in long videos in presence of space and time constraints. We demonstrate this ability by outperforming a variety of algorithms on the TLD dataset with 2685 frames on average. This coreset based learning approach can be applied for both real-time learning of small, varied data and fast learning of big data.

CVNov 19, 2015
Variable Rate Image Compression with Recurrent Neural Networks

George Toderici, Sean M. O'Malley, Sung Jin Hwang et al.

A large fraction of Internet traffic is now driven by requests from mobile devices with relatively small screens and often stringent bandwidth requirements. Due to these factors, it has become the norm for modern graphics-heavy websites to transmit low-resolution, low-bytecount image previews (thumbnails) as part of the initial page load process to improve apparent page responsiveness. Increasing thumbnail compression beyond the capabilities of existing codecs is therefore a current research focus, as any byte savings will significantly enhance the experience of mobile device users. Toward this end, we propose a general framework for variable-rate image compression and a novel architecture based on convolutional and deconvolutional LSTM recurrent networks. Our models address the main issues that have prevented autoencoder neural networks from competing with existing image compression algorithms: (1) our networks only need to be trained once (not per-image), regardless of input image dimensions and the desired compression rate; (2) our networks are progressive, meaning that the more bits are sent, the more accurate the image reconstruction; and (3) the proposed architecture is at least as efficient as a standard purpose-trained autoencoder for a given number of bits. On a large-scale benchmark of 32$\times$32 thumbnails, our LSTM-based approaches provide better visual quality than (headerless) JPEG, JPEG2000 and WebP, with a storage size that is reduced by 10% or more.

CVApr 4, 2015
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

Chen Sun, Sanketh Shetty, Rahul Sukthankar et al.

We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos.

CVDec 1, 2014
Recovering Spatiotemporal Correspondence between Deformable Objects by Exploiting Consistent Foreground Motion in Video

Luca Del Pero, Susanna Ricco, Rahul Sukthankar et al.

Given unstructured videos of deformable objects, we automatically recover spatiotemporal correspondences to map one object to another (such as animals in the wild). While traditional methods based on appearance fail in such challenging conditions, we exploit consistency in object motion between instances. Our approach discovers pairs of short video intervals where the object moves in a consistent manner and uses these candidates as seeds for spatial alignment. We model the spatial correspondence between the point trajectories on the object in one interval to those in the other using a time-varying Thin Plate Spline deformation model. On a large dataset of tiger and horse videos, our method automatically aligns thousands of pairs of frames to a high accuracy, and outperforms the popular SIFT Flow algorithm.

CVNov 28, 2014
Articulated motion discovery using pairs of trajectories

Luca Del Pero, Susanna Ricco, Rahul Sukthankar et al.

We propose an unsupervised approach for discovering characteristic motion patterns in videos of highly articulated objects performing natural, unscripted behaviors, such as tigers in the wild. We discover consistent patterns in a bottom-up manner by analyzing the relative displacements of large numbers of ordered trajectory pairs through time, such that each trajectory is attached to a different moving part on the object. The pairs of trajectories descriptor relies entirely on motion and is more discriminative than state-of-the-art features that employ single trajectories. Our method generates temporal video intervals, each automatically trimmed to one instance of the discovered behavior, and clusters them by type (e.g., running, turning head, drinking water). We present experiments on two datasets: dogs from YouTube-Objects and a new dataset of National Geographic tiger videos. Results confirm that our proposed descriptor outperforms existing appearance- and trajectory-based descriptors (e.g., HOG and DTFs) on both datasets and enables us to segment unconstrained animal video into intervals containing single behaviors.

CVNov 27, 2014
Features in Concert: Discriminative Feature Selection meets Unsupervised Clustering

Marius Leordeanu, Alexandra Radu, Rahul Sukthankar

Feature selection is an essential problem in computer vision, important for category learning and recognition. Along with the rapid development of a wide variety of visual features and classifiers, there is a growing need for efficient feature selection and combination methods, to construct powerful classifiers for more complex and higher-level recognition tasks. We propose an algorithm that efficiently discovers sparse, compact representations of input features or classifiers, from a vast sea of candidates, with important optimality properties, low computational cost and excellent accuracy in practice. Different from boosting, we start with a discriminant linear classification formulation that encourages sparse solutions. Then we obtain an equivalent unsupervised clustering problem that jointly discovers ensembles of diverse features. They are independently valuable but even more powerful when united in a cluster of classifiers. We evaluate our method on the task of large-scale recognition in video and show that it significantly outperforms classical selection approaches, such as AdaBoost and greedy forward-backward selection, and powerful classifiers such as SVMs, in speed of training and performance, especially in the case of limited training data.

CVApr 2, 2014
Thoughts on a Recursive Classifier Graph: a Multiclass Network for Deep Object Recognition

Marius Leordeanu, Rahul Sukthankar

We propose a general multi-class visual recognition model, termed the Classifier Graph, which aims to generalize and integrate ideas from many of today's successful hierarchical recognition approaches. Our graph-based model has the advantage of enabling rich interactions between classes from different levels of interpretation and abstraction. The proposed multi-class system is efficiently learned using step by step updates. The structure consists of simple logistic linear layers with inputs from features that are automatically selected from a large pool. Each newly learned classifier becomes a potential new feature. Thus, our feature pool can consist both of initial manually designed features as well as learned classifiers from previous steps (graph nodes), each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.

LGJun 20, 2012
Bayesian Active Distance Metric Learning

Liu Yang, Rong Jin, Rahul Sukthankar

Distance metric learning is an important component for many tasks, such as statistical classification and content-based image retrieval. Existing approaches for learning distance metrics from pairwise constraints typically suffer from two major problems. First, most algorithms only offer point estimation of the distance metric and can therefore be unreliable when the number of training examples is small. Second, since these algorithms generally select their training examples at random, they can be inefficient if labeling effort is limited. This paper presents a Bayesian framework for distance metric learning that estimates a posterior distribution for the distance metric from labeled pairwise constraints. We describe an efficient algorithm based on the variational method for the proposed Bayesian approach. Furthermore, we apply the proposed Bayesian framework to active distance metric learning by selecting those unlabeled example pairs with the greatest uncertainty in relative distance. Experiments in classification demonstrate that the proposed framework achieves higher classification accuracy and identifies more informative training examples than the non-Bayesian approach and state-of-the-art distance metric learning algorithms.

CVFeb 16, 2012
Generalized Boundaries from Multiple Image Interpretations

Marius Leordeanu, Rahul Sukthankar, Cristian Sminchisescu

Boundary detection is essential for a variety of computer vision tasks such as segmentation and recognition. In this paper we propose a unified formulation and a novel algorithm that are applicable to the detection of different types of boundaries, such as intensity edges, occlusion boundaries or object category specific boundaries. Our formulation leads to a simple method with state-of-the-art performance and significantly lower computational cost than existing methods. We evaluate our algorithm on different types of boundaries, from low-level boundaries extracted in natural images, to occlusion boundaries obtained using motion cues and RGB-D cameras, to boundaries from soft-segmentation. We also propose a novel method for figure/ground soft-segmentation that can be used in conjunction with our boundary detection method and improve its accuracy at almost no extra computational cost.