Srikumar Ramalingam

CV
h-index38
37papers
1,829citations
Novelty51%
AI Score38

37 Papers

CVNov 30, 2023
Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit et al.

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

LGMar 9, 2022
The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks

Xin Yu, Thiago Serra, Srikumar Ramalingam et al.

Neural networks tend to achieve better accuracy with training if they are larger -- even if the resulting models are overparameterized. Nevertheless, carefully removing such excess parameters before, during, or after training may also produce models with similar or even improved accuracy. In many cases, that can be curiously achieved by heuristics as simple as removing a percentage of the weights with the smallest absolute value -- even though magnitude is not a perfect proxy for weight relevance. With the premise that obtaining significantly better performance from pruning depends on accounting for the combined effect of removing multiple weights, we revisit one of the classic approaches for impact-based pruning: the Optimal Brain Surgeon(OBS). We propose a tractable heuristic for solving the combinatorial extension of OBS, in which we select weights for simultaneous removal, as well as a systematic update of the remaining weights. Our selection method outperforms other methods under high sparsity, and the weight update is advantageous even when combined with the other methods.

LGJan 28, 2023
Leveraging Importance Weights in Subset Selection

Gui Citovsky, Giulia DeSalvo, Sanjiv Kumar et al.

We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.

LGOct 28, 2022
When does mixup promote local linearity in learned representations?

Arslan Chaudhry, Aditya Krishna Menon, Andreas Veit et al.

Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at Mixup through a \emph{representation learning} lens in a semi-supervised learning setup. In particular, we study the role of Mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?; and (2) how does the enforcement of stronger Mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of Mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised Mixup training does not make \emph{all} the network layers linear; in fact the \emph{intermediate layers} become more non-linear during Mixup training compared to a network that is trained \emph{without} Mixup. However, when Mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence.

CVAug 14, 2023
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam et al.

Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.

LGFeb 15, 2021Code
Scaling Up Exact Neural Network Compression by ReLU Stability

Thiago Serra, Xin Yu, Abhinav Kumar et al.

We can compress a rectifier network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algorithm based on solving a single optimization problem to identify all stable neurons. Our approach is on median 183 times faster than the state-of-art method on CIFAR-10, which allows us to explore exact compression on deeper (5 x 100) and wider (2 x 800) networks within minutes. For classifiers trained under an amount of L1 regularization that does not worsen accuracy, we can remove up to 56% of the connections on the CIFAR-10 dataset. The code is available at the following link, https://github.com/yuxwind/ExactCompression.

CVApr 2, 2019Code
3DRegNet: A Deep Neural Network for 3D Point Registration

G. Dias Pais, Srikumar Ramalingam, Venu Madhav Govindu et al.

We present 3DRegNet, a novel deep learning architecture for the registration of 3D scans. Given a set of 3D point correspondences, we build a deep neural network to address the following two challenges: (i) classification of the point correspondences into inliers/outliers, and (ii) regression of the motion parameters that align the scans into a common reference frame. With regard to regression, we present two alternative approaches: (i) a Deep Neural Network (DNN) registration and (ii) a Procrustes approach using SVD to estimate the transformation. Our correspondence-based approach achieves a higher speedup compared to competing baselines. We further propose the use of a refinement network, which consists of a smaller 3DRegNet as a refinement to improve the accuracy of the registration. Extensive experiments on two challenging datasets demonstrate that we outperform other methods and achieve state-of-the-art results. The code is available.

LGDec 17, 2023
A Weighted K-Center Algorithm for Data Subset Selection

Srikumar Ramalingam, Pranjal Awasthi, Sanjiv Kumar

The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.

LGFeb 26, 2024
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Maximilian Böther, Abraham Sebastian, Pranjal Awasthi et al.

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.

CVJun 10, 2025
SurfR: Surface Reconstruction with Multi-scale Attention

Siddhant Ranade, Gonçalo Dias Pais, Ross Tyler Whitaker et al.

We propose a fast and accurate surface reconstruction algorithm for unorganized point clouds using an implicit representation. Recent learning methods are either single-object representations with small neural models that allow for high surface details but require per-object training or generalized representations that require larger models and generalize to newer shapes but lack details, and inference is slow. We propose a new implicit representation for general 3D shapes that is faster than all the baselines at their optimum resolution, with only a marginal loss in performance compared to the state-of-the-art. We achieve the best accuracy-speed trade-off using three key contributions. Many implicit methods extract features from the point cloud to classify whether a query point is inside or outside the object. First, to speed up the reconstruction, we show that this feature extraction does not need to use the query point at an early stage (lazy query). Second, we use a parallel multi-scale grid representation to develop robust features for different noise levels and input resolutions. Finally, we show that attention across scales can provide improved reconstruction results.

LGFeb 4, 2025
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh et al.

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.

CVDec 24, 2024
LatentCRF: Continuous CRF for Efficient Latent Diffusion

Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit et al.

Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.

LGFeb 7, 2024
Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

Hanna Mazzawi, Pranjal Awasthi, Xavi Gonzalvo et al.

Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment. Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.

CVJun 10, 2021
Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold

Kieran Murphy, Carlos Esteves, Varun Jampani et al.

Single image pose estimation is a fundamental problem in many vision and robotics tasks, and existing deep learning approaches suffer by not completely modeling and handling: i) uncertainty about the predictions, and ii) symmetric objects with multiple (sometimes infinite) correct poses. To this end, we introduce a method to estimate arbitrary, non-parametric distributions on SO(3). Our key idea is to represent the distributions implicitly, with a neural network that estimates the probability given the input image and a candidate pose. Grid sampling or gradient ascent can be used to find the most likely pose, but it is also possible to evaluate the probability at any pose, enabling reasoning about symmetries and uncertainty. This is the most general way of representing distributions on manifolds, and to showcase the rich expressive power, we introduce a dataset of challenging symmetric and nearly-symmetric objects. We require no supervision on pose uncertainty -- the model trains only with a single pose per example. Nonetheless, our implicit model is highly expressive to handle complex distributions over 3D poses, while still obtaining accurate pose estimation on standard non-ambiguous environments, achieving state-of-the-art performance on Pascal3D+ and ModelNet10-SO(3) benchmarks.

LGMay 19, 2021
Balancing Robustness and Sensitivity using Feature Contrastive Learning

Seungyeon Kim, Daniel Glasner, Srikumar Ramalingam et al.

It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between sensitivity and robustness to natural (non-adversarial) perturbations by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise on both vision and NLP datasets.

CVApr 26, 2021
Less is more: Selecting informative and diverse subsets with balancing constraints

Srikumar Ramalingam, Daniel Glasner, Kaushal Patel et al.

Deep learning has yielded extraordinary results in vision and natural language processing, but this achievement comes at a cost. Most models require enormous resources during training, both in terms of computation and in human labeling effort. We show that we can identify informative and diverse subsets of data that lead to deep learning models with similar performance as the ones trained with the original dataset. Prior methods have exploited diversity and uncertainty in submodular objective functions for choosing subsets. In addition to these measures, we show that balancing constraints on predicted class labels and decision boundaries are beneficial. We propose a novel formulation of these constraints using matroids, an algebraic structure that generalizes linear independence in vector spaces, and present an efficient greedy algorithm with constant approximation guarantees. We outperform competing baselines on standard classification datasets such as CIFAR-10, CIFAR-100, ImageNet, as well as long-tailed datasets such as CIFAR-100-LT.

LGMar 4, 2021
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation with weak supervision

Kieran A. Murphy, Varun Jampani, Srikumar Ramalingam et al.

Representational learning forms the backbone of most deep learning applications, and the value of a learned representation is intimately tied to its information content regarding different factors of variation. Finding good representations depends on the nature of supervision and the learning algorithm. We propose a novel algorithm that utilizes a weak form of supervision where the data is partitioned into sets according to certain inactive (common) factors of variation which are invariant across elements of each set. Our key insight is that by seeking correspondence between elements of different sets, we learn strong representations that exclude the inactive factors of variation and isolate the active factors that vary within all sets. As a consequence of focusing on the active factors, our method can leverage a mix of set-supervised and wholly unsupervised data, which can even belong to a different domain. We tackle the challenging problem of synthetic-to-real object pose transfer, without pose annotations on anything, by isolating pose information which generalizes to the category level and across the synthetic/real domain gap. The method can also boost performance in supervised settings, by strengthening intermediate representations, as well as operate in practically attainable scenarios with set-supervised natural images, where quantity is limited and nuisance factors of variation are more plentiful.

LGDec 8, 2020
Kernelized Classification in Deep Networks

Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar

We propose a kernelized classification layer for deep networks. Although conventional deep networks introduce an abundance of nonlinearity for representation (feature) learning, they almost universally use a linear classifier on the learned feature vectors. We advocate a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training and the scorer function during testing. However, the choice of the kernel remains a challenge. To tackle this, we theoretically show the possibility of optimizing over all possible positive definite kernels applicable to our problem setting. This theory is then used to device a new kernelized classification layer that learns the optimal kernel function for a given problem automatically within the deep network itself. We show the usefulness of the proposed nonlinear classification layer on several datasets and tasks.

CVOct 4, 2020
Mapping of Sparse 3D Data using Alternating Projection

Siddhant Ranade, Xin Yu, Shantnu Kakkar et al.

We propose a novel technique to register sparse 3D scans in the absence of texture. While existing methods such as KinectFusion or Iterative Closest Points (ICP) heavily rely on dense point clouds, this task is particularly challenging under sparse conditions without RGB data. Sparse texture-less data does not come with high-quality boundary signal, and this prohibits the use of correspondences from corners, junctions, or boundary lines. Moreover, in the case of sparse data, it is incorrect to assume that the same point will be captured in two consecutive scans. We take a different approach and first re-parameterize the point-cloud using a large number of line segments. In this re-parameterized data, there exists a large number of line intersection (and not correspondence) constraints that allow us to solve the registration task. We propose the use of a two-step alternating projection algorithm by formulating the registration as the simultaneous satisfaction of intersection and rigidity constraints. The proposed approach outperforms other top-scoring algorithms on both Kinect and LiDAR datasets. In Kinect, we can use 100X downsampled sparse data and still outperform competing methods operating on full-resolution data.

LGJan 1, 2020
Lossless Compression of Deep Neural Networks

Thiago Serra, Abhinav Kumar, Srikumar Ramalingam

Deep neural networks have been successful in many predictive modeling tasks, such as image and language recognition, where large neural networks are often used to obtain good accuracy. Consequently, it is challenging to deploy these networks under limited computational resources, such as in mobile devices. In this work, we introduce an algorithm that removes units and layers of a neural network while not changing the output that is produced, which thus implies a lossless compression. This algorithm, which we denote as LEO (Lossless Expressiveness Optimization), relies on Mixed-Integer Linear Programming (MILP) to identify Rectified Linear Units (ReLUs) with linear behavior over the input domain. By using L1 regularization to induce such behavior, we can benefit from training over a larger architecture than we would later use in the environment where the trained neural network is deployed.

CVJun 13, 2019
Can generalised relative pose estimation solve sparse 3D registration?

Siddhant Ranade, Xin Yu, Shantnu Kakkar et al.

Popular 3D scan registration projects, such as Stanford digital Michelangelo or KinectFusion, exploit the high-resolution sensor data for scan alignment. It is particularly challenging to solve the registration of sparse 3D scans in the absence of RGB components. In this case, we can not establish point correspondences since the same 3D point cannot be captured in two successive scans. In contrast to correspondence based methods, we take a different viewpoint and formulate the sparse 3D registration problem based on the constraints from the intersection of line segments from adjacent scans. We obtain the line segments by modeling every horizontal and vertical scan-line as piece-wise linear segments. We propose a new alternating projection algorithm for solving the scan alignment problem using line intersection constraints. We develop two new minimal solvers for scan alignment in the presence of plane correspondences: 1) 3 line intersections and 1 plane correspondence, and 2) 1 line intersection and 2 plane correspondences. We outperform other competing methods on Kinect and LiDAR datasets.

LGMay 27, 2019
Equivalent and Approximate Transformations of Deep Neural Networks

Abhinav Kumar, Thiago Serra, Srikumar Ramalingam

Two networks are equivalent if they produce the same output for any given input. In this paper, we study the possibility of transforming a deep neural network to another network with a different number of units or layers, which can be either equivalent, a local exact approximation, or a global linear approximation of the original network. On the practical side, we show that certain rectified linear units (ReLUs) can be safely removed from a network if they are always active or inactive for any valid input. If we only need an equivalent network for a smaller domain, then more units can be removed and some layers collapsed. On the theoretical side, we constructively show that for any feed-forward ReLU network, there exists a global linear approximation to a 2-hidden-layer shallow network with a fixed number of units. This result is a balance between the increasing number of units for arbitrary approximation with a single layer and the known upper bound of $\lceil log(n_0+1)\rceil +1$ layers for exact representation, where $n_0$ is the input dimension. While the transformed network may require an exponential number of units to capture the activation patterns of the original network, we show that it can be made substantially smaller by only accounting for the patterns that define linear regions. Based on experiments with ReLU networks on the MNIST dataset, we found that $l_1$-regularization and adversarial training reduces the number of linear regions significantly as the number of stable units increases due to weight sparsity. Therefore, we can also intentionally train ReLU networks to allow for effective loss-less compression and approximation.

CVApr 8, 2019
Minimal Solvers for Mini-Loop Closures in 3D Multi-Scan Alignment

Pedro Miraldo, Surojit Saha, Srikumar Ramalingam

3D scan registration is a classical, yet a highly useful problem in the context of 3D sensors such as Kinect and Velodyne. While there are several existing methods, the techniques are usually incremental where adjacent scans are registered first to obtain the initial poses, followed by motion averaging and bundle-adjustment refinement. In this paper, we take a different approach and develop minimal solvers for jointly computing the initial poses of cameras in small loops such as 3-, 4-, and 5-cycles. Note that the classical registration of 2 scans can be done using a minimum of 3 point matches to compute 6 degrees of relative motion. On the other hand, to jointly compute the 3D registrations in n-cycles, we take 2 point matches between the first n-1 consecutive pairs (i.e., Scan 1 & Scan 2, ... , and Scan n-1 & Scan n) and 1 or 2 point matches between Scan 1 and Scan n. Overall, we use 5, 7, and 10 point matches for 3-, 4-, and 5-cycles, and recover 12, 18, and 24 degrees of transformation variables, respectively. Using simulations and real-data we show that the 3D registration using mini n-cycles are computationally efficient, and can provide alternate and better initial poses compared to standard pairwise methods.

CVOct 8, 2018
Novel Single View Constraints for Manhattan 3D Line Reconstruction

Siddhant Ranade, Srikumar Ramalingam

This paper proposes a novel and exact method to reconstruct line-based 3D structure from a single image using Manhattan world assumption. This problem is a distinctly unsolved problem because there can be multiple 3D reconstructions from a single image. Thus, we are often forced to look for priors like Manhattan world assumption and common scene structures. In addition to the standard orthogonality, perspective projection, and parallelism constraints, we investigate a few novel constraints based on the physical realizability of the 3D scene structure. We treat the line segments in the image to be part of a graph similar to straws and connectors game, where the goal is to back-project the line segments in 3D space and while ensuring that some of these 3D line segments connect with each other (i.e., truly intersect in 3D space) to form the 3D structure. We consider three sets of novel constraints while solving the reconstruction: (1) constraints on a series of Manhattan line intersections that form cycles, but are not all physically realizable, (2) constraints on true and false intersections in the case of nearby lines lying on the same Manhattan plane, and (3) constraints from the intersections on boundary and non-boundary line segments. The reconstruction is achieved using mixed integer linear programming (MILP), and we show compelling results on real images. Along with this paper, we will release a challenging Single View Line Reconstruction dataset with ground truth 3D line models for research purposes.

LGOct 8, 2018
Empirical Bounds on Linear Regions of Deep Rectifier Networks

Thiago Serra, Srikumar Ramalingam

We can compare the expressiveness of neural networks that use rectified linear units (ReLUs) by the number of linear regions, which reflect the number of pieces of the piecewise linear functions modeled by such networks. However, enumerating these regions is prohibitive and the known analytical bounds are identical for networks with same dimensions. In this work, we approximate the number of linear regions through empirical bounds based on features of the trained network and probabilistic inference. Our first contribution is a method to sample the activation patterns defined by ReLUs using universal hash functions. This method is based on a Mixed-Integer Linear Programming (MILP) formulation of the network and an algorithm for probabilistic lower bounds of MILP solution sets that we call MIPBound, which is considerably faster than exact counting and reaches values in similar orders of magnitude. Our second contribution is a tighter activation-based bound for the maximum number of linear regions, which is particularly stronger in networks with narrow layers. Combined, these bounds yield a fast proxy for the number of linear regions of a deep neural network.

CVAug 6, 2018
Simultaneous Edge Alignment and Learning

Zhiding Yu, Weiyang Liu, Yang Zou et al.

Edge detection is among the most fundamental vision problems for its role in perceptual grouping and its wide applications. Recent advances in representation learning have led to considerable improvements in this area. Many state of the art edge detection models are learned with fully convolutional networks (FCNs). However, FCN-based edge learning tends to be vulnerable to misaligned labels due to the delicate structure of edges. While such problem was considered in evaluation benchmarks, similar issue has not been explicitly addressed in general edge learning. In this paper, we show that label misalignment can cause considerably degraded edge learning quality, and address this issue by proposing a simultaneous edge alignment and learning framework. To this end, we formulate a probabilistic model where edge alignment is treated as latent variable optimization, and is learned end-to-end during network training. Experiments show several applications of this work, including improved edge detection with state of the art performance, and automatic refinement of noisy annotations.

CVJul 26, 2018
A Minimal Closed-Form Solution for Multi-Perspective Pose Estimation using Points and Lines

Pedro Miraldo, Tiago Dias, Srikumar Ramalingam

We propose a minimal solution for pose estimation using both points and lines for a multi-perspective camera. In this paper, we treat the multi-perspective camera as a collection of rigidly attached perspective cameras. These type of imaging devices are useful for several computer vision applications that require a large coverage such as surveillance, self-driving cars, and motion-capture studios. While prior methods have considered the cases using solely points or lines, the hybrid case involving both points and lines has not been solved for multi-perspective cameras. We present the solutions for two cases. In the first case, we are given 2D to 3D correspondences for two points and one line. In the later case, we are given 2D to 3D correspondences for one point and two lines. We show that the solution for the case of two points and one line can be formulated as a fourth degree equation. This is interesting because we can get a closed-form solution and thereby achieve high computational efficiency. The later case involving two lines and one point can be mapped to an eighth degree equation. We show simulations and real experiments to demonstrate the advantages and benefits over existing methods.

CVJul 6, 2018
VLASE: Vehicle Localization by Aggregating Semantic Edges

Xin Yu, Sagar Chaturvedi, Chen Feng et al.

In this paper, we propose VLASE, a framework to use semantic edge features from images to achieve on-road localization. Semantic edge features denote edge contours that separate pairs of distinct objects such as building-sky, road- sidewalk, and building-ground. While prior work has shown promising results by utilizing the boundary between prominent classes such as sky and building using skylines, we generalize this approach to consider semantic edge features that arise from 19 different classes. Our localization algorithm is simple, yet very powerful. We extract semantic edge features using a recently introduced CASENet architecture and utilize VLAD framework to perform image retrieval. Our experiments show that we achieve improvement over some of the state-of-the-art localization algorithms such as SIFT-VLAD and its deep variant NetVLAD. We use ablation study to study the importance of different semantic classes and show that our unified approach achieves better performance compared to individual prominent features such as skylines.

OCJun 17, 2018
How Could Polyhedral Theory Harness Deep Learning?

Thiago Serra, Christian Tjandraatmadja, Srikumar Ramalingam

The holy grail of deep learning is to come up with an automatic method to design optimal architectures for different applications. In other words, how can we effectively dimension and organize neurons along the network layers based on the computational resources, input size, and amount of training data? We outline promising research directions based on polyhedral theory and mixed-integer representability that may offer an analytical approach to this question, in contrast to the empirical techniques often employed.

CVApr 25, 2018
Analytical Modeling of Vanishing Points and Curves in Catadioptric Cameras

Pedro Miraldo, Francisco Eiras, Srikumar Ramalingam

Vanishing points and vanishing lines are classical geometrical concepts in perspective cameras that have a lineage dating back to 3 centuries. A vanishing point is a point on the image plane where parallel lines in 3D space appear to converge, whereas a vanishing line passes through 2 or more vanishing points. While such concepts are simple and intuitive in perspective cameras, their counterparts in catadioptric cameras (obtained using mirrors and lenses) are more involved. For example, lines in the 3D space map to higher degree curves in catadioptric cameras. The projection of a set of 3D parallel lines converges on a single point in perspective images, whereas they converge to more than one point in catadioptric cameras. To the best of our knowledge, we are not aware of any systematic development of analytical models for vanishing points and vanishing curves in different types of catadioptric cameras. In this paper, we derive parametric equations for vanishing points and vanishing curves using the calibration parameters, mirror shape coefficients, and direction vectors of parallel lines in 3D space. We show compelling experimental results on vanishing point estimation and absolute pose estimation for a wide range of catadioptric cameras in both simulations and real experiments.

CVApr 5, 2018
Learning Strict Identity Mappings in Deep Residual Networks

Xin Yu, Zhiding Yu, Srikumar Ramalingam

A family of super deep networks, referred to as residual networks or ResNet, achieved record-beating performance in various visual tasks such as image recognition, object detection, and semantic segmentation. The ability to train very deep networks naturally pushed the researchers to use enormous resources to achieve the best performance. Consequently, in many applications super deep residual networks were employed for just a marginal improvement in performance. In this paper, we propose epsilon-ResNet that allows us to automatically discard redundant layers, which produces responses that are smaller than a threshold epsilon, with a marginal or no loss in performance. The epsilon-ResNet architecture can be achieved using a few additional rectified linear units in the original ResNet. Our method does not use any additional variables nor numerous trials like other hyper-parameter optimization techniques. The layer selection is achieved using a single training process and the evaluation is performed on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. In some instances, we achieve about 80% reduction in the number of parameters.

CVMar 30, 2018
Class Subset Selection for Transfer Learning using Submodularity

Varun Manjunatha, Srikumar Ramalingam, Tim K. Marks et al.

In recent years, it is common practice to extract fully-connected layer (fc) features that were learned while performing image classification on a source dataset, such as ImageNet, and apply them generally to a wide range of other tasks. The general usefulness of some large training datasets for transfer learning is not yet well understood, and raises a number of questions. For example, in the context of transfer learning, what is the role of a specific class in the source dataset, and how is the transferability of fc features affected when they are trained using various subsets of the set of all classes in the source dataset? In this paper, we address the question of how to select an optimal subset of the set of classes, subject to a budget constraint, that will more likely generate good features for other tasks. To accomplish this, we use a submodular set function to model the accuracy achievable on a new task when the features have been learned on a given subset of classes of the source dataset. An optimal subset is identified as the set that maximizes this submodular function. The maximization can be accomplished using an efficient greedy algorithm that comes with guarantees on the optimality of the solution. We empirically validate our submodular model by successfully identifying subsets of classes that produce good features for new tasks.

LGNov 6, 2017
Bounding and Counting Linear Regions of Deep Neural Networks

Thiago Serra, Christian Tjandraatmadja, Srikumar Ramalingam

We investigate the complexity of deep neural networks (DNN) that represent piecewise linear (PWL) functions. In particular, we study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically. We present (i) tighter upper and lower bounds for the maximum number of linear regions on rectifier networks, which are exact for inputs of dimension one; (ii) a first upper bound for multi-layer maxout networks; and (iii) a first method to perform exact enumeration or counting of the number of regions by modeling the DNN with a mixed-integer linear formulation. These bounds come from leveraging the dimension of the space defining each linear region. The results also indicate that a deep rectifier network can only have more linear regions than every shallow counterpart with same number of neurons if that number exceeds the dimension of the input.

AIJun 28, 2017
Submodular Function Maximization for Group Elevator Scheduling

Srikumar Ramalingam, Arvind U. Raghunathan, Daniel Nikovski

We propose a novel approach for group elevator scheduling by formulating it as the maximization of submodular function under a matroid constraint. In particular, we propose to model the total waiting time of passengers using a quadratic Boolean function. The unary and pairwise terms in the function denote the waiting time for single and pairwise allocation of passengers to elevators, respectively. We show that this objective function is submodular. The matroid constraints ensure that every passenger is allocated to exactly one elevator. We use a greedy algorithm to maximize the submodular objective function, and derive provable guarantees on the optimality of the solution. We tested our algorithm using Elevate 8, a commercial-grade elevator simulator that allows simulation with a wide range of elevator settings. We achieve significant improvement over the existing algorithms.

CVMay 27, 2017
CASENet: Deep Category-Aware Semantic Edge Detection

Zhiding Yu, Chen Feng, Ming-Yu Liu et al.

Boundary and edge cues are highly beneficial in improving a wide variety of vision tasks such as semantic segmentation, object recognition, stereo, and object proposal generation. Recently, the problem of edge detection has been revisited and significant progress has been made with deep learning. While classical edge detection is a challenging binary problem in itself, the category-aware semantic edge detection by nature is an even more challenging multi-label problem. We model the problem such that each edge pixel can be associated with more than one class as they appear in contours or junctions belonging to two or more semantic classes. To this end, we propose a novel end-to-end deep semantic edge learning architecture based on ResNet and a new skip-layer architecture where category-wise edge activations at the top convolution layer share and are fused with the same set of bottom layer features. We then propose a multi-label loss function to supervise the fused activations. We show that our proposed architecture benefits this problem with better performance, and we outperform the current state-of-the-art semantic edge detection methods by a large margin on standard data sets such as SBD and Cityscapes.

RONov 3, 2015
High-Performance and Tunable Stereo Reconstruction

Sudeep Pillai, Srikumar Ramalingam, John J. Leonard

Traditional stereo algorithms have focused their efforts on reconstruction quality and have largely avoided prioritizing for run time performance. Robots, on the other hand, require quick maneuverability and effective computation to observe its immediate environment and perform tasks within it. In this work, we propose a high-performance and tunable stereo disparity estimation method, with a peak frame-rate of 120Hz (VGA resolution, on a single CPU-thread), that can potentially enable robots to quickly reconstruct their immediate surroundings and maneuver at high-speeds. Our key contribution is a disparity estimation algorithm that iteratively approximates the scene depth via a piece-wise planar mesh from stereo imagery, with a fast depth validation step for semi-dense reconstruction. The mesh is initially seeded with sparsely matched keypoints, and is recursively tessellated and refined as needed (via a resampling stage), to provide the desired stereo disparity accuracy. The inherent simplicity and speed of our approach, with the ability to tune it to a desired reconstruction quality and runtime performance makes it a compelling solution for applications in high-speed vehicles.

CVJun 15, 2015
Layered Interpretation of Street View Images

Ming-Yu Liu, Shuoxin Lin, Srikumar Ramalingam et al.

We propose a layered street view model to encode both depth and semantic information on street view images for autonomous driving. Recently, stixels, stix-mantics, and tiered scene labeling methods have been proposed to model street view images. We propose a 4-layer street view model, a compact representation over the recently proposed stix-mantics model. Our layers encode semantic classes like ground, pedestrians, vehicles, buildings, and sky in addition to the depths. The only input to our algorithm is a pair of stereo images. We use a deep neural network to extract the appearance features for semantic classes. We use a simple and an efficient inference algorithm to jointly estimate both semantic classes and layered depth values. Our method outperforms other competing approaches in Daimler urban scene segmentation dataset. Our algorithm is massively parallelizable, allowing a GPU implementation with a processing speed about 9 fps.