Manoj Kumar

CV
h-index52
32papers
2,456citations
Novelty46%
AI Score49

32 Papers

CVFeb 10, 2023
Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa et al. · deepmind

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

CVJul 10, 2024
PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto et al. · deepmind, oxford

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

CLJan 24, 2023
Large language models can segment narrative events similarly to humans

Sebastian Michelmann, Manoj Kumar, Kenneth A. Norman et al. · cmu

Humans perceive discrete events such as "restaurant visits" and "train rides" in their continuous experience. One important prerequisite for studying human event perception is the ability of researchers to quantify when one event ends and another begins. Typically, this information is derived by aggregating behavioral annotations from several observers. Here we present an alternative computational approach where event boundaries are derived using a large language model, GPT-3, instead of using human annotations. We demonstrate that GPT-3 can segment continuous narrative text into events. GPT-3-annotated events are significantly correlated with human event annotations. Furthermore, these GPT-derived annotations achieve a good approximation of the "consensus" solution (obtained by averaging across human annotations); the boundaries identified by GPT-3 are closer to the consensus, on average, than boundaries identified by individual human annotators. This finding suggests that GPT-3 provides a feasible solution for automated event annotations, and it demonstrates a further parallel between human cognition and prediction in large language models. In the future, GPT-3 may thereby help to elucidate the principles underlying human event perception.

SYJun 3
A Survey of Smart Grid Emerging Use Cases and Relevant 5G and 6G Capabilities and Features

Manoj Kumar, Nishith D. Tripathi, Jeffrey H. Reed

The growing complexity of modern energy systems has led to the adoption of Smart Grid (SG) that use advanced communication technologies to facilitate efficient, reliable, secure, and sustainable energy operation and management. Unlike existing surveys that often treat grid and communication domains separately, this work rigorously quantifies service requirements for high-complexity emerging scenarios. It provides a comprehensive overview of SG architecture that integrates digital communication infrastructure with distributed energy resources (DERs), microgrids, energy storage systems, and cybersecurity frameworks. Furthermore, emerging SG use cases such as smart distributed voltage control, real-time fault detection and self-healing, smart and autonomous monitoring, and predictive maintenance are identified, and more importantly, service performance requirements associated with these use cases have been quantified. Additionally, key capabilities and emerging SG enablers of fifth-generation (5G) and sixth-generation (6G) networks are described. These capabilities and enablers include network slicing, edge computing, spectrum management, artificial intelligence (AI) driven optimization, digital twins, and Open-Radio Access Network (O-RAN). Finally, the paper discusses open challenges and future research directions for designing scalable, intelligent, and secure next-generation SG systems.

CVJun 13, 2023
Image Captioners Are Scalable Vision Learners Too

Michael Tschannen, Manoj Kumar, Andreas Steiner et al.

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

CVMar 9, 2022
Do better ImageNet classifiers assess perceptual similarity better?

Manoj Kumar, Neil Houlsby, Nal Kalchbrenner et al.

Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing perceptual similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. In this paper, we present a large-scale empirical study to assess how well ImageNet classifiers perform on perceptual similarity. First, we observe a inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using a number of plausible hypotheses such as distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets and ResNets trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements. The checkpoints for the models in our study are available at https://console.cloud.google.com/storage/browser/gresearch/perceptual_similarity.

MLOct 2, 2022
A Unified Framework for Optimization-Based Graph Coarsening

Manoj Kumar, Anurag Sharma, Sandeep Kumar

Graph coarsening is a widely used dimensionality reduction technique for approaching large-scale graph machine learning problems. Given a large graph, graph coarsening aims to learn a smaller-tractable graph while preserving the properties of the originally given graph. Graph data consist of node features and graph matrix (e.g., adjacency and Laplacian). The existing graph coarsening methods ignore the node features and rely solely on a graph matrix to simplify graphs. In this paper, we introduce a novel optimization-based framework for graph dimensionality reduction. The proposed framework lies in the unification of graph learning and dimensionality reduction. It takes both the graph matrix and the node features as the input and learns the coarsen graph matrix and the coarsen feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorization-minimization, $\log$ determinant, Dirichlet energy, and regularization frameworks. The proposed algorithms are provably convergent and practically amenable to numerous tasks. It is also established that the learned coarsened graph is $ε\in(0,1)$ similar to the original graph. Extensive experiments elucidate the efficacy of the proposed framework for real-world applications.

AIJun 25, 2022
Functional Optimization Reinforcement Learning for Real-Time Bidding

Yining Lu, Changjie Lu, Naina Bandyopadhyay et al.

Real-time bidding is the new paradigm of programmatic advertising. An advertiser wants to make the intelligent choice of utilizing a \textbf{Demand-Side Platform} to improve the performance of their ad campaigns. Existing approaches are struggling to provide a satisfactory solution for bidding optimization due to stochastic bidding behavior. In this paper, we proposed a multi-agent reinforcement learning architecture for RTB with functional optimization. We designed four agents bidding environment: three Lagrange-multiplier based functional optimization agents and one baseline agent (without any attribute of functional optimization) First, numerous attributes have been assigned to each agent, including biased or unbiased win probability, Lagrange multiplier, and click-through rate. In order to evaluate the proposed RTB strategy's performance, we demonstrate the results on ten sequential simulated auction campaigns. The results show that agents with functional actions and rewards had the most significant average winning rate and winning surplus, given biased and unbiased winning information respectively. The experimental evaluations show that our approach significantly improve the campaign's efficacy and profitability.

CVFeb 2, 2023
Dual PatchNorm

Manoj Kumar, Mostafa Dehghani, Neil Houlsby

We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.

CLMar 4, 2022
IISERB Brains at SemEval 2022 Task 6: A Deep-learning Framework to Identify Intended Sarcasm in English

Tanuj Singh Shekhawat, Manoj Kumar, Udaybhan Rathore et al.

This paper describes the system architectures and the models submitted by our team "IISERBBrains" to SemEval 2022 Task 6 competition. We contested for all three sub-tasks floated for the English dataset. On the leader-board, wegot19th rank out of43 teams for sub-taskA, the 8th rank out of22 teams for sub-task B,and13th rank out of 16 teams for sub-taskC. Apart from the submitted results and models, we also report the other models and results that we obtained through our experiments after organizers published the gold labels of their evaluation data

CVFeb 8, 2021Code
Colorization Transformer

Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran

ASJul 31, 2020Code
Designing Neural Speaker Embeddings with Meta Learning

Manoj Kumar, Tae Jin-Park, Somer Bishop et al.

Neural speaker embeddings trained using classification objectives have demonstrated state-of-the-art performance in multiple applications. Typically, such embeddings are trained on an out-of-domain corpus on a single task e.g., speaker classification, albeit with a large number of classes (speakers). In this work, we reformulate embedding training under the meta-learning paradigm. We redistribute the training corpus as an ensemble of multiple related speaker classification tasks, and learn a representation that generalizes better to unseen speakers. First, we develop an open source toolkit to train x-vectors that is matched in performance with pre-trained Kaldi models for speaker diarization and speaker verification applications. We find that different bottleneck layers in the architecture variedly favor different applications. Next, we use two meta-learning strategies, namely prototypical networks and relation networks, to improve over the x-vector embeddings. Our best performing model achieves a relative improvement of 12.37% and 7.11% in speaker error on the DIHARD II development corpus and the AMI meeting corpus, respectively. We analyze improvements across different domains in the DIHARD corpus. Notably, on the challenging child speech domain, we study the relation between child age and the diarization performance. Further, we show reductions in equal error rate for speaker verification on the SITW corpus (7.68%) and the VOiCES challenge corpus (8.78%). We observe that meta-learning particularly offers benefits in challenging acoustic conditions and recording setups encountered in these corpora. Our experiments illustrate the applicability of meta-learning as a generalized learning paradigm for training deep neural speaker embeddings.

LGJul 9, 2024
Modularity aided consistent attributed graph clustering via coarsening

Samarth Bhatia, Yukti Makhija, Manoj Kumar et al.

Graph clustering is an important unsupervised learning technique for partitioning graphs with attributes and detecting communities. However, current methods struggle to accurately capture true community structures and intra-cluster relations, be computationally efficient, and identify smaller communities. We address these challenges by integrating coarsening and modularity maximization, effectively leveraging both adjacency and node features to enhance clustering accuracy. We propose a loss function incorporating log-determinant, smoothness, and modularity components using a block majorization-minimization technique, resulting in superior clustering outcomes. The method is theoretically consistent under the Degree-Corrected Stochastic Block Model (DC-SBM), ensuring asymptotic error-free performance and complete label recovery. Our provably convergent and time-efficient algorithm seamlessly integrates with graph neural networks (GNNs) and variational graph autoencoders (VGAEs) to learn enhanced node features and deliver exceptional clustering performance. Extensive experiments on benchmark datasets demonstrate its superiority over existing state-of-the-art methods for both attributed and non-attributed graphs.

CVMar 15, 2024
Frozen Feature Augmentation for Few-Shot Image Classification

Andreas Bär, Neil Houlsby, Mostafa Dehghani et al.

Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper, we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space, dubbed 'frozen feature augmentation (FroFA)', covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA, such as brightness, can improve few-shot performance consistently across three network architectures, three large pretraining datasets, and eight transfer datasets.

PESep 1, 2024
Analysis of a mathematical model for malaria using data-driven approach

Adithya Rajnarayanan, Manoj Kumar, Abdessamad Tridane

Malaria is one of the deadliest diseases in the world, every year millions of people become victims of this disease and many even lose their lives. Medical professionals and the government could take accurate measures to protect the people only when the disease dynamics are understood clearly. In this work, we propose a compartmental model to study the dynamics of malaria. We consider the transmission rate dependent on temperature and altitude. We performed the steady state analysis on the proposed model and checked the stability of the disease-free and endemic steady state. An artificial neural network (ANN) is applied to the formulated model to predict the trajectory of all five compartments following the mathematical analysis. Three different neural network architectures namely Artificial neural network (ANN), convolution neural network (CNN), and Recurrent neural network (RNN) are used to estimate these parameters from the trajectory of the data. To understand the severity of a disease, it is essential to calculate the risk associated with the disease. In this work, the risk is calculated using dynamic mode decomposition(DMD) from the trajectory of the infected people.

CVNov 27, 2024
XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration

Denys Rozumnyi, Nadine Bertsch, Othman Sbai et al.

Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.

LGAug 26, 2025
Generalization Bound for a General Class of Neural Ordinary Differential Equations

Madhusudan Verma, Manoj Kumar

Neural ordinary differential equations (neural ODEs) are a popular type of deep learning model that operate with continuous-depth architectures. To assess how well such models perform on unseen data, it is crucial to understand their generalization error bounds. Previous research primarily focused on the linear case for the dynamics function in neural ODEs - Marion, P. (2023), or provided bounds for Neural Controlled ODEs that depend on the sampling interval Bleistein et al. (2023). In this work, we analyze a broader class of neural ODEs where the dynamics function is a general nonlinear function, either time dependent or time independent, and is Lipschitz continuous with respect to the state variables. We showed that under this Lipschitz condition, the solutions to neural ODEs have solutions with bounded variations. Based on this observation, we establish generalization bounds for both time-dependent and time-independent cases and investigate how overparameterization and domain constraints influence these bounds. To our knowledge, this is the first derivation of generalization bounds for neural ODEs with general nonlinear dynamics.

CVMay 23, 2024
Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

Manoj Kumar, Neil Houlsby, Emiel Hoogeboom

Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.

LGNov 14, 2021
Skillful Twelve Hour Precipitation Forecasts using Large Context Neural Networks

Lasse Espeholt, Shreya Agrawal, Casper Sønderby et al.

The problem of forecasting weather has been scientifically studied for centuries due to its high impact on human lives, transportation, food production and energy management, among others. Current operational forecasting models are based on physics and use supercomputers to simulate the atmosphere to make forecasts hours and days in advance. Better physics-based forecasts require improvements in the models themselves, which can be a substantial scientific challenge, as well as improvements in the underlying resolution, which can be computationally prohibitive. An emerging class of weather models based on neural networks represents a paradigm shift in weather forecasting: the models learn the required transformations from data instead of relying on hand-coded physics and are computationally efficient. For neural models, however, each additional hour of lead time poses a substantial challenge as it requires capturing ever larger spatial contexts and increases the uncertainty of the prediction. In this work, we present a neural network that is capable of large-scale precipitation forecasting up to twelve hours ahead and, starting from the same atmospheric state, the model achieves greater skill than the state-of-the-art physics-based models HRRR and HREF that currently operate in the Continental United States. Interpretability analyses reinforce the observation that the model learns to emulate advanced physics principles. These results represent a substantial step towards establishing a new paradigm of efficient forecasting with neural networks.

CRSep 20, 2021
Encrypted Data Processing

Jessica Tseng, Gianfranco Bilardi, Kattamuri Ekanadham et al.

In this paper, we present a comprehensive architecture for confidential computing, which we show to be general purpose and quite efficient. It executes the application as is, without any added burden or discipline requirements from the application developers. Furthermore, it does not require the trust of system software at the computing server and does not impose any added burden on the communication subsystem. The proposed Encrypted Data Processing (EDAP) architecture accomplishes confidentiality, authenticity, and freshness of the key-based cryptographic data protection by adopting data encryption with a multi-level key protection scheme. It guarantees that the user data is visible only in non-privileged mode to a designated program trusted by the data owner on a designated hardware, thus protecting the data from an untrusted hardware, hypervisor, OS, or other users' applications. The cryptographic keys and protocols used for achieving these confidential computing requirements are described in a use case example. Encrypting and decrypting data in an EDAP-enabled processor can lead to performance degradation as it adds cycle time to the overall execution. However, our simulation result shows that the slowdown is only 6% on average across a collection of commercial workloads when the data encryption engine is placed between the L1 and L2 cache. We demonstrate that the EDAP architecture is valuable and practicable in the modern cloud environment for confidential computing. EDAP delivers a zero trust model of computing where the user software does not trust system software and vice versa.

CLJan 28, 2021
ProtoDA: Efficient Transfer Learning for Few-Shot Intent Classification

Manoj Kumar, Varun Kumar, Hadrien Glaude et al.

Practical sequence classification tasks in natural language processing often suffer from low training data availability for target classes. Recent works towards mitigating this problem have focused on transfer learning using embeddings pre-trained on often unrelated tasks, for instance, language modeling. We adopt an alternative approach by transfer learning on an ensemble of related tasks using prototypical networks under the meta-learning paradigm. Using intent classification as a case study, we demonstrate that increasing variability in training tasks can significantly improve classification performance. Further, we apply data augmentation in conjunction with meta-learning to reduce sampling bias. We make use of a conditional generator for data augmentation that is trained directly using the meta-learning objective and simultaneously with prototypical networks, hence ensuring that data augmentation is customized to the task. We explore augmentation in the sentence embedding space as well as prototypical embedding space. Combining meta-learning with augmentation provides upto 6.49% and 8.53% relative F1-score improvements over the best performing systems in the 5-shot and 10-shot learning, respectively.

ASJul 19, 2020
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization

Monisankha Pal, Manoj Kumar, Raghuveer Peri et al.

The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform Kaldi state-of-the-art z-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization as compared to x-vectors and ClusterGAN in telephonic data.

ASMar 5, 2020
Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap

Tae Jin Park, Kyu J. Han, Manoj Kumar et al.

In this study, we propose a new spectral clustering framework that can auto-tune the parameters of the clustering algorithm in the context of speaker diarization. The proposed framework uses normalized maximum eigengap (NME) values to estimate the number of clusters and the parameters for the threshold of the elements of each row in an affinity matrix during spectral clustering, without the use of parameter tuning on the development set. Even through this hands-off approach, we achieve a comparable or better performance across various evaluation sets than the results found using traditional clustering methods that apply careful parameter tuning and development data. A relative improvement of 17% in the speaker error rate on the well-known CALLHOME evaluation set shows the effectiveness of our proposed spectral clustering with auto-tuning.

ASOct 25, 2019
Learning Domain Invariant Representations for Child-Adult Classification from Speech

Rimita Lahiri, Manoj Kumar, Somer Bishop et al.

Diagnostic procedures for ASD (autism spectrum disorder) involve semi-naturalistic interactions between the child and a clinician. Computational methods to analyze these sessions require an end-to-end speech and language processing pipeline that go from raw audio to clinically-meaningful behavioral features. An important component of this pipeline is the ability to automatically detect who is speaking when i.e., perform child-adult speaker classification. This binary classification task is often confounded due to variability associated with the participants' speech and background conditions. Further, scarcity of training data often restricts direct application of conventional deep learning methods. In this work, we address two major sources of variability - age of the child and data source collection location - using domain adversarial learning which does not require labeled target domain data. We use two methods, generative adversarial training with inverted label loss and gradient reversal layer to learn speaker embeddings invariant to the above sources of variability, and analyze different conditions under which the proposed techniques improve over conventional learning methods. Using a large corpus of ADOS-2 (autism diagnostic observation schedule, 2nd edition) sessions, we demonstrate upto 13.45% and 6.44% relative improvements over conventional learning methods.

ASOct 24, 2019
A study of semi-supervised speaker diarization system using gan mixture model

Monisankha Pal, Manoj Kumar, Raghuveer Peri et al.

We propose a new speaker diarization system based on a recently introduced unsupervised clustering technique namely, generative adversarial network mixture model (GANMM). The proposed system uses x-vectors as front-end representation. Spectral embedding is used for dimensionality reduction followed by k-means initialization during GANMM pre-training. GANMM performs unsupervised speaker clustering by efficiently capturing complex data distributions. Experimental results on the AMI meeting corpus show that the proposed semi-supervised diarization system matches or exceeds the performance of competitive baselines. On an evaluation set containing fifty sessions with varying durations, the best achieved average diarization error rate (DER) is 17.11%, a relative improvement of 33% over the information bottleneck baseline and comparable to xvector baseline.

ASOct 24, 2019
Meta-learning for robust child-adult classification from speech

Nithin Rao Koluguri, Manoj Kumar, So Hyun Kim et al.

Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. Training a speaker classification system robust to speaker and channel conditions is challenging due to inherent variability in the speech within children and the adult interlocutors. In this work, we propose the use of meta-learning, in particular, prototypical networks which optimize a metric space across multiple tasks. By modeling every child-adult pair in the training set as a separate task during meta-training, we learn a representation with improved generalizability compared to conventional supervised learning. We demonstrate improvements over state-of-the-art speaker embeddings (x-vectors) under two evaluation settings: weakly supervised classification (up to 14.53% relative improvement in F1-scores) and clustering (up to relative 9.66% improvement in cluster purity). Our results show that protonets can potentially extract robust speaker embeddings for child-adult classification from speech.

ASOct 24, 2019
Speaker diarization using latent space clustering in generative adversarial network

Monisankha Pal, Manoj Kumar, Raghuveer Peri et al.

In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.

CVMar 4, 2019
VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation

Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan et al.

Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modelling of video.

CLJun 8, 2018
Measuring Conversational Productivity in Child Forensic Interviews

Victor Ardulov, Manoj Kumar, Shanna Williams et al.

Child Forensic Interviewing (FI) presents a challenge for effective information retrieval and decision making. The high stakes associated with the process demand that expert legal interviewers are able to effectively establish a channel of communication and elicit substantive knowledge from the child-client while minimizing potential for experiencing trauma. As a first step toward computationally modeling and producing quality spoken interviewing strategies and a generalized understanding of interview dynamics, we propose a novel methodology to computationally model effectiveness criteria, by applying summarization and topic modeling techniques to objectively measure and rank the responsiveness and conversational productivity of a child during FI. We score information retrieval by constructing an agenda to represent general topics of interest and measuring alignment with a given response and leveraging lexical entrainment for responsiveness. For comparison, we present our methods along with traditional metrics of evaluation and discuss the use of prior information for generating situational awareness.

CVMay 25, 2018
Parallel Architecture and Hyperparameter Search via Successive Halving and Classification

Manoj Kumar, George E. Dahl, Vijay Vasudevan et al.

We present a simple and powerful algorithm for parallel black box optimization called Successive Halving and Classification (SHAC). The algorithm operates in $K$ stages of parallel function evaluations and trains a cascade of binary classifiers to iteratively cull the undesirable regions of the search space. SHAC is easy to implement, requires no tuning of its own configuration parameters, is invariant to the scale of the objective function and can be built using any choice of binary classifier. We adopt tree-based classifiers within SHAC and achieve competitive performance against several strong baselines for optimizing synthetic functions, hyperparameters and architectures.

DCAug 9, 2017
Enabling Massive Deep Neural Networks with the GraphBLAS

Jeremy Kepner, Manoj Kumar, José Moreira et al.

Deep Neural Networks (DNNs) have emerged as a core tool for machine learning. The computations performed during DNN training and inference are dominated by operations on the weight matrices describing the DNN. As DNNs incorporate more stages and more nodes per stage, these weight matrices may be required to be sparse because of memory limitations. The GraphBLAS.org math library standard was developed to provide high performance manipulation of sparse weight matrices and input/output vectors. For sufficiently sparse matrices, a sparse matrix library requires significantly less memory than the corresponding dense matrix implementation. This paper provides a brief description of the mathematics underlying the GraphBLAS. In addition, the equations of a typical DNN are rewritten in a form designed to use the GraphBLAS. An implementation of the DNN is given using a preliminary GraphBLAS C library. The performance of the GraphBLAS implementation is measured relative to a standard dense linear algebra library implementation. For various sizes of DNN weight matrices, it is shown that the GraphBLAS sparse implementation outperforms a BLAS dense implementation as the weight matrix becomes sparser.

CVApr 24, 2012
A New Approach of Improving CFA Image for Digital Camera's

Manoj Kumar, Vikas Kaushik, Pradeep Singla

This paper work directly towards the improving the quality of the image for the digital cameras and other visual capturing products. In this Paper, the authors clearly defines the problems occurs in the CFA image. A different methodology for removing the noise is discuses in the paper for color correction and color balancing of the image. At the same time, the authors also proposed a new methodology of providing denoisiing process before the demosaickingfor the improving the image quality of CFA which is much efficient then the other previous defined. The demosaicking process for producing the colors in the image in a best way is also discuss.