Vivek Sharma

CV
h-index117
37papers
9,021citations
Novelty45%
AI Score46

37 Papers

CLJul 20, 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel Lee et al. · cambridge, cmu

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.

CVApr 11, 2022
Physically Disentangled Representations

Tzofi Klinghoffer, Kushagra Tiwary, Arkadiusz Balata et al.

State-of-the-art methods in generative representation learning yield semantic disentanglement, but typically do not consider physical scene parameters, such as geometry, albedo, lighting, or camera. We posit that inverse rendering, a way to reverse the rendering process to recover scene parameters from an image, can also be used to learn physically disentangled representations of scenes without supervision. In this paper, we show the utility of inverse rendering in learning representations that yield improved accuracy on downstream clustering, linear classification, and segmentation tasks with the help of our novel Leave-One-Out, Cycle Contrastive loss (LOOCC), which improves disentanglement of scene parameters and robustness to out-of-distribution lighting and viewpoints. We perform a comparison of our method with other generative representation learning methods across a variety of downstream tasks, including face attribute classification, emotion recognition, identification, face segmentation, and car classification. Our physically disentangled representations yield higher accuracy than semantically disentangled alternatives across all tasks and by as much as 18%. We hope that this work will motivate future research in applying advances in inverse rendering and 3D understanding to representation learning.

CVMar 23, 2022
Learning to Censor by Noisy Sampling

Ayush Chopra, Abhinav Java, Abhishek Singh et al.

Point clouds are an increasingly ubiquitous input modality and the raw signal can be efficiently processed with recent progress in deep learning. This signal may, often inadvertently, capture sensitive information that can leak semantic and geometric properties of the scene which the data owner does not want to share. The goal of this work is to protect sensitive information when learning from point clouds; by censoring the sensitive information before the point cloud is released for downstream tasks. Specifically, we focus on preserving utility for perception tasks while mitigating attribute leakage attacks. The key motivating insight is to leverage the localized saliency of perception tasks on point clouds to provide good privacy-utility trade-offs. We realize this through a mechanism called Censoring by Noisy Sampling (CBNS), which is composed of two modules: i) Invariant Sampler: a differentiable point-cloud sampler which learns to remove points invariant to utility and ii) Noisy Distorter: which learns to distort sampled points to decouple the sensitive information from utility, and mitigate privacy leakage. We validate the effectiveness of CBNS through extensive comparisons with state-of-the-art baselines and sensitivity analyses of key design choices. Results show that CBNS achieves superior privacy-utility trade-offs on multiple datasets.

CVSep 22, 2023
Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception?

Xiaoxiao Sun, Nidham Gazagnadou, Vivek Sharma et al.

Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.

CRJun 17, 2022
A Roadmap for Greater Public Use of Privacy-Sensitive Government Data: Workshop Report

Chris Clifton, Bradley Malin, Anna Oganian et al.

Government agencies collect and manage a wide range of ever-growing datasets. While such data has the potential to support research and evidence-based policy making, there are concerns that the dissemination of such data could infringe upon the privacy of the individuals (or organizations) from whom such data was collected. To appraise the current state of data sharing, as well as learn about opportunities for stimulating such sharing at a faster pace, a virtual workshop was held on May 21st and 26th, 2021, sponsored by the National Science Foundation (NSF) and National Institute of Standards and Technologies (NIST), and the White House Office of Science and Technology Policy (OSTP), where a multinational collection of researchers and practitioners were brought together to discuss their experiences and learn about recently developed technologies for managing privacy while sharing data. The workshop specifically focused on challenges and successes in government data sharing at various levels. The first day focused on successful examples of new technology applied to sharing of public data, including formal privacy techniques, synthetic data, and cryptographic approaches. Day two emphasized brainstorming sessions on some of the challenges and directions to address them.

CRMar 17, 2022
Decouple-and-Sample: Protecting sensitive information in task agnostic data release

Abhishek Singh, Ethan Garza, Ayush Chopra et al.

We propose sanitizer, a framework for secure and task-agnostic data release. While releasing datasets continues to make a big impact in various applications of computer vision, its impact is mostly realized when data sharing is not inhibited by privacy concerns. We alleviate these concerns by sanitizing datasets in a two-stage process. First, we introduce a global decoupling stage for decomposing raw data into sensitive and non-sensitive latent representations. Secondly, we design a local sampling stage to synthetically generate sensitive information with differential privacy and merge it with non-sensitive latent features to create a useful representation while preserving the privacy. This newly formed latent information is a task-agnostic representation of the original dataset with anonymized sensitive information. While most algorithms sanitize data in a task-dependent manner, a few task-agnostic sanitization techniques sanitize data by censoring sensitive information. In this work, we show that a better privacy-utility trade-off is achieved if sensitive information can be synthesized privately. We validate the effectiveness of the sanitizer by outperforming state-of-the-art baselines on the existing benchmark tasks and demonstrating tasks that are not possible using existing techniques.

CLMar 8, 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei et al. · deepmind, mila

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

CVMar 20, 2021Code
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

M. Saquib Sarfraz, Naila Murray, Vivek Sharma et al.

Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH

AIJul 27, 2025
Artificial Intelligence In Patent And Market Intelligence: A New Paradigm For Technology Scouting

Manish Verma, Vivek Sharma, Vishal Singh

This paper presents the development of an AI powered software platform that leverages advanced large language models (LLMs) to transform technology scouting and solution discovery in industrial R&D. Traditional approaches to solving complex research and development challenges are often time consuming, manually driven, and heavily dependent on domain specific expertise. These methods typically involve navigating fragmented sources such as patent repositories, commercial product catalogs, and competitor data, leading to inefficiencies and incomplete insights. The proposed platform utilizes cutting edge LLM capabilities including semantic understanding, contextual reasoning, and cross-domain knowledge extraction to interpret problem statements and retrieve high-quality, sustainable solutions. The system processes unstructured patent texts, such as claims and technical descriptions, and systematically extracts potential innovations aligned with the given problem context. These solutions are then algorithmically organized under standardized technical categories and subcategories to ensure clarity and relevance across interdisciplinary domains. In addition to patent analysis, the platform integrates commercial intelligence by identifying validated market solutions and active organizations addressing similar challenges. This combined insight sourced from both intellectual property and real world product data enables R&D teams to assess not only technical novelty but also feasibility, scalability, and sustainability. The result is a comprehensive, AI driven scouting engine that reduces manual effort, accelerates innovation cycles, and enhances decision making in complex R&D environments.

CVOct 22, 2024
Activity Recognition on Avatar-Anonymized Datasets with Masked Differential Privacy

David Schneider, Sina Sajadmanesh, Vikash Sehwag et al. · princeton

Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence. Prevalent methods tackling this problem use differential privacy (DP) or obfuscation techniques to protect the privacy of individuals. In both cases, the utility of the trained model is sacrificed heavily in this process. In this work, we present an anonymization pipeline that replaces sensitive human subjects in video datasets with synthetic avatars within context, employing a combined rendering and stable diffusion-based strategy. Additionally we propose masked differential privacy ({MaskDP}) to protect non-anonymized but privacy sensitive background information. MaskDP allows for controlling sensitive regions where differential privacy is applied, in contrast to applying DP on the entire input. This combined methodology provides strong privacy protection while minimizing the usual performance penalty of privacy preserving methods. Experiments on multiple challenging action recognition datasets demonstrate that our proposed techniques result in better utility-privacy trade-offs compared to standard differentially private training in the especially demanding $ε<1$ regime.

AIAug 31, 2025
A Hybrid Ai Framework For Strategic Patent Portfolio Pruning: Integrating Learning To-Rank And Market Need Analysis For Technology Transfer Optimization

Manish Verma, Vivek Sharma, Vishal Singh

This paper introduces a novel, multi stage hybrid intelligence framework for pruning patent portfolios to identify high value assets for technology transfer. Current patent valuation methods often rely on retrospective indicators or manual, time intensive analysis. Our framework automates and deepens this process by combining a Learning to Rank (LTR) model, which evaluates patents against over 30 legal and commercial parameters, with a unique "Need-Seed" agent-based system. The "Need Agent" uses Natural Language Processing (NLP) to mine unstructured market and industry data, identifying explicit technological needs. Concurrently, the "Seed Agent" employs fine tuned Large Language Models (LLMs) to analyze patent claims and map their technological capabilities. The system generates a "Core Ontology Framework" that matches high potential patents (Seeds) to documented market demands (Needs), providing a strategic rationale for divestment decisions. We detail the architecture, including a dynamic parameter weighting system and a crucial Human in the-Loop (HITL) validation protocol, to ensure both adaptability and real-world credibility.

AIDec 19, 2024
Bridging the Data Provenance Gap Across Text, Speech and Video

Shayne Longpre, Nikhil Singh, Manuel Cherep et al. · mit

Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.

CVDec 9, 2024
Rendering-Refined Stable Diffusion for Privacy Compliant Synthetic Data

Kartik Patwari, David Schneider, Xiaoxiao Sun et al.

Growing privacy concerns and regulations like GDPR and CCPA necessitate pseudonymization techniques that protect identity in image datasets. However, retaining utility is also essential. Traditional methods like masking and blurring degrade quality and obscure critical context, especially in human-centric images. We introduce Rendering-Refined Stable Diffusion (RefSD), a pipeline that combines 3D-rendering with Stable Diffusion, enabling prompt-based control over human attributes while preserving posture. Unlike standard diffusion models that fail to retain posture or GANs that lack realism and flexible attribute control, RefSD balances posture preservation, realism, and customization. We also propose HumanGenAI, a framework for human perception and utility evaluation. Human perception assessments reveal attribute-specific strengths and weaknesses of RefSD. Our utility experiments show that models trained on RefSD pseudonymized data outperform those trained on real data in detection tasks, with further performance gains when combining RefSD with real data. For classification tasks, we consistently observe performance improvements when using RefSD data with real data, confirming the utility of our pseudonymized data.

CLJan 21, 2024
MedLM: Exploring Language Models for Medical Question Answering Systems

Niraj Yagnik, Jay Jhaveri, Vivek Sharma et al.

In the face of rapidly expanding online medical literature, automated systems for aggregating and summarizing information are becoming increasingly crucial for healthcare professionals and patients. Large Language Models (LLMs), with their advanced generative capabilities, have shown promise in various NLP tasks, and their potential in the healthcare domain, particularly for Closed-Book Generative QnA, is significant. However, the performance of these models in domain-specific tasks such as medical Q&A remains largely unexplored. This study aims to fill this gap by comparing the performance of general and medical-specific distilled LMs for medical Q&A. We aim to evaluate the effectiveness of fine-tuning domain-specific LMs and compare the performance of different families of Language Models. The study will address critical questions about these models' reliability, comparative performance, and effectiveness in the context of medical Q&A. The findings will provide valuable insights into the suitability of different LMs for specific applications in the medical domain.

LGDec 2, 2021
AdaSplit: Adaptive Trade-offs for Resource-constrained Distributed Deep Learning

Ayush Chopra, Surya Kant Sahu, Abhishek Singh et al.

Distributed deep learning frameworks like federated learning (FL) and its variants are enabling personalized experiences across a wide range of web clients and mobile/IoT devices. However, FL-based frameworks are constrained by computational resources at clients due to the exploding growth of model parameters (eg. billion parameter model). Split learning (SL), a recent framework, reduces client compute load by splitting the model training between client and server. This flexibility is extremely useful for low-compute setups but is often achieved at cost of increase in bandwidth consumption and may result in sub-optimal convergence, especially when client data is heterogeneous. In this work, we introduce AdaSplit which enables efficiently scaling SL to low resource scenarios by reducing bandwidth consumption and improving performance across heterogeneous clients. To capture and benchmark this multi-dimensional nature of distributed deep learning, we also introduce C3-Score, a metric to evaluate performance under resource budgets. We validate the effectiveness of AdaSplit under limited resources through extensive experimental comparison with strong federated and split learning baselines. We also present a sensitivity analysis of key design choices in AdaSplit which validates the ability of AdaSplit to provide adaptive trade-offs across variable resource budgets.

CVDec 20, 2020
DISCO: Dynamic and Invariant Sensitive Channel Obfuscation for deep neural networks

Abhishek Singh, Ayush Chopra, Vivek Sharma et al.

Recent deep learning models have shown remarkable performance in image classification. While these deep learning systems are getting closer to practical deployment, the common assumption made about data is that it does not carry any sensitive information. This assumption may not hold for many practical cases, especially in the domain where an individual's personal information is involved, like healthcare and facial recognition systems. We posit that selectively removing features in this latent space can protect the sensitive information and provide a better privacy-utility trade-off. Consequently, we propose DISCO which learns a dynamic and data driven pruning filter to selectively obfuscate sensitive information in the feature space. We propose diverse attack schemes for sensitive inputs \& attributes and demonstrate the effectiveness of DISCO against state-of-the-art methods through quantitative and qualitative evaluation. Finally, we also release an evaluation benchmark dataset of 1 million sensitive representations to encourage rigorous exploration of novel attack schemes.

LGAug 7, 2020
SplitNN-driven Vertical Partitioning

Iker Ceballos, Vivek Sharma, Eduardo Mugica et al.

In this work, we introduce SplitNN-driven Vertical Partitioning, a configuration of a distributed deep learning method called SplitNN to facilitate learning from vertically distributed features. SplitNN does not share raw data or model details with collaborating institutions. The proposed configuration allows training among institutions holding diverse sources of data without the need of complex encryption algorithms or secure computation protocols. We evaluate several configurations to merge the outputs of the split models, and compare performance and resource efficiency. The method is flexible and allows many different configurations to tackle the specific challenges posed by vertically split datasets.

CVApr 5, 2020
Deep Multimodal Feature Encoding for Video Ordering

Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen

True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications.

CVApr 5, 2020
Clustering based Contrastive Learning for Improving Face Representations

Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz et al.

A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets.

CRMar 19, 2020
Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic

Ramesh Raskar, Isabel Schunemann, Rachel Barbar et al.

Containment, the key strategy in quickly halting an epidemic, requires rapid identification and quarantine of the infected individuals, determination of whom they have had close contact with in the previous days and weeks, and decontamination of locations the infected individual has visited. Achieving containment demands accurate and timely collection of the infected individual's location and contact history. Traditionally, this process is labor intensive, susceptible to memory errors, and fraught with privacy concerns. With the recent almost ubiquitous availability of smart phones, many people carry a tool which can be utilized to quickly identify an infected individual's contacts during an epidemic, such as the current 2019 novel Coronavirus crisis. Unfortunately, the very same first-generation contact tracing tools have been used to expand mass surveillance, limit individual freedoms and expose the most private details about individuals. We seek to outline the different technological approaches to mobile-phone based contact-tracing to date and elaborate on the opportunities and the risks that these technologies pose to individuals and societies. We describe advanced security enhancing approaches that can mitigate these risks and describe trade-offs one must make when developing and deploying any mass contact-tracing technology. With this paper, our aim is to continue to grow the conversation regarding contact-tracing for epidemic and pandemic containment and discuss opportunities to advance this space. We invite feedback and discussion.

CVOct 9, 2019
ExpertMatcher: Automating ML Model Selection for Clients using Hidden Representations

Vivek Sharma, Praneeth Vepakomma, Tristan Swedish et al.

Recently, there has been the development of Split Learning, a framework for distributed computation where model components are split between the client and server (Vepakomma et al., 2018b). As Split Learning scales to include many different model components, there needs to be a method of matching client-side model components with the best server-side model components. A solution to this problem was introduced in the ExpertMatcher (Sharma et al., 2019) framework, which uses autoencoders to match raw data to models. In this work, we propose an extension of ExpertMatcher, where matching can be performed without the need to share the client's raw data representation. The technique is applicable to situations where there are local clients and centralized expert ML models, but the sharing of raw data is constrained.

LGOct 5, 2019
ExpertMatcher: Automating ML Model Selection for Users in Resource Constrained Countries

Vivek Sharma, Praneeth Vepakomma, Tristan Swedish et al.

In this work we introduce ExpertMatcher, a method for automating deep learning model selection using autoencoders. Specifically, we are interested in performing inference on data sources that are distributed across many clients using pretrained expert ML networks on a centralized server. The ExpertMatcher assigns the most relevant model(s) in the central server given the client's data representation. This allows resource-constrained clients in developing countries to utilize the most relevant ML models for their given task without having to evaluate the performance of each ML model. The method is generic and can be beneficial in any setup where there are local clients and numerous centralized expert ML models.

CVApr 25, 2019
Large Scale Holistic Video Understanding

Ali Diba, Mohsen Fayyaz, Vivek Sharma et al.

Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale "Holistic Video Understanding Dataset"~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally captures the real-world scenarios. We demonstrate the generalization capability of HVU on three challenging tasks: 1.) Video classification, 2.) Video captioning and 3.) Video clustering tasks. In particular for video classification, we introduce a new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. Via our experiments, we validate the idea that holistic representation learning is complementary, and can play a key role in enabling many real-world applications.

CVApr 25, 2019
DynamoNet: Dynamic Action and Motion Network

Ali Diba, Vivek Sharma, Luc Van Gool et al.

In this paper, we are interested in self-supervised learning the motion cues in videos using dynamic motion filters for a better motion representation to finally boost human action recognition in particular. Thus far, the vision community has focused on spatio-temporal approaches using standard filters, rather we here propose dynamic filters that adaptively learn the video-specific internal motion representation by predicting the short-term future frames. We name this new motion representation, as dynamic motion representation (DMR) and is embedded inside of 3D convolutional network as a new layer, which captures the visual appearance and motion dynamics throughout entire video clip via end-to-end network learning. Simultaneously, we utilize these motion representation to enrich video classification. We have designed the frame prediction task as an auxiliary task to empower the classification problem. With these overall objectives, to this end, we introduce a novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem. We conduct experiments on challenging human action datasets: Kinetics 400, UCF101, HMDB51. The experiments using the proposed DynamoNet show promising results on all the datasets.

CVMar 3, 2019
Self-Supervised Learning of Face Representations for Video Face Clustering

Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz et al.

Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks.

CVFeb 28, 2019
Efficient Parameter-free Clustering Using First Neighbor Relations

M. Saquib Sarfraz, Vivek Sharma, Rainer Stiefelhagen

We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds and/or the need to specify the number of clusters. The proposed algorithm belongs to the family of hierarchical agglomerative methods. The technique has a very low computational overhead, is easily scalable and applicable to large practical problems. Evaluation on well known datasets from different domains ranging between 1077 and 8.1 million samples shows substantial performance gains when compared to the existing clustering techniques.

CVDec 27, 2018
Can Image Enhancement be Beneficial to Find Smoke Images in Laparoscopic Surgery?

Congcong Wang, Vivek Sharma, Yu Fan et al.

Laparoscopic surgery has a limited field of view. Laser ablation in a laproscopic surgery causes smoke, which inevitably influences the surgeon's visibility. Therefore, it is of vital importance to remove the smoke, such that a clear visualization is possible. In order to employ a desmoking technique, one needs to know beforehand if the image contains smoke or not, to this date, there exists no accurate method that could classify the smoke/non-smoke images completely. In this work, we propose a new enhancement method which enhances the informative details in the RGB images for discrimination of smoke/non-smoke images. Our proposed method utilizes weighted least squares optimization framework~(WLS). For feature extraction, we use statistical features based on bivariate histogram distribution of gradient magnitude~(GM) and Laplacian of Gaussian~(LoG). We then train a SVM classifier with binary smoke/non-smoke classification task. We demonstrate the effectiveness of our method on Cholec80 dataset. Experiments using our proposed enhancement method show promising results with improvements of 4\% in accuracy and 4\% in F1-Score over the baseline performance of RGB images. In addition, our approach improves over the saturation histogram based classification methodologies Saturation Analysis~(SAN) and Saturation Peak Analysis~(SPA) by 1/5\% and 1/6\% in accuracy/F1-Score metrics.

CVJun 19, 2018
Spatio-Temporal Channel Correlation Networks for Action Classification

Ali Diba, Mohsen Fayyaz, Vivek Sharma et al.

The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block can be added as a residual unit to different parts of 3D CNNs. We name our novel block 'Spatio-Temporal Channel Correlation' (STC). By embedding this block to the current state-of-the-art architectures such as ResNext and ResNet, we improved the performance by 2-3\% on Kinetics dataset. Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D CNNs is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D CNNs is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by fine-tuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and fine-tuned on the target datasets, e.g. HMDB51/UCF101.

CVNov 22, 2017
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

Ali Diba, Mohsen Fayyaz, Vivek Sharma et al.

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

CVNov 22, 2017
Weakly Supervised Object Discovery by Generative Adversarial & Ranking Networks

Ali Diba, Vivek Sharma, Rainer Stiefelhagen et al.

The deep generative adversarial networks (GAN) recently have been shown to be promising for different computer vision applications, like image edit- ing, synthesizing high resolution images, generating videos, etc. These networks and the corresponding learning scheme can handle various visual space map- pings. We approach GANs with a novel training method and learning objective, to discover multiple object instances for three cases: 1) synthesizing a picture of a specific object within a cluttered scene; 2) localizing different categories in images for weakly supervised object detection; and 3) improving object discov- ery in object detection pipelines. A crucial advantage of our method is that it learns a new deep similarity metric, to distinguish multiple objects in one im- age. We demonstrate that the network can act as an encoder-decoder generating parts of an image which contain an object, or as a modified deep CNN to rep- resent images for object detection in supervised and weakly supervised scheme. Our ranking GAN offers a novel way to search through images for object specific patterns. We have conducted experiments for different scenarios and demonstrate the method performance for object synthesizing and weakly supervised object detection and classification using the MS-COCO and PASCAL VOC datasets.

CVOct 20, 2017
Classification Driven Dynamic Image Enhancement

Vivek Sharma, Ali Diba, Davy Neven et al.

Convolutional neural networks rely on image texture and structure to serve as discriminative features to classify the image content. Image enhancement techniques can be used as preprocessing steps to help improve the overall image quality and in turn improve the overall effectiveness of a CNN. Existing image enhancement methods, however, are designed to improve the perceptual quality of an image for a human observer. In this paper, we are interested in learning CNNs that can emulate image enhancement and restoration, but with the overall goal to improve image classification and not necessarily human perception. To this end, we present a unified CNN architecture that uses a range of enhancement filters that can enhance image-specific details via end-to-end dynamic filter learning. We demonstrate the effectiveness of this strategy on four challenging benchmark datasets for fine-grained, object, scene, and texture classification: CUB-200-2011, PASCAL-VOC2007, MIT-Indoor, and DTD. Experiments using our proposed enhancement show promising results on all the datasets. In addition, our approach is capable of improving the performance of all generic CNN architectures.

CVNov 24, 2016
Weakly Supervised Cascaded Convolutional Networks

Ali Diba, Vivek Sharma, Ali Pazandeh et al.

Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions. We introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. The first stage of both architectures extracts best candidate of class specific region proposals by training a fully convolutional network. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. The final stage of both architectures is a part of a convolutional neural network that performs multiple instance learning on proposals extracted in the previous stage(s). Our experiments on the PASCAL VOC 2007, 2010, 2012 and large scale object datasets, ILSVRC 2013, 2014 datasets show improvements in the areas of weakly-supervised object detection, classification and localization.

CVNov 21, 2016
Deep Temporal Linear Encoding Networks

Ali Diba, Vivek Sharma, Luc Van Gool

The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space; (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification; and (c) they model feature interactions in a more expressive way and without loss of information. We conduct experiments on two challenging human action datasets: HMDB51 and UCF101. The experiments show that TLE outperforms current state-of-the-art methods on both datasets.

CVAug 23, 2016
Does V-NIR based Image Enhancement Come with Better Features?

Vivek Sharma, Luc Van Gool

Image enhancement using the visible (V) and near-infrared (NIR) usually enhances useful image details. The enhanced images are evaluated by observers perception, instead of quantitative feature evaluation. Thus, can we say that these enhanced images using NIR information has better features in comparison to the computed features in the Red, Green, and Blue color channels directly? In this work, we present a new method to enhance the visible images using NIR information via edge-preserving filters, and also investigate which method performs best from a image features standpoint. We then show that our proposed enhancement method produces more stable features than the existing state-of-the-art methods.

CVMay 26, 2016
Low-Cost Scene Modeling using a Density Function Improves Segmentation Performance

Vivek Sharma, Sule Yildirim-Yayilgan, Luc Van Gool

We propose a low cost and effective way to combine a free simulation software and free CAD models for modeling human-object interaction in order to improve human & object segmentation. It is intended for research scenarios related to safe human-robot collaboration (SHRC) and interaction (SHRI) in the industrial domain. The task of human and object modeling has been used for detecting activity, and for inferring and predicting actions, different from those works, we do human and object modeling in order to learn interactions in RGB-D data for improving segmentation. For this purpose, we define a novel density function to model a three dimensional (3D) scene in a virtual environment (VREP). This density function takes into account various possible configurations of human-object and object-object relationships and interactions governed by their affordances. Using this function, we synthesize a large, realistic and highly varied synthetic RGB-D dataset that we use for training. We train a random forest classifier, and the pixelwise predictions obtained is integrated as a unary term in a pairwise conditional random fields (CRF). Our evaluation shows that modeling these interactions improves segmentation performance by ~7\% in mean average precision and recall over state-of-the-art methods that ignore these interactions in real-world data. Our approach is computationally efficient, robust and can run real-time on consumer hardware.

CVMay 11, 2016
Image-level Classification in Hyperspectral Images using Feature Descriptors, with Application to Face Recognition

Vivek Sharma, Luc Van Gool

In this paper, we proposed a novel pipeline for image-level classification in the hyperspectral images. By doing this, we show that the discriminative spectral information at image-level features lead to significantly improved performance in a face recognition task. We also explored the potential of traditional feature descriptors in the hyperspectral images. From our evaluations, we observe that SIFT features outperform the state-of-the-art hyperspectral face recognition methods, and also the other descriptors. With the increasing deployment of hyperspectral sensors in a multitude of applications, we believe that our approach can effectively exploit the spectral information in hyperspectral images, thus beneficial to more accurate classification.