Sumit Shekhar

CV
14papers
511citations
Novelty48%
AI Score30

14 Papers

CVMay 24, 2022
SALAD: Source-free Active Label-Agnostic Domain Adaptation for Classification, Segmentation and Detection

Divya Kothandaraman, Sumit Shekhar, Abhilasha Sancheti et al.

We present a novel method, SALAD, for the challenging vision task of adapting a pre-trained "source" domain network to a "target" domain, with a small budget for annotation in the "target" domain and a shift in the label space. Further, the task assumes that the source data is not available for adaptation, due to privacy concerns or otherwise. We postulate that such systems need to jointly optimize the dual task of (i) selecting fixed number of samples from the target domain for annotation and (ii) transfer of knowledge from the pre-trained network to the target domain. To do this, SALAD consists of a novel Guided Attention Transfer Network (GATN) and an active learning function, HAL. The GATN enables feature distillation from pre-trained network to the target network, complemented with the target samples mined by HAL using transfer-ability and uncertainty criteria. SALAD has three key benefits: (i) it is task-agnostic, and can be applied across various visual tasks such as classification, segmentation and detection; (ii) it can handle shifts in output label space from the pre-trained source network to the target domain; (iii) it does not require access to source data for adaptation. We conduct extensive experiments across 3 visual tasks, viz. digits classification (MNIST, SVHN, VISDA), synthetic (GTA5) to real (CityScapes) image segmentation, and document layout detection (PubLayNet to DSSE). We show that our source-free approach, SALAD, results in an improvement of 0.5%-31.3%(across datasets and tasks) over prior adaptation methods that assume access to large amounts of annotated source data for adaptation.

CVSep 24, 2024Code
ImPoster: Text and Frequency Guidance for Subject Driven Action Personalization using Diffusion Models

Divya Kothandaraman, Kuldeep Kulkarni, Sumit Shekhar et al.

We present ImPoster, a novel algorithm for generating a target image of a 'source' subject performing a 'driving' action. The inputs to our algorithm are a single pair of a source image with the subject that we wish to edit and a driving image with a subject of an arbitrary class performing the driving action, along with the text descriptions of the two images. Our approach is completely unsupervised and does not require any access to additional annotations like keypoints or pose. Our approach builds on a pretrained text-to-image latent diffusion model and learns the characteristics of the source and the driving image by finetuning the diffusion model for a small number of iterations. At inference time, ImPoster performs step-wise text prompting i.e. it denoises by first moving in the direction of the image manifold corresponding to the driving image followed by the direction of the image manifold corresponding to the text description of the desired target image. We propose a novel diffusion guidance formulation, image frequency guidance, to steer the generation towards the manifold of the source subject and the driving action at every step of the inference denoising. Our frequency guidance formulations are derived from the frequency domain properties of images. We extensively evaluate ImPoster on a diverse set of source-driving image pairs to demonstrate improvements over baselines. To the best of our knowledge, ImPoster is the first approach towards achieving both subject-driven as well as action-driven image personalization. Code and data is available at https://github.com/divyakraman/ImPosterDiffusion2024.

CVJun 1, 2023
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

Abisek Rajakumar Kalarani, Pushpak Bhattacharyya, Niyati Chhaya et al.

Well-formed context aware image captions and tags in enterprise content such as marketing material are critical to ensure their brand presence and content recall. Manual creation and updates to ensure the same is non trivial given the scale and the tedium towards this task. We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model, with a focus on context-assisted image captioning where the caption is generated based on both the image and its context. Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches. We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context. The second pretraining task is a new VL task, and we construct and release two datasets for the task with 1.1M and 2.2K data instances. Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets. To the best of our knowledge, ours is the first effort at incorporating contextual information in pretraining the models for the VL tasks.

GRJan 2, 2023
Interactive Control over Temporal Consistency while Stylizing Video Streams

Sumit Shekhar, Max Reimann, Moritz Hilscher et al.

Image stylization has seen significant advancement and widespread interest over the years, leading to the development of a multitude of techniques. Extending these stylization techniques, such as Neural Style Transfer (NST), to videos is often achieved by applying them on a per-frame basis. However, per-frame stylization usually lacks temporal consistency, expressed by undesirable flickering artifacts. Most of the existing approaches for enforcing temporal consistency suffer from one or more of the following drawbacks: They (1) are only suitable for a limited range of techniques, (2) do not support online processing as they require the complete video as input, (3) cannot provide consistency for the task of stylization, or (4) do not provide interactive consistency control. Domain-agnostic techniques for temporal consistency aim to eradicate flickering completely but typically disregard aesthetic aspects. For stylization tasks, however, consistency control is an essential requirement as a certain amount of flickering adds to the artistic look and feel. Moreover, making this control interactive is paramount from a usability perspective. To achieve the above requirements, we propose an approach that stylizes video streams in real-time at full HD resolutions while providing interactive consistency control. We develop a lite optical-flow network that operates at 80 FPS on desktop systems with sufficient accuracy. Further, we employ an adaptive combination of local and global consistency features and enable interactive selection between them. Objective and subjective evaluations demonstrate that our method is superior to state-of-the-art video consistency approaches.

CVMar 9, 2022
Low-light Image and Video Enhancement via Selective Manipulation of Chromaticity

Sumit Shekhar, Max Reimann, Amir Semmo et al.

Image acquisition in low-light conditions suffers from poor quality and significant degradation in visual aesthetics. This affects the visual perception of the acquired image and the performance of various computer vision and image processing algorithms applied after acquisition. Especially for videos, the additional temporal domain makes it more challenging, wherein we need to preserve quality in a temporally coherent manner. We present a simple yet effective approach for low-light image and video enhancement. To this end, we introduce "Adaptive Chromaticity", which refers to an adaptive computation of image chromaticity. The above adaptivity allows us to avoid the costly step of low-light image decomposition into illumination and reflectance, employed by many existing techniques. All stages in our method consist of only point-based operations and high-pass or low-pass filtering, thereby ensuring that the amount of temporal incoherence is negligible when applied on a per-frame basis for videos. Our results on standard lowlight image datasets show the efficacy of our algorithm and its qualitative and quantitative superiority over several state-of-the-art techniques. For videos captured in the wild, we perform a user study to demonstrate the preference for our method in comparison to state-of-the-art approaches.

CVJun 7, 2024
Unveiling the Invisible: Captioning Videos with Metaphors

Abisek Rajakumar Kalarani, Pushpak Bhattacharyya, Sumit Shekhar

Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.

CVNov 30, 2021
TALISMAN: Targeted Active Learning for Object Detection with Rare Classes and Slices using Submodular Mutual Information

Suraj Kothawade, Saikat Ghosh, Sumit Shekhar et al.

Deep neural networks based object detectors have shown great success in a variety of domains like autonomous vehicles, biomedical imaging, etc. It is known that their success depends on a large amount of data from the domain of interest. While deep models often perform well in terms of overall accuracy, they often struggle in performance on rare yet critical data slices. For example, data slices like "motorcycle at night" or "bicycle at night" are often rare but very critical slices for self-driving applications and false negatives on such rare slices could result in ill-fated failures and accidents. Active learning (AL) is a well-known paradigm to incrementally and adaptively build training datasets with a human in the loop. However, current AL based acquisition functions are not well-equipped to tackle real-world datasets with rare slices, since they are based on uncertainty scores or global descriptors of the image. We propose TALISMAN, a novel framework for Targeted Active Learning or object detectIon with rare slices using Submodular MutuAl iNformation. Our method uses the submodular mutual information functions instantiated using features of the region of interest (RoI) to efficiently target and acquire data points with rare slices. We evaluate our framework on the standard PASCAL VOC07+12 and BDD100K, a real-world self-driving dataset. We observe that TALISMAN outperforms other methods by in terms of average precision on rare slices, and in terms of mAP.

IROct 1, 2021
OPAD: An Optimized Policy-based Active Learning Framework for Document Content Analysis

Sumit Shekhar, Bhanu Prakash Reddy Guda, Ashutosh Chaubey et al.

Documents are central to many business systems, and include forms, reports, contracts, invoices or purchase orders. The information in documents is typically in natural language, but can be organized in various layouts and formats. There have been recent spurt of interest in understanding document content with novel deep learning architectures. However, document understanding tasks need dense information annotations, which are costly to scale and generalize. Several active learning techniques have been proposed to reduce the overall budget of annotation while maintaining the performance of the underlying deep learning model. However, most of these techniques work only for classification problems. But content detection is a more complex task, and has been scarcely explored in active learning literature. In this paper, we propose \textit{OPAD}, a novel framework using reinforcement policy for active learning in content detection tasks for documents. The proposed framework learns the acquisition function to decide the samples to be selected while optimizing performance metrics that the tasks typically have. Furthermore, we extend to weak labelling scenarios to further reduce the cost of annotation significantly. We propose novel rewards to account for class imbalance and user feedback in the annotation interface, to improve the active learning method. We show superior performance of the proposed \textit{OPAD} framework for active learning for various tasks related to document understanding like layout parsing, object detection and named entity recognition. Ablation studies for human feedback and class imbalance rewards are presented, along with a comparison of annotation times for different approaches.

GRJul 13, 2021
COLiER: Collaborative Editing of Raster Images

Ulrike Bath, Sumit Shekhar, Jürgen Döllner et al.

Various web-based image-editing tools and web-based collaborative tools exist in isolation. Research focusing to bridge the gap between these two domains is sparse. We respond to the above and develop prototype groupware for real-time collaborative editing of raster images in a web browser. To better understand the requirements, we conduct a preliminary user study and establish communication and synchronization as key elements. The existing groupware for text documents, presentations, and vector graphics handles the above through well-established techniques. However, those cannot be extended as it is for raster graphics manipulation. To this end, we develop a document model that is maintained by a server and is delivered and synchronized to multiple clients. Our prototypical implementation is based on a scalable client-server architecture: using WebGL for interactive browser-based rendering and WebSocket connections to maintain synchronization. We evaluate our work qualitatively through a post-deployment user study for three different scenarios.

CVJul 30, 2019
LEAF-QA: Locate, Encode & Attend for Figure Question Answering

Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta et al.

We introduce LEAF-QA, a comprehensive dataset of $250,000$ densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering. To this end, LEAF-Net, a deep architecture involving chart element localization, question and answer encoding in terms of chart elements, and an attention network is proposed. Different experiments are conducted to demonstrate the challenges of QA on LEAF-QA. The proposed architecture, LEAF-Net also considerably advances the current state-of-the-art on FigureQA and DVQA.

HCMay 20, 2019
Are all the frames equally important?

Oleksii Sidorov, Marius Pedersen, Nam Wook Kim et al.

In this work, we address the problem of measuring and predicting temporal video saliency - a metric which defines the importance of a video frame for human attention. Unlike the conventional spatial saliency which defines the location of the salient regions within a frame (as it is done for still images), temporal saliency considers importance of a frame as a whole and may not exist apart from context. The proposed interface is an interactive cursor-based algorithm for collecting experimental data about temporal saliency. We collect the first human responses and perform their analysis. As a result, we show that qualitatively, the produced scores have very explicit meaning of the semantic changes in a frame, while quantitatively being highly correlated between all the observers. Apart from that, we show that the proposed tool can simultaneously collect fixations similar to the ones produced by eye-tracker in a more affordable way. Further, this approach may be used for creation of first temporal saliency datasets which will allow training computational predictive algorithms. The proposed interface does not rely on any special equipment, which allows to run it remotely and cover a wide audience.

HCNov 19, 2018
VoCoG: An Intelligent, Non-Intrusive Assistant for Voice-based Collaborative Group-Viewing

Sumit Shekhar, Aditya Siddhant, Anindya Shankar Bhandari et al.

There have been significant innovations in media technologies in the recent years. While these developments have improved experiences for individual users, design of multi-user interfaces still remains a challenge. A relatively unexplored area in this context, is enabling multiple users to enjoy shared viewing (e.g. deciding on movies to watch together). In particular, the challenge is to design an intelligent system which would enable viewers to explore together shows or movies they like, seamlessly. This is a complex design problem, as it requires the system to (i) assess affinities of individual users (movies or genres), (ii) combine individual preferences taking into account user-user interactions, and (iii) be non-intrusive simultaneously. The proposed system VoCoG, is an end-to-end intelligent system for collaborative viewing. VoCoG incorporates an online recommendation algorithm, efficient methods for analyzing natural conversation and a graph-based method to fuse preferences of multiple users. It takes user conversation as input, making it non-intrusive. A usability survey of the system indicates that the system provides a good experience to the users as well as relevant recommendations. Further analysis of the usage data reveals insights about the nature of conversation during the interaction sessions, final consensus among the users as well as ratings of varied user groups.

CVJul 17, 2017
Show and Recall: Learning What Makes Videos Memorable

Sumit Shekhar, Dhruv Singal, Harvineet Singh et al.

With the explosion of video content on the Internet, there is a need for research on methods for video analysis which take human cognition into account. One such cognitive measure is memorability, or the ability to recall visual content after watching it. Prior research has looked into image memorability and shown that it is intrinsic to visual content, but the problem of modeling video memorability has not been addressed sufficiently. In this work, we develop a prediction model for video memorability, including complexities of video content in it. Detailed feature analysis reveals that the proposed method correlates well with existing findings on memorability. We also describe a novel experiment of predicting video sub-shot memorability and show that our approach improves over current memorability methods in this task. Experiments on standard datasets demonstrate that the proposed metric can achieve results on par or better than the state-of-the art methods for video summarization.

CVJul 10, 2017
Synthesis-based Robust Low Resolution Face Recognition

Sumit Shekhar, Vishal M. Patel, Rama Chellappa

Recognition of low resolution face images is a challenging problem in many practical face recognition systems. Methods have been proposed in the face recognition literature for the problem which assume that the probe is low resolution, but a high resolution gallery is available for recognition. These attempts have been aimed at modifying the probe image such that the resultant image provides better discrimination. We formulate the problem differently by leveraging the information available in the high resolution gallery image and propose a dictionary learning approach for classifying the low-resolution probe image. An important feature of our algorithm is that it can handle resolution change along with illumination variations. Furthermore, we also kernelize the algorithm to handle non-linearity in data and present a joint dictionary learning technique for robust recognition at low resolutions. The effectiveness of the proposed method is demonstrated using standard datasets and a challenging outdoor face dataset. It is shown that our method is efficient and can perform significantly better than many competitive low resolution face recognition algorithms.