Viswanathan Swaminathan

CV
h-index36
14papers
527citations
Novelty49%
AI Score33

14 Papers

CVJul 26, 2022Code
Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis

Trisha Mittal, Ritwik Sinha, Viswanathan Swaminathan et al.

As tools for content editing mature, and artificial intelligence (AI) based algorithms for synthesizing media grow, the presence of manipulated content across online media is increasing. This phenomenon causes the spread of misinformation, creating a greater need to distinguish between ``real'' and ``manipulated'' content. To this end, we present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated). Many of the existing deepfake datasets focus exclusively on two types of facial manipulations -- swapping with a different subject's face or altering the existing face. VideoSham, on the other hand, contains more diverse, context-rich, and human-centric, high-resolution videos manipulated using a combination of 6 different spatial and temporal attacks. Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham. We performed a user study on Amazon Mechanical Turk with 1200 participants to understand if they can differentiate between the real and manipulated videos in VideoSham. Finally, we dig deeper into the strengths and weaknesses of performances by humans and SOTA-algorithms to identify gaps that need to be filled with better AI algorithms. We present the dataset at https://github.com/adobe-research/VideoSham-dataset.

CVJul 18, 2022
Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Uttaran Bhattacharya, Gang Wu, Stefano Petrangeli et al.

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.

CVMar 23, 2023
VADER: Video Alignment Differencing and Retrieval

Alexander Black, Simon Jenni, Tu Bui et al.

We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.

CLNov 20, 2023
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

Zhengmian Hu, Gang Wu, Saayan Mitra et al.

In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that mislead LLMs into generating incorrect or undesired outputs. Previous work has revealed that with relatively simple yet effective attacks based on discrete optimization, it is possible to generate adversarial prompts that bypass moderation and alignment of the models. This vulnerability to adversarial prompts underscores a significant concern regarding the robustness and reliability of LLMs. Our work aims to address this concern by introducing a novel approach to detecting adversarial prompts at a token level, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity, where tokens predicted with high probability are considered normal, and those exhibiting high perplexity are flagged as adversarial. Additionaly, our method also integrates context understanding by incorporating neighboring token information to encourage the detection of contiguous adversarial prompt sequences. To this end, we design two algorithms for adversarial prompt detection: one based on optimization techniques and another on Probabilistic Graphical Models (PGM). Both methods are equipped with efficient solving methods, ensuring efficient adversarial prompt detection. Our token-level detection result can be visualized as heatmap overlays on the text sequence, allowing for a clearer and more intuitive representation of which part of the text may contain adversarial prompts.

MENov 3, 2022
Privacy Aware Experiments without Cookies

Shiv Shankar, Ritwik Sinha, Saayan Mitra et al.

Consider two brands that want to jointly test alternate web experiences for their customers with an A/B test. Such collaborative tests are today enabled using \textit{third-party cookies}, where each brand has information on the identity of visitors to another website. With the imminent elimination of third-party cookies, such A/B tests will become untenable. We propose a two-stage experimental design, where the two brands only need to agree on high-level aggregate parameters of the experiment to test the alternate experiences. Our design respects the privacy of customers. We propose an estimater of the Average Treatment Effect (ATE), show that it is unbiased and theoretically compute its variance. Our demonstration describes how a marketer for a brand can design such an experiment and analyze the results. On real and simulated data, we show that the approach provides valid estimate of the ATE with low variance and is robust to the proportion of visitors overlapping across the brands.

AIApr 29, 2025
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Paiheng Xu, Gang Wu, Xiang Chen et al.

Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset's diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.

AIApr 15, 2025
GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

Dayeon Ki, Tianyi Zhou, Marine Carpuat et al.

Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with open-ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across four design types. We further present GraphicTown, an LLM agent framework with three design experts and 46 actions (tools) to choose from for executing each step of the planned workflows in web environments. Experiments with six LLMs demonstrate their ability to generate workflows that integrate both explicit design constraints from user queries and implicit commonsense constraints. However, these workflows often do not lead to successful execution outcomes, primarily due to challenges in: (1) reasoning about spatial relationships, (2) coordinating global dependencies across experts, and (3) retrieving the most appropriate action per step. We envision GraphicBench as a challenging yet valuable testbed for advancing LLM-agent planning and execution in creative design tasks.

CVOct 5, 2021
HighlightMe: Detecting Highlights from Human-Centric Videos

Uttaran Bhattacharya, Gang Wu, Stefano Petrangeli et al.

We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning.

LGMar 31, 2020
Optimal Bidding Strategy without Exploration in Real-time Bidding

Aritra Ghosh, Saayan Mitra, Somdeb Sarkhel et al.

Maximizing utility with a budget constraint is the primary goal for advertisers in real-time bidding (RTB) systems. The policy maximizing the utility is referred to as the optimal bidding strategy. Earlier works on optimal bidding strategy apply model-based batch reinforcement learning methods which can not generalize to unknown budget and time constraint. Further, the advertiser observes a censored market price which makes direct evaluation infeasible on batch test datasets. Previous works ignore the losing auctions to alleviate the difficulty with censored states; thus significantly modifying the test distribution. We address the challenge of lacking a clear evaluation procedure as well as the error propagated through batch reinforcement learning methods in RTB systems. We exploit two conditional independence structures in the sequential bidding process that allow us to propose a novel practical framework using the maximum entropy principle to imitate the behavior of the true distribution observed in real-time traffic. Moreover, the framework allows us to train a model that can generalize to the unseen budget conditions than limit only to those observed in history. We compare our methods on two real-world RTB datasets with several baselines and demonstrate significantly improved performance under various budget settings.

LGJan 18, 2020
Scalable Bid Landscape Forecasting in Real-time Bidding

Aritra Ghosh, Saayan Mitra, Somdeb Sarkhel et al.

In programmatic advertising, ad slots are usually sold using second-price (SP) auctions in real-time. The highest bidding advertiser wins but pays only the second-highest bid (known as the winning price). In SP, for a single item, the dominant strategy of each bidder is to bid the true value from the bidder's perspective. However, in a practical setting, with budget constraints, bidding the true value is a sub-optimal strategy. Hence, to devise an optimal bidding strategy, it is of utmost importance to learn the winning price distribution accurately. Moreover, a demand-side platform (DSP), which bids on behalf of advertisers, observes the winning price if it wins the auction. For losing auctions, DSPs can only treat its bidding price as the lower bound for the unknown winning price. In literature, typically censored regression is used to model such partially observed data. A common assumption in censored regression is that the winning price is drawn from a fixed variance (homoscedastic) uni-modal distribution (most often Gaussian). However, in reality, these assumptions are often violated. We relax these assumptions and propose a heteroscedastic fully parametric censored regression approach, as well as a mixture density censored network. Our approach not only generalizes censored regression but also provides flexibility to model arbitrarily distributed real-world data. Experimental evaluation on the publicly available dataset for winning price estimation demonstrates the effectiveness of our method. Furthermore, we evaluate our algorithm on one of the largest demand-side platforms and significant improvement has been achieved in comparison with the baseline solutions.

MMAug 1, 2018
From Thumbnails to Summaries - A single Deep Neural Network to Rule Them All

Hongxiang Gu, Viswanathan Swaminathan

Video summaries come in many forms, from traditional single-image thumbnails, animated thumbnails, storyboards, to trailer-like video summaries. Content creators use the summaries to display the most attractive portion of their videos; the users use them to quickly evaluate if a video is worth watching. All forms of summaries are essential to video viewers, content creators, and advertisers. Often video content management systems have to generate multiple versions of summaries that vary in duration and presentational forms. We present a framework ReconstSum that utilizes LSTM-based autoencoder architecture to extract and select a sparse subset of video frames or keyshots that optimally represent the input video in an unsupervised manner. The encoder selects a subset from the input video while the decoder seeks to reconstruct the video from the selection. The goal is to minimize the difference between the original input video and the reconstructed video. Our method is easily extendable to generate a variety of applications including static video thumbnails, animated thumbnails, storyboards and "trailer-like" highlights. We specifically study and evaluate two most popular use cases: thumbnail generation and storyboard generation. We demonstrate that our methods generate better results than the state-of-the-art techniques in both use cases.

MMMar 21, 2018
Viewport-Driven Rate-Distortion Optimized 360° Video Streaming

Jacob Chakareski, Ridvan Aksu, Xavier Corbillon et al.

The growing popularity of virtual and augmented reality communications and 360° video streaming is moving video communication systems into much more dynamic and resource-limited operating settings. The enormous data volume of 360° videos requires an efficient use of network bandwidth to maintain the desired quality of experience for the end user. To this end, we propose a framework for viewport-driven rate-distortion optimized 360° video streaming that integrates the user view navigation pattern and the spatiotemporal rate-distortion characteristics of the 360° video content to maximize the delivered user quality of experience for the given network/system resources. The framework comprises a methodology for constructing dynamic heat maps that capture the likelihood of navigating different spatial segments of a 360° video over time by the user, an analysis and characterization of its spatiotemporal rate-distortion characteristics that leverage preprocessed spatial tilling of the 360° view sphere, and an optimization problem formulation that characterizes the delivered user quality of experience given the user navigation patterns, 360° video encoding decisions, and the available system/network resources. Our experimental results demonstrate the advantages of our framework over the conventional approach of streaming a monolithic uniformly encoded 360° video and a state-of-the-art reference method. Considerable video quality gains of 4 - 5 dB are demonstrated in the case of two popular 4K 360° videos.

MMJan 23, 2017
Adaptive 360 VR Video Streaming based on MPEG-DASH SRD

Mohammad Hosseini, Viswanathan Swaminathan

We demonstrate an adaptive bandwidth-efficient 360 VR video streaming system based on MPEG-DASH SRD. We extend MPEG-DASH SRD to the 3D space of 360 VR videos, and showcase a dynamic view-aware adaptation technique to tackle the high bandwidth demands of streaming 360 VR videos to wireless VR headsets. We spatially partition the underlying 3D mesh into multiple 3D sub-meshes, and construct an efficient 3D geometry mesh called hexaface sphere to optimally represent tiled 360 VR videos in the 3D space. We then spatially divide the 360 videos into multiple tiles while encoding and packaging, use MPEG-DASH SRD to describe the spatial relationship of tiles in the 3D space, and prioritize the tiles in the Field of View (FoV) for view-aware adaptation. Our initial evaluation results show that we can save up to 72% of the required bandwidth on 360 VR video streaming with minor negative quality impacts compared to the baseline scenario when no adaptations is applied.

MMSep 28, 2016
Adaptive 360 VR Video Streaming: Divide and Conquer!

Mohammad Hosseini, Viswanathan Swaminathan

While traditional multimedia applications such as games and videos are still popular, there has been a significant interest in the recent years towards new 3D media such as 3D immersion and Virtual Reality (VR) applications, especially 360 VR videos. 360 VR video is an immersive spherical video where the user can look around during playback. Unfortunately, 360 VR videos are extremely bandwidth intensive, and therefore are difficult to stream at acceptable quality levels. In this paper, we propose an adaptive bandwidth-efficient 360 VR video streaming system using a divide and conquer approach. In our approach, we propose a dynamic view-aware adaptation technique to tackle the huge streaming bandwidth demands of 360 VR videos. We spatially divide the videos into multiple tiles while encoding and packaging, use MPEG-DASH SRD to describe the spatial relationship of tiles in the 360-degree space, and prioritize the tiles in the Field of View (FoV). In order to describe such tiled representations, we extend MPEG-DASH SRD to the 3D space of 360 VR videos. We spatially partition the underlying 3D mesh, and construct an efficient 3D geometry mesh called hexaface sphere to optimally represent a tiled 360 VR video in the 3D space. Our initial evaluation results report up to 72% bandwidth savings on 360 VR video streaming with minor negative quality impacts compared to the baseline scenario when no adaptations is applied.