LGAug 15, 2022Code
Combining Gradients and Probabilities for Heterogeneous Approximation of Neural NetworksElias Trommer, Bernd Waschneck, Akash Kumar
This work explores the search for heterogeneous approximate multiplier configurations for neural networks that produce high accuracy and low energy consumption. We discuss the validity of additive Gaussian noise added to accurate neural network computations as a surrogate model for behavioral simulation of approximate multipliers. The continuous and differentiable properties of the solution space spanned by the additive Gaussian noise model are used as a heuristic that generates meaningful estimates of layer robustness without the need for combinatorial optimization techniques. Instead, the amount of noise injected into the accurate computations is learned during network training using backpropagation. A probabilistic model of the multiplier error is presented to bridge the gap between the domains; the model estimates the standard deviation of the approximate multiplier error, connecting solutions in the additive Gaussian noise space to actual hardware instances. Our experiments show that the combination of heterogeneous approximation and neural network retraining reduces the energy consumption for multiplications by 70% to 79% for different ResNet variants on the CIFAR-10 dataset with a Top-1 accuracy loss below one percentage point. For the more complex Tiny ImageNet task, our VGG16 model achieves a 53 % reduction in energy consumption with a drop in Top-5 accuracy of 0.5 percentage points. We further demonstrate that our error model can predict the parameters of an approximate multiplier in the context of the commonly used additive Gaussian noise (AGN) model with high accuracy. Our software implementation is available under https://github.com/etrommer/agn-approx.
CVApr 17, 2022
Video Action Detection: Analysing Limitations and ChallengesRajat Modi, Aayush Jung Rana, Akash Kumar et al.
Beyond possessing large enough size to feed data hungry machines (eg, transformers), what attributes measure the quality of a dataset? Assuming that the definitions of such attributes do exist, how do we quantify among their relative existences? Our work attempts to explore these questions for video action detection. The task aims to spatio-temporally localize an actor and assign a relevant action class. We first analyze the existing datasets on video action detection and discuss their limitations. Next, we propose a new dataset, Multi Actor Multi Action (MAMA) which overcomes these limitations and is more suitable for real world applications. In addition, we perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect. This reveals if the actions in these datasets really need the motion information of an actor, or whether they predict the occurrence of an action even by looking at a single frame. Finally, we investigate the widely held assumptions on the importance of temporal ordering: is temporal ordering important for detecting these actions? Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
LGOct 2, 2022
Robust Empirical Risk Minimization with ToleranceRobi Bhattacharjee, Max Hopkins, Akash Kumar et al.
Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $γ$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{γδ}}{ε^2}\right)$ samples for robustness regions of (maximum) diameter $D$.
CVMar 8, 2022
End-to-End Semi-Supervised Learning for Video Action DetectionAkash Kumar, Yogesh Singh Rawat
In this work, we focus on semi-supervised learning for video action detection which utilizes both labeled as well as unlabeled data. We propose a simple end-to-end consistency based approach which effectively utilizes the unlabeled data. Video action detection requires both, action class prediction as well as a spatio-temporal localization of actions. Therefore, we investigate two types of constraints, classification consistency, and spatio-temporal consistency. The presence of predominant background and static regions in a video makes it challenging to utilize spatio-temporal consistency for action detection. To address this, we propose two novel regularization constraints for spatio-temporal consistency; 1) temporal coherency, and 2) gradient smoothness. Both these aspects exploit the temporal continuity of action in videos and are found to be effective for utilizing unlabeled videos for action detection. We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets, UCF101-24 and JHMDB-21. In addition, we also show the effectiveness of the proposed approach for video object segmentation on the Youtube-VOS which demonstrates its generalization capability The proposed approach achieves competitive performance by using merely 20% of annotations on UCF101-24 when compared with recent fully supervised methods. On UCF101-24, it improves the score by +8.9% and +11% at 0.5 f-mAP and v-mAP respectively, compared to supervised approach.
65.6CVMay 2
VISTA: Video Interaction Spatio-Temporal Analysis BenchmarkAlejandro Aparcedo, Akash Kumar, Aaryan Garg et al.
Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.
CVJun 9, 2023
A Large-Scale Analysis on Self-Supervised Video Representation LearningAkash Kumar, Ashlesha Kumar, Vibhav Vineet et al.
Self-supervised learning is an effective way for label-free model pre-training, especially in the video domain where labeling is expensive. Existing self-supervised works in the video domain use varying experimental setups to demonstrate their effectiveness and comparison across approaches becomes challenging with no standard benchmark. In this work, we first provide a benchmark that enables a comparison of existing approaches on the same ground. Next, we study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5)feature analysis. To facilitate this study, we focus on seven different methods along with seven different network architectures and perform an extensive set of experiments on 5 different datasets with an evaluation of two different downstream tasks. We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures among others. We further put some of these insights to the real test and propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data. We believe this work will pave the way for researchers to a better understanding of self-supervised pretext tasks in video representation learning.
ARSep 23, 2023
AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical ProgrammingSiva Satyendra Sahoo, Salim Ullah, Akash Kumar
With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference on resource-constrained systems. Approximate computing (AxC) aims to provide disproportionate gains in the power, performance, and area (PPA) of an application by allowing some level of reduction in its behavioral accuracy (BEHAV). Using approximate operators (AxOs) for computer arithmetic forms one of the more prevalent methods of implementing AxC. AxOs provide the additional scope for finer granularity of optimization, compared to only precision scaling of computer arithmetic. To this end, designing platform-specific and cost-efficient approximate operators forms an important research goal. Recently, multiple works have reported using AI/ML-based approaches for synthesizing novel FPGA-based AxOs. However, most of such works limit usage of AI/ML to designing ML-based surrogate functions used during iterative optimization processes. To this end, we propose a novel data analysis-driven mathematical programming-based approach to synthesizing approximate operators for FPGAs. Specifically, we formulate mixed integer quadratically constrained programs based on the results of correlation analysis of the characterization data and use the solutions to enable a more directed search approach for evolutionary optimization algorithms. Compared to traditional evolutionary algorithms-based optimization, we report up to 21% improvement in the hypervolume, for joint optimization of PPA and BEHAV, in the design of signed 8-bit multipliers.
ARSep 22, 2023
AxOCS: Scaling FPGA-based Approximate Operators using Configuration SupersamplingSiva Satyendra Sahoo, Salim Ullah, Soumyo Bhattacharjee et al.
The rising usage of AI and ML-based processing across application domains has exacerbated the need for low-cost ML implementation, specifically for resource-constrained embedded systems. To this end, approximate computing, an approach that explores the power, performance, area (PPA), and behavioral accuracy (BEHAV) trade-offs, has emerged as a possible solution for implementing embedded machine learning. Due to the predominance of MAC operations in ML, designing platform-specific approximate arithmetic operators forms one of the major research problems in approximate computing. Recently there has been a rising usage of AI/ML-based design space exploration techniques for implementing approximate operators. However, most of these approaches are limited to using ML-based surrogate functions for predicting the PPA and BEHAV impact of a set of related design decisions. While this approach leverages the regression capabilities of ML methods, it does not exploit the more advanced approaches in ML. To this end, we propose AxOCS, a methodology for designing approximate arithmetic operators through ML-based supersampling. Specifically, we present a method to leverage the correlation of PPA and BEHAV metrics across operators of varying bit-widths for generating larger bit-width operators. The proposed approach involves traversing the relatively smaller design space of smaller bit-width operators and employing its associated Design-PPA-BEHAV relationship to generate initial solutions for metaheuristics-based optimization for larger operators. The experimental evaluation of AxOCS for FPGA-optimized approximate operators shows that the proposed approach significantly improves the quality-resulting hypervolume for multi-objective optimization-of 8x8 signed approximate multipliers.
LGSep 11, 2023
Temporal Patience: Efficient Adaptive Deep Learning for Embedded Radar Data ProcessingMax Sponner, Julius Ott, Lorenzo Servadei et al.
Radar sensors offer power-efficient solutions for always-on smart devices, but processing the data streams on resource-constrained embedded platforms remains challenging. This paper presents novel techniques that leverage the temporal correlation present in streaming radar data to enhance the efficiency of Early Exit Neural Networks for Deep Learning inference on embedded devices. These networks add additional classifier branches between the architecture's hidden layers that allow for an early termination of the inference if their result is deemed sufficient enough by an at-runtime decision mechanism. Our methods enable more informed decisions on when to terminate the inference, reducing computational costs while maintaining a minimal loss of accuracy. Our results demonstrate that our techniques save up to 26% of operations per inference over a Single Exit Network and 12% over a confidence-based Early Exit version. Our proposed techniques work on commodity hardware and can be combined with traditional optimizations, making them accessible for resource-constrained embedded platforms commonly used in smart devices. Such efficiency gains enable real-time radar data processing on resource-constrained platforms, allowing for new applications in the context of smart homes, Internet-of-Things, and human-computer interaction.
44.2DSMay 22
Reducing the Randomness in Partition Oracles for Bounded Degree Minor-Free GraphsAkash Kumar, Abhiruk Lahiri, C. Seshadhri
Consider a bounded-degree graph $G$ that belongs to a minor-closed family (such as planar graphs). Such a graph has a hyperfinite decomposition, wherein, for a sufficiently small $\varepsilon > 0$, one can remove $\varepsilon dn$ edges to obtain connected components of size independent of $n$. (As usual, $n$ is the number of vertices and $d$ is the degree bound.) In a seminal result, Hassidim-Kelner-Nguyen-Onak (FOCS 2009) introduced the partition oracle, a procedure that provides local access to a hyperfinite decomposition. The partition oracle computes the component containing an input vertex $v$ with query complexity (to $G$) independent of $n$. Remarkably, this is done without any preprocessing on $G$. The coordination is done purely through a shared random seed. Despite a line of work on optimizing the query complexity of partition oracles, there were no attempts to bound the size of the random seed. All existing partition oracles use a random seed of size $Ω(n)$, which technically implies a linear setup time. Any blackbox derandomization would likely need $Ω(\log^2n)$ uniform random bits. A natural question is whether the random seed can also have length independent of $n$. We prove the $poly(d\varepsilon^{-1})$-query partition oracles of Kumar-Seshadhri-Stolman can be implemented with a random seed of $poly(d\varepsilon^{-1}) \cdot \log n$ length. To get a deeper understanding on the randomness complexity, we consider a more general model where the vertex labels come from the universe $[N]$, where $N \geq n$. In this setting, we prove that any partition oracle even for cycles requires $ω_N(1)$ random bits.
CVDec 12, 2023Code
Semi-supervised Active Learning for Video Action DetectionAyush Singh, Aayush J Rana, Akash Kumar et al.
In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data along with informative sample selection for action detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning informative sample selection as well as semi-supervised learning pseudo label generation. First, we propose NoiseAug, a simple augmentation strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different benchmark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos. The code and models is publicly available at: \url{https://github.com/AKASH2907/semi-sup-active-learning}.
CVNov 26, 2025
HTTM: Head-wise Temporal Token Merging for Faster VGGTWeitian Wang, Lukas Meiner, Rai Shubham et al.
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
DSFeb 5, 2013
Approximation Algorithms for Digraph Width ParametersShiva Kintali, Nishad Kothari, Akash Kumar
Several problems that are NP-hard on general graphs are efficiently solvable on graphs with bounded treewidth. Efforts have been made to generalize treewidth and the related notion of pathwidth to digraphs. Directed treewidth, DAG-width and Kelly-width are some such notions which generalize treewidth, whereas directed pathwidth generalizes pathwidth. Each of these digraph width measures have an associated decomposition structure. In this paper, we present approximation algorithms for all these digraph width parameters. In particular, we give an O(sqrt{logn})-approximation algorithm for directed treewidth, and an O({\log}^{3/2}{n})-approximation algorithm for directed pathwidth, DAG-width and Kelly-width. Our algorithms construct the corresponding decompositions whose widths are within the above mentioned approximation factors.
ARFeb 26
Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware AcceleratorsYuhao Liu, Salim Ullah, Akash Kumar
Neural network accelerators have been widely applied to edge devices for complex tasks like object tracking, image recognition, etc. Previous works have explored the quantization technologies in related lightweight accelerator designs to reduce hardware resource consumption. However, low precision leads to high accuracy loss in inference. Therefore, mixed-precision quantization becomes an alternative solution by applying different precision in different layers to trade off resource consumption and accuracy. Because regular designs for multiplication on hardware cannot support the precision reconfiguration for a multi-precision Quantized Neural Network (QNN) model in runtime, we propose a runtime reconfigurable multi-precision multi-channel bitwise systolic array design for QNN accelerators. We have implemented and evaluated our work on the Ultra96 FPGA platform. Results show that our work can achieve 1.3185 to 3.5671 times speedup in inferring mixed-precision models and has less critical path delay, supporting a higher clock frequency (250MHz).
BMDec 19, 2025
Application of machine learning to predict food processing level using Open Food FactsNalin Arora, Aviral Chauhan, Siddhant Rana et al.
Ultra-processed foods are increasingly linked to health issues like obesity, cardiovascular disease, type 2 diabetes, and mental health disorders due to poor nutritional quality. This first-of-its-kind study at such a scale uses machine learning to classify food processing levels (NOVA) based on the Open Food Facts dataset of over 900,000 products. Models including LightGBM, Random Forest, and CatBoost were trained on nutrient concentration data. LightGBM performed best, achieving 80-85% accuracy across different nutrient panels and effectively distinguishing minimally from ultra-processed foods. Exploratory analysis revealed strong associations between higher NOVA classes and lower Nutri-Scores, indicating poorer nutritional quality. Products in NOVA 3 and 4 also had higher carbon footprints and lower Eco-Scores, suggesting greater environmental impact. Allergen analysis identified gluten and milk as common in ultra-processed items, posing risks to sensitive individuals. Categories like Cakes and Snacks were dominant in higher NOVA classes, which also had more additives, highlighting the role of ingredient modification. This study, leveraging the largest dataset of NOVA-labeled products, emphasizes the health, environmental, and allergenic implications of food processing and showcases machine learning's value in scalable classification. A user-friendly web tool is available for NOVA prediction using nutrient data: https://cosylab.iiitd.edu.in/foodlabel/.
ARFeb 25
GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware AcceleratorsYuhao Liu, Salim Ullah, Akash Kumar
With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators. Classic multi-threshold activation hardware requires 2^n thresholds for n-bit outputs, causing a rapid increase in hardware cost as precision increases. We propose a reconfigurable activation hardware, GRAU, based on piecewise linear fitting, where the segment slopes are approximated by powers of two. Our design requires only basic comparators and 1-bit right shifters, supporting mixed-precision quantization and nonlinear functions such as SiLU. Compared with multi-threshold activators, GRAU reduces LUT consumption by over 90%, achieving higher hardware efficiency, flexibility, and scalability.
PLMar 8, 2021Code
Compiler Toolchains for Deep Learning Workloads on Embedded PlatformsMax Sponner, Bernd Waschneck, Akash Kumar
As the usage of deep learning becomes increasingly popular in mobile and embedded solutions, it is necessary to convert the framework-specific network representations into executable code for these embedded platforms. This paper consists of two parts: The first section is made up of a survey and benchmark of the available open source deep learning compiler toolchains, which focus on the capabilities and performance of the individual solutions in regard to targeting embedded devices and microcontrollers that are combined with a dedicated accelerator in a heterogeneous fashion. The second part explores the implementation and evaluation of a compilation flow for such a heterogeneous device and reuses one of the existing toolchains to demonstrate the necessary steps for hardware developers that plan to build a software flow for their own hardware.
ARFeb 26
BiKA: Kolmogorov-Arnold-Network-inspired Ultra Lightweight Neural Network Hardware AcceleratorYuhao Liu, Salim Ullah, Akash Kumar
Lightweight neural network accelerators are essential for edge devices with limited resources and power constraints. While quantization and binarization can efficiently reduce hardware cost, they still rely on the conventional Artificial Neural Network (ANN) computation pattern. The recently proposed Kolmogorov-Arnold Network (KAN) presents a novel network paradigm built on learnable nonlinear functions. However, it is computationally expensive for hardware deployment. Inspired by KAN, we propose BiKA, a multiply-free architecture that replaces nonlinear functions with binary, learnable thresholds, introducing an extremely lightweight computational pattern that requires only comparators and accumulators. Our FPGA prototype on Ultra96-V2 shows that BiKA reduces hardware resource usage by 27.73% and 51.54% compared with binarized and quantized neural network systolic array accelerators, while maintaining competitive accuracy. BiKA provides a promising direction for hardware-friendly neural network design on edge devices.
38.3DCApr 30
AnTi-MiCS: Analytical Framework for Bounding Time in Embedded Mixed-Criticality SystemsBehnaz Ranjbar, Akash Kumar
In Mixed-Criticality (MC) systems, although the high Worst-Case Execution Time (WCET) serves as a conservative upper bound representing the task's maximum execution time under all conditions, obtaining a low WCET is essential for representing realistic executions and improving utilization and Quality-of-Service (QoS). Nevertheless, determining appropriate low WCET(s) for lower-criticality (LO) modes poses a significant challenge. Opting for a very low value of this WCET enhances processor utilization by scheduling more tasks in LO mode. Conversely, employing a larger WCET ensures fewer mode switches, thereby enhancing QoS, albeit at the cost of processor utilization. This paper proposes an analytical approach, AnTi-MiCS, to determine the appropriate low WCET through design-time analysis of task executions. In some cases, a single low WCET may not be adequate to capture large variations in the execution time distribution, for example, in scenarios like bimodal distributions. Therefore, we further propose a scalable approach, MulTi-MiCS, to compute multiple appropriate low WCETs. This approach exploits the temporal correlation between subsequent inputs presented to the application. Experimental results, conducted on a real platform with embedded real-time benchmarks, demonstrate the efficacy of our proposed scheme, in which QoS is improved by 30.27% on average while reducing utilization waste by 35.89%, compared to existing approaches. Besides, MulTi-MiCS improves QoS by 6.41% compared to AnTi-MiCS while reducing utilization waste by 8.23%.
22.5AIApr 30
Focus Session: Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and CertificationBehnaz Ranjbar, Kirankumar Raveendiran, Sudeep Pasricha et al.
The design of embedded safety-critical systems such as those used in next-generation automotive and autonomous platforms, is increasingly challenged by escalating system complexity, hardware-software heterogeneity, and the integration of intelligent, data-driven components. Ensuring dependability in such systems requires a holistic approach that spans multiple abstraction layers and encompasses both design- and run-time assurance. Traditional methods for reliability, safety, and security management often fall short in addressing the dynamic and uncertain behaviors introduced by Artificial Intelligence (AI) and Machine Learning (ML) components, especially under stringent real-time, power, and safety constraints. While AI and ML offer powerful predictive, adaptive, and self-optimizing capabilities that can enhance system dependability, their inherent non-determinism, data-dependence, and lack of formal guarantees introduce new challenges for verification, validation, and certification. This paper explores emerging methodologies, architectures, and frameworks for designing dependable autonomous and embedded systems in the era of AI. It highlight advances in reliability modeling, secure system design, and certification approaches that account for imperfect, learning-enabled components, aiming to bridge the gap between AI innovation and certifiable system-level dependability.
CVDec 10, 2024
Stable Mean Teacher for Semi-supervised Video Action DetectionAkash Kumar, Sirshapan Mitra, Yogesh Singh Rawat
In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA, and YouTube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and YouTube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.
CVFeb 28, 2025
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal GroundingAaryan Garg, Akash Kumar, Yogesh S Rawat
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.
CVJan 28, 2025
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video GroundingAkash Kumar, Zsolt Kira, Yogesh Singh Rawat
In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.
CVApr 8, 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation LearningAkash Kumar, Ashlesha Kumar, Vibhav Vineet et al.
Self-supervised learning has emerged as a powerful paradigm for label-free model pretraining, particularly in the video domain, where manual annotation is costly and time-intensive. However, existing self-supervised approaches employ diverse experimental setups, making direct comparisons challenging due to the absence of a standardized benchmark. In this work, we establish a unified benchmark that enables fair comparisons across different methods. Additionally, we systematically investigate five critical aspects of self-supervised learning in videos: (1) dataset size, (2) model complexity, (3) data distribution, (4) data noise, and (5) feature representations. To facilitate this study, we evaluate six self-supervised learning methods across six network architectures, conducting extensive experiments on five benchmark datasets and assessing performance on two distinct downstream tasks. Our analysis reveals key insights into the interplay between pretraining strategies, dataset characteristics, pretext tasks, and model architectures. Furthermore, we extend these findings to Video Foundation Models (ViFMs), demonstrating their relevance in large-scale video representation learning. Finally, leveraging these insights, we propose a novel approach that significantly reduces training data requirements while surpassing state-of-the-art methods that rely on 10% more pretraining data. We believe this work will guide future research toward a deeper understanding of self-supervised video representation learning and its broader implications.
LGNov 18, 2024
Mirror Descent on Reproducing Kernel Banach SpacesAkash Kumar, Mikhail Belkin, Parthe Pandit
Recent advances in machine learning have led to increased interest in reproducing kernel Banach spaces (RKBS) as a more general framework that extends beyond reproducing kernel Hilbert spaces (RKHS). These works have resulted in the formulation of representer theorems under several regularized learning schemes. However, little is known about an optimization method that encompasses these results in this setting. This paper addresses a learning problem on Banach spaces endowed with a reproducing kernel, focusing on efficient optimization within RKBS. To tackle this challenge, we propose an algorithm based on mirror descent (MDA). Our approach involves an iterative method that employs gradient steps in the dual space of the Banach space using the reproducing kernel. We analyze the convergence properties of our algorithm under various assumptions and establish two types of results: first, we identify conditions under which a linear convergence rate is achievable, akin to optimization in the Euclidean setting, and provide a proof of the linear rate; second, we demonstrate a standard convergence rate in a constrained setting. Moreover, to instantiate this algorithm in practice, we introduce a novel family of RKBSs with $p$-norm ($p \neq 2$), characterized by both an explicit dual map and a kernel.
LGMar 5
MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural NetworksMikail Yayla, Akash Kumar
Robustness to bit errors is a key requirement for the reliable use of neural networks (NNs) on emerging approximate computing platforms and error-prone memory technologies. A common approach to achieve bit error tolerance in NNs is injecting bit flips during training according to a predefined error model. While effective in certain scenarios, training-time bit flip injection introduces substantial computational overhead, often degrades inference accuracy at high error rates, and scales poorly for larger NN architectures. These limitations make error injection an increasingly impractical solution for ensuring robustness on future approximate computing platforms and error-prone memory technologies. In this work, we investigate the mechanisms that enable NNs to tolerate bit errors without relying on error-aware training. We establish a direct connection between bit error tolerance and classification margins at the output layer. Building on this insight, we propose a novel loss function, the Margin Cross-Entropy Loss (MCEL), which explicitly promotes logit-level margin separation while preserving the favorable optimization properties of the standard cross-entropy loss. Furthermore, MCEL introduces an interpretable margin parameter that allows robustness to be tuned in a principled manner. Extensive experimental evaluations across multiple datasets of varying complexity, diverse NN architectures, and a range of quantization schemes demonstrate that MCEL substantially improves bit error tolerance, up to 15 % in accuracy for an error rate of 1 %. Our proposed MCEL method is simple to implement, efficient, and can be integrated as a drop-in replacement for standard CEL. It provides a scalable and principled alternative to training-time bit flip injection, offering new insights into the origins of NN robustness and enabling more efficient deployment on approximate computing and memory systems.
LGMar 12, 2024
Temporal Decisions: Leveraging Temporal Correlation for Efficient Decisions in Early Exit Neural NetworksMax Sponner, Lorenzo Servadei, Bernd Waschneck et al.
Deep Learning is becoming increasingly relevant in Embedded and Internet-of-things applications. However, deploying models on embedded devices poses a challenge due to their resource limitations. This can impact the model's inference accuracy and latency. One potential solution are Early Exit Neural Networks, which adjust model depth dynamically through additional classifiers attached between their hidden layers. However, the real-time termination decision mechanism is critical for the system's efficiency, latency, and sustained accuracy. This paper introduces Difference Detection and Temporal Patience as decision mechanisms for Early Exit Neural Networks. They leverage the temporal correlation present in sensor data streams to efficiently terminate the inference. We evaluate their effectiveness in health monitoring, image classification, and wake-word detection tasks. Our novel contributions were able to reduce the computational footprint compared to established decision mechanisms significantly while maintaining higher accuracy scores. We achieved a reduction of mean operations per inference by up to 80% while maintaining accuracy levels within 5% of the original model. These findings highlight the importance of considering temporal correlation in sensor data to improve the termination decision.
CVDec 11, 2025
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental UpdatesShresth Grover, Priyank Pathak, Akash Kumar et al.
Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA. Project Page : https://shroglck.github.io/cos_plan/
LGJan 7
A Gap Between Decision Trees and Neural NetworksAkash Kumar
We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates--piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing--also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$. We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $τ=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy--complexity tradeoff and how threshold selection shifts where training lands along it.
CVNov 17, 2025
RobustGait: Robustness Analysis for Appearance Based Gait RecognitionReeshoon Sayera, Akash Kumar, Sirshapan Mitra et al.
Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
CVAug 19, 2025
OmViD: Omni-supervised active learning for video action detectionAayush Rana, Akash Kumar, Vibhav Vineet et al.
Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.
CVJul 25, 2025
MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization PerspectiveWeitian Wang, Rai Shubham, Cecilia De La Parra et al.
In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows, improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7%, reducing quantization degradation by 24%.
ARMar 18, 2025
Retrospective: A CORDIC Based Configurable Activation Function for NN ApplicationsOmkar Kokane, Gopal Raut, Salim Ullah et al.
A CORDIC-based configuration for the design of Activation Functions (AF) was previously suggested to accelerate ASIC hardware design for resource-constrained systems by providing functional reconfigurability. Since its introduction, this new approach for neural network acceleration has gained widespread popularity, influencing numerous designs for activation functions in both academic and commercial AI processors. In this retrospective analysis, we explore the foundational aspects of this initiative, summarize key developments over recent years, and introduce the DA-VINCI AF tailored for the evolving needs of AI applications. This new generation of dynamically configurable and precision-adjustable activation function cores promise greater adaptability for a range of activation functions in AI workloads, including Swish, SoftMax, SeLU, and GeLU, utilizing the Shift-and-Add CORDIC technique. The previously presented design has been optimized for MAC, Sigmoid, and Tanh functionalities and incorporated into ReLU AFs, culminating in an accumulative NEURIC compute unit. These enhancements position NEURIC as a fundamental component in the resource-efficient vector engine for the realization of AI accelerators that focus on DNNs, RNNs/LSTMs, and Transformers, achieving a quality of results (QoR) of 98.5%.
LGFeb 22, 2025
A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic AnalysisAkash Kumar, Rahul Parhi, Mikhail Belkin
Recent works have characterized the function-space inductive bias of infinite-width bounded-norm single-hidden-layer neural networks as a kind of bounded-variation-type space. This novel neural network Banach space encompasses many classical multivariate function spaces, including certain Sobolev spaces and the spectral Barron spaces. Notably, this Banach space also includes functions that exhibit less classical regularity, such as those that only vary in a few directions. On bounded domains, it is well-established that the Gaussian reproducing kernel Hilbert space (RKHS) strictly embeds into this Banach space, demonstrating a clear gap between the Gaussian RKHS and the neural network Banach space. It turns out that when investigating these spaces on unbounded domains, e.g., all of $\mathbb{R}^d$, the story is fundamentally different. We establish the following fundamental result: Certain functions that lie in the Gaussian RKHS have infinite norm in the neural network Banach space. This provides a nontrivial gap between kernel methods and neural networks by exhibiting functions that kernel methods easily represent, whereas neural networks cannot.
LGFeb 8, 2025
The Complexity of Learning Sparse Superposed Features with FeedbackAkash Kumar
The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \tt{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on Large Language Models.
LGDec 2, 2024
Learning Smooth Distance Functions via QueriesAkash Kumar, Sanjoy Dasgupta
In this work, we investigate the problem of learning distance functions within the query-based learning framework, where a learner is able to pose triplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We establish formal guarantees on the query complexity required to learn smooth, but otherwise general, distance functions under two notions of approximation: $ω$-additive approximation and $(1 + ω)$-multiplicative approximation. For the additive approximation, we propose a global method whose query complexity is quadratic in the size of a finite cover of the sample space. For the (stronger) multiplicative approximation, we introduce a method that combines global and local approaches, utilizing multiple Mahalanobis distance functions to capture local geometry. This method has a query complexity that scales quadratically with both the size of the cover and the ambient space dimension of the sample space.
LGMar 12, 2024
Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT EnvironmentsMax Sponner, Lorenzo Servadei, Bernd Waschneck et al.
Early Exit Neural Networks (EENNs) present a solution to enhance the efficiency of neural network deployments. However, creating EENNs is challenging and requires specialized domain knowledge, due to the large amount of additional design choices. To address this issue, we propose an automated augmentation flow that focuses on converting an existing model into an EENN. It performs all required design decisions for the deployment to heterogeneous or distributed hardware targets: Our framework constructs the EENN architecture, maps its subgraphs to the hardware targets, and configures its decision mechanism. To the best of our knowledge, it is the first framework that is able to perform all of these steps. We evaluated our approach on a collection of Internet-of-Things and standard image classification use cases. For a speech command detection task, our solution was able to reduce the mean operations per inference by 59.67%. For an ECG classification task, it was able to terminate all samples early, reducing the mean inference energy by 74.9% and computations by 78.3%. On CIFAR-10, our solution was able to achieve up to a 58.75% reduction in computations. The search on a ResNet-152 base model for CIFAR-10 took less than nine hours on a laptop CPU. Our proposed approach enables the creation of EENN optimized for IoT environments and can reduce the inference cost of Deep Learning applications on embedded and fog platforms, while also significantly reducing the search cost - making it more accessible for scientists and engineers in industry and research. The low search cost improves the accessibility of EENNs, with the potential to improve the efficiency of neural networks in a wide range of practical applications.
LGDec 4, 2020
Logic Synthesis Meets Machine Learning: Trading Exactness for GeneralizationShubham Rai, Walter Lau Neto, Yukio Miyasaka et al.
Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthesis rely on SAT and Boolean methods to exactly implement the care set, we investigate learning in logic synthesis, attempting to trade exactness for generalization. This work is directly related to machine learning where the care set is the training set and the implementation is expected to generalize on a validation set. We present learning incompletely-specified functions based on the results of a competition conducted at IWLS 2020. The goal of the competition was to implement 100 functions given by a set of care minterms for training, while testing the implementation using a set of validation minterms sampled from the same function. We make this benchmark suite available and offer a detailed comparative analysis of the different approaches to learning
ARNov 2, 2020
SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable AccuracyZahra Ebrahimi, Salim Ullah, Akash Kumar
The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate energy as their main resource constraint, approximate computing has re-emerged,albeit mainly specialized for their Application-Specific Integrated Circuit (ASIC) implementations. This paper, presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy, targeted for Field-Programmable Gate Arrays (FPGAs). The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits. Experimental results obtained from Vivado, multimedia and DNN applications indicate superiority of proposed architecture (both SISD and SIMD) over accurate and state-of-the-art approximate counterparts. In particular, the proposed SISD divider outperforms the accurate Intellectual Property (IP) divider provided by Xilinx with 4x higher speed and 4.6x less energy and tolerating only < 0.8% error. Moreover, the proposed SIMD multiplier-divider supersede accurate SIMD multiplier by achieving up to 26%, 45%, 36%, and 56% improvement in area, throughput, power, and energy, respectively.
LGOct 27, 2020
The Teaching Dimension of Kernel PerceptronAkash Kumar, Hanqi Zhang, Adish Singla et al.
Algorithmic machine teaching has been studied under the linear setting where exact teaching is possible. However, little is known for teaching nonlinear learners. Here, we establish the sample complexity of teaching, aka teaching dimension, for kernelized perceptrons for different families of feature maps. As a warm-up, we show that the teaching complexity is $Θ(d)$ for the exact teaching of linear perceptrons in $\mathbb{R}^d$, and $Θ(d^k)$ for kernel perceptron with a polynomial kernel of order $k$. Furthermore, under certain smooth assumptions on the data distribution, we establish a rigorous bound on the complexity for approximately teaching a Gaussian kernel perceptron. We provide numerical examples of the optimal (approximate) teaching set under several canonical settings for linear, polynomial and Gaussian kernel perceptrons.
AROct 24, 2020
ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network Design in FPGA-based SystemsSuresh Nambi, Salim Ullah, Aditya Lohana et al.
The recent advances in machine learning, in general, and Artificial Neural Networks (ANN), in particular, has made smart embedded systems an attractive option for a larger number of application areas. However, the high computational complexity, memory footprints, and energy requirements of machine learning models hinder their deployment on resource-constrained embedded systems. Most state-of-the-art works have considered this problem by proposing various low bit-width data representation schemes, optimized arithmetic operators' implementations, and different complexity reduction techniques such as network pruning. To further elevate the implementation gains offered by these individual techniques, there is a need to cross-examine and combine these techniques' unique features. This paper presents ExPAN(N)D, a framework to analyze and ingather the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs. The Posit scheme offers a better dynamic range and higher precision for various applications than IEEE $754$ single-precision floating-point format. However, due to the dynamic nature of the various fields of the Posit scheme, the corresponding arithmetic circuits have higher critical path delay and resource requirements than the single-precision-based arithmetic units. Towards this end, we propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs with minimal drop in the output accuracy. We also propose a modified Posit-based representation to store the trained parameters of a network. Compared to an $8$-bit fixed-point-based inference accelerator, our proposed implementation offers $\approx46\%$ and $\approx18\%$ reductions in the storage requirements of the parameters and energy consumption of the MAC units, respectively.
LGJun 25, 2020
Average-case Complexity of Teaching Convex Polytopes via Halfspace QueriesAkash Kumar, Adish Singla, Yisong Yue et al.
We examine the task of locating a target region among those induced by intersections of $n$ halfspaces in $\mathbb{R}^d$. This generic task connects to fundamental machine learning problems, such as training a perceptron and learning a $φ$-separable dichotomy. We investigate the average teaching complexity of the task, i.e., the minimal number of samples (halfspace queries) required by a teacher to help a version-space learner in locating a randomly selected target. As our main result, we show that the average-case teaching complexity is $Θ(d)$, which is in sharp contrast to the worst-case teaching complexity of $Θ(n)$. If instead, we consider the average-case learning complexity, the bounds have a dependency on $n$ as $Θ(n)$ for \tt{i.i.d.} queries and $Θ(d \log(n))$ for actively chosen queries by the learner. Our proof techniques are based on novel insights from computational geometry, which allow us to count the number of convex polytopes and faces in a Euclidean space depending on the arrangement of halfspaces. Our insights allow us to establish a tight bound on the average-case complexity for $φ$-separable dichotomies, which generalizes the known $\mathcal{O}(d)$ bound on the average number of "extreme patterns" in the classical computational geometry literature (Cover, 1965).
CRMay 26, 2020
SeqL: Secure Scan-Locking for IP ProtectionSeetal Potluri, Aydin Aysu, Akash Kumar
Existing logic-locking attacks are known to successfully decrypt functionally correct key of a locked combinational circuit. It is possible to extend these attacks to real-world Silicon-based Intellectual Properties (IPs, which are sequential circuits) through scan-chains by selectively initializing the combinational logic and analyzing the responses. In this paper, we propose SeqL, which achieves functional isolation and locks selective flip-flop functional-input/scan-output pairs, thus rendering the decrypted key functionally incorrect. We conduct a formal study of the scan-locking problem and demonstrate automating our proposed defense on any given IP. We show that SeqL hides functionally correct keys from the attacker, thereby increasing the likelihood of the decrypted key being functionally incorrect. When tested on pipelined combinational benchmarks (ISCAS,MCNC), sequential benchmarks (ITC) and a fully-fledged RISC-V CPU, SeqL gave 100% resilience to a broad range of state-of-the-art attacks including SAT[1], Double-DIP[2], HackTest[3], SMT[4], FALL[5], Shift-and-Leak[6] and Multi-cycle attacks[7].
CVMar 19, 2020
Detecting Deepfakes with Metric LearningAkash Kumar, Arnav Bhavsar
With the arrival of several face-swapping applications such as FaceApp, SnapChat, MixBooth, FaceBlender and many more, the authenticity of digital media content is hanging on a very loose thread. On social media platforms, videos are widely circulated often at a high compression factor. In this work, we analyze several deep learning approaches in the context of deepfakes classification in high compression scenario and demonstrate that a proposed approach based on metric learning can be very effective in performing such a classification. Using less number of frames per video to assess its realism, the metric learning approach using a triplet network architecture proves to be fruitful. It learns to enhance the feature space distance between the cluster of real and fake videos embedding vectors. We validated our approaches on two datasets to analyze the behavior in different environments. We achieved a state-of-the-art AUC score of 99.2% on the Celeb-DF dataset and accuracy of 90.71% on a highly compressed Neural Texture dataset. Our approach is especially helpful on social media platforms where data compression is inevitable.
CVFeb 3, 2020
Syn2Real: Forgery Classification via Unsupervised Domain AdaptationAkash Kumar, Arnav Bhavasar
In recent years, image manipulation is becoming increasingly more accessible, yielding more natural-looking images, owing to the modern tools in image processing and computer vision techniques. The task of the identification of forged images has become very challenging. Amongst different types of forgeries, the cases of Copy-Move forgery are increasing manifold, due to the difficulties involved to detect this tampering. To tackle such problems, publicly available datasets are insufficient. In this paper, we propose to create a synthetic forged dataset using deep semantic image inpainting and copy-move forgery algorithm. However, models trained on these datasets have a significant drop in performance when tested on more realistic data. To alleviate this problem, we use unsupervised domain adaptation networks to detect copy-move forgery in new domains by mapping the feature space from our synthetically generated dataset. Furthermore, we improvised the F1 score on CASIA and CoMoFoD dataset to 80.3% and 78.8%, respectively. Our approach can be helpful in those cases where the classification of data is unavailable.
CVNov 14, 2019
Copy-Move Forgery Classification via Unsupervised Domain AdaptationAkash Kumar, Arnav Bhavsar
In the current era, image manipulation is becoming increasingly easier, yielding more natural looking images, owing to the modern tools in image processing and computer vision techniques. The task of the segregation of forged images has become very challenging. To tackle such problems, publicly available datasets are insufficient. In this paper, we propose to create a synthetic forged dataset using deep semantic image inpainting algorithm. Furthermore, we use an unsupervised domain adaptation network to detect copy-move forgery in images. Our approach can be helpful in those cases, where the classification of data is unavailable.
CVDec 30, 2018
Solar Potential Analysis of Rooftops Using Satellite ImageryAkash Kumar
Solar energy is one of the most important sources of renewable energy and the cleanest form of energy. In India, where solar energy could produce power around trillion kilowatt-hours in a year, our country is only able to produce power of around in gigawatts only. Many people are not aware of the solar potential of their rooftop, and hence they always think that installing solar panels is very much expensive. In this work, we introduce an approach through which we can generate a report remotely that provides the amount of solar potential of a building using only its latitude and longitude. We further evaluated various types of rooftops to make our solution more robust. We also provide an approximate area of rooftop that can be used for solar panels placement and a visual analysis of how solar panels can be placed to maximize the output of solar power at a location.
CVNov 30, 2018
Improving Landmark Recognition using Saliency detection and Feature classificationAkash Kumar, Sagnik Bhowmick, N. Jayanthi et al.
Image Landmark Recognition has been one of the most sought-after classification challenges in the field of vision and perception. After so many years of generic classification of buildings and monuments from images, people are now focussing upon fine-grained problems - recognizing the category of each building or monument. We proposed an ensemble network for the purpose of classification of Indian Landmark Images. To this end, our method gives robust classification by ensembling the predictions from Graph-Based Visual Saliency (GBVS) network alongwith supervised feature-based classification algorithms such as kNN and Random Forest. The final architecture is an adaptive learning of all the mentioned networks. The proposed network produces a reliable score to eliminate false category cases. Evaluation of our model was done on a new dataset, which involves challenges such as landmark clutter, variable scaling, partial occlusion, etc.
CVOct 9, 2018
Bird Species Classification using Transfer Learning with Multistage TrainingSourya Dipta Das, Akash Kumar
Bird species classification has received more and more attention in the field of computer vision, for its promising applications in biology and environmental studies. Recognizing bird species is difficult due to the challenges of discriminative region localization and fine-grained feature learning. In this paper, we have introduced a Transfer learning based method with multistage training. We have used both Pre-Trained Mask-RCNN and an ensemble model consisting of Inception Nets (InceptionV3 & InceptionResNetV2 ) to get localization and species of the bird from the images respectively. Our final model achieves an F1 score of 0.5567 or 55.67 % on the dataset provided in CVIP 2018 Challenge.
IRAug 16, 2018
IceBreaker: Solving Cold Start Problem for Video Recommendation EnginesYaman Kumar, Agniv Sharma, Abhigyan Khaund et al.
Internet has brought about a tremendous increase in content of all forms and, in that, video content constitutes the major backbone of the total content being published as well as watched. Thus it becomes imperative for video recommendation engines such as Hulu to look for novel and innovative ways to recommend the newly added videos to their users. However, the problem with new videos is that they lack any sort of metadata and user interaction so as to be able to rate the videos for the consumers. To this effect, this paper introduces the several techniques we develop for the Content Based Video Relevance Prediction (CBVRP) Challenge being hosted by Hulu for the ACM Multimedia Conference 2018. We employ different architectures on the CBVRP dataset to make use of the provided frame and video level features and generate predictions of videos that are similar to the other videos. We also implement several ensemble strategies to explore complementarity between both the types of provided features. The obtained results are encouraging and will impel the boundaries of research for multimedia based video recommendation systems.