Dipankar Sarkar

h-index31

11papers

3,777citations

Novelty36%

AI Score23

Ranked #175,993 of 194,257 authors (top 91%)#38,004 in LG (top 95%)

11 Papers

LGJun 23

Operator-Aware Mixed-Precision Tolerance Calibration for Tensor Kernels

Dipankar Sarkar

Most tensor-kernel correctness tests go through a fixed-shape all close-style check with hand-picked absolute and relative tolerances. The thresholds are copied across the corpus and rarely revisited. We mine the element-wise error distribution of every test case from accumulated cloud GPU runs across the 26-entry gpuemu corpus and 2 dtypes (8,076 result rows). We then ask one empirical question: what absolute tolerance would the kernel itself, observed under its correct implementation, justify? The answer is much tighter than the current hand-picked atol. The largest tightening is attention_triton fp16 at $2{,}184\times$. Restricted to the seven LLM-style buggy variants for which the corpus ships a paired correct counterpart, calibrated per-(op, dtype) tolerances raise bug-detection recall from 73.2% (1,805 of 2,467) to 82.4% (2,034 of 2,467), an absolute gain of 9.3 percentage points (+229 new detections). The control false-positive count rises from 0 to 20 out of 1,882 correct-control cases (+1.1 percentage points).

9.9SEJun 17Code

Before the Pull Request: Mining Multi-Agent Coordination

Dipankar Sarkar

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

SEJun 23

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs

Dipankar Sarkar

Test-input generation for tensor kernels is folkloric. Most projects pick a representative shape and dtype, run a fixed-shape allclose-style check, and ship. We make the choices explicit and measure them. Using the gpuemu op-schema-aware seeded fuzzer (arXiv:2606.20128), we evaluate seven test-generation strategies across a 26-op corpus (16 correct controls and 10 LLM-style buggy variants seeded with documented transcription patterns) on an RTX 3060 GPU instance. Strategies vary the shape candidate set, the dtype mix, and the input value distribution. We report each strategy on two axes: bug recall and control false-positive (FP) rate. Boundary-only shape sampling is the operationally safe winner: 78% recall on the 10 buggy kernels with 0% FP on the 16 controls. Adversarial value sampling reaches higher recall (99%) but inflates control FP to 94% because the strategy injects NaN and Inf inputs and the validator's NaN check fires on every kernel that propagates them, not only on buggy kernels. On the two softmax tail-mask bugs the "regular" strategy (no boundary shapes) catches 0%, while boundary raises recall to 100% and 62% respectively. That gap is the clearest single signal in the data. The corpus result is about which seeded bug patterns each strategy catches, not about the bug rate of any specific deployed LLM.

8.1SEJun 18

The Correctness Illusion in LLM-Generated GPU Kernels

Dipankar Sarkar

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

DCJun 23

Static PTX Metrics Track Structural Kernel Regressions but Miss Semantic Ones

Dipankar Sarkar

We pair each GPU kernel's static PTX metrics (registers, spills, instruction count) with CUDA-event-timed runtime on five GPU classes: RTX 3060, A10, L40S, A100 SXM4, and H100 NVL. In this corpus and toolchain the static and measured signals separate cleanly along one axis. Per-pair Delta-regs and Delta-instrs are identical across all five GPUs for any given (correct, buggy) pair. Measured Delta-perf% is not. Structural bugs that change the kernel's work are unambiguous in the static signal. The gelu_triton_buggy variant, which drops a leading 0.5 factor, removes 8 instructions and 8 registers. The corresponding measured Delta-perf% on RTX 3060 is +3.2%, within the run-to-run noise band at the sub-millisecond scale these corpus kernels occupy. Semantic bugs that swap one constant for another are invisible to the static signal. The softmax_triton_buggy variant, which substitutes other=0.0 for -inf on the masked load, compiles to byte-identical PTX. The paper's bounded claim is that, for this corpus and toolchain, a static-PTX delta gate is a portable pre-filter that separates structural from semantic changes; measured runtime deltas at this scale are hardware- and noise-sensitive and are not a substitute.

2.2IRFeb 7, 2024

Navigating the Knowledge Sea: Planet-scale answer retrieval using LLMs

Dipankar Sarkar

Information retrieval is a rapidly evolving field of information retrieval, which is characterized by a continuous refinement of techniques and technologies, from basic hyperlink-based navigation to sophisticated algorithm-driven search engines. This paper aims to provide a comprehensive overview of the evolution of Information Retrieval Technology, with a particular focus on the role of Large Language Models (LLMs) in bridging the gap between traditional search methods and the emerging paradigm of answer retrieval. The integration of LLMs in the realms of response retrieval and indexing signifies a paradigm shift in how users interact with information systems. This paradigm shift is driven by the integration of large language models (LLMs) like GPT-4, which are capable of understanding and generating human-like text, thus enabling them to provide more direct and contextually relevant answers to user queries. Through this exploration, we seek to illuminate the technological milestones that have shaped this journey and the potential future directions in this rapidly changing field.

4.6LGDec 31, 2023

Viz: A QLoRA-based Copyright Marketplace for Legally Compliant Generative AI

Dipankar Sarkar

This paper aims to introduce and analyze the Viz system in a comprehensive way, a novel system architecture that integrates Quantized Low-Rank Adapters (QLoRA) to fine-tune large language models (LLM) within a legally compliant and resource efficient marketplace. Viz represents a significant contribution to the field of artificial intelligence, particularly in addressing the challenges of computational efficiency, legal compliance, and economic sustainability in the utilization and monetization of LLMs. The paper delineates the scholarly discourse and developments that have informed the creation of Viz, focusing primarily on the advancements in LLM models, copyright issues in AI training (NYT case, 2023), and the evolution of model fine-tuning techniques, particularly low-rank adapters and quantized low-rank adapters, to create a sustainable and economically compliant framework for LLM utilization. The economic model it proposes benefits content creators, AI developers, and end-users, delineating a harmonious integration of technology, economy, and law, offering a comprehensive solution to the complex challenges of today's AI landscape.

1.6LGJun 16, 2021

Curriculum generation using Autoencoder based continuous optimization

Dipankar Sarkar, Mukur Gupta

Research in Curriculum Learning has shown better performance on the task by optimizing the sequence of the training data. Recent works have focused on using complex reinforcement learning techniques to find the optimal data ordering strategy to maximize learning for a given network. In this paper, we present a simple yet efficient technique based on continuous optimization trained with auto-encoding procedure. We call this new approach Training Sequence Optimization (TSO). With a usual encoder-decoder setup we try to learn the latent space continuous representation of the training strategy and a predictor network is used on the continuous representation to predict the accuracy of the strategy on the fixed network architecture. The performance predictor and encoder enable us to perform gradient-based optimization by gradually moving towards the latent space representation of training data ordering with potentially better accuracy. We show an empirical gain of 2AP with our generated optimal curriculum strategy over the random strategy using the CIFAR-100 and CIFAR-10 datasets and have better boosts than the existing state-of-the-art CL algorithms.

1.4CVFeb 19, 2021

One Shot Audio to Animated Video Generation

Neeraj Kumar, Srishti Goel, Ankur Narang et al.

We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in the human domain given an audio and a person's image. In the second stage, the talking-head video from the human domain is converted to the animated domain. The model architecture of the first stage consists of spatially adaptive normalization based multi-level generator and multiple multilevel discriminators along with multiple adversarial and non-adversarial losses. The second stage leverages attention based normalization driven GAN architecture along with temporal predictor based recycle loss and blink loss coupled with lipsync loss, for unsupervised generation of animated video. In our approach, the input audio clip is not restricted to any specific language, which gives the method multilingual applicability. OneShotAu2AV can generate animated videos that have: (a) lip movements that are in sync with the audio, (b) natural facial expressions such as blinks and eyebrow movements, (c) head movements. Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error rate, blinks/sec

2.3LGNov 14, 2020

CatFedAvg: Optimising Communication-efficiency and Classification Accuracy in Federated Learning

Dipankar Sarkar, Sumit Rai, Ankur Narang

Federated learning has allowed the training of statistical models over remote devices without the transfer of raw client data. In practice, training in heterogeneous and large networks introduce novel challenges in various aspects like network load, quality of client data, security and privacy. Recent works in FL have worked on improving communication efficiency and addressing uneven client data distribution independently, but none have provided a unified solution for both challenges. We introduce a new family of Federated Learning algorithms called CatFedAvg which not only improves the communication efficiency but improves the quality of learning using a category coverage maximization strategy. We use the FedAvg framework and introduce a simple and efficient step every epoch to collect meta-data about the client's training data structure which the central server uses to request a subset of weight updates. We explore two distinct variations which allow us to further explore the tradeoffs between communication efficiency and model accuracy. Our experiments based on a vision classification task have shown that an increase of 10% absolute points in accuracy using the MNIST dataset with 70% absolute points lower network transfer over FedAvg. We also run similar experiments with Fashion MNIST, KMNIST-10, KMNIST-49 and EMNIST-47. Further, under extreme data imbalance experiments for both globally and individual clients, we see the model performing better than FedAvg. The ablation study further explores its behaviour under varying data and client parameter conditions showcasing the robustness of the proposed approach.

12.8LGNov 12, 2020

Fed-Focal Loss for imbalanced data classification in Federated Learning

Dipankar Sarkar, Ankur Narang, Sumit Rai

The Federated Learning setting has a central server coordinating the training of a model on a network of devices. One of the challenges is variable training performance when the dataset has a class imbalance. In this paper, we address this by introducing a new loss function called Fed-Focal Loss. We propose to address the class imbalance by reshaping cross-entropy loss such that it down-weights the loss assigned to well-classified examples along the lines of focal loss. Additionally, by leveraging a tunable sampling framework, we take into account selective client model contributions on the central server to further focus the detector during training and hence improve its robustness. Using a detailed experimental analysis with the VIRTUAL (Variational Federated Multi-Task Learning) approach, we demonstrate consistently superior performance in both the balanced and unbalanced scenarios for MNIST, FEMNIST, VSN and HAR benchmarks. We obtain a more than 9% (absolute percentage) improvement in the unbalanced MNIST benchmark. We further show that our technique can be adopted across multiple Federated Learning algorithms to get improvements.