Varun Singh

CL
h-index43
8papers
123citations
Novelty38%
AI Score44

8 Papers

CLApr 4, 2025Code
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Aaron Blakeman, Aarti Basant, Abhinav Khattar et al. · nvidia

As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.

LGFeb 19Code
Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar et al.

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

CLMay 18, 2025Code
KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

Nikita Tatarinov, Vidhyakshaya Kannan, Haricharana Srinivasa et al. · gatech

The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions -- multi-hop retrieval, set operations, and answer plurality -- enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.

SEMar 28
A Multi-agent AI System for Deep Learning Model Migration from TensorFlow to JAX

Stoyan Nikolov, Bernhard Konrad, Moritz Gronbach et al.

The rapid development of AI-based products and their underlying models has led to constant innovation in deep learning frameworks. Google has been pioneering machine learning usage across dozens of products. Maintaining the multitude of model source codes in different ML frameworks and versions is a significant challenge. So far the maintenance and migration work was done largely manually by human experts. We describe an AI-based multi-agent system that we built to support automatic migration of TensorFlow-based deep learning models into JAX-based ones. We make three main contributions: First, we show how an AI planner that uses a mix of static analysis with AI instructions can create migration plans for very complex code components that are reliably followed by the combination of an orchestrator and coders, using AI-generated example-based playbooks. Second, we define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements. Third, we demonstrate how the system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases. Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations and creates a virtuous circle where effectively AI supports its own development workflow. We expect that the techniques and approaches described here can be generalized for other framework migrations and general code transformation tasks.

GRJan 9, 2025
A Scalable System for Visual Analysis of Ocean Data

Toshit Jain, Upkar Singh, Varun Singh et al.

Oceanographers rely on visual analysis to interpret model simulations, identify events and phenomena, and track dynamic ocean processes. The ever increasing resolution and complexity of ocean data due to its dynamic nature and multivariate relationships demands a scalable and adaptable visualization tool for interactive exploration. We introduce pyParaOcean, a scalable and interactive visualization system designed specifically for ocean data analysis. pyParaOcean offers specialized modules for common oceanographic analysis tasks, including eddy identification and salinity movement tracking. These modules seamlessly integrate with ParaView as filters, ensuring a user-friendly and easy-to-use system while leveraging the parallelization capabilities of ParaView and a plethora of inbuilt general-purpose visualization functionalities. The creation of an auxiliary dataset stored as a Cinema database helps address I/O and network bandwidth bottlenecks while supporting the generation of quick overview visualizations. We present a case study on the Bay of Bengal (BoB) to demonstrate the utility of the system and scaling studies to evaluate the efficiency of the system.

IVJan 5, 2020
Automated Segmentation of Vertebrae on Lateral Chest Radiography Using Deep Learning

Sanket Badhe, Varun Singh, Joy Li et al.

The purpose of this study is to develop an automated algorithm for thoracic vertebral segmentation on chest radiography using deep learning. 124 de-identified lateral chest radiographs on unique patients were obtained. Segmentations of visible vertebrae were manually performed by a medical student and verified by a board-certified radiologist. 74 images were used for training, 10 for validation, and 40 were held out for testing. A U-Net deep convolutional neural network was employed for segmentation, using the sum of dice coefficient and binary cross-entropy as the loss function. On the test set, the algorithm demonstrated an average dice coefficient value of 90.5 and an average intersection-over-union (IoU) of 81.75. Deep learning demonstrates promise in the segmentation of vertebrae on lateral chest radiography.

MMAug 7, 2014
Characterizing Internet Video for Large-scale Active Measurements

Saba Ahsan, Varun Singh, Jörg Ott

The availability of high definition video content on the web has brought about a significant change in the characteristics of Internet video, but not many studies on characterizing video have been done after this change. Video characteristics such as video length, format, target bit rate, and resolution provide valuable input to design Adaptive Bit Rate (ABR) algorithms, sizing playout buffers in Dynamic Adaptive HTTP streaming (DASH) players, model the variability in video frame sizes, etc. This paper presents datasets collected in 2013 and 2014 that contains over 130,000 videos from YouTube's most viewed (or most popular) video charts in 58 countries. We describe the basic characteristics of the videos on YouTube for each category, format, video length, file size, and data rate variation, observing that video length and file size fit a log normal distribution. We show that three minutes of a video suffice to represent its instant data rate fluctuation and that we can infer data rate characteristics of different video resolutions from a single given one. Based on our findings, we design active measurements for measuring the performance of Internet video.

NIOct 6, 2013
Congestion Control using FEC for Conversational Multimedia Communication

Marcin Nagy, Varun Singh, Joerg Ott et al.

In this paper, we propose a new rate control algorithm for conversational multimedia flows. In our approach, along with Real-time Transport Protocol (RTP) media packets, we propose sending redundant packets to probe for available bandwidth. These redundant packets are Forward Error Correction (FEC) encoded RTP packets. A straightforward interpretation is that if no losses occur, the sender can increase the sending rate to include the FEC bit rate, and in the case of losses due to congestion the redundant packets help in recovering the lost packets. We also show that by varying the FEC bit rate, the sender is able to conservatively or aggressively probe for available bandwidth. We evaluate our FEC-based Rate Adaptation (FBRA) algorithm in a network simulator and in the real-world and compare it to other congestion control algorithms.