Anshul Kumar

AI
h-index3
5papers
4citations
Novelty39%
AI Score43

5 Papers

CLJan 5Code
Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece

Anshul Kumar

Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages-Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM baseline. English/Hindi translations of Sanskrit commentary resulted in an approximately 20x increase in token count. GPT o200k base (latest, used by GPT-4o) and Gemini (latest) reduce bias by a significant degree compared to GPT cl100k base (used until GPT-4), but still fail to fully capture Sanskrit's compactness. This matters because there might be a penalty bias for non-English users, which inflates the token count. This research provides a foundation for improving future tokenizer design and shows the potential of Sanskrit for highly compact encoding, saving on cost while speeding up training and inference. The code and dataset are available at https://github.com/anshulkr713/sanskrit-token-efficiency

LGDec 12, 2025
AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

Anshul Kumar, Gagan Raj Gupta, Manisha Chawla

Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

AINov 17, 2025
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Gagan Raj Gupta, Anshul Kumar, Manish Rai et al.

Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.

SIOct 29, 2025
Flex-GAD : Flexible Graph Anomaly Detection

Apu Chakraborty, Anshul Kumar, Gagan Raj Gupta

Detecting anomalous nodes in attributed networks, where each node is associated with both structural connections and descriptive attributes, is essential for identifying fraud, misinformation, and suspicious behavior in domains such as social networks, academic citation graphs, and e-commerce platforms. We propose Flex-GAD, a novel unsupervised framework for graph anomaly detection at the node level. Flex-GAD integrates two encoders to capture complementary aspects of graph data. The framework incorporates a novel community-based GCN encoder to model intra-community and inter-community information into node embeddings, thereby ensuring structural consistency, along with a standard attribute encoder. These diverse representations are fused using a self-attention-based representation fusion module, which enables adaptive weighting and effective integration of the encoded information. This fusion mechanism allows automatic emphasis of the most relevant node representation across different encoders. We evaluate Flex-GAD on seven real-world attributed graphs with varying sizes, node degrees, and attribute homogeneity. Flex-GAD achieves an average AUC improvement of 7.98% over the previously best-performing method, GAD-NR, demonstrating its effectiveness and flexibility across diverse graph structures. Moreover, it significantly reduces training time, running 102x faster per epoch than Anomaly DAE and 3x faster per epoch than GAD-NR on average across seven benchmark datasets.

CYAug 5, 2021
The application of adaptive minimum match k-nearest neighbors to identify at-risk students in health professions education

Anshul Kumar, Taylor DiJohnson, Roger Edwards et al.

Purpose: When a learner fails to reach a milestone, educators often wonder if there had been any warning signs that could have allowed them to intervene sooner. Machine learning can predict which students are at risk of failing a high-stakes certification exam. If predictions can be made well in advance of the exam, then educators can meaningfully intervene before students take the exam to reduce the chances of a failing score. Methods: Using already-collected, first-year student assessment data from five cohorts in a Master of Physician Assistant Studies program, the authors implement an "adaptive minimum match" version of the k-nearest neighbors algorithm (AMMKNN), using changing numbers of neighbors to predict each student's future exam scores on the Physician Assistant National Certifying Examination (PANCE). Validation occurred in two ways: Leave-one-out cross-validation (LOOCV) and evaluating the predictions in a new cohort. Results: AMMKNN achieved an accuracy of 93% in LOOCV. AMMKNN generates a predicted PANCE score for each student, one year before they are scheduled to take the exam. Students can then be classified into extra support, optional extra support, or no extra support groups. The educator then has one year to provide the appropriate customized support to each category of student. Conclusions: Predictive analytics can identify at-risk students, so they can receive additional support or remediation when preparing for high-stakes certification exams. Educators can use the included methods and code to generate predicted test outcomes for students. The authors recommend that educators use this or similar predictive methods responsibly and transparently, as one of many tools used to support students.