Neha Gupta

LG
h-index42
23papers
734citations
Novelty50%
AI Score56

23 Papers

AIMar 26
Voxtral TTS

Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg et al. · deepmind, tsinghua

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

MLJun 21, 2022
Ensembling over Classifiers: a Bias-Variance Perspective

Neha Gupta, Jamie Smith, Ben Adlam et al. · deepmind

Ensembles are a straightforward, remarkably effective method for improving the accuracy,calibration, and robustness of models on classification tasks; yet, the reasons that underlie their success remain an active area of research. We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers. Introducing a dual reparameterization of the bias-variance tradeoff, we first derive generalized laws of total expectation and variance for nonsymmetric losses typical of classification tasks. Comparing conditional and bootstrap bias/variance estimates, we then show that conditional estimates necessarily incur an irreducible error. Next, we show that ensembling in dual space reduces the variance and leaves the bias unchanged, whereas standard ensembling can arbitrarily affect the bias. Empirically, standard ensembling reducesthe bias, leading us to hypothesize that ensembles of classifiers may perform well in part because of this unexpected reduction.We conclude by an empirical analysis of recent deep learning methods that ensemble over hyperparameters, revealing that these techniques indeed favor bias reduction. This suggests that, contrary to classical wisdom, targeting bias reduction may be a promising direction for classifier ensembles.

LGJul 6, 2023
When Does Confidence-Based Cascade Deferral Suffice?

Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon et al.

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.

CLJan 13
Ministral 3

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian et al.

We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.

CLJun 12, 2025Code
Magistral

Mistral-AI, Abhinav Rastogi, Albert Q. Jiang et al.

We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

SEAug 8, 2025Code
Devstral: Fine-tuning Language Models for Coding Agent Applications

Abhinav Rastogi, Adam Yang, Albert Q. Jiang et al. · deepmind

We introduce Devstral-Small, a lightweight open source model for code agents with the best performance among models below 100B size. In this technical report, we give an overview of how we design and develop a model and craft specializations in agentic software development. The resulting model, Devstral-Small is a small 24B model, fast and easy to serve. Despite its size, Devstral-Small still attains competitive performance compared to models more than an order of magnitude larger.

CLApr 15, 2024
Language Model Cascades: Token-level uncertainty and beyond

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum et al.

Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs: here, a small model is invoked for most "easy" instances, while a few "hard" instances are deferred to the large model. While the principles underpinning cascading are well-studied for classification tasks - with deferral based on predicted class uncertainty favored theoretically and practically - a similar understanding is lacking for generative LM tasks. In this work, we initiate a systematic study of deferral rules for LM cascades. We begin by examining the natural extension of predicted class uncertainty to generative LM tasks, namely, the predicted sequence uncertainty. We show that this measure suffers from the length bias problem, either over- or under-emphasizing outputs based on their lengths. This is because LMs produce a sequence of uncertainty values, one for each output token; and moreover, the number of output tokens is variable across examples. To mitigate this issue, we propose to exploit the richer token-level uncertainty information implicit in generative LMs. We argue that naive predicted sequence uncertainty corresponds to a simple aggregation of these uncertainties. By contrast, we show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform such simple aggregation strategies, via experiments on a range of natural language benchmarks with FLAN-T5 models. We further show that incorporating embeddings from the smaller model and intermediate layers of the larger model can give an additional boost in the overall cost-quality tradeoff.

CLDec 12, 2025
When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Mrinal Rawat, Arkajyoti Chakraborty, Neha Gupta et al.

Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.

SDJul 17, 2025
Voxtral

Alexander H. Liu, Andy Ehrenberg, Andy Lo et al. · deepmind

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

AIMay 15, 2025
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents

Mrinal Rawat, Ambuje Gupta, Rushil Goomer et al.

The ReAct (Reasoning + Action) capability in large language models (LLMs) has become the foundation of modern agentic systems. Recent LLMs, such as DeepSeek-R1 and OpenAI o1/o3, exemplify this by emphasizing reasoning through the generation of ample intermediate tokens, which help build a strong premise before producing the final output tokens. In this paper, we introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan along with the detailed reasoning for the given user input. This plan incrementally incorporates previous steps and tool outputs, refining itself after each step execution until the final response is obtained. Our approach is applicable to both conversational and non-conversational agents. To measure the performance of task-oriented agents comprehensively, we propose a two-level evaluation framework: (1) turn level and (2) end-to-end. Our turn-level evaluation, averaged across five models, shows that our approach, Pre-Act, outperforms ReAct by 70% in Action Recall on the Almita dataset. While this approach is effective for larger models, smaller models crucial for practical applications, where latency and cost are key constraints, often struggle with complex reasoning tasks required for agentic systems. To address this limitation, we fine-tune relatively small models such as Llama 3.1 (8B & 70B) using the proposed Pre-Act approach. Our experiments show that the fine-tuned 70B model outperforms GPT-4, achieving a 69.5% improvement in action accuracy (turn-level) and a 28% improvement in goal completion rate (end-to-end) on the Almita (out-of-domain) dataset.

CLAug 10, 2025
DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Kabir Khan, Priya Sharma, Arjun Mehta et al.

Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

AIFeb 11
Voxtral Realtime

Alexander H. Liu, Andy Ehrenberg, Andy Lo et al.

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

LGDec 5, 2025
NeuroMemFPP: A recurrent neural approach for memory-aware parameter estimation in fractional Poisson process

Neha Gupta, Aditya Maheshwari

In this paper, we propose a recurrent neural network (RNN)-based framework for estimating the parameters of the fractional Poisson process (FPP), which models event arrivals with memory and long-range dependence. The Long Short-Term Memory (LSTM) network estimates the key parameters $μ>0$ and $β\in(0,1)$ from sequences of inter-arrival times, effectively modeling their temporal dependencies. Our experiments on synthetic data show that the proposed approach reduces the mean squared error (MSE) by about 55.3\% compared to the traditional method of moments (MOM) and performs reliably across different training conditions. We tested the method on two real-world high-frequency datasets: emergency call records from Montgomery County, PA, and AAPL stock trading data. The results show that the LSTM can effectively track daily patterns and parameter changes, indicating its effectiveness on real-world data with complex time dependencies.

MLFeb 8, 2022
Understanding the bias-variance tradeoff of Bregman divergences

Ben Adlam, Neha Gupta, Zelda Mariet et al.

This paper builds upon the work of Pfau (2013), which generalized the bias variance tradeoff to any Bregman divergence loss function. Pfau (2013) showed that for Bregman divergences, the bias and variances are defined with respect to a central label, defined as the mean of the label variable, and a central prediction, of a more complex form. We show that, similarly to the label, the central prediction can be interpreted as the mean of a random variable, where the mean operates in a dual space defined by the loss function itself. Viewing the bias-variance tradeoff through operations taken in dual space, we subsequently derive several results of interest. In particular, (a) the variance terms satisfy a generalized law of total variance; (b) if a source of randomness cannot be controlled, its contribution to the bias and variance has a closed form; (c) there exist natural ensembling operations in the label and prediction spaces which reduce the variance and do not affect the bias.

LGNov 3, 2020
Estimating decision tree learnability with polylogarithmic sample complexity

Guy Blanc, Neha Gupta, Jane Lange et al.

We show that top-down decision tree learning heuristics are amenable to highly efficient learnability estimation: for monotone target functions, the error of the decision tree hypothesis constructed by these heuristics can be estimated with polylogarithmically many labeled examples, exponentially smaller than the number necessary to run these heuristics, and indeed, exponentially smaller than information-theoretic minimum required to learn a good decision tree. This adds to a small but growing list of fundamental learning algorithms that have been shown to be amenable to learnability estimation. En route to this result, we design and analyze sample-efficient minibatch versions of top-down decision tree learning heuristics and show that they achieve the same provable guarantees as the full-batch versions. We further give "active local" versions of these heuristics: given a test point $x^\star$, we show how the label $T(x^\star)$ of the decision tree hypothesis $T$ can be computed with polylogarithmically many labeled examples, exponentially smaller than the number necessary to learn $T$.

LGOct 16, 2020
Universal guarantees for decision tree induction via a higher-order splitting criterion

Guy Blanc, Neha Gupta, Jane Lange et al.

We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes. Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $ε$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/ε^2)}$ that achieves error $\le O(\mathsf{opt}_s) + ε$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure.

CRSep 23, 2020
I-SiamIDS: an improved Siam-IDS for handling class imbalance in network-based intrusion detection systems

Punam Bedi, Neha Gupta, Vinita Jindal

NIDSs identify malicious activities by analyzing network traffic. NIDSs are trained with the samples of benign and intrusive network traffic. Training samples belong to either majority or minority classes depending upon the number of available instances. Majority classes consist of abundant samples for the normal traffic as well as for recurrent intrusions. Whereas, minority classes include fewer samples for unknown events or infrequent intrusions. NIDSs trained on such imbalanced data tend to give biased predictions against minority attack classes, causing undetected or misclassified intrusions. Past research works handled this class imbalance problem using data-level approaches that either increase minority class samples or decrease majority class samples in the training data set. Although these data-level balancing approaches indirectly improve the performance of NIDSs, they do not address the underlying issue in NIDSs i.e. they are unable to identify attacks having limited training data only. This paper proposes an algorithm-level approach called I-SiamIDS, which is a two-layer ensemble for handling class imbalance problem. I-SiamIDS identifies both majority and minority classes at the algorithm-level without using any data-level balancing techniques. The first layer of I-SiamIDS uses an ensemble of b-XGBoost, Siamese-NN and DNN for hierarchical filtration of input samples to identify attacks. These attacks are then sent to the second layer of I-SiamIDS for classification into different attack classes using m-XGBoost. As compared to its counterparts, I-SiamIDS showed significant improvement in terms of Accuracy, Recall, Precision, F1-score and values of AUC for both NSL-KDD and CIDDS-001 datasets. To further strengthen the results, computational cost analysis was also performed to study the acceptability of the proposed I-SiamIDS.

LGAug 31, 2020
Active Local Learning

Arturs Backurs, Avrim Blum, Neha Gupta

In this work we consider active local learning: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined, independent of $x$. This immediately also implies an algorithm for distance estimation: estimating the value $opt(H)$ from many fewer labels than needed to actually learn a near-optimal $h \in H$, by running local learning on a few random query points and computing the average error. For the hypothesis class consisting of functions supported on the interval $[0,1]$ with Lipschitz constant bounded by $L$, we present an algorithm that makes $O(({1 / ε^6}) \log(1/ε))$ label queries from an unlabeled pool of $O(({L / ε^4})\log(1/ε))$ samples. It estimates the distance to the best hypothesis in the class to an additive error of $ε$ for an arbitrary underlying distribution. We further generalize our algorithm to more than one dimensions. We emphasize that the number of labels used is independent of the complexity of the hypothesis class which depends on $L$. Furthermore, we give an algorithm to locally estimate the values of a near-optimal function at a few query points of interest with number of labels independent of $L$. We also consider the related problem of approximating the minimum error that can be achieved by the Nadaraya-Watson estimator under a linear diagonal transformation with eigenvalues coming from a small range. For a $d$-dimensional pointset of size $N$, our algorithm achieves an additive approximation of $ε$, makes $\tilde{O}({d}/{ε^2})$ queries and runs in $\tilde{O}({d^2}/{ε^{d+4}}+{dN}/{ε^2})$ time.

LGApr 19, 2019
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant et al.

We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.

NAMay 6, 2019
On Extending the Applicability of two-Step Secant Method for non-differentiable operators

Neha Gupta, J. P. Jaiswal

The semi-local convergence analysis of a well defined and efficient two-step Secant method in Banach spaces is presented in this study. The recurrence relation technique is used under some weak assumptions. The pertinency of the assumed method is extended for nonlinear non-differentiable operators. The convergence theorem is also established to show the existence and uniqueness of the approximate solution. A numerical illustration is quoted to certify the theoretical part which shows that the earlier study will fail if the function is non-differentiable.

DSNov 27, 2018
Exploiting Numerical Sparsity for Efficient Learning : Faster Eigenvector Computation and Regression

Neha Gupta, Aaron Sidford

In this paper, we obtain improved running times for regression and top eigenvector computation for numerically sparse matrices. Given a data matrix $A \in \mathbb{R}^{n \times d}$ where every row $a \in \mathbb{R}^d$ has $\|a\|_2^2 \leq L$ and numerical sparsity at most $s$, i.e. $\|a\|_1^2 / \|a\|_2^2 \leq s$, we provide faster algorithms for these problems in many parameter settings. For top eigenvector computation, we obtain a running time of $\tilde{O}(nd + r(s + \sqrt{r s}) / \mathrm{gap}^2)$ where $\mathrm{gap} > 0$ is the relative gap between the top two eigenvectors of $A^\top A$ and $r$ is the stable rank of $A$. This running time improves upon the previous best unaccelerated running time of $O(nd + r d / \mathrm{gap}^2)$ as it is always the case that $r \leq d$ and $s \leq d$. For regression, we obtain a running time of $\tilde{O}(nd + (nL / μ) \sqrt{s nL / μ})$ where $μ> 0$ is the smallest eigenvalue of $A^\top A$. This running time improves upon the previous best unaccelerated running time of $\tilde{O}(nd + n L d / μ)$. This result expands the regimes where regression can be solved in nearly linear time from when $L/μ= \tilde{O}(1)$ to when $L / μ= \tilde{O}(d^{2/3} / (sn)^{1/3})$. Furthermore, we obtain similar improvements even when row norms and numerical sparsities are non-uniform and we show how to achieve even faster running times by accelerating using approximate proximal point [Frostig et. al. 2015] / catalyst [Lin et. al. 2015]. Our running times depend only on the size of the input and natural numerical measures of the matrix, i.e. eigenvalues and $\ell_p$ norms, making progress on a key open problem regarding optimal running times for efficient large-scale learning.

CRJun 14, 2014
bit.ly/malicious: Deep Dive into Short URL based e-Crime Detection

Neha Gupta, Anupama Aggarwal, Ponnurangam Kumaraguru

Existence of spam URLs over emails and Online Social Media (OSM) has become a massive e-crime. To counter the dissemination of long complex URLs in emails and character limit imposed on various OSM (like Twitter), the concept of URL shortening has gained a lot of traction. URL shorteners take as input a long URL and output a short URL with the same landing page (as in the long URL) in return. With their immense popularity over time, URL shorteners have become a prime target for the attackers giving them an advantage to conceal malicious content. Bitly, a leading service among all shortening services is being exploited heavily to carry out phishing attacks, work-from-home scams, pornographic content propagation, etc. This imposes additional performance pressure on Bitly and other URL shorteners to be able to detect and take a timely action against the illegitimate content. In this study, we analyzed a dataset of 763,160 short URLs marked suspicious by Bitly in the month of October 2013. Our results reveal that Bitly is not using its claimed spam detection services very effectively. We also show how a suspicious Bitly account goes unnoticed despite of a prolonged recurrent illegitimate activity. Bitly displays a warning page on identification of suspicious links, but we observed this approach to be weak in controlling the overall propagation of spam. We also identified some short URL based features and coupled them with two domain specific features to classify a Bitly URL as malicious or benign and achieved an accuracy of 86.41%. The feature set identified can be generalized to other URL shortening services as well. To the best of our knowledge, this is the first large scale study to highlight the issues with the implementation of Bitly's spam detection policies and proposing suitable countermeasures.

SIMay 7, 2014
Exploration of gaps in Bitly's spam detection and relevant counter measures

Neha Gupta, Ponnurangam Kumaraguru

Existence of spam URLs over emails and Online Social Media (OSM) has become a growing phenomenon. To counter the dissemination issues associated with long complex URLs in emails and character limit imposed on various OSM (like Twitter), the concept of URL shortening gained a lot of traction. URL shorteners take as input a long URL and give a short URL with the same landing page in return. With its immense popularity over time, it has become a prime target for the attackers giving them an advantage to conceal malicious content. Bitly, a leading service in this domain is being exploited heavily to carry out phishing attacks, work from home scams, pornographic content propagation, etc. This imposes additional performance pressure on Bitly and other URL shorteners to be able to detect and take a timely action against the illegitimate content. In this study, we analyzed a dataset marked as suspicious by Bitly in the month of October 2013 to highlight some ground issues in their spam detection mechanism. In addition, we identified some short URL based features and coupled them with two domain specific features to classify a Bitly URL as malicious / benign and achieved a maximum accuracy of 86.41%. To the best of our knowledge, this is the first large scale study to highlight the issues with Bitly's spam detection policies and proposing a suitable countermeasure.