Basab Jha

CL
h-index8
5papers
3citations
Novelty53%
AI Score46

5 Papers

AIApr 20Code
SAGE-32B: Agentic Reasoning via Iterative Distillation

Basab Jha, Firoj Paudel, Ujjwal Puri et al.

We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at https://huggingface.co/sagea-ai/sage-reasoning-32b

CLMar 24
SAGE Celer 2.6 Technical Card

SAGEA Research Team, Basab Jha, Firoj Paudel et al.

We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

CLDec 9, 2024
The Rosetta Paradox: Domain-Specific Performance Inversions in Large Language Models

Basab Jha, Ujjwal Puri

While large language models, such as GPT and BERT, have already demonstrated unprecedented skills in everything from natural language processing to domain-specific applications, there came an unexplored phenomenon we term the Rosetta Paradox. The Rosetta Paradox characterizes the counterintuitive performance inversions across domains of knowledge. This paradox captures how such LLMs can excel in highly specialized fields but do poorly on tasks which require general, everyday knowledge. This paper formalizes the definition of the Rosetta Paradox and introduces a panoramic analysis framework that includes both a Domain Specificity Index (DSI) and a Performance Inversion Metric (PIM) for consistent quantification of domain-specific behavior in LLMs. We adopt this paradox and conduct a series of investigations through extensive experiments across diverse models and knowledge domains, ranging from rich technical areas to common-sense reasoning. Our findings indicate that the Rosetta Paradox is likely not a mere artifact of data distribution but an intrinsic architectural and emergent property of deep neural networks. We present comparative analyses across different model architectures, sizes, and training methodologies that shed light into the peculiar ways this paradox manifests itself and challenge the standard evaluation metrics.

AIJun 30, 2025
Thinking About Thinking: SAGE-nano's Inverse Reasoning for Self-Aware Language Models

Basab Jha, Firoj Paudel, Ujjwal Puri et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought (CoT) prompting, but their decision-making processes remain somewhat blackbox. We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc. Our approach, used in SAGE-nano, a 4-billion-parameter reasoning model, employs a metacognitive structure that reflects back via attention processes to identify major decision points and generate explanations of reasoning choices. While typical CoT approaches are directed towards forward reasoning generation, inverse reasoning provides insight into why specific reasoning chains were selected over others. Through thorough testing of logical reasoning puzzles, math problems and ethical dilemmas from AQUA-RAT, CommonsenseQA, and customized benchmarks, we demonstrate that SAGE-nano is at the cutting edge both on reasoning accuracy (74.6% on AQUA-RAT) and explanation quality (92.1% human preference score) for its task, and offers performance almost on par with models like Claude-3.5 Sonnet or GPT-4o. Our contributions are: (i) the first rigorous framework for LLM self-reflection via inverse reasoning, (ii) a novel metalearning framework to reverse the attention flow, (iii) comprehensive evaluation frameworks for reasoning transparency, and (iv) evidence that increasing reasoning using inverse reasoning improves interpretability along with reasoning performance. Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.

CLMar 16, 2025
Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

Basab Jha, Firoj Paudel

The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.