CLAIAug 3, 2025

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

arXiv:2508.01918v21 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses digital access for Punjabi speakers by advancing language generation and retrieval, though it is incremental in applying existing methods to a new language.

The authors tackled the exclusion of low-resource languages like Punjabi from NLP by developing PunGPT2, the first open-source Punjabi generative model suite trained on a 35GB corpus, and introduced Quantum-RAG, a retrieval-augmented framework that improves recall and BLEU scores over baselines.

Despite rapid advances in large language models (LLMs), low-resource languages remain excluded from NLP, limiting digital access for millions. We present PunGPT2, the first fully open-source Punjabi generative model suite, trained on a 35GB corpus covering literature, religious texts, news, social discourse, etc. PunGPT2 captures Punjabi's syntactic and morphological richness through a tokenizer optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval-augmented framework integrating PunGPT2 with a FAISS retriever over a curated Punjabi knowledge base, and Pun-Instruct, an instruction-tuned variant using QLoRA for robust zero-shot summarization, translation, and question answering. Our key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings for efficient, context-aware retrieval with low memory overhead, marking the first practical quantum-inspired retrieval in a low-resource LLM. Our models outperform multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and a new PunjabiEval suite. Quantum-RAG yields +7.4 Recall@10 over FAISS and +3.5 BLEU over mT5 on PunjabiEval. We publicly release all training scripts, hyperparameters, evaluation pipelines, the 35GB Punjabi corpus, the PunjabiEval benchmark, and all model weights, establishing new state-of-the-art results for Punjabi language generation and retrieval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes