CLAILGOct 16, 2024

LEGAL-UQA: A Low-Resource Urdu-English Dataset for Legal Question Answering

arXiv:2410.13013v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited legal information access in Pakistan by providing a foundational dataset for Urdu-English legal NLP, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of domain-specific NLP resources for low-resource languages by creating LEGAL-UQA, the first Urdu legal question-answering dataset from Pakistan's constitution, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy on it.

We present LEGAL-UQA, the first Urdu legal question-answering dataset derived from Pakistan's constitution. This parallel English-Urdu dataset includes 619 question-answer pairs, each with corresponding legal article contexts, addressing the need for domain-specific NLP resources in low-resource languages. We describe the dataset creation process, including OCR extraction, manual refinement, and GPT-4-assisted translation and generation of QA pairs. Our experiments evaluate the latest generalist language and embedding models on LEGAL-UQA, with Claude-3.5-Sonnet achieving 99.19% human-evaluated accuracy. We fine-tune mt5-large-UQA-1.0, highlighting the challenges of adapting multilingual models to specialized domains. Additionally, we assess retrieval performance, finding OpenAI's text-embedding-3-large outperforms Mistral's mistral-embed. LEGAL-UQA bridges the gap between global NLP advancements and localized applications, particularly in constitutional law, and lays the foundation for improved legal information access in Pakistan.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes