CYMar 31

Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

arXiv:2603.300286.3

Predicted impact top 96% in CY · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of using LLMs as political companions for legislators, highlighting risks of contextual ignorance, but it is incremental as it builds on existing evaluation frameworks and principal-agent theory.

This paper evaluated whether commercial large language models (LLMs) can serve as reliable political advisory tools by comparing their generated rationales against official legislative reasoning from Romanian Senate proposals, finding that frontier models achieved high semantic closeness scores (above 4.6 out of 5.0) while open-weight models performed significantly worse, but all models exhibited confabulation in politically idiosyncratic cases.

This paper evaluates whether commercial large language models (LLMs) can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.

View on arXiv PDF

Similar