Akhil Deo

h-index15
2papers

2 Papers

93.4CLApr 6
DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou, Luis Brena, Akhil Deo et al.

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.

CLOct 6, 2025
SocialNLI: A Dialogue-Centric Social Inference Dataset

Akhil Deo, Kate Sanders, Benjamin Van Durme

Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.