MAMar 17
MetaCrit: A Critical Thinking Framework for Self-Regulated LLM ReasoningXinmeng Hou, Ziting Chang, Zhouquan Lu et al.
Large language models (LLMs) fail on over one-third of multi-hop questions with counterfactual premises and remain vulnerable to adversarial prompts that trigger biased or factually incorrect responses, which exposes a fundamental deficit in self-regulated reasoning. We propose \textbf{MetaCrit}, a multi-agent framework grounded in Nelson and Narens' metacognitive regulation theory. MetaCrit decomposes reasoning regulation into four agents: object-level generation, a \emph{monitoring} agent that assesses response validity, a \emph{control} agent that critiques logical soundness, and a meta-level synthesizer that integrates all signals into a final response. Evaluation across eight benchmarks, four model backbones, and a college-level analytical writing study shows that MetaCrit significantly improves content truthfulness and logical soundness while eliminating toxic outputs. Its modular design allows individual agents to be integrated into existing frameworks as drop-in components without architectural modifications.
CLNov 29, 2024Code
Train Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet ExtractionXinmeng Hou, Lingyue Fu, Chenhao Meng et al.
Aspect-Opinion Pair Extraction (AOPE) and Aspect Sentiment Triplet Extraction (ASTE) have drawn growing attention in NLP. However, most existing approaches extract aspects and opinions independently, optionally adding pairwise relations, often leading to error propagation and high time complexity. To address these challenges and being inspired by transition-based dependency parsing, we propose the first transition-based model for AOPE and ASTE that performs aspect and opinion extraction jointly, which also better captures position-aware aspect-opinion relations and mitigates entity-level bias. By integrating contrastive-augmented optimization, our model delivers more accurate action predictions and jointly optimizes separate subtasks in linear time. Extensive experiments on 4 commonly used ASTE/AOPE datasets show that, while performing worse when trained on a single dataset than some previous models, our model achieves the best performance on both ASTE and AOPE if trained on combined datasets, outperforming the strongest previous models in F1-measures (often by a large margin). We hypothesize that this is due to our model's ability to learn transition actions from multiple datasets and domains. Our code is available at https://anonymous.4open.science/r/trans_aste-8FCF.
CLOct 17, 2024
Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic LanguageXinmeng Hou
This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations compared to original datasets based on descriptive instructions. Our experiments show that LLMs can serve as effective alternatives when professional annotators are unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated data outperform models trained on larger, single-source human-annotated datasets. These findings highlight the value of structured guidelines in reducing subjective variability, maintaining performance with limited data, and embracing language diversity. Content Warning: This article only analyzes offensive language for academic purposes. Discretion is advised.