CLJun 29, 2023
Surveying (Dis)Parities and Concerns of Compute Hungry NLP ResearchJi-Ung Lee, Haritz Puerto, Betty van Aken et al. · allen-ai, cmu
Many recent improvements in NLP stem from the development and use of large pre-trained language models (PLMs) with billions of parameters. Large model sizes makes computational cost one of the main limiting factors for training and evaluating such models; and has raised severe concerns about the sustainability, reproducibility, and inclusiveness for researching PLMs. These concerns are often based on personal experiences and observations. However, there had not been any large-scale surveys that investigate them. In this work, we provide a first attempt to quantify these concerns regarding three topics, namely, environmental impact, equity, and impact on peer reviewing. By conducting a survey with 312 participants from the NLP community, we capture existing (dis)parities between different and within groups with respect to seniority, academia, and industry; and their impact on the peer reviewing process. For each topic, we provide an analysis and devise recommendations to mitigate found disparities, some of which already successfully implemented. Finally, we discuss additional concerns raised by many participants in free-text responses.
CLOct 21, 2022
CEFR-Based Sentence Difficulty Annotation and AssessmentYuki Arase, Satoru Uchida, Tomoyuki Kajiwara
Controllable text simplification is a crucial assistive technique for language learning and teaching. One of the primary factors hindering its advancement is the lack of a corpus annotated with sentence difficulty levels based on language ability descriptions. To address this problem, we created the CEFR-based Sentence Profile (CEFR-SP) corpus, containing 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages assigned by English-education professionals. In addition, we propose a sentence-level assessment model to handle unbalanced level distribution because the most basic and highly proficient sentences are naturally scarce. In the experiments in this study, our method achieved a macro-F1 score of 84.5% in the level assessment, thus outperforming strong baselines employed in readability assessment.
CLJun 7, 2023
Unbalanced Optimal Transport for Unbalanced Word AlignmentYuki Arase, Han Bao, Sho Yokoi
Monolingual word alignment is crucial to model semantic interactions between sentences. In particular, null alignment, a phenomenon in which words have no corresponding counterparts, is pervasive and critical in handling semantically divergent sentences. Identification of null alignment is useful on its own to reason about the semantic similarity of sentences by indicating there exists information inequality. To achieve unbalanced word alignment that values both alignment and null alignment, this study shows that the family of optimal transport (OT), i.e., balanced, partial, and unbalanced OT, are natural and powerful approaches even without tailor-made techniques. Our extensive experiments covering unsupervised and supervised settings indicate that our generic OT-based alignment methods are competitive against the state-of-the-arts specially designed for word alignment, remarkably on challenging datasets with high null alignment frequencies.
CLMar 8, 2024Code
An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human AssessmentXuanxin Wu, Yuki Arase
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.
CLJan 7
Reasoning Model Is Superior LLM-Judge, Yet Suffers from BiasesHui Huang, Xuanxin Wu, Muyun Yang et al.
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.
CLSep 28, 2024
Edit-Constrained Decoding for Sentence SimplificationTatsuya Zetsu, Yuki Arase, Tomoyuki Kajiwara
We propose edit operation based lexically constrained decoding for sentence simplification. In sentence simplification, lexical paraphrasing is one of the primary procedures for rewriting complex sentences into simpler correspondences. While previous studies have confirmed the efficacy of lexically constrained decoding on this task, their constraints can be loose and may lead to sub-optimal generation. We address this problem by designing constraints that replicate the edit operations conducted in simplification and defining stricter satisfaction conditions. Our experiments indicate that the proposed method consistently outperforms the previous studies on three English simplification corpora commonly used in this task.
CLJan 7
DiVA: Fine-grained Factuality Verification with Agentic-Discriminative VerifierHui Huang, Muyun Yang, Yuki Arase
Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.
CLSep 13, 2024
Distilling Monolingual and Crosslingual Word-in-Context RepresentationsYuki Arase, Tomoyuki Kajiwara
In this study, we propose a method that distils representations of word meaning in context from a pre-trained masked language model in both monolingual and crosslingual settings. Word representations are the basis for context-aware lexical semantics and unsupervised semantic textual similarity (STS) estimation. Different from existing approaches, our method does not require human-annotated corpora nor updates of the parameters of the pre-trained model. The latter feature is appealing for practical scenarios where the off-the-shelf pre-trained model is a common asset among different applications. Specifically, our method learns to combine the outputs of different hidden layers of the pre-trained model using self-attention. Our auto-encoder based training only requires an automatically generated corpus. To evaluate the performance of the proposed approach, we performed extensive experiments using various benchmark tasks. The results on the monolingual tasks confirmed that our representations exhibited a competitive performance compared to that of the previous study for the context-aware lexical semantic tasks and outperformed it for STS estimation. The results of the crosslingual tasks revealed that the proposed method largely improved crosslingual word representations of multilingual pre-trained models.
CLMay 30, 2025
Adaptive LoRA Merge with Parameter Pruning for Low-Resource GenerationRyota Miyano, Yuki Arase
This study proposes a simple yet effective LoRA merge method to achieve LLM adaptation for low-resource language generation tasks. The LoRA merge technique, which integrates multiple LoRA modules trained on different tasks, has gained attention as an effective and efficient approach for adapting LLMs to target tasks. However, previous methods are limited in adaptability as they keep the LoRA parameters frozen. Additionally, the low-resource problem has been out of their scope. We propose a LoRA merge method that updates and prunes LoRA parameters through fine-tuning with minimal target task data, which allows finer-grained adjustments of LoRA parameters and enhancement of task adaptability. Extensive experiments have been conducted taking summarization as a benchmark task. Our datasets cover various domains and multiple languages of English and Japanese. The results confirm that the proposed method achieves significant and consistent improvements in task adaptability over the previous methods.
CLApr 6, 2025
Hallucinated Span Detection with Multi-View Attention FeaturesYuya Ogasa, Yuki Arase
This study addresses the problem of hallucinated span detection in the outputs of large language models. It has received less attention than output-level hallucination detection despite its practical importance. Prior work has shown that attentions often exhibit irregular patterns when hallucinations occur. Motivated by these findings, we extract features from the attention matrix that provide complementary views capturing (a) whether certain tokens are influential or ignored, (b) whether attention is biased toward specific subsets, and (c) whether a token is generated referring to a narrow or broad context, in the generation. These features are input to a Transformer-based classifier to conduct sequential labelling to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucinated span detection with longer input contexts, such as data-to-text and summarisation tasks.
CLFeb 17, 2025
Aligning Sentence Simplification with ESL Learner's Proficiency for Language AcquisitionGuanlin Li, Yuki Arase, Noel Crespi
Text simplification is crucial for improving accessibility and comprehension for English as a Second Language (ESL) learners. This study goes a step further and aims to facilitate ESL learners' language acquisition by simplification. Specifically, we propose simplifying complex sentences to appropriate levels for learners while also increasing vocabulary coverage of the target level in the simplifications. We achieve this without a parallel corpus by conducting reinforcement learning on a large language model. Our method employs token-level and sentence-level rewards, and iteratively trains the model on its self-generated outputs to guide the model to search for simplification hypotheses that satisfy the target attributes. Experiment results on CEFR-SP and TurkCorpus datasets show that the proposed method can effectively increase the frequency and diversity of vocabulary of the target level by more than $20\%$ compared to baseline models, while maintaining high simplification quality.
CLOct 6, 2025
Camellia: Benchmarking Cultural Biases in LLMs for Asian LanguagesTarek Naous, Anagha Savit, Carlos Rafael Catalan et al.
As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.
CLOct 17, 2020
A Corpus for English-Japanese Multimodal Neural Machine Translation with Comparable SentencesAndrew Merritt, Chenhui Chu, Yuki Arase
Multimodal neural machine translation (NMT) has become an increasingly important area of research over the years because additional modalities, such as image data, can provide more context to textual data. Furthermore, the viability of training multimodal NMT models without a large parallel corpus continues to be investigated due to low availability of parallel sentences with images, particularly for English-Japanese data. However, this void can be filled with comparable sentences that contain bilingual terms and parallel phrases, which are naturally created through media such as social network posts and e-commerce product descriptions. In this paper, we propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets. In addition, we supplement our comparable sentences with a smaller parallel corpus for validation and test purposes. To test the performance of this comparable sentence translation scenario, we train several baseline NMT models with our comparable corpus and evaluate their English-Japanese translation performance. Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data. Despite this, we hope for our corpus to be used to further research into multimodal NMT with comparable sentences.
CLOct 11, 2020
Lexically Cohesive Neural Machine Translation with Copy MechanismVipul Mishra, Chenhui Chu, Yuki Arase
Lexically cohesive translations preserve consistency in word choices in document-level translation. We employ a copy mechanism into a context-aware neural machine translation model to allow copying words from previous translation outputs. Different from previous context-aware neural machine translation models that handle all the discourse phenomena implicitly, our model explicitly addresses the lexical cohesion problem by boosting the probabilities to output words consistently. We conduct experiments on Japanese to English translation using an evaluation dataset for discourse translation. The results showed that the proposed model significantly improved lexical cohesion compared to previous context-aware models.
CLSep 3, 2019
Transfer Fine-Tuning: A BERT Case StudyYuki Arase, Junichi Tsujii
A semantic equivalence assessment is defined as a task that assesses semantic equivalence in a sentence pair by binary judgment (i.e., paraphrase identification) or grading (i.e., semantic textual similarity measurement). It constitutes a set of tasks crucial for research on natural language understanding. Recently, BERT realized a breakthrough in sentence representation learning (Devlin et al., 2019), which is broadly transferable to various NLP tasks. While BERT's performance improves by increasing its model size, the required computational power is an obstacle preventing practical applications from adopting the technology. Herein, we propose to inject phrasal paraphrase relations into BERT in order to generate suitable representations for semantic equivalence assessment instead of increasing the model size. Experiments on standard natural language understanding tasks confirm that our method effectively improves a smaller BERT model while maintaining the model size. The generated model exhibits superior performance compared to a larger BERT model on semantic equivalence assessment tasks. Furthermore, it achieves larger performance gains on tasks with limited training datasets for fine-tuning, which is a property desirable for transfer learning.
CLMay 25, 2018
Recursive Neural Network Based Preordering for English-to-Japanese Machine TranslationYuki Kawara, Chenhui Chu, Yuki Arase
The word order between source and target languages significantly influences the translation quality in machine translation. Preordering can effectively address this problem. Previous preordering methods require a manual feature design, making language dependent design costly. In this paper, we propose a preordering method with a recursive neural network that learns features from raw inputs. Experiments show that the proposed method achieves comparable gain in translation quality to the state-of-the-art method but without a manual feature design.
SISep 7, 2015
Wikipedia Page View Reflects Web Search TrendMitsuo Yoshida, Yuki Arase, Takaaki Tsunoda et al.
The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends.