CLApr 1, 2024

Estimating Lexical Complexity from Document-Level Distributions

Sondre Wold, Petter Mæhlum, Oddbjørn Hove

arXiv:2404.01196v124.083 citationsh-index: 15Has CodeLREC

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better tools to support health practitioners in creating accessible assessment tools for patients with varying cognitive and linguistic abilities, though it is incremental as it adapts complexity estimation to a new scope.

The paper tackled the problem of estimating lexical complexity for short texts like health assessment sentences, which existing document-level methods cannot handle, by developing a two-step approach without pre-annotated data and verifying it for Norwegian with statistical testing and qualitative evaluation.

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.

View on arXiv PDF Code

Similar