CLSep 24, 2016

The distribution of information content in English sentences

Shuiyuan Yu, Jin Cong, Junying Liang, Haitao Liu

arXiv:1609.07681v10.86 citations

Originality Synthesis-oriented

AI Analysis

This addresses a fundamental linguistic problem for researchers in natural language processing and linguistics, providing insights into sentence structure and information flow, though it is incremental as it builds on existing entropy-based analyses.

The study tackled the problem of how information content is distributed across positions in English sentences by calculating entropy statistics from authentic language data, revealing a three-step staircase-shaped pattern with lower entropy at the initial position, higher at the final, and no significant differences in medial positions.

Sentence is a basic linguistic unit, however, little is known about how information content is distributed across different positions of a sentence. Based on authentic language data of English, the present study calculated the entropy and other entropy-related statistics for different sentence positions. The statistics indicate a three-step staircase-shaped distribution pattern, with entropy in the initial position lower than the medial positions (positions other than the initial and final), the medial positions lower than the final position and the medial positions showing no significant difference. The results suggest that: (1) the hypotheses of Constant Entropy Rate and Uniform Information Density do not hold for the sentence-medial positions; (2) the context of a word in a sentence should not be simply defined as all the words preceding it in the same sentence; and (3) the contextual information content in a sentence does not accumulate incrementally but follows a pattern of "the whole is greater than the sum of parts".

View on arXiv PDF

Similar