CLDATA-ANApr 13, 2013

The risks of mixing dependency lengths from sequences of different length

arXiv:1304.3841v261 citations
AI Analysis

This work addresses a methodological issue for linguists and computational linguists, highlighting an incremental but important correction to common practices in dependency length analysis.

The paper tackles the problem of mixing dependency lengths from sentences of different lengths in language research, showing that this practice can lead to misleading results, such as false conclusions about language optimization, due to differences in empirical distributions and sentence length effects.

Mixing dependency lengths from sequences of different length is a common practice in language research. However, the empirical distribution of dependency lengths of sentences of the same length differs from that of sentences of varying length and the distribution of dependency lengths depends on sentence length for real sentences and also under the null hypothesis that dependencies connect vertices located in random positions of the sequence. This suggests that certain results, such as the distribution of syntactic dependency lengths mixing dependencies from sentences of varying length, could be a mere consequence of that mixing. Furthermore, differences in the global averages of dependency length (mixing lengths from sentences of varying length) for two different languages do not simply imply a priori that one language optimizes dependency lengths better than the other because those differences could be due to differences in the distribution of sentence lengths and other factors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes