Unsupervised Disambiguation of Syncretism in Inflected Lexicons
This addresses a challenge in computational linguistics for researchers and practitioners working with inflected languages, though it is incremental as it builds on existing unsupervised learning techniques.
The paper tackles the problem of lexical ambiguity in morphological analysis by developing an unsupervised method to disambiguate word forms into their possible morphological feature bundles, achieving results on 5 languages.
Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token's context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages.