A Machine Learning Approach for the Identification of Bengali Noun-Noun Compound Multiword Expressions
This work addresses a domain-specific problem for Bengali natural language processing, but it is incremental as it applies existing methods to a new language context.
The paper tackles the problem of identifying Bengali bigram nominal compound multiword expressions (MWEs) by proposing a two-step machine learning approach using Random Forest classification with various features, achieving classification but without reporting specific performance numbers.
This paper presents a machine learning approach for identification of Bengali multiword expressions (MWE) which are bigram nominal compounds. Our proposed approach has two steps: (1) candidate extraction using chunk information and various heuristic rules and (2) training the machine learning algorithm called Random Forest to classify the candidates into two groups: bigram nominal compound MWE or not bigram nominal compound MWE. A variety of association measures, syntactic and linguistic clues and a set of WordNet-based similarity features have been used for our MWE identification task. The approach presented in this paper can be used to identify bigram nominal compound MWE in Bengali running text.