SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text
This work addresses the challenge of processing informal, mixed-language text on social media for NLP applications, but it is incremental as it applies an existing method to new data.
The authors tackled the problem of part-of-speech tagging for code-mixed Indic social media text by developing a supervised system based on Conditional Random Fields with rich linguistic features, achieving encouraging performance across three language pairs (English-Hindi, English-Bengali, English-Telugu) and three platforms (Twitter, Facebook, WhatsApp).
Use of social media has grown dramatically during the last few years. Users follow informal languages in communicating through social media. The language of communication is often mixed in nature, where people transcribe their regional language with English and this technique is found to be extremely popular. Natural language processing (NLP) aims to infer the information from these text where Part-of-Speech (PoS) tagging plays an important role in getting the prosody of the written text. For the task of PoS tagging on Code-Mixed Indian Social Media Text, we develop a supervised system based on Conditional Random Field classifier. In order to tackle the problem effectively, we have focused on extracting rich linguistic features. We participate in three different language pairs, ie. English-Hindi, English-Bengali and English-Telugu on three different social media platforms, Twitter, Facebook & WhatsApp. The proposed system is able to successfully assign coarse as well as fine-grained PoS tag labels for a given a code-mixed sentence. Experiments show that our system is quite generic that shows encouraging performance levels on all the three language pairs in all the domains.