IROct 16, 2019

Rule based Approach for Word Normalization by resolving Transcription Ambiguity in Transliterated Search Queries

arXiv:1910.07233v11 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of enabling common people to search information via SMS in their own transliteration style, though it is incremental as it builds on existing methods like Levenshtein distance.

The paper tackled the problem of transcription ambiguity in transliterated search queries for SMS-based information systems in Marathi and Hindi, developing a rule-based approach that resolved term-level noise to improve query matching, with experiments showing results on literature datasets including songs and gazals.

Query term matching with document term matching is the basic function of any best effort Information Retrieval models like Vector Space Model. In our problem of SMS based Information Systems we expect common people to participate in information search. Our system allows mobile users to formulate their queries in their own words, own transliteration style and spelling formation. To achieve this flexibility we have resolved the term level ambiguity due to inherent transcription noise in user query terms. We have developed a rule based approach to select most relevantly close standard term for each noisy term in the user query. We have used four different versions of the rule based algorithm with variation in the rule set. We have formulated this rule set including the basic Levenshtein minimum edit distance algorithm for term matching. This paper presents the experiments and corresponding results of Marathi and Hindi language literature information system. We have experimented on Marathi and Hindi literature which include songs, gazals, powadas, bharud and other types in a standard transliteration form like ITRANS.

View on arXiv PDF

Similar