Natural Language Processing using Hadoop and KOSHIK
This work addresses data processing bottlenecks in NLP for researchers and practitioners, but it is incremental as it focuses on integrating existing tools rather than introducing new methods.
The study tackled the challenge of processing large-scale data for natural language processing by building and evaluating the KOSHIK architecture using Hadoop and tools like Stanford CoreNLP and OpenNLP, analyzing wiki data to assess its performance and provide recommendations for improvement.
Natural language processing, as a data analytics related technology, is used widely in many research areas such as artificial intelligence, human language processing, and translation. At present, due to explosive growth of data, there are many challenges for natural language processing. Hadoop is one of the platforms that can process the large amount of data required for natural language processing. KOSHIK is one of the natural language processing architectures, and utilizes Hadoop and contains language processing components such as Stanford CoreNLP and OpenNLP. This study describes how to build a KOSHIK platform with the relevant tools, and provides the steps to analyze wiki data. Finally, it evaluates and discusses the advantages and disadvantages of the KOSHIK architecture, and gives recommendations on improving the processing performance.