CLFeb 26, 2020

Detecting Potential Topics In News Using BERT, CRF and Wikipedia

arXiv:2002.11402v20.32 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for better topic detection in news content distribution platforms like Dailyhunt to enhance user recommendations and notifications, though it is incremental as it builds on existing NER and NLP techniques.

The paper tackled the problem of detecting important case-less n-grams (e.g., 'me too movement') from English news text for use as topics or hashtags, using a model based on BERT, Bi-GRU, and CRF, which showed promising results with improved F1 and recall compared to industry benchmarks like Flair and Spacy.

For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

View on arXiv PDF Code

Similar