Regular Expressions for Fast-response COVID-19 Text Classification
This provides a fast and explainable method for narrow-topic text classification, such as for social media platforms like Facebook, though it is incremental as it adapts existing regex techniques to a specific domain.
The paper tackles the problem of quickly classifying text related to COVID-19 using large-scale regular expressions, achieving over 90% precision and recall for 11 languages and 99% precision with over 50% recall for 66 languages, while enabling low-latency queries.
Text classifiers are at the core of many NLP applications and use a variety of algorithmic approaches and software. This paper introduces infrastructure and methodologies for text classifiers based on large-scale regular expressions. In particular, we describe how Facebook determines if a given piece of text - anything from a hashtag to a post - belongs to a narrow topic such as COVID-19. To fully define a topic and evaluate classifier performance we employ human-guided iterations of keyword discovery, but do not require labeled data. For COVID-19, we build two sets of regular expressions: (1) for 66 languages, with 99% precision and recall >50%, (2) for the 11 most common languages, with precision >90% and recall >90%. Regular expressions enable low-latency queries from multiple platforms. Response to challenges like COVID-19 is fast and so are revisions. Comparisons to a DNN classifier show explainable results, higher precision and recall, and less overfitting. Our learnings can be applied to other narrow-topic classifiers.