CLSep 17, 2021

reproducing "ner and pos when nothing is capitalized"

Andreas Kuster, Jakub Filipek, Viswa Virinchi Muppirala

arXiv:2109.08396v10.2Has Code

Originality Synthesis-oriented

AI Analysis

This is an incremental reproduction study for NLP practitioners dealing with casing inconsistencies in text data.

The researchers reproduced a study on mitigating performance drops in NLP tasks like NER and POS when capitalization mismatches occur between training and testing data, finding that lowercasing 50% of the dataset yields the best performance as claimed, though they observed slightly lower results in most experiments.

Capitalization is an important feature in many NLP tasks such as Named Entity Recognition (NER) or Part of Speech Tagging (POS). We are trying to reproduce results of paper which shows how to mitigate a significant performance drop when casing is mismatched between training and testing data. In particular we show that lowercasing 50% of the dataset provides the best performance, matching the claims of the original paper. We also show that we got slightly lower performance in almost all experiments we have tried to reproduce, suggesting that there might be some hidden factors impacting our performance. Lastly, we make all of our work available in a public github repository.

View on arXiv PDF Code

Similar