CL CVMar 1, 2024

Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction

arXiv:2403.00528v13.45 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of improving named entity recognition in noisy, domain-specific text like receipts, though it is incremental as it builds on existing methods with minor gains.

The paper tackled the problem of extracting named entities from text with spelling errors, specifically from OCR-processed Japanese shop receipts, and found that fine-tuned large language models perform similarly to BERT models in extraction while also correcting some OCR errors.

Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories. In this paper, we hypothesise that decoder-only Large Language Models (LLMs) can also be used generatively to extract both the NE, as well as potentially recover the correct surface form of the NE, where any spelling errors that were present in the input text get automatically corrected. We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on the task of producing NEs from text that was obtained by applying Optical Character Recognition (OCR) to images of Japanese shop receipts; in this work, we do not attempt to find or evaluate the location of NEs in the text. We show that the best fine-tuned LLM performs as well as, or slightly better than, the best fine-tuned BERT LM, although the differences are not significant. However, the best LLM is also shown to correct OCR errors in some cases, as initially hypothesised.

View on arXiv PDF

Similar