Query-Based Keyphrase Extraction from Long Documents
This addresses a practical issue for NLP practitioners working with long documents, but it is incremental as it adapts existing BERT models with chunking and query techniques.
The paper tackles the problem of keyphrase extraction from long documents by using a query-based approach to maintain global context across chunks, overcoming transformer input size limits. Results show that shorter contexts with a query outperform longer ones without a query on long documents, as demonstrated on Inspec, SemEval, and a novel dataset.
Transformer-based architectures in natural language processing force input size limits that can be problematic when long documents need to be processed. This paper overcomes this issue for keyphrase extraction by chunking the long documents while keeping a global context as a query defining the topic for which relevant keyphrases should be extracted. The developed system employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase. We experimented using various context sizes on two popular datasets, Inspec and SemEval, and a large novel dataset. The presented results show that a shorter context with a query overcomes a longer one without the query on long documents.