Capturing Global Informativeness in Open Domain Keyphrase Extraction
This work addresses a bottleneck in open-domain keyphrase extraction for applications like web content analysis, though it is incremental as it builds on existing neural methods.
The paper tackles the problem that neural keyphrase extraction methods often prioritize short, entity-style phrases over globally informative ones in open-domain documents, and presents JointKPE, which improves performance by capturing both local phraseness and global informativeness, achieving significant gains on datasets like OpenKP and KP20k.
Open-domain KeyPhrase Extraction (KPE) aims to extract keyphrases from documents without domain or quality restrictions, e.g., web pages with variant domains and qualities. Recently, neural methods have shown promising results in many KPE tasks due to their powerful capacity for modeling contextual semantics of the given documents. However, we empirically show that most neural KPE methods prefer to extract keyphrases with good phraseness, such as short and entity-style n-grams, instead of globally informative keyphrases from open-domain documents. This paper presents JointKPE, an open-domain KPE architecture built on pre-trained language models, which can capture both local phraseness and global informativeness when extracting keyphrases. JointKPE learns to rank keyphrases by estimating their informativeness in the entire document and is jointly trained on the keyphrase chunking task to guarantee the phraseness of keyphrase candidates. Experiments on two large KPE datasets with diverse domains, OpenKP and KP20k, demonstrate the effectiveness of JointKPE on different pre-trained variants in open-domain scenarios. Further analyses reveal the significant advantages of JointKPE in predicting long and non-entity keyphrases, which are challenging for previous neural KPE methods. Our code is publicly available at https://github.com/thunlp/BERT-KPE.