Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation
This addresses the need for more versatile language models in domain-specific applications like tourism, though it is incremental as it builds on existing BERT expansions.
The paper tackles the problem of pre-training language models that can handle multiple text formats by proposing HKLM, which learns from unstructured, semi-structured, and well-structured text using specific objectives, and it shows improved performance on tourism datasets with only 1/4 of the data and gains on XNLI.
Existing technologies expand BERT from different perspectives, e.g. designing different pre-training tasks, different semantic granularities, and different model architectures. Few models consider expanding BERT from different text formats. In this paper, we propose a heterogeneous knowledge language model (\textbf{HKLM}), a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. To capture the corresponding relations among these multi-format knowledge, our approach uses masked language model objective to learn word knowledge, uses triple classification objective and title matching objective to learn entity knowledge and topic knowledge respectively. To obtain the aforementioned multi-format text, we construct a corpus in the tourism domain and conduct experiments on 5 tourism NLP datasets. The results show that our approach outperforms the pre-training of plain text using only 1/4 of the data. We further pre-train the domain-agnostic HKLM and achieve performance gains on the XNLI dataset.