CLAILGMay 25, 2023

Extracting Text Representations for Terms and Phrases in Technical Domains

arXiv:2305.15867v1222 citations
Originality Incremental advance
AI Analysis

This addresses the computational cost and out-of-vocabulary issues in knowledge discovery platforms for technical fields, offering a more efficient solution.

The paper tackles the problem of extracting dense text representations for terms and phrases in technical domains, proposing an unsupervised approach that matches sentence encoder quality while being 5 times smaller and up to 10 times faster.

Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes