IR CLJul 28, 2017

Putting Self-Supervised Token Embedding on the Tables

Marc Szafraniec, Gautier Marti, Philippe Donnat

arXiv:1708.04120v22.2

Originality Incremental advance

AI Analysis

This addresses the need for automated information extraction from semi-structured electronic messages for businesses and individuals, representing an incremental improvement over existing methods.

The paper tackles the problem of extracting text and numbers from plain-text tables with many variations, fuzzy structure, or implicit labels, by introducing SC2T, a self-supervised model for token embeddings, which enables unsupervised labeling or semi-supervised information extraction.

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

View on arXiv PDF

Similar