Nonsymbolic Text Representation
This addresses the challenge of text processing for applications where segmentation or tokenization is unreliable or unavailable, representing an incremental advance in representation methods.
The authors tackled the problem of text representation without relying on symbolic units like words, introducing the first generic nonsymbolic model that outperforms prior work on information extraction and text denoising tasks, though no specific numbers are provided.
We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words or other symbolic units in text. This applies to training the parameters of the model on a training corpus as well as to applying it when computing the representation of a new text. We show that our model performs better than prior work on an information extraction and a text denoising task.