Thomas Roland Barillot

2papers

2 Papers

LGJun 12, 2024
Topological quantification of ambiguity in semantic search

Thomas Roland Barillot, Alex De Castro

We studied how the local topological structure of sentence-embedding neighborhoods encodes semantic ambiguity. Extending ideas that link word-level polysemy to non-trivial persistent homology, we generalized the concept to full sentences and quantified ambiguity of a query in a semantic search process with two persistent homology metrics: the 1-Wasserstein norm of $H_{0}$ and the maximum loop lifetime of $H_{1}$. We formalized the notion of ambiguity as the relative presence of semantic domains or topics in sentences. We then used this formalism to compute "ab-initio" simulations that encode datapoints as linear combination of randomly generated single topics vectors in an arbitrary embedding space and demonstrate that ambiguous sentences separate from unambiguous ones in both metrics. Finally we validated those findings with real-world case by investigating on a fully open corpus comprising Nobel Prize Physics lectures from 1901 to 2024, segmented into contiguous, non-overlapping chunks at two granularity: $\sim\!250$ tokens and $\sim\!750$ tokens. We tested embedding with four publicly available models. Results across all models reproduce simulations and remain stable despite changes in embedding architecture. We conclude that persistent homology provides a model-agnostic signal of semantic discontinuities, suggesting practical use for ambiguity detection and semantic search recall.

CLFeb 18, 2022
Modelling the semantics of text in complex document layouts using graph transformer networks

Thomas Roland Barillot, Jacob Saks, Polena Lilyanova et al.

Representing structured text from complex documents typically calls for different machine learning techniques, such as language models for paragraphs and convolutional neural networks (CNNs) for table extraction, which prohibits drawing links between text spans from different content types. In this article we propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span irrespective of the content type they are found in. We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information, similar to language models that work only on text sequences.