LGCVNov 10, 2025

Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery

arXiv:2511.06973v2
Originality Incremental advance
AI Analysis

This enables large-scale automated template discovery for applications like retrieval-augmented generation and data cleaning, though it is incremental as it builds on existing similarity methods.

The paper tackled the problem of identifying structurally similar spreadsheets by introducing a hybrid distance metric that combines semantic embeddings, data types, and spatial positioning, achieving perfect template reconstruction with an Adjusted Rand Index of 1.00 compared to 0.90 for a baseline.

Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes