CLAICVJun 13, 2025

Unsupervised Document and Template Clustering using Multimodal Embeddings

arXiv:2506.12116v32 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses document organization for applications like digitization and archiving, but it is incremental as it systematizes existing methods without introducing new paradigms.

The paper tackled unsupervised clustering of documents at category and template levels using multimodal encoders and classical algorithms, revealing that vision features excel on clean pages while text dominates under covariate shift, with fused encoders providing the best balance.

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes