Patent Representation Learning via Self-supervision
This work addresses the problem of scalable and generalizable patent understanding for researchers and practitioners, offering an incremental improvement by leveraging inherent document structure to enhance representation learning.
The paper tackled the problem of learning patent embeddings by addressing a failure mode in SimCSE-style dropout augmentation that leads to overly uniform embeddings, proposing a section-based augmentation method using different patent sections as complementary views. The result is a self-supervised framework that matches or surpasses supervised baselines in prior-art retrieval and classification on large-scale benchmarks, with analysis showing task-specific benefits from different sections.
This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.