IR CV LGJan 22, 2025

Patent Figure Classification using Large Vision-language Models

Sushil Awale, Eric Müller-Budack, Ralph Ewerth

arXiv:2501.12751v13.61 citationsh-index: 10Has CodeECIR

Originality Incremental advance

AI Analysis

This work addresses efficient prior art search in patent retrieval systems by improving classification across multiple aspects, though it appears incremental as it applies existing LVLMs to a new domain with novel datasets and a strategy.

The paper tackled patent figure classification by exploring large vision-language models (LVLMs) in zero-shot and few-shot scenarios, introducing new datasets (PatFigVQA and PatFigCLS) and a tournament-style classification strategy, with experimental results showing feasibility compared to CNN-based approaches.

Patent figure classification facilitates faceted search in patent retrieval systems, enabling efficient prior art search. Existing approaches have explored patent figure classification for only a single aspect and for aspects with a limited number of concepts. In recent years, large vision-language models (LVLMs) have shown tremendous performance across numerous computer vision downstream tasks, however, they remain unexplored for patent figure classification. Our work explores the efficacy of LVLMs in patent figure visual question answering (VQA) and classification, focusing on zero-shot and few-shot learning scenarios. For this purpose, we introduce new datasets, PatFigVQA and PatFigCLS, for fine-tuning and evaluation regarding multiple aspects of patent figures~(i.e., type, projection, patent class, and objects). For a computational-effective handling of a large number of classes using LVLM, we propose a novel tournament-style classification strategy that leverages a series of multiple-choice questions. Experimental results and comparisons of multiple classification approaches based on LVLMs and Convolutional Neural Networks (CNNs) in few-shot settings show the feasibility of the proposed approaches.

View on arXiv PDF Code

Similar