CLDec 23, 2022

CinPatent: Datasets for Patent Classification

Minh-Tien Nguyen, Nhung Bui, Manh Tran-Tien, Linh Le, Huy-The Vu

arXiv:2212.12192v30.31 citationsh-index: 26

Originality Synthesis-oriented

AI Analysis

This work addresses the lack of systematic benchmarks and accessible datasets for patent classification, which is incremental as it builds on existing methods by providing new data and comparisons.

The authors introduced two new datasets for patent classification in English and Japanese, containing 45,131 and 54,657 documents respectively, and compared multi-label text classification methods, finding that AttentionXML consistently outperformed other baselines.

Patent classification is the task that assigns each input patent into several codes (classes). Due to its high demand, several datasets and methods have been introduced. However, the lack of both systematic performance comparison of baselines and access to some datasets creates a gap for the task. To fill the gap, we introduce two new datasets in English and Japanese collected by using CPC codes. The English dataset includes 45,131 patent documents with 425 labels and the Japanese dataset contains 54,657 documents with 523 labels. To facilitate the next studies, we compare the performance of strong multi-label text classification methods on the two datasets. Experimental results show that AttentionXML is consistently better than other strong baselines. The ablation study is also conducted in two aspects: the contribution of different parts (title, abstract, description, and claims) of a patent and the behavior of baselines in terms of performance with different training data segmentation. We release the two new datasets with the code of the baselines.

View on arXiv PDF

Similar