CL LGMay 31, 2022

Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Michał Możdżonek, Anna Wróblewska, Sergiy Tkachuk, Szymon Łukasik

arXiv:2205.15712v20.31 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This addresses product matching for e-commerce and data integration, but it is incremental as it applies existing models to a new language dataset.

The paper tackled product matching across data sources using textual features in English and Polish, showing that fine-tuned multilingual Transformer models like mBERT and XLM-RoBERTa perform similarly or better than latest solutions on an English benchmark and provide baseline results on a new Polish dataset.

Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

View on arXiv PDF

Similar