CL AI IR LGMay 17, 2024

A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification

D. Subhalingam, Keshav Kolluru, Mausam, Saurabh Singal

arXiv:2405.10918v21.01 citationsh-index: 6KDD

Originality Incremental advance

AI Analysis

This addresses the challenge of incomplete annotations in e-commerce data extraction, enabling more accurate search and recommendation systems, though it is incremental as it builds on existing neural methods for a specific domain.

The paper tackles the problem of extracting attribute-value pairs from product data in e-commerce, where training data is often incomplete, by introducing GenToC, a model that learns from partially-labeled data and improves extraction accuracy by up to 56.3% and boosts deployed system performance by 20.2% with high precision.

In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.

View on arXiv PDF

Similar