IROct 14, 2021

Tagged Documents Co-Clustering

arXiv:2110.11079v1
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of improving tag-based information retrieval and recommender systems for users, but it appears incremental as it builds on existing co-clustering methods with specific preprocessing steps.

The paper tackled the problem of clustering tags into conceptual groups by preprocessing data to mitigate power-law effects and proposing a hierarchical agglomerative co-clustering algorithm, evaluating it on synthetic and real-world datasets with an unsupervised stopping criterion.

Tags are short sequences of words allowing to describe textual and non-texual resources such as as music, image or book. Tags could be used by machine information retrieval systems to access quickly a document. These tags can be used to build recommender systems to suggest similar items to a user. However, the number of tags per document is limited, and often distributed according to a Zipf law. In this paper, we propose a methodology to cluster tags into conceptual groups. Data are preprocessed to remove power-law effects and enhance the context of low-frequency words. Then, a hierarchical agglomerative co-clustering algorithm is proposed to group together the most related tags into clusters. The capabilities were evaluated on a sparse synthetic dataset and a real-world tag collection associated with scientific papers. The task being unsupervised, we propose some stopping criterion for selectecting an optimal partitioning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes