SI AI CL IR LGDec 4, 2024

YT-30M: A multi-lingual multi-category dataset of YouTube comments

arXiv:2412.03465v11.2h-index: 7

Originality Synthesis-oriented

AI Analysis

It provides a new dataset for analyzing multilingual and multi-category social media content, but is incremental as it builds on existing data collection efforts.

The paper introduces YT-30M, a large-scale multilingual dataset of YouTube comments with categories, containing over 32 million comments, and releases it publicly for research.

This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).

View on arXiv PDF

Similar