CLApr 2, 2020

4chan & 8chan embeddings

arXiv:2005.06946v10.55 citations

Originality Synthesis-oriented

AI Analysis

This provides a resource for researchers and developers working on toxic discourse analysis and hate speech detection, though it is incremental as it applies existing embedding methods to new data.

The researchers tackled the problem of modeling toxic language by collecting over 30 million messages from 4chan and 8chan's /pol/ boards, resulting in the release of free word embeddings (0.4GB) for further study or to enhance hate speech detection systems.

We have collected over 30M messages from the publicly available /pol/ message boards on 4chan and 8chan, and compiled them into a model of toxic language use. The trained word embeddings (0.4GB) are released for free and may be useful for further study on toxic discourse or to boost hate speech detection systems: https://textgain.com/8chan.

View on arXiv PDF

Similar