Detection of Criminal Texts for the Polish State Border Guard
This work addresses a domain-specific problem for the Polish State Border Guard by creating a new benchmark for criminal text detection, though it appears incremental as it applies existing methods to new data.
The researchers tackled the problem of detecting Polish criminal texts on the Internet by developing a classification model, achieving the best performance through fine-tuning a pre-trained Polish transformer language model on a large annotated dataset.
This paper describes research on the detection of Polish criminal texts appearing on the Internet. We carried out experiments to find the best available setup for the efficient classification of unbalanced and noisy data. The best performance was achieved when our model was fine-tuned on a pre-trained Polish-based transformer language model. For the detection task, a large corpus of annotated Internet snippets was collected as training data. We share this dataset and create a new task for the detection of criminal texts using the Gonito platform as the benchmark.