CLAILGJul 16, 2025

StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features

arXiv:2507.12064v14 citationsh-index: 14CLEF
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of detecting AI-generated text for security and authenticity applications, but it is incremental as it builds on existing non-neural, explainable methods.

The paper tackled AI-generated text detection by using a stylometric pipeline with frequency-based features and gradient-boosted trees, achieving results on a large corpus of over 500,000 machine-generated texts.

This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes