CLJul 16, 2025
StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric FeaturesJeremi K. Ochab, Mateusz Matias, Tymoteusz Boba et al.
This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
CLJul 1, 2025
Stylometry recognizes human and LLM-generated texts in short samplesKarol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska et al.
The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
SINov 3, 2015
Reinventing the Triangles: Rule of Thumb for Assessing DetectabilityJeremi K. Ochab
Statistical significance of network clustering has been an unresolved problem since it was observed that community detection algorithms produce false positives even in random graphs. After a phase transition between undetectable and detectable cluster structures was discovered, the connection between spectra of adjacency matrices and detectability limits were shown, and both were calculated for a wide range of networks with arbitrary degree distributions and community structure. In practice the full eigenspectrum is not known, and whether a given network has any communities within detectability regime cannot be easily established. Based on the global clustering coefficient we construct a criterion telling whether in an undirected, unweighted network there is some/no detectable community structure, or if the network is in a transient regime. The method is simple and faster than methods involving bootstrapping.