Florian Cafiero

h-index7

6papers

279citations

Novelty20%

AI Score30

Ranked #135,566 of 194,257 authors (top 70%)#24,400 in CL (top 79%)

6 Papers

2.1CLMar 3, 2023

Who could be behind QAnon? Authorship attribution with supervised machine-learning

Florian Cafiero, Jean-Baptiste Camps

A series of social media posts signed under the pseudonym "Q", started a movement known as QAnon, which led some of its most radical supporters to violent and illegal actions. To identify the person(s) behind Q, we evaluate the coincidence between the linguistic properties of the texts written by Q and to those written by a list of suspects provided by journalistic investigation. To identify the authors of these posts, serious challenges have to be addressed. The "Q drops" are very short texts, written in a way that constitute a sort of literary genre in itself, with very peculiar features of style. These texts might have been written by different authors, whose other writings are often hard to find. After an online ethnology of the movement, necessary to collect enough material written by these thirteen potential authors, we use supervised machine learning to build stylistic profiles for each of them. We then performed a rolling analysis on Q's writings, to see if any of those linguistic profiles match the so-called 'QDrops' in part or entirety. We conclude that two different individuals, Paul F. and Ron W., are the closest match to Q's linguistic signature, and they could have successively written Q's texts. These potential authors are not high-ranked personality from the U.S. administration, but rather social media activists.

0.6CLFeb 17

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

5.8DLJun 4

Hybrid Metadata Extraction from League of Nations Index Cards: From Feasibility Study to Archival System Integration

Florian Cafiero, Grégoire Mallard

This project report presents a hybrid AI-assisted workflow for extracting and reintegrating archival metadata from League of Nations index cards. The project is situated in the broader context of the Total Digital Access to the League of Nations Archives project (LONTAD). Rather than attempting full OCR of the underlying archival collections, the workflow targets the index cards themselves as documentary access points to files, series, archival descriptions, and digital objects. The project evolved from a layout-aware pipeline combining YOLO, TrOCR, and local LLM post-correction to a hybrid architecture using a fine-tuned vision-language model for broad extraction while retaining specialized OCR for file and series identifiers.

2.0CVNov 15, 2024

Diachronic Document Dataset for Semantic Layout Analysis

Thibault Clérice, Juliette Janes, Hugo Scheithauer et al.

We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

0.2CLApr 19, 2021

No comments: Addressing commentary sections in websites' analyses

Florian Cafiero, Paul Guille-Escuret, Jeremy Ward

Removing or extracting the commentary sections from a series of websites is a tedious task, as no standard way to code them is widely adopted. This operation is thus very rarely performed. In this paper, we show that these commentary sections can induce significant biases in the analyses, especially in the case of controversial Highlights $\bullet$ Commentary sections can induce biases in the analysis of websites' contents $\bullet$ Analyzing these sections can be interesting per se. $\bullet$ We illustrate these points using a corpus of anti-vaccine websites. $\bullet$ We provide guidelines to remove or extract these sections.

1.7CLJan 2, 2020

Why Molière most likely did write his plays

Florian Cafiero, Jean-Baptiste Camps

As for Shakespeare, a hard-fought debate has emerged about Molière, a supposedly uneducated actor who, according to some, could not have written the masterpieces attributed to him. In the past decades, the century-old thesis according to which Pierre Corneille would be their actual author has become popular, mostly because of new works in computational linguistics. These results are reassessed here through state-of-the-art attribution methods. We study a corpus of comedies in verse by major authors of Molière and Corneille's time. Analysis of lexicon, rhymes, word forms, affixes, morphosyntactic sequences, and function words do not give any clue that another author among the major playwrights of the time would have written the plays signed under the name Molière.