CLSep 19, 2023

FRACAS: A FRench Annotated Corpus of Attribution relations in newS

Ange Richard, Laura Alonzo-Canul, François Portet

arXiv:2309.10604v117.084 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This provides a resource for studying quotation extraction in French, addressing a gap for NLP researchers and sociologists, but it is incremental as it extends existing work to a new language.

The authors tackled the lack of non-English data for quotation extraction by creating a manually annotated corpus of 1676 French newswire texts, achieving substantially high inter-annotator agreement for this challenging task.

Quotation extraction is a widely useful task both from a sociological and from a Natural Language Processing perspective. However, very little data is available to study this task in languages other than English. In this paper, we present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution. We first describe the composition of our corpus and the choices that were made in selecting the data. We then detail the annotation guidelines and annotation process, as well as a few statistics about the final corpus and the obtained balance between quote types (direct, indirect and mixed, which are particularly challenging). We end by detailing our inter-annotator agreement between the 8 annotators who worked on manual labelling, which is substantially high for such a difficult linguistic phenomenon.

View on arXiv PDF

Similar