CL AIMar 28, 2023

Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

Maria Clara Ramos Morales Crespo, Maria Lina de Souza Jeannine Rocha, Mariana Lourenço Sturzeneker, Felipe Ribas Serras, Guilherme Lamartine de Mello, Aline Silva Costa, Mayara Feliciano Palma, Renata Morais Mesquita, Raquel de Paula Guets, Mariana Marques da Silva, Marcelo Finger, Maria Clara Paixão de Sousa

arXiv:2303.16098v11.76 citationsh-index: 19

Originality Synthesis-oriented

AI Analysis

It provides a foundational resource for researchers in linguistics and NLP working with Brazilian Portuguese, though it is incremental in building upon existing web-as-corpus methodologies.

The paper introduces the Carolina Corpus, a large open corpus of Brazilian Portuguese with 653,322,577 tokens across 7 types, designed to support research in Linguistics and Computer Science by addressing the low-resource status of Portuguese.

This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.

View on arXiv PDF

Similar