CLApr 26, 2024
Evaluation of Geographical Distortions in Language ModelsRémy Decoupes, Roberto Interdonato, Mathieu Roche et al.
Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
IRJun 2, 2024
A lexicon obtained and validated by a data-driven approach for organic residues valorization in emerging and developing countriesChristiane Rakotomalala, Jean-Marie Paillat, Frédéric Feder et al.
The text mining method presented in this paper was used for annotation of terms related to biological transformation and valorization of organic residues in agriculture in low and middle-income country. Specialized lexicon was obtained through different steps: corpus and extraction of terms, annotation of extracted terms, selection of relevant terms.
IRJan 15, 2021
Annotation of epidemiological information in animal disease-related news articles: guidelinesSarah Valentin, Elena Arsevska, Aline Vilain et al.
This paper describes a method for annotation of epidemiological information in animal disease-related news articles. The annotation guidelines are generic and aim to embrace all animal or zoonotic infectious diseases, regardless of the pathogen involved or its way of transmission (e.g. vector-borne, airborne, by contact). The framework relies on the successive annotation of all the sentences from a news article. The annotator evaluates the sentences in a specific epidemiological context, corresponding to the publication of the news article.
IRApr 16, 2019
How to define co-occurrence in different domains of study?Mathieu Roche
This position paper presents a comparative study of co-occurrences. Some similarities and differences in the definition exist depending on the research domain (e.g. linguistics, NLP, computer science). This paper discusses these points, and deals with the methodological aspects in order to identify co-occurrences in a multidisciplinary paradigm.
IRAug 13, 2018
Methodology for identifying study sites in scientific corpusEric Kergosien, Marie-Noëlle Bessagnet, Maguelonne Teisseire et al.
The TERRE-ISTEX project aims at identifying the evolution of research working relation to study areas, disciplinary crossings and concrete research methods based on the heterogeneous digital content available in scientific corpora. The project is divided into three main actions: (1) to identify the periods and places which have been the subject of empirical studies, and which reflect the publications resulting from the corpus analyzed, (2) to identify the thematics addressed in these works and (3) to develop a web-based geographical information retrieval tool (GIR). The first two actions involve approaches combining Natural languages processing patterns with text mining methods. By crossing the three dimensions (spatial, thematic and temporal) in a GIR engine, it will be possible to understand what research has been carried out on which territories and at what time. In the project, the experiments are carried out on a heterogeneous corpus including electronic thesis and scientific articles from the ISTEX digital libraries and the CIRAD research center.
IRJun 8, 2018
Automatic Identification of Research Fields in Scientific PapersEric Kergosien, Amin Farvardin, Maguelonne Teisseire et al.
The TERRE-ISTEX project aims to identify scientific research dealing with specific geographical territories areas based on heterogeneous digital content available in scientific papers. The project is divided into three main work packages: (1) identification of the periods and places of empirical studies, and which reflect the publications resulting from the analyzed text samples, (2) identification of the themes which appear in these documents, and (3) development of a web-based geographical information retrieval tool (GIR). The first two actions combine Natural Language Processing patterns with text mining methods. The integration of the spatial, thematic and temporal dimensions in a GIR contributes to a better understanding of what kind of research has been carried out, of its topics and its geographical and historical coverage. Another originality of the TERRE-ISTEX project is the heterogeneous character of the corpus, including PhD theses and scientific articles from the ISTEX digital libraries and the CIRAD research center.