Antoine Doucet

CL
h-index10
19papers
1,559citations
Novelty24%
AI Score32

19 Papers

CLFeb 11, 2023Code
DocILE Benchmark for Document Information Localization and Extraction

Štěpán Šimsa, Milan Šulc, Michal Uřičář et al.

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.

CLJan 31, 2023
Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections

Nicolas Gutehrlé, Antoine Doucet, Adam Jatowt

Archive collections are nowadays mostly available through search engines interfaces, which allow a user to retrieve documents by issuing queries. The study of these collections may be, however, impaired by some aspects of search engines, such as the overwhelming number of documents returned or the lack of contextual knowledge provided. New methods that could work independently or in combination with search engines are then required to access these collections. In this position paper, we propose to extend TimeLine Summarization (TLS) methods on archive collections to assist in their studies. We provide an overview of existing TLS methods and we describe a conceptual framework for an Archive TimeLine Summarization (ATLS) system, which aims to generate informative, readable and interpretable timelines.

CLJan 17, 2023
The Recent Advances in Automatic Term Extraction: A survey

Hanh Thi Hong Tran, Matej Martinc, Jaya Caporusso et al.

Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and improve several complex downstream tasks, e.g., information retrieval, machine translation, topic detection, and sentiment analysis. ATE systems, along with annotated datasets, have been studied and developed widely for decades, but recently we observed a surge in novel neural systems for the task at hand. Despite a large amount of new research on ATE, systematic survey studies covering novel neural approaches are lacking. We present a comprehensive survey of deep learning-based approaches to ATE, with a focus on Transformer-based neural models. The study also offers a comparison between these systems and previous ATE approaches, which were based on feature engineering and non-neural supervised learning algorithms.

DLMar 30, 2023
Yes but.. Can ChatGPT Identify Entities in Historical Documents?

Carlos-Emiliano González-Gallardo, Emanuela Boros, Nancy Girdhar et al.

Large language models (LLMs) have been leveraged for several years now, obtaining state-of-the-art performance in recognizing entities from modern documents. For the last few months, the conversational agent ChatGPT has "prompted" a lot of interest in the scientific community and public due to its capacity of generating plausible-sounding answers. In this paper, we explore this ability by probing it in the named entity recognition and classification (NERC) task in primary sources (e.g., historical newspapers and classical commentaries) in a zero-shot manner and by comparing it with state-of-the-art LM-based systems. Our findings indicate several shortcomings in identifying entities in historical text that range from the consistency of entity annotation guidelines, entity complexity, and code-switching, to the specificity of prompting. Moreover, as expected, the inaccessibility of historical archives to the public (and thus on the Internet) also impacts its performance.

CLDec 12, 2022
Ensembling Transformers for Cross-domain Automatic Term Extraction

Hanh Thi Hong Tran, Matej Martinc, Andraz Pelicon et al.

Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.

CLSep 28, 2023
A Comprehensive Survey of Document-level Relation Extraction (2016-2023)

Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo et al.

Document-level relation extraction (DocRE) is an active area of research in natural language processing (NLP) concerned with identifying and extracting relationships between entities beyond sentence boundaries. Compared to the more traditional sentence-level relation extraction, DocRE provides a broader context for analysis and is more challenging because it involves identifying relationships that may span multiple sentences or paragraphs. This task has gained increased interest as a viable solution to build and populate knowledge bases automatically from unstructured large-scale documents (e.g., scientific papers, legal contracts, or news articles), in order to have a better understanding of relationships between entities. This paper aims to provide a comprehensive overview of recent advances in this field, highlighting its different applications in comparison to sentence-level relation extraction.

CLJul 4, 2022
Using contextual sentence analysis models to recognize ESG concepts

Elvys Linhares Pontes, Mohamed Benjannet, Jose G. Moreno et al.

This paper summarizes the joint participation of the Trading Central Labs and the L3i laboratory of the University of La Rochelle on both sub-tasks of the Shared Task FinSim-4 evaluation campaign. The first sub-task aims to enrich the 'Fortia ESG taxonomy' with new lexicon entries while the second one aims to classify sentences to either 'sustainable' or 'unsustainable' with respect to ESG (Environment, Social and Governance) related factors. For the first sub-task, we proposed a model based on pre-trained Sentence-BERT models to project sentences and concepts in a common space in order to better represent ESG concepts. The official task results show that our system yields a significant performance improvement compared to the baseline and outperforms all other submissions on the first sub-task. For the second sub-task, we combine the RoBERTa model with a feed-forward multi-layer perceptron in order to extract the context of sentences and classify them. Our model achieved high accuracy scores (over 92%) and was ranked among the top 5 systems.

CLAug 6, 2024
L3iTC at the FinLLM Challenge Task: Quantization for Financial Text Classification & Summarization

Elvys Linhares Pontes, Carlos-Emiliano González-Gallardo, Mohamed Benjannet et al.

This article details our participation (L3iTC) in the FinLLM Challenge Task 2024, focusing on two key areas: Task 1, financial text classification, and Task 2, financial text summarization. To address these challenges, we fine-tuned several large language models (LLMs) to optimize performance for each task. Specifically, we used 4-bit quantization and LoRA to determine which layers of the LLMs should be trained at a lower precision. This approach not only accelerated the fine-tuning process on the training data provided by the organizers but also enabled us to run the models on low GPU memory. Our fine-tuned models achieved third place for the financial classification task with an F1-score of 0.7543 and secured sixth place in the financial summarization task on the official test datasets.

CLJan 20, 2023
Contextualizing Emerging Trends in Financial News Articles

Nhu Khoa Nguyen, Thierry Delahaut, Emanuela Boros et al.

Identifying and exploring emerging trends in the news is becoming more essential than ever with many changes occurring worldwide due to the global health crises. However, most of the recent research has focused mainly on detecting trends in social media, thus, benefiting from social features (e.g. likes and retweets on Twitter) which helped the task as they can be used to measure the engagement and diffusion rate of content. Yet, formal text data, unlike short social media posts, comes with a longer, less restricted writing format, and thus, more challenging. In this paper, we focus our study on emerging trends detection in financial news articles about Microsoft, collected before and during the start of the COVID-19 pandemic (July 2019 to July 2020). We make the dataset accessible and propose a strong baseline (Contextual Leap2Trend) for exploring the dynamics of similarities between pairs of keywords based on topic modelling and term frequency. Finally, we evaluate against a gold standard (Google Trends) and present noteworthy real-world scenarios regarding the influence of the pandemic on Microsoft.

CLFeb 24, 2025
Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data

Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah et al.

Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language Models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.

CLJul 4, 2025
Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis

Elvys Linhares Pontes, Carlos-Emiliano González-Gallardo, Georgeta Bordea et al.

Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark Buy&Hold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark Buy&Hold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.

CLDec 11, 2024
DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

Phan Phuong Mai Chau, Souhail Bakkali, Antoine Doucet

Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.

CLJun 13, 2024
CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature

Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo et al.

The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.

CLDec 15, 2021
Named entity recognition architecture combining contextual and global features

Tran Thi Hong Hanh, Antoine Doucet, Nicolas Sidere et al.

Named entity recognition (NER) is an information extraction technique that aims to locate and classify named entities (e.g., organizations, locations,...) within a document into predefined categories. Correctly identifying these phrases plays a significant role in simplifying information access. However, it remains a difficult task because named entities (NEs) have multiple forms and they are context-dependent. While the context can be represented by contextual features, global relations are often misrepresented by those models. In this paper, we propose the combination of contextual features from XLNet and global features from Graph Convolution Network (GCN) to enhance NER performance. Experiments over a widely-used dataset, CoNLL 2003, show the benefits of our strategy, with results competitive with the state of the art (SOTA).

CLSep 23, 2021
Named Entity Recognition and Classification on Historical Documents: A Survey

Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes et al.

After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

CLApr 14, 2021
Event Detection as Question Answering with Entity Information

Emanuela Boros, Jose G. Moreno, Antoine Doucet

In this paper, we propose a recent and under-researched paradigm for the task of event detection (ED) by casting it as a question-answering (QA) problem with the possibility of multiple answers and the support of entities. The extraction of event triggers is, thus, transformed into the task of identifying answer spans from a context, while also focusing on the surrounding entities. The architecture is based on a pre-trained and fine-tuned language model, where the input context is augmented with entities marked at different levels, their positions, their types, and, finally, the argument roles. Experiments on the ACE~2005 corpus demonstrate that the proposed paradigm is a viable solution for the ED task and it significantly outperforms the state-of-the-art models. Moreover, we prove that our methods are also able to extract unseen event types.

CLApr 13, 2021
Transformer-based Methods for Recognizing Ultra Fine-grained Entities (RUFES)

Emanuela Boros, Antoine Doucet

This paper summarizes the participation of the Laboratoire Informatique, Image et Interaction (L3i laboratory) of the University of La Rochelle in the Recognizing Ultra Fine-grained Entities (RUFES) track within the Text Analysis Conference (TAC) series of evaluation workshops. Our participation relies on two neural-based models, one based on a pre-trained and fine-tuned language model with a stack of Transformer layers for fine-grained entity extraction and one out-of-the-box model for within-document entity coreference. We observe that our approach has great potential in increasing the performance of fine-grained entity recognition. Thus, the future work envisioned is to enhance the ability of the models following additional experiments and a deeper analysis of the results.

CVMar 5, 2020
AI outperformed every dermatologist: Improved dermoscopic melanoma diagnosis through customizing batch logic and loss function in an optimized Deep CNN architecture

Cong Tri Pham, Mai Chi Luong, Dung Van Hoang et al.

Melanoma, one of most dangerous types of skin cancer, re-sults in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent research has used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, an imbalance of sensitivity and specificity measures affected the performance of existing models. This study proposes a method using deep convolutional neural networks aiming to detect melanoma as a binary classification problem. It involves 3 key features, namely customized batch logic, customized loss function and reformed fully connected layers. The training dataset is kept up to date including 17,302 images of melanoma and nevus; this is the largest dataset by far. The model performance is compared to that of 157 dermatologists from 12 university hospitals in Germany based on MClass-D dataset. The model outperformed all 157 dermatologists and achieved state-of-the-art performance with AUC at 94.4% with sensitivity of 85.0% and specificity of 95.0% using a prediction threshold of 0.5 on the MClass-D dataset of 100 dermoscopic images. Moreover, a threshold of 0.40858 showed the most balanced measure compared to other researches, and is promisingly application to medical diagnosis, with sensitivity of 90.0% and specificity of 93.8%.

IRNov 15, 2015
Applying Semantic Web Technologies for Improving the Visibility of Tourism Data

Fayrouz Soualah-Alila, Cyril Faucher, Frédéric Bertrand et al.

Tourism industry is an extremely information-intensive, complex and dynamic activity. It can benefit from semantic Web technologies, due to the significant heterogeneity of information sources and the high volume of on-line data. The management of semantically diverse annotated tourism data is facilitated by ontologies that provide methods and standards, which allow flexibility and more intelligent access to on-line data. This paper provides a description of some of the early results of the Tourinflux project which aims to apply semantic Web technologies to support tourist actors in effectively finding and publishing information on the Web.