Asier Gutiérrez-Fandiño

CL
11papers
270citations
Novelty22%
AI Score20

11 Papers

CLJun 30, 2022
esCorpius: A Massive Spanish Crawling Corpus

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé et al.

In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.

CLSep 8, 2021Code
Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

Casimiro Pio Carrino, Jordi Armengol-Estapé, Asier Gutiérrez-Fandiño et al.

This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model's vocabulary on the NER performances by offering an interesting vocabulary-centric analysis. The results confirm that domain-specific pretraining is fundamental to achieving higher performances in downstream NER tasks, even within a mid-resource scenario. To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine. Our best models are freely available in the HuggingFace hub: https://huggingface.co/BSC-TeMU.

CVDec 10, 2021
The Large Labelled Logo Dataset (L3D): A Multipurpose and Hand-Labelled Continuously Growing Dataset

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé

In this work, we present the Large Labelled Logo Dataset (L3D), a multipurpose, hand-labelled, continuously growing dataset. It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) open registry. Each of them is associated to multiple labels that classify the figurative and textual elements that appear in the images. These annotations have been classified by the EUIPO evaluators using the Vienna classification, a hierarchical classification of figurative marks. We suggest two direct applications of this dataset, namely, logo classification and logo generation.

CLOct 31, 2021
FinEAS: Financial Embedding Analysis of Sentiment

Asier Gutiérrez-Fandiño, Miquel Noguer i Alonso, Petter Kolm et al.

We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS). In financial markets, news and investor sentiment are significant drivers of security prices. Thus, leveraging the capabilities of modern NLP approaches for financial sentiment analysis is a crucial component in identifying patterns and trends that are useful for market participants and regulators. In recent years, methods that use transfer learning from large Transformer-based language models like BERT, have achieved state-of-the-art results in text classification tasks, including sentiment analysis using labelled datasets. Researchers have quickly adopted these approaches to financial texts, but best practices in this domain are not well-established. In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model. We demonstrate our approach achieves significant improvements in comparison to vanilla BERT, LSTM, and FinBERT, a financial domain specific BERT.

CLOct 23, 2021
Spanish Legalese Language Model and Corpora

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre et al.

There are many Language Models for the English language according to its worldwide relevance. However, for the Spanish language, even if it is a widely spoken language, there are very few Spanish Language Models which result to be small and too general. Legal slang could be think of a Spanish variant on its own as it is very complicated in vocabulary, semantics and phrase understanding. For this work we gathered legal-domain corpora from different sources, generated a model and evaluated against Spanish general domain tasks. The model provides reasonable results in those tasks.

CLSep 16, 2021
Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

Casimiro Pio Carrino, Jordi Armengol-Estapé, Ona de Gibert Bonet et al.

We introduce CoWeSe (the Corpus Web Salud Español), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (\url{https://zenodo.org/record/4561971\#.YTI5SnVKiEA}).

CLJul 15, 2021
MarIA: Spanish Language Models

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies et al.

This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB of clean and deduplicated texts with 135 billion words extracted from the Spanish Web Archive crawled by the National Library of Spain between 2009 and 2019. We assessed the performance of the models with nine existing evaluation datasets and with a novel extractive Question Answering dataset created ex novo. Overall, MarIA models outperform the existing Spanish models across a variety of NLU tasks and training settings.

LGMay 31, 2021
Persistent Homology Captures the Generalization of Neural Networks Without A Validation Set

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé et al.

The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model. This is done instead of measuring intrinsic properties of the model to determine whether it is learning appropriately. In this work, we suggest studying the training of neural networks with Algebraic Topology, specifically Persistent Homology (PH). Using simplicial complex representations of neural networks, we study the PH diagram distance evolution on the neural network learning process with different architectures and several datasets. Results show that the PH diagram distance between consecutive neural network states correlates with the validation accuracy, implying that the generalization error of a neural network could be intrinsically estimated without any holdout set.

CLFeb 25, 2021
Spanish Biomedical and Clinical Language Embeddings

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Casimiro Pio Carrino et al.

We computed both Word and Sub-word Embeddings using FastText. For Sub-word embeddings we selected Byte Pair Encoding (BPE) algorithm to represent the sub-words. We evaluated the Biomedical Word Embeddings obtaining better results than previous versions showing the implication that with more data, we obtain better representations.

LGJan 19, 2021
Characterizing and Measuring the Similarity of Neural Networks with Persistent Homology

David Pérez-Fernández, Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé et al.

Characterizing the structural properties of neural networks is crucial yet poorly understood, and there are no well-established similarity measures between networks. In this work, we observe that neural networks can be represented as abstract simplicial complex and analyzed using their topological 'fingerprints' via Persistent Homology (PH). We then describe a PH-based representation proposed for characterizing and measuring similarity of neural networks. We empirically show the effectiveness of this representation as a descriptor of different architectures in several datasets. This approach based on Topological Data Analysis is a step towards better understanding neural networks and serves as a useful similarity measure.

CRDec 21, 2020
A Vulnerability Study on Academic Collaboration Networks Based on Network Dynamics

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marta Villegas

Researchers that work for the same institution use their email as the main communication tool. Email can be one of the most fruitful attack vectors of research institutions as they also contain access to all accounts and thus to all private information. We propose an approach for analyzing in terms of security research institutions' communication networks. We first obtained institutions' communication networks as well as a method to analyze possible breaches of collected emails. We downloaded the network of 4 different research centers, three from Spain and one from Portugal. We then ran simulations of Susceptible-Exposed-Infected-Recovered (SEIR) complex network dynamics model for analyzing the vulnerability of the network. More than half of the nodes have more than one security breach, and our simulation results show that more than 90\% of the networks' nodes are vulnerable. This method can be employed for enhancing security of research centers and can make email accounts' use security-aware. It may additionally open new research lines in communication security. Finally, we manifest that, due to confidentiality reasons, the sources we utilized for obtaining communication networks should not be providing the information that we were able to gather.