Antoine J. -P. Tixier

CL
11papers
1,902citations
Novelty38%
AI Score27

11 Papers

LGJan 9, 2023
Safer Together: Machine Learning Models Trained on Shared Accident Datasets Predict Construction Injuries Better than Company-Specific Models

Antoine J. -P. Tixier, Matthew R. Hallowell

In this study, we capitalized on a collective dataset repository of 57k accidents from 9 companies belonging to 3 domains and tested whether models trained on multiple datasets (generic models) predicted safety outcomes better than the company-specific models. We experimented with full generic models (trained on all data), per-domain generic models (construction, electric T&D, oil & gas), and with ensembles of generic and specific models. Results are very positive, with generic models outperforming the company-specific models in most cases while also generating finer-grained, hence more useful, forecasts. Successful generic models remove the needs for training company-specific models, saving a lot of time and resources, and give small companies, whose accident datasets are too limited to train their own models, access to safety outcome predictions. It may still however be advantageous to train specific models to get an extra boost in performance through ensembling with the generic models. Overall, by learning lessons from a pool of datasets whose accumulated experience far exceeds that of any single company, and making these lessons easily accessible in the form of simple forecasts, generic models tackle the holy grail of safety cross-organizational learning and dissemination in the construction industry.

CLMar 23, 2020Code
Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

Christos Xypolopoulos, Antoine J. -P. Tixier, Michalis Vazirgiannis

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available at https://github.com/ksipos/polysemy-assessment . The paper was accepted as a long paper at EACL 2021.

CLAug 17, 2019Code
Message Passing Attention Networks for Document Understanding

Giannis Nikolentzos, Antoine J. -P. Tixier, Michalis Vazirgiannis

Graph neural networks have recently emerged as a very effective framework for processing graph-structured data. These models have achieved state-of-the-art performance in many tasks. Most graph neural networks can be described in terms of message passing, vertex update, and readout functions. In this paper, we represent documents as word co-occurrence networks and propose an application of the message passing framework to NLP, the Message Passing Attention network for Document understanding (MPAD). We also propose several hierarchical variants of MPAD. Experiments conducted on 10 standard text classification datasets show that our architectures are competitive with the state-of-the-art. Ablation studies reveal further insights about the impact of the different components on performance. Code is publicly available at: https://github.com/giannisnik/mpad .

CLOct 16, 2021
FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation

Moussa Kamal Eddine, Guokan Shang, Antoine J. -P. Tixier et al.

Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more reliable, but require significant computational resources. In this paper, we propose FrugalScore, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude less parameters and running several times faster. On average over all learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times less parameters than the original metrics. We make our trained metrics publicly available, to benefit the entire NLP community and in particular researchers and practitioners with limited resources.

CLOct 23, 2020
BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Moussa Kamal Eddine, Antoine J. -P. Tixier, Michalis Vazirgiannis

Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. Code, data and models are publicly available.

LGAug 16, 2019
AI-based Prediction of Independent Construction Safety Outcomes from Universal Attributes

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

This paper significantly improves on, and finishes to validate, an approach proposed in previous research in which safety outcomes were predicted from attributes with machine learning. Like in the original study, we use Natural Language Processing (NLP) to extract fundamental attributes from raw incident reports and machine learning models are trained to predict safety outcomes. The outcomes predicted here are injury severity, injury type, body part impacted, and incident type. However, unlike in the original study, safety outcomes were not extracted via NLP but were provided by independent human annotations, eliminating any potential source of artificial correlation between predictors and predictands. Results show that attributes are still highly predictive, confirming the validity of the original approach. Other improvements brought by the current study include the use of (1) a much larger dataset featuring more than 90,000 reports, (2) two new models, XGBoost and linear SVM (Support Vector Machines), (3) model stacking, (4) a more straightforward experimental setup with more appropriate performance metrics, and (5) an analysis of per-category attribute importance scores. Finally, the injury severity outcome is well predicted, which was not the case in the original study. This is a significant advancement.

CLJul 26, 2019
Automatically Learning Construction Injury Precursors from Text

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

In light of the increasing availability of digitally recorded safety reports in the construction industry, it is important to develop methods to exploit these data to improve our understanding of safety incidents and ability to learn from them. In this study, we compare several approaches to automatically learn injury precursors from raw construction accident reports. More precisely, we experiment with two state-of-the-art deep learning architectures for Natural Language Processing (NLP), Convolutional Neural Networks (CNN) and Hierarchical Attention Networks (HAN), and with the established Term Frequency - Inverse Document Frequency representation (TF-IDF) + Support Vector Machine (SVM) approach. For each model, we provide a method to identify (after training) the textual patterns that are, on average, the most predictive of each safety outcome. We show that among those pieces of text, valid injury precursors can be found. The proposed methods can also be used by the user to visualize and understand the models' predictions.

SIJul 13, 2018
Perturb and Combine to Identify Influential Spreaders in Real-World Networks

Antoine J. -P. Tixier, Maria-Evgenia G. Rossi, Fragkiskos D. Malliaros et al.

Some of the most effective influential spreader detection algorithms are unstable to small perturbations of the network structure. Inspired by bagging in Machine Learning, we propose the first Perturb and Combine (P&C) procedure for networks. It (1) creates many perturbed versions of a given graph, (2) applies a node scoring function separately to each graph, and (3) combines the results. Experiments conducted on real-world networks of various sizes with the k-core, generalized k-core, and PageRank algorithms reveal that P&C brings substantial improvements. Moreover, this performance boost can be obtained at almost no extra cost through parallelization. Finally, a bias-variance analysis suggests that P&C works mainly by reducing bias, and that therefore, it should be capable of improving the performance of all vertex scoring functions, including stable ones.

CLOct 28, 2016
Word Embeddings for the Construction Domain

Antoine J. -P. Tixier, Michalis Vazirgiannis, Matthew R. Hallowell

We introduce word vectors for the construction domain. Our vectors were obtained by running word2vec on an 11M-word corpus that we created from scratch by leveraging freely-accessible online sources of construction-related text. We first explore the embedding space and show that our vectors capture meaningful construction-specific concepts. We then evaluate the performance of our vectors against that of ones trained on a 100B-word corpus (Google News) within the framework of an injury report classification task. Without any parameter tuning, our embeddings give competitive results, and outperform the Google News vectors in many cases. Using a keyword-based compression of the reports also leads to a significant speed-up with only a limited loss in performance. We release our corpus and the data set we created for the classification task as publicly available, in the hope that they will be used by future studies for benchmarking and building on our work.

APSep 26, 2016
Construction Safety Risk Modeling and Simulation

Antoine J. -P. Tixier, Matthew R. Hallowell, Balaji Rajagopalan

By building on a recently introduced genetic-inspired attribute-based conceptual framework for safety risk analysis, we propose a novel methodology to compute construction univariate and bivariate construction safety risk at a situational level. Our fully data-driven approach provides construction practitioners and academicians with an easy and automated way of extracting valuable empirical insights from databases of unstructured textual injury reports. By applying our methodology on an attribute and outcome dataset directly obtained from 814 injury reports, we show that the frequency-magnitude distribution of construction safety risk is very similar to that of natural phenomena such as precipitation or earthquakes. Motivated by this observation, and drawing on state-of-the-art techniques in hydroclimatology and insurance, we introduce univariate and bivariate nonparametric stochastic safety risk generators, based on Kernel Density Estimators and Copulas. These generators enable the user to produce large numbers of synthetic safety risk values faithfully to the original data, allowing safetyrelated decision-making under uncertainty to be grounded on extensive empirical evidence. Just like the accurate modeling and simulation of natural phenomena such as wind or streamflow is indispensable to successful structure dimensioning or water reservoir management, we posit that improving construction safety calls for the accurate modeling, simulation, and assessment of safety risk. The underlying assumption is that like natural phenomena, construction safety may benefit from being studied in an empirical and quantitative way rather than qualitatively which is the current industry standard. Finally, a side but interesting finding is that attributes related to high energy levels and to human error emerge as strong risk shapers on the dataset we used to illustrate our methodology.