Gautier Marti

LG
h-index24
15papers
284citations
Novelty47%
AI Score44

15 Papers

CLSep 17, 2024
Enriching Datasets with Demographics through Large Language Models: What's in a Name?

Khaled AlNuaimi, Gautier Marti, Mathieu Ravaut et al.

Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.

LGNov 28, 2025
Measuring What LLMs Think They Do: SHAP Faithfulness and Deployability on Financial Tabular Classification

Saeed AlMarri, Mathieu Ravaut, Kristof Juhasz et al.

Large Language Models (LLMs) have attracted significant attention for classification tasks, offering a flexible alternative to trusted classical machine learning models like LightGBM through zero-shot prompting. However, their reliability for structured tabular data remains unclear, particularly in high stakes applications like financial risk assessment. Our study systematically evaluates LLMs and generates their SHAP values on financial classification tasks. Our analysis shows a divergence between LLMs self-explanation of feature impact and their SHAP values, as well as notable differences between LLMs and LightGBM SHAP values. These findings highlight the limitations of LLMs as standalone classifiers for structured financial modeling, but also instill optimism that improved explainability mechanisms coupled with few-shot prompting will make LLMs usable in risk-sensitive domains.

CLOct 29, 2025
Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut et al.

Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.

SISep 7, 2025
Network Contagion in Financial Labor Markets: Predicting Turnover in Hong Kong

Abdulla AlKetbi, Patrick Yam, Gautier Marti et al.

Employee turnover is a critical challenge in financial markets, yet little is known about the role of professional networks in shaping career moves. Using the Hong Kong Securities and Futures Commission (SFC) public register (2007-2024), we construct temporal networks of 121,883 professionals and 4,979 firms to analyze and predict employee departures. We introduce a graph-based feature propagation framework that captures peer influence and organizational stability. Our analysis shows a contagion effect: professionals are 23% more likely to leave when over 30% of their peers depart within six months. Embedding these network signals into machine learning models improves turnover prediction by 30% over baselines. These results highlight the predictive power of temporal network effects in workforce dynamics, and demonstrate how network-based analytics can inform regulatory monitoring, talent management, and systemic risk assessment.

LGFeb 26, 2025
Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis

Hamdan Al Ahbabi, Gautier Marti, Saeed AlMarri et al.

Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features. These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.

STJul 22, 2021
cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the Elliptope

Gautier Marti, Victor Goubet, Frank Nielsen

We propose a methodology to approximate conditional distributions in the elliptope of correlation matrices based on conditional generative adversarial networks. We illustrate the methodology with an application from quantitative finance: Monte Carlo simulations of correlated returns to compare risk-based portfolio construction methods. Finally, we discuss about current limitations and advocate for further exploration of the elliptope geometry to improve results.

STOct 21, 2019
CorrGAN: Sampling Realistic Financial Correlation Matrices Using Generative Adversarial Networks

Gautier Marti

We propose a novel approach for sampling realistic financial correlation matrices. This approach is based on generative adversarial networks. Experiments demonstrate that generative adversarial networks are able to recover most of the known stylized facts about empirical correlation matrices estimated on asset returns. This is the first time such results are documented in the literature. Practical financial applications range from trading strategies enhancement to risk and portfolio stress testing. Such generative models can also help ground empirical finance deeper into science by allowing for falsifiability of statements and more objective comparison of empirical methods.

IRJul 28, 2017
Putting Self-Supervised Token Embedding on the Tables

Marc Szafraniec, Gautier Marti, Philippe Donnat

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

LGMar 12, 2017
Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Mikołaj Bińkowski, Gautier Marti, Philippe Donnat

We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of adjusted regressors, while the weights are datadependent functions learnt through a convolutional network. The architecture was designed for applications on asynchronous time series and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. The proposed architecture achieves promising results as compared to convolutional and recurrent neural networks.

MLOct 30, 2016
Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering

Gautier Marti, Sebastien Andler, Frank Nielsen et al.

We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research.

MLApr 28, 2016
Optimal Transport vs. Fisher-Rao distance between Copulas for Clustering Multivariate Time Series

Gautier Marti, Sébastien Andler, Frank Nielsen et al.

We present a methodology for clustering N objects which are described by multivariate time series, i.e. several sequences of real-valued random variables. This clustering methodology leverages copulas which are distributions encoding the dependence structure between several random variables. To take fully into account the dependence information while clustering, we need a distance between copulas. In this work, we compare renowned distances between distributions: the Fisher-Rao geodesic distance, related divergences and optimal transport, and discuss their advantages and disadvantages. Applications of such methodology can be found in the clustering of financial assets. A tutorial, experiments and implementation for reproducible research can be found at www.datagrapple.com/Tech.

MLMar 13, 2016
Clustering Financial Time Series: How Long is Enough?

Gautier Marti, Sébastien Andler, Frank Nielsen et al.

Researchers have used from 30 days to several years of daily returns as source data for clustering financial time series based on their correlations. This paper sets up a statistical framework to study the validity of such practices. We first show that clustering correlated random variables from their observed values is statistically consistent. Then, we also give a first empirical answer to the much debated question: How long should the time series be? If too short, the clusters found can be spurious; if too long, dynamics can be smoothed out.

LGSep 27, 2015
Optimal Copula Transport for Clustering Multivariate Time Series

Gautier Marti, Frank Nielsen, Philippe Donnat

This paper presents a new methodology for clustering multivariate time series leveraging optimal transport between copulas. Copulas are used to encode both (i) intra-dependence of a multivariate time series, and (ii) inter-dependence between two time series. Then, optimal copula transport allows us to define two distances between multivariate time series: (i) one for measuring intra-dependence dissimilarity, (ii) another one for measuring inter-dependence dissimilarity based on a new multivariate dependence coefficient which is robust to noise, deterministic, and which can target specified dependencies.

HCJul 29, 2015
HCMapper: An interactive visualization tool to compare partition-based flat clustering extracted from pairs of dendrograms

Gautier Marti, Philippe Donnat, Frank Nielsen et al.

We describe a new visualization tool, dubbed HCMapper, that visually helps to compare a pair of dendrograms computed on the same dataset by displaying multiscale partition-based layered structures. The dendrograms are obtained by hierarchical clustering techniques whose output reflects some hypothesis on the data and HCMapper is specifically designed to grasp at first glance both whether the two compared hypotheses broadly agree and the data points on which they do not concur. Leveraging juxtaposition and explicit encodings, HCMapper focus on two selected partitions while displaying coarser ones in context areas for understanding multiscale structure and eventually switching the selected partitions. HCMapper utility is shown through the example of testing whether the prices of credit default swap financial time series only undergo correlation. This use case is detailed in the supplementary material as well as experiments with code on toy-datasets for reproducible research. HCMapper is currently released as a visualization tool on the DataGrapple time series and clustering analysis platorm at www.datagrapple.com.

LGJun 2, 2015
Toward a generic representation of random variables for machine learning

Gautier Marti, Philippe Very, Philippe Donnat

This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.