LG AIMar 2

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu

arXiv:2603.01863v11.4h-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited data availability for AML researchers, providing a customisable benchmark to advance detection methods, though it is incremental as it builds on synthetic generation with new temporal features.

The paper tackles the lack of accessible transactional data for Anti-Money Laundering (AML) research by introducing Tide, an open-source synthetic dataset generator that produces graph-based financial networks with structural and temporal laundering patterns, resulting in condition-dependent model rankings where LightGBM achieves a PR-AUC of 78.05 at low illicit ratios and XGBoost reaches 85.12 at higher fraud prevalence.

The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.

View on arXiv PDF

Similar