LGMay 31, 2025Code
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion ModelsValter Hudovernik, Minkai Xu, Juntong Shi et al.
Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at https://github.com/ValterH/RelDiff.
LGNov 4, 2025
Reducing normalizing flow complexity for MCMC preconditioningDavid Nabergoj, Erik Štrumbelj
Preconditioning is a key component of MCMC algorithms that improves sampling efficiency by facilitating exploration of geometrically complex target distributions through an invertible map. While linear preconditioners are often sufficient for moderately complex target distributions, recent work has explored nonlinear preconditioning with invertible neural networks as components of normalizing flows (NFs). However, empirical and theoretical studies show that overparameterized NF preconditioners can degrade sampling efficiency and fit quality. Moreover, existing NF-based approaches do not adapt their architectures to the target distribution. Related work outside of MCMC similarly finds that suitably parameterized NFs can achieve comparable or superior performance with substantially less training time or data. We propose a factorized preconditioning architecture that reduces NF complexity by combining a linear component with a conditional NF, improving adaptability to target geometry. The linear preconditioner is applied to dimensions that are approximately Gaussian, as estimated from warmup samples, while the conditional NF models more complex dimensions. Our method yields significantly better tail samples on two complex synthetic distributions and consistently better performance on a sparse logistic regression posterior across varying likelihood and prior strengths. It also achieves higher effective sample sizes on hierarchical Bayesian model posteriors with weak likelihoods and strong funnel geometries. This approach is particularly relevant for hierarchical Bayesian model analyses with limited data and could inform current theoretical and software strides in neural MCMC design.
LGDec 22, 2024
Empirical evaluation of normalizing flows in Markov Chain Monte CarloDavid Nabergoj, Erik Štrumbelj
Recent advances in MCMC use normalizing flows to precondition target distributions and enable jumps to distant regions. However, there is currently no systematic comparison of different normalizing flow architectures for MCMC. As such, many works choose simple flow architectures that are readily available and do not consider other models. Guidelines for choosing an appropriate architecture would reduce analysis time for practitioners and motivate researchers to take the recommended models as foundations to be improved. We provide the first such guideline by extensively evaluating many normalizing flow architectures on various flow-based MCMC methods and target distributions. When the target density gradient is available, we show that flow-based MCMC outperforms classic MCMC for suitable NF architecture choices with minor hyperparameter tuning. When the gradient is unavailable, flow-based MCMC wins with off-the-shelf architectures. We find contractive residual flows to be the best general-purpose models with relatively low sensitivity to hyperparameter choice. We also provide various insights into normalizing flow behavior within MCMC when varying their hyperparameters, properties of target distributions, and the overall computational budget.
MEJul 31, 2025
A General Approach to Visualizing Uncertainty in Statistical GraphicsBernarda Petek, David Nabergoj, Erik Štrumbelj
Visualizing uncertainty is integral to data analysis, yet its application is often hindered by the need for specialized methods for quantifying and representing uncertainty for different types of graphics. We introduce a general approach that simplifies this process. The core idea is to treat the statistical graphic as a function of the underlying distribution. Instead of first calculating uncertainty metrics and then plotting them, the method propagates uncertainty through to the visualization. By repeatedly sampling from the data distribution and generating a complete statistical graphic for each sample, a distribution over graphics is produced. These graphics are aggregated pixel-by-pixel to create a single, static image. This approach is versatile, requires no specific knowledge from the user beyond how to create the basic statistical graphic, and comes with theoretical coverage guarantees for standard cases such as confidence intervals and bands. We provide a reference implementation as a Python library to demonstrate the method's utility. Our approach not only reproduces conventional uncertainty visualizations for point estimates and regression lines but also seamlessly extends to non-standard cases, including pie charts, stacked bar charts, and tables. This approach makes uncertainty visualization more accessible to practitioners and can be a valuable tool for teaching uncertainty.
LGOct 6, 2021
Predicting the Popularity of Games on SteamAndraž De Luisa, Jan Hartman, David Nabergoj et al.
The video game industry has seen rapid growth over the last decade. Thousands of video games are released and played by millions of people every year, creating a large community of players. Steam is a leading gaming platform and social networking site, which allows its users to purchase and store games. A by-product of Steam is a large database of information about games, players, and gaming behavior. In this paper, we take recent video games released on Steam and aim to discover the relation between game popularity and a game's features that can be acquired through Steam. We approach this task by predicting the popularity of Steam games in the early stages after their release and we use a Bayesian approach to understand the influence of a game's price, size, supported languages, release date, and genres on its player count. We implement several models and discover that a genre-based hierarchical approach achieves the best performance. We further analyze the model and interpret its coefficients, which indicate that games released at the beginning of the month and games of certain genres correlate with game popularity.