Stefano Bortoli

h-index13
2papers

2 Papers

MLNov 2, 2023
Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison

Kensuke Mitsuzawa, Motonobu Kanagawa, Stefano Bortoli et al.

We study two-sample variable selection: identifying variables that discriminate between the distributions of two sets of data vectors. Such variables help scientists understand the mechanisms behind dataset discrepancies. Although domain-specific methods exist (e.g., in medical imaging, genetics, and computational social science), a general framework remains underdeveloped. We make two separate contributions. (i) We introduce a mathematical notion of the discriminating set of variables: the largest subset containing no variables whose marginals are identical across the two distributions and independent of the remaining variables. We prove this set is uniquely defined and establish further properties, making it a suitable ground truth for theory and evaluation. (ii) We propose two methods for two-sample variable selection that assign weights to variables and optimise them to maximise the power of a kernel two-sample test while enforcing sparsity to downweight redundant variables. To select the regularisation parameter - unknown in practice, as it controls the number of selected variables - we develop two data-driven procedures to balance recall and precision. Synthetic experiments show improved performance over baselines, and we illustrate the approach on two applications using datasets from water-pipe and traffic networks.

MEDec 9, 2024
Variable Selection for Comparing High-dimensional Time-Series Data

Kensuke Mitsuzawa, Margherita Grossi, Stefano Bortoli et al.

Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator's parameters on traffic flows are analysed.