CYMay 7
How Hyper-Datafication Impacts the Sustainability Costs in Frontier AISophia N. Wilson, Sebastian Mair, Mophat Okinyi et al.
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.
SYMay 12
Provably-Correct Safety Protocol for Cooperative PlatooningSebastian Mair, Matthias Althoff
Cooperative Adaptive Cruise Control (CACC) is a well-studied technology for forming string-stable vehicle platoons. Ensuring collision avoidance is particularly difficult in CACC due to the small desired inter-vehicle spacing. We propose a safety protocol preventing collisions in a provably-correct manner while still maintaining a small distance to the preceding vehicle, by utilizing communicated braking capabilities. In addition, the safety of the protocol is ensured despite possible communication failures. While our concept can be applied to any CACC system, we particularly consider a class of CACCs, where the platoon vehicles successively agree on a consensus behavior. Our safety protocol is evaluated on various scenarios using the CommonRoad benchmark suite.
CRJul 5, 2023Code
Personalized Privacy Amplification via Importance SamplingDominik Fay, Sebastian Mair, Jens Sjölund
For scalable machine learning on large data sets, subsampling a representative subset is a common approach for efficient model training. This is often achieved through importance sampling, whereby informative data points are sampled more frequently. In this paper, we examine the privacy properties of importance sampling, focusing on an individualized privacy analysis. We find that, in importance sampling, privacy is well aligned with utility but at odds with sample size. Based on this insight, we propose two approaches for constructing sampling distributions: one that optimizes the privacy-efficiency trade-off; and one based on a utility guarantee in the form of coresets. We evaluate both approaches empirically in terms of privacy, efficiency, and accuracy on the differentially private $k$-means problem. We observe that both approaches yield similar outcomes and consistently outperform uniform sampling across a wide range of data sets. Our code is available on GitHub: https://github.com/smair/personalized-privacy-amplification-via-importance-sampling
LGJan 31, 2023
Archetypal Analysis++: Rethinking the Initialization StrategySebastian Mair, Jens Sjölund
Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential, but frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective function, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 15 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones.
LGOct 30, 2023
On Feynman--Kac training of partial Bayesian neural networksZheng Zhao, Sebastian Mair, Thomas B. Schön et al.
Recently, partial Bayesian neural networks (pBNNs), which only consider a subset of the parameters to be stochastic, were shown to perform competitively with full Bayesian neural networks. However, pBNNs are often multi-modal in the latent variable space and thus challenging to approximate with parametric models. To address this problem, we propose an efficient sampling-based training strategy, wherein the training of a pBNN is formulated as simulating a Feynman--Kac model. We then describe variations of sequential Monte Carlo samplers that allow us to simultaneously estimate the parameters and the latent posterior distribution of this model at a tractable computational cost. Using various synthetic and real-world datasets we show that our proposed training scheme outperforms the state of the art in terms of predictive performance.
LGApr 5, 2023
Self-Supervised Siamese AutoencodersFriederike Baier, Sebastian Mair, Samuel G. Fadel
In contrast to fully-supervised models, self-supervised representation learning only needs a fraction of data to be labeled and often achieves the same or even higher downstream performance. The goal is to pre-train deep neural networks on a self-supervised task, making them able to extract meaningful features from raw input data afterwards. Previously, autoencoders and Siamese networks have been successfully employed as feature extractors for tasks such as image classification. However, both have their individual shortcomings and benefits. In this paper, we combine their complementary strengths by proposing a new method called SidAE (Siamese denoising autoencoder). Using an image classification downstream task, we show that our model outperforms two self-supervised baselines across multiple data sets and scenarios. Crucially, this includes conditions in which only a small amount of labeled data is available. Empirically, the Siamese component has more impact, but the denoising autoencoder is nevertheless necessary to improve performance.
LGFeb 23
Stop Preaching and Start Practising Data Frugality for Responsible Development of AISophia N. Wilson, Guðrún Fjóla Guðmundsdóttir, Andrew Millard et al.
This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that coreset-based subset selection can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preach to concrete practice for responsible development of AI.
LGFeb 15, 2024
Ising on the Graph: Task-specific Graph Subsampling via the Ising ModelMaria Bånkestad, Jennifer R. Andersson, Sebastian Mair et al.
Reducing a graph while preserving its overall properties is an important problem with many applications. Typically, reduction approaches either remove edges (sparsification) or merge nodes (coarsening) in an unsupervised way with no specific downstream task in mind. In this paper, we present an approach for subsampling graph structures using an Ising model defined on either the nodes or edges and learning the external magnetic field of the Ising model using a graph neural network. Our approach is task-specific as it can learn how to reduce a graph for a specific downstream task in an end-to-end fashion without requiring a differentiable loss function for the task. We showcase the versatility of our approach on four distinct applications: image segmentation, explainability for graph classification, 3D shape sparsification, and sparse approximate matrix inverse determination.
MEApr 16, 2025
A Survey on Archetypal AnalysisAleix Alcacer, Irene Epifanio, Sebastian Mair et al.
Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure to extract the distinct aspects called archetypes in observations with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data with wide applications throughout the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This survey provides researchers and data mining practitioners an overview of methodologies and opportunities that AA has to offer surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data using AA and limitations. The survey concludes by explaining important future research directions concerning AA.
ROMar 29, 2025
Predictive Traffic Rule Compliance using Reinforcement LearningYanliang Huang, Sebastian Mair, Zhuoqi Zeng et al.
Autonomous vehicle path planning has reached a stage where safety and regulatory compliance are crucial. This paper presents an approach that integrates a motion planner with a deep reinforcement learning model to predict potential traffic rule violations. Our main innovation is replacing the standard actor network in an actor-critic method with a motion planning module, which ensures both stable and interpretable trajectory generation. In this setup, we use traffic rule robustness as the reward to train a reinforcement learning agent's critic, and the output of the critic is directly used as the cost function of the motion planner, which guides the choices of the trajectory. We incorporate some key interstate rules from the German Road Traffic Regulation into a rule book and use a graph-based state representation to handle complex traffic information. Experiments on an open German highway dataset show that the model can predict and prevent traffic rule violations beyond the planning horizon, increasing safety and rule compliance in challenging traffic scenarios.
CVNov 29, 2024
Explaining the Impact of Training on Vision Models via Activation ClusteringAhcène Boubekki, Samuel G. Fadel, Sebastian Mair
This paper introduces Neuro-Activated Vision Explanations (NAVE), a method for extracting and visualizing the internal representations of vision model encoders. By clustering feature activations, NAVE provides insights into learned semantics without fine-tuning. Using object localization, we show that NAVE's concepts align with image semantics. Through extensive experiments, we analyze the impact of training strategies and architectures on encoder representation capabilities. Additionally, we apply NAVE to study training artifacts in vision transformers and reveal how weak training strategies and spurious correlations degrade model performance. Our findings establish NAVE as a valuable tool for post-hoc model inspection and improving transparency in vision models.
CVJun 7, 2024
Leveraging Activations for Superpixel ExplanationsAhcène Boubekki, Samuel G. Fadel, Sebastian Mair
Saliency methods have become standard in the explanation toolkit of deep neural networks. Recent developments specific to image classifiers have investigated region-based explanations with either new methods or by adapting well-established ones using ad-hoc superpixel algorithms. In this paper, we aim to avoid relying on these segmenters by extracting a segmentation from the activations of a deep neural network image classifier without fine-tuning the network. Our so-called Neuro-Activated Superpixels (NAS) can isolate the regions of interest in the input relevant to the model's prediction, which boosts high-threshold weakly supervised object localization performance. This property enables the semi-supervised semantic evaluation of saliency methods. The aggregation of NAS with existing saliency methods eases their interpretation and reveals the inconsistencies of the widely used area under the relevance curve metric.
MED-PHMay 6, 2024
Efficient Radiation Treatment Planning based on Voxel ImportanceSebastian Mair, Anqi Fu, Jens Sjölund
Radiation treatment planning involves optimization over a large number of voxels, many of which carry limited information about the clinical problem. We propose an approach to reduce the large optimization problem by only using a representative subset of informative voxels. This way, we drastically improve planning efficiency while maintaining the plan quality. Within an initial probing step, we pre-solve an easier optimization problem involving a simplified objective from which we derive an importance score per voxel. This importance score is then turned into a sampling distribution, which allows us to subsample a small set of informative voxels using importance sampling. By solving a - now reduced - version of the original optimization problem using this subset, we effectively reduce the problem's size and computational demands while accounting for regions where satisfactory dose deliveries are challenging. In contrast to other stochastic (sub-)sampling methods, our technique only requires a single probing and sampling step to define a reduced optimization problem. This problem can be efficiently solved using established solvers without the need of modifying or adapting them. Empirical experiments on open benchmark data highlight substantially reduced optimization times, up to 50 times faster than the original ones, for intensity-modulated radiation therapy (IMRT), all while upholding plan quality comparable to traditional methods. Our novel approach has the potential to significantly accelerate radiation treatment planning by addressing its inherent computational challenges. We reduce the treatment planning time by reducing the size of the optimization problem rather than modifying and improving the optimization method. Our efforts are thus complementary to many previous developments.
MLOct 22, 2020
Principled Interpolation in Normalizing FlowsSamuel G. Fadel, Sebastian Mair, Ricardo da S. Torres et al.
Generative models based on normalizing flows are very successful in modeling complex data distributions using simpler ones. However, straightforward linear interpolations show unexpected side effects, as interpolation paths lie outside the area where samples are observed. This is caused by the standard choice of Gaussian base distributions and can be seen in the norms of the interpolated samples as they are outside the data manifold. This observation suggests that changing the way of interpolating should generally result in better interpolations, but it is not clear how to do that in an unambiguous way. In this paper, we solve this issue by enforcing a specific manifold and, hence, change the base distribution, to allow for a principled way of interpolation. Specifically, we use the Dirichlet and von Mises-Fisher base distributions on the probability simplex and the hypersphere, respectively. Our experimental results show superior performance in terms of bits per dimension, Fréchet Inception Distance (FID), and Kernel Inception Distance (KID) scores for interpolation, while maintaining the generative performance.