DCNov 14, 2025Code
What happens when nanochat meets DiLoCo?Alexander Acker, Soeren Becker, Sasho Nedelkoski et al.
Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat's training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub.
LGJan 25, 2023
PULL: Reactive Log Anomaly Detection Based On Iterative PU LearningThorsten Wittkopp, Dominik Scheinert, Philipp Wiesner et al.
Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
DCAug 22, 2023
Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data AnalyticsDominik Scheinert, Philipp Wiesner, Thorsten Wittkopp et al.
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.
DCFeb 26
Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility StudyPhilipp Wiesner, Soeren Becker, Brett Cornick et al.
Training large language models (LLMs) requires substantial compute and energy. At the same time, renewable energy sources regularly produce more electricity than the grid can absorb, leading to curtailment, the deliberate reduction of clean generation that would otherwise go to waste. These periods represent an opportunity: if training is aligned with curtailment windows, LLMs can be pretrained using electricity that is both clean and cheap. This technical report presents a system that performs full-parameter LLM training across geo-distributed GPU clusters during regional curtailment windows, elastically switching between local single-site training and federated multi-site synchronization as sites become available or unavailable. Our prototype trains a 561M-parameter transformer model across three clusters using the Flower federated learning framework, with curtailment periods derived from real-world marginal carbon intensity traces. Preliminary results show that curtailment-aware scheduling preserves training quality while reducing operational emissions to 5-12% of single-site baselines.
LGMay 22, 2024
LogRCA: Log-based Root Cause Analysis for Distributed ServicesThorsten Wittkopp, Philipp Wiesner, Odej Kao
To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure. We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts. LogRCA consistently outperforms baselines based on deep learning and statistical analysis in terms of precision and recall to detect candidate root causes. In addition, we investigated the impact of our deployed data balancing approach, demonstrating that it considerably improves performance on rare failures.
LGMar 5, 2024
Federated Learning over Connected ModesDennis Grinwald, Philipp Wiesner, Shinichi Nakajima
Statistical heterogeneity in federated learning poses two major challenges: slow global training due to conflicting gradient signals, and the need of personalization for local distributions. In this work, we tackle both challenges by leveraging recent advances in \emph{linear mode connectivity} -- identifying a linearly connected low-loss region in the parameter space of neural networks, which we call solution simplex. We propose federated learning over connected modes (\textsc{Floco}), where clients are assigned local subregions in this simplex based on their gradient signals, and together learn the shared global solution simplex. This allows personalization of the client models to fit their local distributions within the degrees of freedom in the solution simplex and homogenizes the update signals for the global simplex training. Our experiments show that \textsc{Floco} accelerates the global training process, and significantly improves the local accuracy with minimal computational overhead in cross-silo federated learning settings.
LGApr 1
Exploring Silent Data Corruption as a Reliability Challenge in LLM TrainingAnton Altenbernd, Philipp Wiesner, Odej Kao
As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter divergence. Building on the observed corruption signatures, we propose a lightweight detection method that identifies potentially harmful parameter updates. Experiments on LLaMA models with 60M, 350M, and 1.3B parameters demonstrate that recomputing the most recent training step upon detection can effectively mitigate the impact of these events.
AINov 19, 2025
Efficiency Will Not Lead to Sustainable Reasoning AIPhilipp Wiesner, Daniel W. O'Neill, Francesca Larosa et al.
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing's global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
LGMar 26, 2025
$β$-GNN: A Robust Ensemble Approach Against Graph Structure PerturbationHaci Ismail Aslan, Philipp Wiesner, Ping Xiong et al.
Graph Neural Networks (GNNs) are playing an increasingly important role in the efficient operation and security of computing systems, with applications in workload scheduling, anomaly detection, and resource management. However, their vulnerability to network perturbations poses a significant challenge. We propose $β$-GNN, a model enhancing GNN robustness without sacrificing clean data performance. $β$-GNN uses a weighted ensemble, combining any GNN with a multi-layer perceptron. A learned dynamic weight, $β$, modulates the GNN's contribution. This $β$ not only weights GNN influence but also indicates data perturbation levels, enabling proactive mitigation. Experimental results on diverse datasets show $β$-GNN's superior adversarial accuracy and attack severity quantification. Crucially, $β$-GNN avoids perturbation assumptions, preserving clean data structure and performance.
LGMay 24, 2023
FedZero: Leveraging Renewable Excess Energy in Federated LearningPhilipp Wiesner, Ramin Khalili, Dennis Grinwald et al.
Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair training. We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.
DCDec 17, 2021
Continuously Testing Distributed IoT Systems: An Overview of the State of the ArtJossekin Beilharz, Philipp Wiesner, Arne Boockmeyer et al.
The continuous testing of small changes to systems has proven to be useful and is widely adopted in the development of software systems. For this, software is tested in environments that are as close as possible to the production environments. When testing IoT systems, this approach is met with unique challenges that stem from the typically large scale of the deployments, heterogeneity of nodes, challenging network characteristics, and tight integration with the environment among others. IoT test environments present a possible solution to these challenges by emulating the nodes, networks, and possibly domain environments in which IoT applications can be executed. This paper gives an overview of the state of the art in IoT testing. We derive desirable characteristics of IoT test environments, compare 18 tools that can be used in this respect, and give a research outlook of future trends in this area.
DBNov 26, 2021
A Taxonomy of Anomalies in Log DataThorsten Wittkopp, Philipp Wiesner, Dominik Scheinert et al.
Log data anomaly detection is a core component in the area of artificial intelligence for IT operations. However, the large amount of existing methods makes it hard to choose the right approach for a specific system. A better understanding of different kinds of anomalies, and which algorithms are suitable for detecting them, would support researchers and IT operators. Although a common taxonomy for anomalies already exists, it has not yet been applied specifically to log data, pointing out the characteristics and peculiarities in this domain. In this paper, we present a taxonomy for different kinds of log data anomalies and introduce a method for analyzing such anomalies in labeled datasets. We applied our taxonomy to the three common benchmark datasets Thunderbird, Spirit, and BGL, and trained five state-of-the-art unsupervised anomaly detection algorithms to evaluate their performance in detecting different kinds of anomalies. Our results show, that the most common anomaly type is also the easiest to predict. Moreover, deep learning-based approaches outperform data mining-based approaches in all anomaly types, but especially when it comes to detecting contextual anomalies.
LGNov 2, 2021
LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak SupervisionThorsten Wittkopp, Philipp Wiesner, Dominik Scheinert et al.
With increasing scale and complexity of cloud operations, automated detection of anomalies in monitoring data such as logs will be an essential part of managing future IT infrastructures. However, many methods based on artificial intelligence, such as supervised deep learning models, require large amounts of labeled training data to perform well. In practice, this data is rarely available because labeling log data is expensive, time-consuming, and requires a deep understanding of the underlying system. We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts. Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect. It is based on the attention mechanism and uses a custom objective function for weak supervision deep learning techniques that accounts for imbalanced data. Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
DCAug 10, 2021
Evaluation of Load Prediction Techniques for Distributed Stream ProcessingKordian Gontarska, Morgan Geldenhuys, Dominik Scheinert et al.
Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service. In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets.