49.1AIMay 11
Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM PerformanceShiqiang Wang, Herbert Woisetschläger, Hans Arno Jacobsen et al.
Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.
LGDec 4, 2025
MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning SystemFelix Mulitze, Herbert Woisetschläger, Hans Arno Jacobsen
The convergence of next-generation wireless systems and distributed Machine Learning (ML) demands Federated Learning (FL) methods that remain efficient and robust with wireless connected peers and under network churn. Peer-to-peer (P2P) FL removes the bottleneck of a central coordinator, but existing approaches suffer from excessive communication complexity, limiting their scalability in practice. We introduce MAR-FL, a novel P2P FL system that leverages iterative group-based aggregation to substantially reduce communication overhead while retaining resilience to churn. MAR-FL achieves communication costs that scale as O(N log N), contrasting with the O(N^2) complexity of previously existing baselines, and thereby maintains effectiveness especially as the number of peers in an aggregation round grows. The system is robust towards unreliable FL clients and can integrate private computing.
LGOct 31, 2024
MESS+: Energy-Optimal Inferencing in Language Model Zoos with Service Level GuaranteesRyan Zhang, Herbert Woisetschläger, Shiqiang Wang et al.
Open-weight large language model (LLM) zoos allow users to quickly integrate state-of-the-art models into systems. Despite increasing availability, selecting the most appropriate model for a given task still largely relies on public benchmark leaderboards and educated guesses. This can be unsatisfactory for both inference service providers and end users, where the providers usually prioritize cost efficiency, while the end users usually prioritize model output quality for their inference requests. In commercial settings, these two priorities are often brought together in Service Level Agreements (SLA). We present MESS+, an online stochastic optimization algorithm for energy-optimal model selection from a model zoo, which works on a per-inference-request basis. For a given SLA that requires high accuracy, we are up to 2.5x more energy efficient with MESS+ than with randomly selecting an LLM from the zoo while maintaining SLA quality constraints.
LGJun 22, 2024
Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression RecognitionKai Shao, Rui Wang, Yixue Hao et al.
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.
SYJun 30, 2017
On the Effects of Distributed Electric Vehicle Network Utility Maximization in Low Voltage FeedersJose Rivera, Hans Arno Jacobsen
The fast charging of Electric Vehicles (EVs) in distribution networks requires real-time EV charging control to avoid the overloading of grid components. Recent studies have proposed congestion control protocols, which result from distributed optimization solutions of the Network Utility Maximization (NUM) problem. While the NUM formulation allows the definition of distributed computations with closed form solutions, its simple model does not account for many of the feeders operational constraints. This puts the resulting control algorithms effectiveness into question. In this paper, we investigate the impact of implementing such algorithms for congestion control in low voltage feeders. We review the latest NUM based algorithms for real-time EV charging control, and evaluate their behavior and impact on the comprehensive IEEE European Low Voltage Test Feeder. Our results show that the EV NUM problem can effectively capture the relevant operational constraints, as long as ampacity violations are the main bottleneck. Moreover, the results demonstrate an advantage of the primal NUM solution over the more conventional dual NUM solution in preventing a system overload.