LGMay 20Code
Divide and Contrast: Learning Robust Temporal Features without AugmentationAbdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor
Self-supervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps. Di-COT stochastically partitions each window into a small number of overlapping sub-blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions. To further improve scalability, we adopt a contrastive objective whose computation depends on the batch size and the number of sub-blocks, making loss computation independent of sequence length. Extensive experiments on six large-scale real-world datasets, as well as the UCR and UEA benchmarks, demonstrate that Di-COT learns semantically structured and transferable representations, achieving state-of-the-art performance on classification, clustering, $k$NN, and cross-dataset transfer, while substantially reducing training time. The source code is publicly available at https://github.com/sfi-norwai/Di-COT.
LGJul 20, 2025Code
eMargin: Revisiting Contrastive Learning with Margin-Based SeparationAbdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor
We revisit previous contrastive learning frameworks to investigate the effect of introducing an adaptive margin into the contrastive loss function for time series representation learning. Specifically, we explore whether an adaptive margin (eMargin), adjusted based on a predefined similarity threshold, can improve the separation between adjacent but dissimilar time steps and subsequently lead to better performance in downstream tasks. Our study evaluates the impact of this modification on clustering performance and classification in three benchmark datasets. Our findings, however, indicate that achieving high scores on unsupervised clustering metrics does not necessarily imply that the learned embeddings are meaningful or effective in downstream tasks. To be specific, eMargin added to InfoNCE consistently outperforms state-of-the-art baselines in unsupervised clustering metrics, but struggles to achieve competitive results in downstream classification with linear probing. The source code is publicly available at https://github.com/sfi-norwai/eMargin.
LGOct 20, 2024Code
Contrast All the Time: Learning Time Series Representation from Temporal ConsistencyAbdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor
Representation learning for time series using contrastive learning has emerged as a critical technique for improving the performance of downstream tasks. To advance this effective approach, we introduce CaTT (\textit{Contrast All The Time}), a new approach to unsupervised contrastive learning for time series, which takes advantage of dynamics between temporally similar moments more efficiently and effectively than existing methods. CaTT departs from conventional time-series contrastive approaches that rely on data augmentations or selected views. Instead, it uses the full temporal dimension by contrasting all time steps in parallel. This is made possible by a scalable NT-pair formulation, which extends the classic N-pair loss across both batch and temporal dimensions, making the learning process end-to-end and more efficient. CaTT learns directly from the natural structure of temporal data, using repeated or adjacent time steps as implicit supervision, without the need for pair selection heuristics. We demonstrate that this approach produces superior embeddings which allow better performance in downstream tasks. Additionally, training is faster than other contrastive learning approaches, making it suitable for large-scale and real-world time series applications. The source code is publicly available at \href{https://github.com/sfi-norwai/CaTT}{https://github.com/sfi-norwai/CaTT}.
LGDec 5, 2021Code
Probabilistic Deep Learning to Quantify Uncertainty in Air Quality ForecastingAbdulmajid Murad, Frank Alexander Kraemer, Kerstin Bach et al.
Data-driven forecasts of air quality have recently achieved more accurate short-term predictions. Despite their success, most of the current data-driven solutions lack proper quantifications of model uncertainty that communicate how much to trust the forecasts. Recently, several practical tools to estimate uncertainty have been developed in probabilistic deep learning. However, there have not been empirical applications and extensive comparisons of these tools in the domain of air quality forecasts. Therefore, this work applies state-of-the-art techniques of uncertainty quantification in a real-world setting of air quality forecasts. Through extensive experiments, we describe training probabilistic models and evaluate their predictive uncertainties based on empirical performance, reliability of confidence estimate, and practical applicability. We also propose improving these models using "free" adversarial training and exploiting temporal and spatial correlation inherent in air quality data. Our experiments demonstrate that the proposed models perform better than previous works in quantifying uncertainty in data-driven air quality forecasts. Overall, Bayesian neural networks provide a more reliable uncertainty estimate but can be challenging to implement and scale. Other scalable methods, such as deep ensemble, Monte Carlo (MC) dropout, and stochastic weight averaging-Gaussian (SWAG), can perform well if applied correctly but with different tradeoffs and slight variations in performance metrics. Finally, our results show the practical impact of uncertainty estimation and demonstrate that, indeed, probabilistic models are more suitable for making informed decisions. Code and dataset are available at \url{https://github.com/Abdulmajid-Murad/deep_probabilistic_forecast}
LGApr 1, 2020Code
MetaPoison: Practical General-purpose Clean-label Data PoisoningW. Ronny Huang, Jonas Geiping, Liam Fowl et al.
Data poisoning -- the process by which an attacker takes control of a model by making imperceptible changes to a subset of the training data -- is an emerging threat in the context of neural networks. Existing attacks for data poisoning neural networks have relied on hand-crafted heuristics, because solving the poisoning problem directly via bilevel optimization is generally thought of as intractable for deep models. We propose MetaPoison, a first-order method that approximates the bilevel problem via meta-learning and crafts poisons that fool neural networks. MetaPoison is effective: it outperforms previous clean-label poisoning methods by a large margin. MetaPoison is robust: poisoned data made for one model transfer to a variety of victim models with unknown training settings and architectures. MetaPoison is general-purpose, it works not only in fine-tuning scenarios, but also for end-to-end training from scratch, which till now hasn't been feasible for clean-label attacks with deep nets. MetaPoison can achieve arbitrary adversary goals -- like using poisons of one class to make a target image don the label of another arbitrarily chosen class. Finally, MetaPoison works in the real-world. We demonstrate for the first time successful data poisoning of models trained on the black-box Google Cloud AutoML API. Code and premade poisons are provided at https://github.com/wronnyhuang/metapoison
LGApr 29, 2019Code
Adversarial Training for Free!Ali Shafahi, Mahyar Najibi, Amin Ghiasi et al.
Adversarial training, in which a network is trained on adversarial examples, is one of the few defenses against adversarial attacks that withstands strong attacks. Unfortunately, the high cost of generating strong adversarial examples makes standard adversarial training impractical on large-scale problems like ImageNet. We present an algorithm that eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating model parameters. Our "free" adversarial training algorithm achieves comparable robustness to PGD adversarial training on the CIFAR-10 and CIFAR-100 datasets at negligible additional cost compared to natural training, and can be 7 to 30 times faster than other strong adversarial training methods. Using a single workstation with 4 P100 GPUs and 2 days of runtime, we can train a robust model for the large-scale ImageNet classification task that maintains 40% accuracy against PGD attacks. The code is available at https://github.com/ashafahi/free_adv_train.
LGJan 3, 2022
Execute Order 66: Targeted Data Poisoning for Reinforcement LearningHarrison Foley, Liam Fowl, Tom Goldstein et al.
Data poisoning for reinforcement learning has historically focused on general performance degradation, and targeted attacks have been successful via perturbations that involve control of the victim's policy and rewards. We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states - all while minimally modifying a small fraction of training observations without assuming any control over policy or reward. We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning. We test our method and demonstrate success in two Atari games of varying difficulty.
CVJan 20, 2021
LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial RecognitionValeriia Cherepanova, Micah Goldblum, Harrison Foley et al.
Facial recognition systems are increasingly deployed by private corporations, government agencies, and contractors for consumer services and mass surveillance programs alike. These systems are typically built by scraping social media profiles for user images. Adversarial perturbations have been proposed for bypassing facial recognition systems. However, existing methods fail on full-scale systems and commercial APIs. We develop our own adversarial filter that accounts for the entire image processing pipeline and is demonstrably effective against industrial-grade pipelines that include face detection and large scale databases. Additionally, we release an easy-to-use webtool that significantly degrades the accuracy of Amazon Rekognition and the Microsoft Azure Face Recognition API, reducing the accuracy of each to below 1%.
LGOct 19, 2020
Robust Optimization as Data Augmentation for Large-scale GraphsKezhi Kong, Guohao Li, Mucong Ding et al.
Data augmentation helps neural networks generalize better by enlarging the training set, but it remains an open question how to effectively augment graph data to enhance the performance of GNNs (Graph Neural Networks). While most existing graph regularizers focus on manipulating graph topological structures by adding/removing edges, we offer a method to augment node features for better performance. We propose FLAG (Free Large-scale Adversarial Augmentation on Graphs), which iteratively augments node features with gradient-based adversarial perturbations during training. By making the model invariant to small fluctuations in input data, our method helps models generalize to out-of-distribution samples and boosts model performance at test time. FLAG is a general-purpose approach for graph data, which universally works in node classification, link prediction, and graph classification tasks. FLAG is also highly flexible and scalable, and is deployable with arbitrary GNN backbones and large-scale datasets. We demonstrate the efficacy and stability of our method through extensive experiments and ablation studies. We also provide intuitive observations for a deeper understanding of our method.
LGOct 8, 2020
Information-Driven Adaptive Sensing Based on Deep Reinforcement LearningAbdulmajid Murad, Frank Alexander Kraemer, Kerstin Bach et al.
In order to make better use of deep reinforcement learning in the creation of sensing policies for resource-constrained IoT devices, we present and study a novel reward function based on the Fisher information value. This reward function enables IoT sensor devices to learn to spend available energy on measurements at otherwise unpredictable moments, while conserving energy at times when measurements would provide little new information. This is a highly general approach, which allows for a wide range of use cases without significant human design effort or hyper-parameter tuning. We illustrate the approach in a scenario of workplace noise monitoring, where results show that the learned behavior outperforms a uniform sampling strategy and comes close to a near-optimal oracle solution.
CVSep 4, 2020
Witches' Brew: Industrial Scale Data Poisoning via Gradient MatchingJonas Geiping, Liam Fowl, W. Ronny Huang et al.
Data Poisoning attacks modify training data to maliciously control a model trained on such data. In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a particularly malicious poisoning attack that is both "from scratch" and "clean label", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the training data. Previous poisoning attacks against deep neural networks in this setting have been limited in scope and success, working only in simplified settings or being prohibitively expensive for large datasets. The central mechanism of the new attack is matching the gradient direction of malicious examples. We analyze why this works, supplement with practical considerations. and show its threat to real-world practitioners, finding that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. Finally we demonstrate the limitations of existing defensive strategies against such an attack, concluding that data poisoning is a credible threat, even for large-scale deep learning systems.
MLMay 15, 2019
Transferable Clean-Label Poisoning Attacks on Deep Neural NetsChen Zhu, W. Ronny Huang, Ali Shafahi et al.
Clean-label poisoning attacks inject innocuous looking (and "correctly" labeled) poison images into training data, causing a model to misclassify a targeted image after being trained on this data. We consider transferable poisoning attacks that succeed without access to the victim network's outputs, architecture, or (in some cases) training data. To achieve this, we propose a new "polytope attack" in which poison images are designed to surround the targeted image in feature space. We also demonstrate that using Dropout during poison creation helps to enhance transferability of this attack. We achieve transferable attack success rates of over 50% while poisoning only 1% of the training set.
LGMay 10, 2019
Autonomous Management of Energy-Harvesting IoT Nodes Using Deep Reinforcement LearningAbdulmajid Murad, Frank Alexander Kraemer, Kerstin Bach et al.
Reinforcement learning (RL) is capable of managing wireless, energy-harvesting IoT nodes by solving the problem of autonomous management in non-stationary, resource-constrained settings. We show that the state-of-the-art policy-gradient approaches to RL are appropriate for the IoT domain and that they outperform previous approaches. Due to the ability to model continuous observation and action spaces, as well as improved function approximation capability, the new approaches are able to solve harder problems, permitting reward functions that are better aligned with the actual application goals. We show such a reward function and use policy-gradient approaches to learn capable policies, leading to behavior more appropriate for IoT nodes with less manual design effort, increasing the level of autonomy in IoT.
LGDec 28, 2017
Visualizing the Loss Landscape of Neural NetsHao Li, Zheng Xu, Gavin Taylor et al.
Neural network training relies on our ability to find "good" minimizers of highly non-convex loss functions. It is well-known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effects on the underlying loss landscape, are not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple "filter normalization" method that helps us visualize loss function curvature and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.
LGJun 9, 2017
Adaptive Consensus ADMM for Distributed OptimizationZheng Xu, Gavin Taylor, Hao Li et al.
The alternating direction method of multipliers (ADMM) is commonly used for distributed model fitting problems, but its performance and reliability depend strongly on user-defined penalty parameters. We study distributed ADMM methods that boost performance by using different fine-tuned algorithm parameters on each worker node. We present a O(1/k) convergence rate for adaptive ADMM methods with node-specific parameters, and propose adaptive consensus ADMM (ACADMM), which automatically tunes parameters without user oversight.
LGMay 6, 2016
Training Neural Networks Without Gradients: A Scalable ADMM ApproachGavin Taylor, Ryan Burmeister, Zheng Xu et al.
With the growing importance of large network models and enormous training datasets, GPUs have become increasingly necessary to train neural networks. This is largely because conventional optimization algorithms rely on stochastic gradient methods that don't scale well to large numbers of cores in a cluster setting. Furthermore, the convergence of all gradient methods, including batch methods, suffers from common problems like saturation effects, poor conditioning, and saddle points. This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps. The proposed method reduces the network training problem to a sequence of minimization sub-steps that can each be solved globally in closed form. The proposed method is advantageous because it avoids many of the caveats that make gradient methods slow on highly non-convex problems. The method exhibits strong scaling in the distributed setting, yielding linear speedups even when split over thousands of cores.
LGDec 5, 2015
Variance Reduction for Distributed Stochastic Gradient DescentSoham De, Gavin Taylor, Tom Goldstein
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.
CVOct 15, 2015
Layer-Specific Adaptive Learning Rates for Deep NetworksBharat Singh, Soham De, Yangmuzi Zhang et al.
The increasing complexity of deep learning architectures is resulting in training time requiring weeks or even months. This slow training is due in part to vanishing gradients, in which the gradients used by back-propagation are extremely large for weights connecting deep layers (layers near the output layer), and extremely small for shallow layers (near the input layer); this results in slow learning in the shallow layers. Additionally, it has also been shown that in highly non-convex problems, such as deep neural networks, there is a proliferation of high-error low curvature saddle points, which slows down learning dramatically. In this paper, we attempt to overcome the two above problems by proposing an optimization method for training deep neural networks which uses learning rates which are both specific to each layer in the network and adaptive to the curvature of the function, increasing the learning rate at low curvature points. This enables us to speed up learning in the shallow layers of the network and quickly escape high-error low curvature saddle points. We test our method on standard image classification datasets such as MNIST, CIFAR10 and ImageNet, and demonstrate that our method increases accuracy as well as reduces the required training time over standard algorithms.
DCApr 8, 2015
Unwrapping ADMM: Efficient Distributed Computing via Transpose ReductionTom Goldstein, Gavin Taylor, Kawika Barabin et al.
Recent approaches to distributed model fitting rely heavily on consensus ADMM, where each node solves small sub-problems using only local data. We propose iterative methods that solve {\em global} sub-problems over an entire distributed dataset. This is possible using transpose reduction strategies that allow a single node to solve least-squares over massive datasets without putting all the data in one place. This results in simple iterative methods that avoid the expensive inner loops required for consensus methods. To demonstrate the efficiency of this approach, we fit linear classifiers and sparse linear models to datasets over 5 Tb in size using a distributed implementation with over 7000 cores in far less time than previous approaches.
AIApr 16, 2014
An Analysis of State-Relevance Weights and Sampling Distributions on L1-Regularized Approximate Linear Programming Approximation AccuracyGavin Taylor, Connor Geer, David Piekut
Recent interest in the use of $L_1$ regularization in the use of value function approximation includes Petrik et al.'s introduction of $L_1$-Regularized Approximate Linear Programming (RALP). RALP is unique among $L_1$-regularized approaches in that it approximates the optimal value function using off-policy samples. Additionally, it produces policies which outperform those of previous methods, such as LSPI. RALP's value function approximation quality is affected heavily by the choice of state-relevance weights in the objective function of the linear program, and by the distribution from which samples are drawn; however, there has been no discussion of these considerations in the previous literature. In this paper, we discuss and explain the effects of choices in the state-relevance weights and sampling distribution on approximation quality, using both theoretical and experimental illustrations. The results provide insight not only onto these effects, but also provide intuition into the types of MDPs which are especially well suited for approximation with RALP.
LGOct 16, 2012
Value Function Approximation in Noisy Environments Using Locally Smoothed Regularized Approximate Linear ProgramsGavin Taylor, Ron Parr
Recently, Petrik et al. demonstrated that L1Regularized Approximate Linear Programming (RALP) could produce value functions and policies which compared favorably to established linear value function approximation techniques like LSPI. RALP's success primarily stems from the ability to solve the feature selection and value function approximation steps simultaneously. RALP's performance guarantees become looser if sampled next states are used. For very noisy domains, RALP requires an accurate model rather than samples, which can be unrealistic in some practical scenarios. In this paper, we demonstrate this weakness, and then introduce Locally Smoothed L1-Regularized Approximate Linear Programming (LS-RALP). We demonstrate that LS-RALP mitigates inaccuracies stemming from noise even without an accurate model. We show that, given some smoothness assumptions, as the number of samples increases, error from noise approaches zero, and provide experimental examples of LS-RALP's success on common reinforcement learning benchmark problems.