Lukas Schott

CV
h-index25
15papers
1,031citations
Novelty46%
AI Score45

15 Papers

LGAug 15, 2024
Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

Lukas Kirchdorfer, Cathrin Elich, Simon Kutsche et al. · eth-zurich

With the rise of neural networks in various domains, multi-task learning (MTL) gained significant relevance. A key challenge in MTL is balancing individual task losses during neural network training to improve performance and efficiency through knowledge sharing across tasks. To address these challenges, we propose a novel task-weighting method by building on the most prevalent approach of Uncertainty Weighting and computing analytically optimal uncertainty-based weights, normalized by a softmax function with tunable temperature. Our approach yields comparable results to the combinatorially prohibitive, brute-force approach of Scalarization while offering a more cost-effective yet high-performing alternative. We conduct an extensive benchmark on various datasets and architectures. Our method consistently outperforms six other common weighting methods. Furthermore, we report noteworthy experimental findings for the practical application of MTL. For example, larger networks diminish the influence of weighting methods, and tuning the weight decay has a low impact compared to the learning rate.

MLOct 6, 2022
Understanding Neural Coding on Latent Manifolds by Sharing Features and Dividing Ensembles

Martin Bjerke, Lukas Schott, Kristopher T. Jensen et al.

Systems neuroscience relies on two complementary views of neural data, characterized by single neuron tuning curves and analysis of population activity. These two perspectives combine elegantly in neural latent variable models that constrain the relationship between latent variables and neural activity, modeled by simple tuning curve functions. This has recently been demonstrated using Gaussian processes, with applications to realistic and topologically relevant latent manifolds. Those and previous models, however, missed crucial shared coding properties of neural populations. We propose feature sharing across neural tuning curves which significantly improves performance and helps optimization. We also propose a solution to the ensemble detection problem, where different groups of neurons, i.e., ensembles, can be modulated by different latent manifolds. Achieved through a soft clustering of neurons during training, this allows for the separation of mixed neural populations in an unsupervised manner. These innovations lead to more interpretable models of neural population activity that train well and perform better even on mixtures of complex latent manifolds. Finally, we apply our method on a recently published grid cell dataset, and recover distinct ensembles, infer toroidal latents and predict neural tuning curves in a single integrated modeling framework.

LGNov 8, 2023
Examining Common Paradigms in Multi-Task Learning

Cathrin Elich, Lukas Kirchdorfer, Jan M. Köhler et al. · eth-zurich

While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.

35.9CVApr 28
The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Muhammad Ali, Kevin Alexander Laube, Madan Ravi Ganesh et al.

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

CVDec 30, 2024
Attention Is All You Need For Mixture-of-Depths Routing

Advait Gadhikar, Souptik Kumar Majumdar, Niclas Popp et al.

Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.

CVJun 2, 2025
Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

Niclas Popp, Kevin Alexander Laube, Matthias Hein et al.

Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines

CVDec 13, 2024
Single-Pass Object-Focused Data Selection

Niclas Popp, Dan Zhang, Jan Hendrik Metzen et al.

While unlabeled image data is often plentiful, the costs of high-quality labels pose an important practical challenge: Which images should one select for labeling to use the annotation budget for a particular target task most effectively? To address this problem, we focus on single-pass data selection, which refers to the process of selecting all data to be annotated at once before training a downstream model. Prior methods for single-pass data selection rely on image-level representations and fail to reliably outperform random selection for object detection and segmentation. We propose Object-Focused Data Selection (OFDS) which leverages object-level features from foundation models and ensures semantic coverage of all target classes. In extensive experiments across tasks and target domains, OFDS consistently outperforms random selection and all baselines. The best results for constrained annotation budgets are obtained by combining human labels from OFDS with autolabels from foundation models. Moreover, using OFDS to select the initial labeled set for active learning yields consistent improvements

CVMay 6, 2024
Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data

Leonhard Hennicke, Christian Medeiros Adriano, Holger Giese et al.

Generative foundation models like Stable Diffusion comprise a diverse spectrum of knowledge in computer vision with the potential for transfer learning, e.g., via generating data to train student models for downstream tasks. This could circumvent the necessity of collecting labeled real-world data, thereby presenting a form of data-free knowledge distillation. However, the resultant student models show a significant drop in accuracy compared to models trained on real data. We investigate possible causes for this drop and focus on the role of the different layers of the student model. By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers. Further, we briefly investigate other factors, such as differences in data-normalization between synthetic and real, the impact of data augmentations, texture vs.\ shape learning, and assuming oracle prompts. While we find that some of those factors can have an impact, they are not sufficient to close the gap towards real data. Building upon our insights that mainly later layers are responsible for the drop, we investigate the data-efficiency of fine-tuning a synthetically trained model with real data applied to only those last layers. Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy. Our findings contribute to the understanding of the gap between synthetic and real data and indicate solutions to mitigate the scarcity of labeled real data.

MLOct 1, 2021
Score-Based Generative Classifiers

Roland S. Zimmermann, Lukas Schott, Yang Song et al.

The tremendous success of generative models in recent years raises the question whether they can also be used to perform classification. Generative models have been used as adversarially robust classifiers on simple datasets such as MNIST, but this robustness has not been observed on more complex datasets like CIFAR-10. Additionally, on natural image datasets, previous results have suggested a trade-off between the likelihood of the data and classification accuracy. In this work, we investigate score-based generative models as classifiers for natural images. We show that these models not only obtain competitive likelihood values but simultaneously achieve state-of-the-art classification accuracy for generative classifiers on CIFAR-10. Nevertheless, we find that these models are only slightly, if at all, more robust than discriminative baseline models on out-of-distribution tasks based on common image corruptions. Similarly and contrary to prior results, we find that score-based are prone to worst-case distribution shifts in the form of adversarial perturbations. Our work highlights that score-based generative models are closing the gap in classification accuracy compared to standard discriminative models. While they do not yet deliver on the promise of adversarial and out-of-domain robustness, they provide a different approach to classification that warrants further research.

LGJul 17, 2021
Visual Representation Learning Does Not Generalize Strongly Within the Same Domain

Lukas Schott, Julius von Kügelgen, Frederik Träuble et al.

An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D) from controlled environments, and on our contributed CelebGlow dataset. In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.

MLJul 21, 2020
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

David Klindt, Lukas Schott, Yash Sharma et al.

We construct an unsupervised learning model that achieves nonlinear disentanglement of underlying factors of variation in naturalistic videos. Previous work suggests that representations can be disentangled if all but a few factors in the environment stay constant at any point in time. As a result, algorithms proposed for this problem have only been tested on carefully constructed datasets with this exact property, leaving it unclear whether they will transfer to natural scenes. Here we provide evidence that objects in segmented natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution. We leverage this finding and present SlowVAE, a model for unsupervised representation learning that uses a sparse prior on temporally adjacent observations to disentangle generative factors without any assumptions on the number of changing factors. We provide a proof of identifiability and show that the model reliably learns disentangled representations on several established benchmark datasets, often surpassing the current state-of-the-art. We additionally demonstrate transferability towards video datasets with natural dynamics, Natural Sprites and KITTI Masks, which we contribute as benchmarks for guiding disentanglement research towards more natural data domains.

CVJan 16, 2020
A simple way to make neural networks robust against diverse image corruptions

Evgenia Rusak, Lukas Schott, Roland S. Zimmermann et al.

The human visual system is remarkably robust against a wide range of naturally occurring variations and corruptions like rain or snow. In contrast, the performance of modern image recognition models strongly degrades when evaluated on previously unseen corruptions. Here, we demonstrate that a simple but properly tuned training with additive Gaussian and Speckle noise generalizes surprisingly well to unseen corruptions, easily reaching the previous state of the art on the corruption benchmark ImageNet-C (with ResNet50) and on MNIST-C. We build on top of these strong baseline results and show that an adversarial training of the recognition model against uncorrelated worst-case noise distributions leads to an additional increase in performance. This regularization can be combined with previously proposed defense methods for further improvement.

CVMay 23, 2018
Towards the first adversarially robust neural network model on MNIST

Lukas Schott, Jonas Rauber, Matthias Bethge et al.

Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful defense by Madry et al. (1) overfits on the L-infinity metric (it's highly susceptible to L2 and L0 perturbations), (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L-infinity perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.

CVApr 7, 2017
Learned Watershed: End-to-End Learning of Seeded Segmentation

Steffen Wolf, Lukas Schott, Ullrich Köthe et al.

Learned boundary maps are known to outperform hand- crafted ones as a basis for the watershed algorithm. We show, for the first time, how to train watershed computation jointly with boundary map prediction. The estimator for the merging priorities is cast as a neural network that is con- volutional (over space) and recurrent (over iterations). The latter allows learning of complex shape priors. The method gives the best known seeded segmentation results on the CREMI segmentation challenge.

LGNov 19, 2015
Comparative Study of Deep Learning Software Frameworks

Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott et al.

Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of five deep learning frameworks, namely Caffe, Neon, TensorFlow, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is performed on several types of deep learning architectures and we evaluate the performance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important during the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algorithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best performance on GPU for training and deployment of LSTM networks. Caffe is the easiest for evaluating the performance of standard deep architectures. Finally, TensorFlow is a very flexible framework, similar to Theano, but its performance is currently not competitive compared to the other studied frameworks.