Timo Milbich

CV
h-index19
17papers
832citations
Novelty51%
AI Score46

17 Papers

CVJan 8
Atlas 2 -- Foundation models for clinical deployment

Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie et al.

Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

LGJul 20, 2021Code
Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning

Timo Milbich, Karsten Roth, Samarth Sinha et al.

Deep Metric Learning (DML) aims to find representations suitable for zero-shot transfer to a priori unknown test distributions. However, common evaluation protocols only test a single, fixed data split in which train and test classes are assigned randomly. More realistic evaluations should consider a broad spectrum of distribution shifts with potentially varying degree and difficulty. In this work, we systematically construct train-test splits of increasing difficulty and present the ooDML benchmark to characterize generalization under out-of-distribution shifts in DML. ooDML is designed to probe the generalization performance on much more challenging, diverse train-to-test distribution shifts. Based on our new benchmark, we conduct a thorough empirical analysis of state-of-the-art DML methods. We find that while generalization tends to consistently degrade with difficulty, some methods are better at retaining performance as the distribution shift increases. Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML. Code available here: https://github.com/CompVis/Characterizing_Generalization_in_DML.

CVSep 17, 2020Code
S2SD: Simultaneous Similarity-based Self-Distillation for Deep Metric Learning

Karsten Roth, Timo Milbich, Björn Ommer et al.

Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose \emph{Simultaneous Similarity-based Self-distillation (S2SD). S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offers notable improvements of up to 7% in Recall@1, while also setting a new state-of-the-art. Code available at https://github.com/MLforHealth/S2SD.

CVMar 24, 2020Code
PADS: Policy-Adapted Sampling for Visual Similarity Learning

Karsten Roth, Timo Milbich, Björn Ommer

Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures. Code can be found under https://github.com/Confusezius/CVPR2020_PADS.

CVFeb 19, 2020Code
Revisiting Training Strategies and Generalization Performance in Deep Metric Learning

Karsten Roth, Timo Milbich, Samarth Sinha et al.

Deep Metric Learning (DML) is arguably one of the most influential lines of research for learning visual similarities with many proposed approaches every year. Although the field benefits from the rapid progress, the divergence in training protocols, architectures, and parameter choices make an unbiased comparison difficult. To provide a consistent reference point, we revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices as well as the commonly neglected mini-batch sampling process. Under consistent comparison, DML objectives show much higher saturation than indicated by literature. Further based on our analysis, we uncover a correlation between the embedding space density and compression to the generalization performance of DML models. Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets. Code and a publicly accessible WandB-repo are available at https://github.com/Confusezius/Revisiting_Deep_Metric_Learning_PyTorch.

IVJan 8, 2024
RudolfV: A Foundation Model by Pathologists for Pathologists

Jonas Dippel, Barbara Feulner, Tobias Winterhoff et al.

Artificial intelligence has started to transform histopathology impacting clinical diagnostics and biomedical research. However, while many computational pathology approaches have been proposed, most current AI models are limited with respect to generalization, application variety, and handling rare diseases. Recent efforts introduced self-supervised foundation models to address these challenges, yet existing approaches do not leverage pathologist knowledge by design. In this study, we present a novel approach to designing foundation models for computational pathology, incorporating pathologist expertise, semi-automated data curation, and a diverse dataset from over 15 laboratories, including 58 tissue types, and encompassing 129 different histochemical and immunohistochemical staining modalities. We demonstrate that our model "RudolfV" surpasses existing state-of-the-art foundation models across different benchmarks focused on tumor microenvironment profiling, biomarker evaluation, and reference case search while exhibiting favorable robustness properties. Our study shows how domain-specific knowledge can increase the efficiency and performance of pathology foundation models and enable novel application areas.

CVJan 9, 2025
Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics

Maximilian Alber, Stephan Tietz, Jonas Dippel et al.

Recent advances in digital pathology have demonstrated the effectiveness of foundation models across diverse applications. In this report, we present Atlas, a novel vision foundation model based on the RudolfV approach. Our model was trained on a dataset comprising 1.2 million histopathology whole slide images, collected from two medical institutions: Mayo Clinic and Charité - Universtätsmedizin Berlin. Comprehensive evaluations show that Atlas achieves state-of-the-art performance across twenty-one public benchmark datasets, even though it is neither the largest model by parameter count nor by training dataset size.

CVJan 5
Mind the Gap: Continuous Magnification Sampling for Pathology Foundation Models

Alexander Möllers, Julius Hense, Florian Schulz et al.

In histopathology, pathologists examine both tissue architecture at low magnification and fine-grained morphology at high magnification. Yet, the performance of pathology foundation models across magnifications and the effect of magnification sampling during training remain poorly understood. We model magnification sampling as a multi-source domain adaptation problem and develop a simple theoretical framework that reveals systematic trade-offs between sampling strategies. We show that the widely used discrete uniform sampling of magnifications (0.25, 0.5, 1.0, 2.0 mpp) leads to degradation at intermediate magnifications. We introduce continuous magnification sampling, which removes gaps in magnification coverage while preserving performance at standard scales. Further, we derive sampling distributions that optimize representation quality across magnification scales. To evaluate these strategies, we introduce two new benchmarks (TCGA-MS, BRACS-MS) with appropriate metrics. Our experiments show that continuous sampling substantially improves over discrete sampling at intermediate magnifications, with gains of up to 4 percentage points in balanced classification accuracy, and that optimized distributions can further improve performance. Finally, we evaluate current histopathology foundation models, finding that magnification is a primary driver of performance variation across models. Our work paves the way towards future pathology foundation models that perform reliably across magnifications.

CVJul 6, 2021
iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald et al.

How would a static scene react to a local poke? What are the effects on other parts of an object if you could locally push it? There will be distinctive movement, despite evident variations caused by the stochastic nature of our world. These outcomes are governed by the characteristic kinematics of objects that dictate their overall motion caused by a local interaction. Conversely, the movement of an object provides crucial information about its underlying distinctive kinematics and the interdependencies between its parts. This two-way relation motivates learning a bijective mapping between object kinematics and plausible future image sequences. Therefore, we propose iPOKE -- invertible Prediction of Object Kinematics -- that, conditioned on an initial frame and a local poke, allows to sample object kinematics and establishes a one-to-one correspondence to the corresponding plausible videos, thereby providing a controlled stochastic video synthesis. In contrast to previous works, we do not generate arbitrary realistic videos, but provide efficient control of movements, while still capturing the stochastic nature of our environment and the diversity of plausible outcomes it entails. Moreover, our approach can transfer kinematics onto novel object instances and is not confined to particular object classes. Our project page is available at https://bit.ly/3dJN4Lf.

CVJun 21, 2021
Understanding Object Dynamics for Interactive Image-to-Video Synthesis

Andreas Blattmann, Timo Milbich, Michael Dorkenwald et al.

What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks. Project page is available at https://bit.ly/3cxfA2L .

CVMay 10, 2021
Stochastic Image-to-Video Synthesis using cINNs

Michael Dorkenwald, Timo Milbich, Andreas Blattmann et al.

Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Experiments on four diverse video datasets demonstrate the effectiveness of our approach in terms of both the quality and diversity of the synthesized results. Our project page is available at https://bit.ly/3t66bnU.

CVMar 8, 2021
Behavior-Driven Synthesis of Human Dynamics

Andreas Blattmann, Timo Milbich, Michael Dorkenwald et al.

Generating and representing human behavior are of major importance for various computer vision applications. Commonly, human video synthesis represents behavior as sequences of postures while directly predicting their likely progressions or merely changing the appearance of the depicted persons, thus not being able to exercise control over their actual behavior during the synthesis process. In contrast, controlled behavior synthesis and transfer across individuals requires a deep understanding of body dynamics and calls for a representation of behavior that is independent of appearance and also of specific postures. In this work, we present a model for human behavior synthesis which learns a dedicated representation of human dynamics independent of postures. Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence. To this end, we propose a conditional variational framework which explicitly disentangles posture from behavior. We demonstrate the effectiveness of our approach on this novel task, evaluating capturing, transferring, and sampling fine-grained, diverse behavior, both quantitatively and qualitatively. Project page is available at https://cutt.ly/5l7rXEp

CVApr 28, 2020
DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning

Timo Milbich, Karsten Roth, Homanga Bharadhwaj et al.

Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities which not only generalize from training data to identically distributed test distributions, but in particular also translate to unknown test classes. However, its prevailing learning paradigm is class-discriminative supervised training, which typically results in representations specialized in separating training classes. For effective generalization, however, such an image representation needs to capture a diverse range of data characteristics. To this end, we propose and study multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting. Through simultaneous optimization of our tasks we learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance on multiple established DML benchmark datasets.

CVApr 12, 2020
Sharing Matters for Generalization in Deep Metric Learning

Timo Milbich, Karsten Roth, Biagio Brattoli et al.

Learning the similarity between images constitutes the foundation for numerous vision tasks. The common paradigm is discriminative metric learning, which seeks an embedding that separates different training classes. However, the main challenge is to learn a metric that not only generalizes from training to novel, but related, test samples. It should also transfer to different object classes. So what complementary information is missed by the discriminative paradigm? Besides finding characteristics that separate between classes, we also need them to likely occur in novel categories, which is indicated if they are shared across training classes. This work investigates how to learn such characteristics without the need for extra annotations or training data. By formulating our approach as a novel triplet sampling strategy, it can be easily applied on top of recent ranking loss frameworks. Experiments show that, independent of the underlying network architecture and the specific ranking loss, our approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets. Preliminary early access page can be found here: https://ieeexplore.ieee.org/document/9141449

CVNov 18, 2019
Unsupervised Representation Learning by Discovering Reliable Image Relations

Timo Milbich, Omair Ghori, Ferran Diego et al.

Learning robust representations that allow to reliably establish relations between images is of paramount importance for virtually all of computer vision. Annotating the quadratic number of pairwise relations between training images is simply not feasible, while unsupervised inference is prone to noise, thus leaving the vast majority of these relations to be unreliable. To nevertheless find those relations which can be reliably utilized for learning, we follow a divide-and-conquer strategy: We find reliable similarities by extracting compact groups of images and reliable dissimilarities by partitioning these groups into subsets, converting the complicated overall problem into few reliable local subproblems. For each of the subsets we obtain a representation by learning a mapping to a target feature space so that their reliable relations are kept. Transitivity relations between the subsets are then exploited to consolidate the local solutions into a concerted global representation. While iterating between grouping, partitioning, and learning, we can successively use more and more reliable relations which, in turn, improves our image representation. In experiments, our approach shows state-of-the-art performance on unsupervised classification on ImageNet with 46.0% and competes favorably on different transfer learning tasks on PASCAL VOC.

CVMar 16, 2019
Unsupervised Part-Based Disentangling of Object Shape and Appearance

Dominik Lorenz, Leonard Bereska, Timo Milbich et al.

Large intra-class variation is the result of changes in multiple object characteristics. Images, however, only show the superposition of different variable factors such as appearance or shape. Therefore, learning to disentangle and represent these different characteristics poses a great challenge, especially in the unsupervised case. Moreover, large object articulation calls for a flexible part-based model. We present an unsupervised approach for disentangling appearance and shape by learning parts consistently over all instances of a category. Our model for learning an object representation is trained by simultaneously exploiting invariance and equivariance constraints between synthetically transformed images. Since no part annotation or prior information on an object class is required, the approach is applicable to arbitrary classes. We evaluate our approach on a wide range of object categories and diverse tasks including pose prediction, disentangled image synthesis, and video-to-video translation. The approach outperforms the state-of-the-art on unsupervised keypoint prediction and compares favorably even against supervised approaches on the task of shape and appearance transfer.

CVAug 3, 2017
Unsupervised Video Understanding by Reconciliation of Posture Similarities

Timo Milbich, Miguel Bautista, Ekaterina Sutter et al.

Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based solely on video sequences, which starts from scratch without requiring pre-trained networks, predefined body models, or keypoints. A combinatorial sequence matching algorithm proposes relations between frames from subsets of the training data, while a CNN is reconciling the transitivity conflicts of the different subsets to learn a single concerted pose embedding despite changes in appearance across sequences. Without any manual annotation, the model learns a structured representation of postures and their temporal development. The model not only enables retrieval of similar postures but also temporal super-resolution. Additionally, based on a recurrent formulation, next frames can be synthesized.