Brunó B. Englert

CV
h-index19
4papers
24citations
Novelty43%
AI Score32

4 Papers

CVJul 12, 2025Code
Simplifying Traffic Anomaly Detection with Video Foundation Models

Svetlana Orlova, Tommie Kerssies, Brunó B. Englert et al.

Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.

CVApr 25, 2025
What is the Added Value of UDA in the VFM Era?

Brunó B. Englert, Tommie Kerssies, Gijs Dubbelman

Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

CVMar 11, 2025
VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Brunó B. Englert, Gijs Dubbelman

Unsupervised Domain Adaptation (UDA) enables strong generalization from a labeled source domain to an unlabeled target domain, often with limited data. In parallel, Vision Foundation Models (VFMs) pretrained at scale without labels have also shown impressive downstream performance and generalization. This motivates us to explore how UDA can best leverage VFMs. Prior work (VFM-UDA) demonstrated that replacing a standard ImageNet-pretrained encoder with a VFM improves generalization. However, it also showed that commonly used feature distance losses harm performance when applied to VFMs. Additionally, VFM-UDA does not incorporate multi-scale inductive biases, which are known to improve semantic segmentation. Building on these insights, we propose VFM-UDA++, which (1) investigates the role of multi-scale features, (2) adapts feature distance loss to be compatible with ViT-based VFMs and (3) evaluates how UDA benefits from increased synthetic source and real target data. By addressing these questions, we can improve performance on the standard GTA5 $\rightarrow$ Cityscapes benchmark by +1.4 mIoU. While prior non-VFM UDA methods did not scale with more data, VFM-UDA++ shows consistent improvement and achieves a further +2.4 mIoU gain when scaling the data, demonstrating that VFM-based UDA continues to benefit from increased data availability.

CVJun 14, 2024
Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation

Brunó B. Englert, Fabrizio J. Piva, Tommie Kerssies et al.

Achieving robust generalization across diverse data domains remains a significant challenge in computer vision. This challenge is important in safety-critical applications, where deep-neural-network-based systems must perform reliably under various environmental conditions not seen during training. Our study investigates whether the generalization capabilities of Vision Foundation Models (VFMs) and Unsupervised Domain Adaptation (UDA) methods for the semantic segmentation task are complementary. Results show that combining VFMs with UDA has two main benefits: (a) it allows for better UDA performance while maintaining the out-of-distribution performance of VFMs, and (b) it makes certain time-consuming UDA components redundant, thus enabling significant inference speedups. Specifically, with equivalent model sizes, the resulting VFM-UDA method achieves an 8.4$\times$ speed increase over the prior non-VFM state of the art, while also improving performance by +1.2 mIoU in the UDA setting and by +6.1 mIoU in terms of out-of-distribution generalization. Moreover, when we use a VFM with 3.6$\times$ more parameters, the VFM-UDA approach maintains a 3.3$\times$ speed up, while improving the UDA performance by +3.1 mIoU and the out-of-distribution performance by +10.3 mIoU. These results underscore the significant benefits of combining VFMs with UDA, setting new standards and baselines for Unsupervised Domain Adaptation in semantic segmentation.