CVJul 18, 2022
Action-based Contrastive Learning for Trajectory PredictionMarah Halawa, Olaf Hellwich, Pia Bideau
Trajectory prediction is an essential task for successful human robot interaction, such as in autonomous driving. In this work, we address the problem of predicting future pedestrian trajectories in a first person view setting with a moving camera. To that end, we propose a novel action-based contrastive learning loss, that utilizes pedestrian action information to improve the learned trajectory embeddings. The fundamental idea behind this new loss is that trajectories of pedestrians performing the same action should be closer to each other in the feature space than the trajectories of pedestrians with significantly different actions. In other words, we argue that behavioral information about pedestrian action influences their future trajectory. Furthermore, we introduce a novel sampling strategy for trajectories that is able to effectively increase negative and positive contrastive samples. Additional synthetic trajectory samples are generated using a trained Conditional Variational Autoencoder (CVAE), which is at the core of several models developed for trajectory prediction. Results show that our proposed contrastive framework employs contextual information about pedestrian behavior, i.e. action, effectively, and it learns a better trajectory representation. Thus, integrating the proposed contrastive framework within a trajectory prediction model improves its results and outperforms state-of-the-art methods on three trajectory prediction benchmarks [31, 32, 26].
CVAug 3, 2022
Image-based Detection of Surface Defects in Concrete during ConstructionDominik Kuhnke, Monika Kwiatkowski, Olaf Hellwich
Defects increase the cost and duration of construction projects as they require significant inspection and documentation efforts. Automating defect detection could significantly reduce these efforts. This work focuses on detecting honeycombs, a substantial defect in concrete structures that may affect structural integrity. We compared honeycomb images scraped from the web with images obtained from real construction inspections. We found that web images do not capture the complete variance found in real-case scenarios and that there is still a lack of data in this domain. Our dataset is therefore freely available for further research. A Mask R-CNN and EfficientNet-B0 were trained for honeycomb detection. The Mask R-CNN model allows detecting honeycombs based on instance segmentation, whereas the EfficientNet-B0 model allows a patch-based classification. Our experiments demonstrate that both approaches are suitable for solving and automating honeycomb detection. In the future, this solution can be incorporated into defect documentation systems.
IRAug 3, 2022
Content-Based Landmark Retrieval Combining Global and Local Features using Siamese Neural NetworksTianyi Hu, Monika Kwiatkowski, Simon Matern et al.
In this work, we present a method for landmark retrieval that utilizes global and local features. A Siamese network is used for global feature extraction and metric learning, which gives an initial ranking of the landmark search. We utilize the extracted feature maps from the Siamese architecture as local descriptors, the search results are then further refined using a cosine similarity between local descriptors. We conduct a deeper analysis of the Google Landmark Dataset, which is used for evaluation, and augment the dataset to handle various intra-class variances. Furthermore, we conduct several experiments to compare the effects of transfer learning and metric learning, as well as experiments using other local descriptors. We show that a re-ranking using local features can improve the search results. We believe that the proposed local feature extraction using cosine similarity is a simple approach that can be extended to many other retrieval tasks.
20.2CVMay 6Code
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road TopographyGasser Elazab, Frank Neuhaus, Tilman Koß et al.
Autonomous driving must operate across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we establish a standardized evaluation protocol for road surface irregularities on CARD and benchmark state-of-the-art depth estimation models to provide strong baselines. The CARD dataset is hosted on https://huggingface.co/CARD-Data.
CVSep 4, 2024
How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental RepresentationsFlorian Blume, Runfeng Qu, Pia Bideau et al.
Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.
CVJan 13
Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience EstimationRunfeng Qu, Ole Hall, Pia K Bideau et al.
Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision
CVJul 24, 2024
Nonverbal Immediacy Analysis in Education: A Multimodal Computational ModelUroš Petković, Jonas Frenkel, Olaf Hellwich et al.
This paper introduces a novel computational approach for analyzing nonverbal social behavior in educational settings. Integrating multimodal behavioral cues, including facial expressions, gesture intensity, and spatial dynamics, the model assesses the nonverbal immediacy (NVI) of teachers from RGB classroom videos. A dataset of 400 30-second video segments from German classrooms was constructed for model training and validation. The gesture intensity regressor achieved a correlation of 0.84, the perceived distance regressor 0.55, and the NVI model 0.44 with median human ratings. The model demonstrates the potential to provide a valuable support in nonverbal behavior assessment, approximating the accuracy of individual human raters. Validated against both questionnaire data and trained observer ratings, our models show moderate to strong correlations with relevant educational outcomes, indicating their efficacy in reflecting effective teaching behaviors. This research advances the objective assessment of nonverbal communication behaviors, opening new pathways for educational research.
CVDec 3, 2025
Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular ApplicationsGasser Elazab, Maximilian Jansen, Michael Unterreiner et al.
Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.
CVOct 17, 2023
DIAR: Deep Image Alignment and Reconstruction using Swin TransformersMonika Kwiatkowski, Simon Matern, Olaf Hellwich
When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.
CVNov 28, 2023
Decomposer: Semi-supervised Learning of Image Restoration and Image DecompositionBoris Meinardus, Mariusz Trzeciakiewicz, Tim Herzig et al.
We present Decomposer, a semi-supervised reconstruction model that decomposes distorted image sequences into their fundamental building blocks - the original image and the applied augmentations, i.e., shadow, light, and occlusions. To solve this problem, we use the SIDAR dataset that provides a large number of distorted image sequences: each sequence contains images with shadows, lighting, and occlusions applied to an undistorted version. Each distortion changes the original signal in different ways, e.g., additive or multiplicative noise. We propose a transformer-based model to explicitly learn this decomposition. The sequential model uses 3D Swin-Transformers for spatio-temporal encoding and 3D U-Nets as prediction heads for individual parts of the decomposition. We demonstrate that by separately pre-training our model on weakly supervised pseudo labels, we can steer our model to optimize for our ambiguous problem definition and learn to differentiate between the different image distortions.
CVAug 5, 2024
ConDL: Detector-Free Dense Image MatchingMonika Kwiatkowski, Simon Matern, Olaf Hellwich
In this work, we introduce a deep-learning framework designed for estimating dense image correspondences. Our fully convolutional model generates dense feature maps for images, where each pixel is associated with a descriptor that can be matched across multiple images. Unlike previous methods, our model is trained on synthetic data that includes significant distortions, such as perspective changes, illumination variations, shadows, and specular highlights. Utilizing contrastive learning, our feature maps achieve greater invariance to these distortions, enabling robust matching. Notably, our method eliminates the need for a keypoint detector, setting it apart from many existing image-matching techniques.
11.4CVMay 3
ProtoFair: Fair Self-Supervised Contrastive Learning via Pseudo-Counterfactual PairsMarah Halawa, Olaf Hellwich
Self-supervised learning methods learn high-quality visual representations, yet recent studies show that these representations often capture demographic biases present in the training data. Existing fairness-aware methods address this by redesigning the self-supervised objective itself, limiting portability across the rapidly evolving landscape of self-supervised learning (SSL) frameworks. We propose ProtoFair, a fairness-aware contrastive loss designed to work alongside existing SSL objectives without modifying them. ProtoFair leverages unsupervised prototype clustering to identify pseudo-counterfactual pairs: samples sharing the same cluster assignment but belonging to different sensitive groups. By pulling these content-matched, cross-group samples together in the embedding space, ProtoFair encourages the encoder to learn representations that are invariant to the sensitive attribute. The method requires only sensitive attribute annotations, no target labels, and integrates seamlessly with both SimCLR and SupCon. Experiments on CelebA and UTKFace demonstrate consistent fairness improvements while maintaining competitive accuracy.
CVNov 29, 2024
MonoPP: Metric-Scaled Self-Supervised Monocular Depth Estimation by Planar-Parallax Geometry in Automotive ApplicationsGasser Elazab, Torben Gräber, Michael Unterreiner et al.
Self-supervised monocular depth estimation (MDE) has gained popularity for obtaining depth predictions directly from videos. However, these methods often produce scale invariant results, unless additional training signals are provided. Addressing this challenge, we introduce a novel self-supervised metric-scaled MDE model that requires only monocular video data and the camera's mounting position, both of which are readily available in modern vehicles. Our approach leverages planar-parallax geometry to reconstruct scene structure. The full pipeline consists of three main networks, a multi-frame network, a singleframe network, and a pose network. The multi-frame network processes sequential frames to estimate the structure of the static scene using planar-parallax geometry and the camera mounting position. Based on this reconstruction, it acts as a teacher, distilling knowledge such as scale information, masked drivable area, metric-scale depth for the static scene, and dynamic object mask to the singleframe network. It also aids the pose network in predicting a metric-scaled relative pose between two subsequent images. Our method achieved state-of-the-art results for the driving benchmark KITTI for metric-scaled depth prediction. Notably, it is one of the first methods to produce self-supervised metric-scaled depth prediction for the challenging Cityscapes dataset, demonstrating its effectiveness and versatility.
CVApr 16, 2024
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression RecognitionMarah Halawa, Florian Blume, Pia Bideau et al.
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
CVMay 21, 2025
Mouse Lockbox Dataset: Behavior Recognition for Mice Solving LockboxesPatrik Reiske, Marcus N. Boon, Niek Andresen et al.
Machine learning and computer vision methods have a major impact on the study of natural animal behavior, as they enable the (semi-)automatic analysis of vast amounts of video data. Mice are the standard mammalian model system in most research fields, but the datasets available today to refine such methods focus either on simple or social behaviors. In this work, we present a video dataset of individual mice solving complex mechanical puzzles, so-called lockboxes. The more than 110 hours of total playtime show their behavior recorded from three different perspectives. As a benchmark for frame-level action classification methods, we provide human-annotated labels for all videos of two different mice, that equal 13% of our dataset. Our keypoint (pose) tracking-based action classification framework illustrates the challenges of automated labeling of fine-grained behaviors, such as the manipulation of objects. We hope that our work will help accelerate the advancement of automated action and behavior classification in the computational neuroscience community. Our dataset is publicly available at https://doi.org/10.14279/depositonce-23850
CVJan 19, 2025
Tracking Mouse from Incomplete Body-Part Observations and Deep-Learned Deformable-Mouse Model Motion-Track Constraint for Behavior AnalysisOlaf Hellwich, Niek Andresen, Katharina Hohlbaum et al.
Tracking mouse body parts in video is often incomplete due to occlusions such that - e.g. - subsequent action and behavior analysis is impeded. In this conceptual work, videos from several perspectives are integrated via global exterior camera orientation; body part positions are estimated by 3D triangulation and bundle adjustment. Consistency of overall 3D track reconstruction is achieved by introduction of a 3D mouse model, deep-learned body part movements, and global motion-track smoothness constraint. The resulting 3D body and body part track estimates are substantially more complete than the original single-frame-based body part detection, therefore, allowing improved animal behavior analysis.
CVMay 19, 2023
SIDAR: Synthetic Image Dataset for Alignment & RestorationMonika Kwiatkowski, Simon Matern, Olaf Hellwich
Image alignment and image restoration are classical computer vision tasks. However, there is still a lack of datasets that provide enough data to train and evaluate end-to-end deep learning models. Obtaining ground-truth data for image alignment requires sophisticated structure-from-motion methods or optical flow systems that often do not provide enough data variance, i.e., typically providing a high number of image correspondences, while only introducing few changes of scenery within the underlying image sequences. Alternative approaches utilize random perspective distortions on existing image data. However, this only provides trivial distortions, lacking the complexity and variance of real-world scenarios. Instead, our proposed data augmentation helps to overcome the issue of data scarcity by using 3D rendering: images are added as textures onto a plane, then varying lighting conditions, shadows, and occlusions are added to the scene. The scene is rendered from multiple viewpoints, generating perspective distortions more consistent with real-world scenarios, with homographies closely resembling those of camera projections rather than randomized homographies. For each scene, we provide a sequence of distorted images with corresponding occlusion masks, homographies, and ground-truth labels. The resulting dataset can serve as a training and evaluation set for a multitude of tasks involving image alignment and artifact removal, such as deep homography estimation, dense image matching, 2D bundle adjustment, inpainting, shadow removal, denoising, content retrieval, and background subtraction. Our data generation pipeline is customizable and can be applied to any existing dataset, serving as a data augmentation to further improve the feature learning of any existing method.
CVJun 10, 2021
A modular framework for object-based saccadic decisions in dynamic scenesNicolas Roth, Pia Bideau, Olaf Hellwich et al.
Visually exploring the world around us is not a passive process. Instead, we actively explore the world and acquire visual information over time. Here, we present a new model for simulating human eye-movement behavior in dynamic real-world scenes. We model this active scene exploration as a sequential decision making process. We adapt the popular drift-diffusion model (DDM) for perceptual decision making and extend it towards multiple options, defined by objects present in the scene. For each possible choice, the model integrates evidence over time and a decision (saccadic eye movement) is triggered as soon as evidence crosses a decision threshold. Drawing this explicit connection between decision making and object-based scene perception is highly relevant in the context of active viewing, where decisions are made continuously while interacting with an external environment. We validate our model with a carefully designed ablation study and explore influences of our model parameters. A comparison on the VidCom dataset supports the plausibility of the proposed approach.
CVAug 16, 2020
Learning Disentangled Expression Representations from Facial ImagesMarah Halawa, Manuel Wöllhaf, Eduardo Vellasques et al.
Face images are subject to many different factors of variation, especially in unconstrained in-the-wild scenarios. For most tasks involving such images, e.g. expression recognition from video streams, having enough labeled data is prohibitively expensive. One common strategy to tackle such a problem is to learn disentangled representations for the different factors of variation of the observed data using adversarial learning. In this paper, we use a formulation of the adversarial loss to learn disentangled representations for face images. The used model facilitates learning on single-task datasets and improves the state-of-the-art in expression recognition with an accuracy of60.53%on the AffectNetdataset, without using any additional data.
CVNov 1, 2013
Iterative Bilateral Filtering of Polarimetric SAR DataOlivier D'Hondt, Stéphane Guillaso, Olaf Hellwich
In this paper, we introduce an iterative speckle filtering method for polarimetric SAR (PolSAR) images based on the bilateral filter. To locally adapt to the spatial structure of images, this filter relies on pixel similarities in both spatial and radiometric domains. To deal with polarimetric data, we study the use of similarities based on a statistical distance called Kullback-Leibler divergence as well as two geodesic distances on Riemannian manifolds. To cope with speckle, we propose to progressively refine the result thanks to an iterative scheme. Experiments are run over synthetic and experimental data. First, simulations are generated to study the effects of filtering parameters in terms of polarimetric reconstruction error, edge preservation and smoothing of homogeneous areas. Comparison with other methods shows that our approach compares well to other state of the art methods in the extraction of polarimetric information and shows superior performance for edge restoration and noise smoothing. The filter is then applied to experimental data sets from ESAR and FSAR sensors (DLR) at L-band and S-band, respectively. These last experiments show the ability of the filter to restore structures such as buildings and roads and to preserve boundaries between regions while achieving a high amount of smoothing in homogeneous areas.