CVAug 31, 2023Code
BTSeg: Barlow Twins Regularization for Domain Adaptation in Semantic SegmentationJohannes Künzel, Anna Hilsmann, Peter Eisert
We introduce BTSeg (Barlow Twins regularized Segmentation), an innovative, semi-supervised training approach enhancing semantic segmentation models in order to effectively tackle adverse weather conditions without requiring additional labeled training data. Images captured at similar locations but under varying adverse conditions are regarded as manifold representation of the same scene, thereby enabling the model to conceptualize its understanding of the environment. BTSeg shows cutting-edge performance for the new challenging ACG benchmark and sets a new state-of-the-art for weakly-supervised domain adaptation for the ACDC dataset. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/BTSeg .
CVAug 29, 2022
Perfusion assessment via local remote photoplethysmography (rPPG)Benjamin Kossack, Eric Wisotzky, Peter Eisert et al.
This paper presents an approach to assess the perfusion of visible human tissue from RGB video files. We propose metrics derived from remote photoplethysmography (rPPG) signals to detect whether a tissue is adequately supplied with blood. The perfusion analysis is done in three different scales, offering a flexible approach for different applications. We perform a plane-orthogonal-to-skin rPPG independently for locally defined regions of interest on each scale. From the extracted signals, we derive the signal-to-noise ratio, magnitude in the frequency domain, heart rate, perfusion index as well as correlation between specific rPPG signals in order to locally assess the perfusion of a specific region of human tissue. We show that locally resolved rPPG has a broad range of applications. As exemplary applications, we present results in intraoperative perfusion analysis and visualization during skin and organ transplantation as well as an application for liveliness assessment for the detection of presentation attacks to authentication systems.
CVJun 16, 2023
Unsupervised Learning of Style-Aware Facial Animation from Real Acting PerformancesWolfgang Paier, Anna Hilsmann, Peter Eisert
This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transforms text or speech into a sequence of animation parameters. In contrast to previous approaches, our animation model learns disentangling/synthesizing different acting-styles in an unsupervised manner, requiring only phonetic labels that describe the content of training sequences. For realistic real-time rendering, we train a U-Net that refines rasterization-based renderings by computing improved pixel colors and a foreground matte. We compare our framework qualitatively/quantitatively against recent methods for head modeling as well as facial animation and evaluate the perceived rendering/animation quality in a user-study, which indicates large improvements compared to state-of-the-art approaches
CVJun 2, 2023
Automatic Reconstruction of Semantic 3D Models from 2D Floor PlansAleixo Cambeiro Barreiro, Mariusz Trzeciakiewicz, Anna Hilsmann et al.
Digitalization of existing buildings and the creation of 3D BIM models for them has become crucial for many tasks. Of particular importance are floor plans, which contain information about building layouts and are vital for processes such as construction, maintenance or refurbishing. However, this data is not always available in digital form, especially for older buildings constructed before CAD tools were widely available, or lacks semantic information. The digitalization of such information usually requires manual work of an expert that must reconstruct the layouts by hand, which is a cumbersome and error-prone process. In this paper, we present a pipeline for reconstruction of vectorized 3D models from scanned 2D plans, aiming at increasing the efficiency of this process. The method presented achieves state-of-the-art results in the public dataset CubiCasa5k, and shows good generalization to different types of plans. Our vectorization approach is particularly effective, outperforming previous methods.
CVNov 21, 2022
Hyperspectral Demosaicing of Snapshot Camera Images Using Deep LearningEric L. Wisotzky, Charul Daudkhane, Anna Hilsmann et al.
Spectral imaging technologies have rapidly evolved during the past decades. The recent development of single-camera-one-shot techniques for hyperspectral imaging allows multiple spectral bands to be captured simultaneously (3x3, 4x4 or 5x5 mosaic), opening up a wide range of applications. Examples include intraoperative imaging, agricultural field inspection and food quality assessment. To capture images across a wide spectrum range, i.e. to achieve high spectral resolution, the sensor design sacrifices spatial resolution. With increasing mosaic size, this effect becomes increasingly detrimental. Furthermore, demosaicing is challenging. Without incorporating edge, shape, and object information during interpolation, chromatic artifacts are likely to appear in the obtained images. Recent approaches use neural networks for demosaicing, enabling direct information extraction from image data. However, obtaining training data for these approaches poses a challenge as well. This work proposes a parallel neural network based demosaicing procedure trained on a new ground truth dataset captured in a controlled environment by a hyperspectral snapshot camera with a 4x4 mosaic pattern. The dataset is a combination of real captured scenes with images from publicly available data adapted to the 4x4 mosaic pattern. To obtain real world ground-truth data, we performed multiple camera captures with 1-pixel shifts in order to compose the entire data cube. Experiments show that the proposed network outperforms state-of-art networks.
CVOct 5, 2023
Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesisWieland Morgenstern, Milena T. Bagdasarian, Anna Hilsmann et al.
We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the network to learn the intricate shape and articulated motion of a human body. However, statistical body models like SMPL provide valuable a-priori knowledge which we leverage in order to constrain the dimension of the search space enabling more efficient and targeted learning and define pose-dependency. Instead of directly learning absolute pose-dependent geometry, we learn the difference between the observed geometry and the fitted SMPL model. This allows us to encode both pose-dependent appearance and geometry in the consistent UV space of the SMPL model. This approach not only ensures a high level of realism but also facilitates streamlined processing and rendering of virtual humans in real-time scenarios.
CVNov 6, 2023
Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human PerformancesPaul Knoll, Wieland Morgenstern, Anna Hilsmann et al.
Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a novel NeRF-based framework for pose-dependent rendering of human performances. In our approach, the radiance field is warped around an SMPL body mesh, thereby creating a new surface-aligned representation. Our representation can be animated through skeletal joint parameters that are provided to the NeRF in addition to the viewpoint for pose dependent appearances. To achieve this, our representation includes the corresponding 2D UV coordinates on the mesh texture map and the distance between the query point and the mesh. To enable efficient learning despite mapping ambiguities and random visual variations, we introduce a novel remapping process that refines the mapped coordinates. Experiments demonstrate that our approach results in high-quality renderings for novel-view and novel-pose synthesis.
CVOct 11, 2022
CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose EstimationNiklas Gard, Anna Hilsmann, Peter Eisert
Applications in the field of augmented reality or robotics often require joint localisation and 6D pose estimation of multiple objects. However, most algorithms need one network per object class to be trained in order to provide the best results. Analysing all visible objects demands multiple inferences, which is memory and time-consuming. We present a new single-stage architecture called CASAPose that determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass. It is fast and memory efficient, and achieves high accuracy for multiple objects by exploiting the output of a semantic segmentation decoder as control input to a keypoint recognition decoder via local class-adaptive normalisation. Our new differentiable regression of keypoint locations significantly contributes to a faster closing of the domain gap between real test and synthetic training data. We apply segmentation-aware convolutions and upsampling operations to increase the focus inside the object mask and to reduce mutual interference of occluding objects. For each inserted object, the network grows by only one output segmentation map and a negligible number of parameters. We outperform state-of-the-art approaches in challenging multi-object scenes with inter-object occlusion and synthetic training.
CVNov 25, 2023
Multi-task Planar Reconstruction with Feature Warping GuidanceLuan Wei, Anna Hilsmann, Peter Eisert
Piece-wise planar 3D reconstruction simultaneously segments plane instances and recovers their 3D plane parameters from an image, which is particularly useful for indoor or man-made environments. Efficient reconstruction of 3D planes coupled with semantic predictions offers advantages for a wide range of applications requiring scene understanding and concurrent spatial mapping. However, most existing planar reconstruction models either neglect semantic predictions or do not run efficiently enough for real-time applications. We introduce SOLOPlanes, a real-time planar reconstruction model based on a modified instance segmentation architecture which simultaneously predicts semantics for each plane instance, along with plane parameters and piece-wise plane instance masks. We achieve an improvement in instance mask segmentation by including multi-view guidance for plane predictions in the training process. This cross-task improvement, training for plane prediction but improving the mask segmentation, is due to the nature of feature sharing in multi-task learning. Our model simultaneously predicts semantics using single images at inference time, while achieving real-time predictions at 43 FPS.
CVMar 6, 2023
System for 3D Acquisition and 3D Reconstruction using Structured Light for Sewer Line InspectionJohannes Künzel, Darko Vehar, Rico Nestler et al.
The assessment of sewer pipe systems is a highly important, but at the same time cumbersome and error-prone task. We introduce an innovative system based on single-shot structured light modules that facilitates the detection and classification of spatial defects like jutting intrusions, spallings, or misaligned joints. This system creates highly accurate 3D measurements with sub-millimeter resolution of pipe surfaces and fuses them into a holistic 3D model. The benefit of such a holistic 3D model is twofold: on the one hand, it facilitates the accurate manual sewer pipe assessment, on the other, it simplifies the detection of defects in downstream automatic systems as it endows the input with highly accurate depth information. In this work, we provide an extensive overview of the system and give valuable insights into our design choices.
CVMar 3
R3GW: Relightable 3D Gaussians for Outdoor Scenes in the WildMargherita Lea Corona, Wieland Morgenstern, Peter Eisert et al.
3D Gaussian Splatting (3DGS) has established itself as a leading technique for 3D reconstruction and novel view synthesis of static scenes, achieving outstanding rendering quality and fast training. However, the method does not explicitly model the scene illumination, making it unsuitable for relighting tasks. Furthermore, 3DGS struggles to reconstruct scenes captured in the wild by unconstrained photo collections featuring changing lighting conditions. In this paper, we present R3GW, a novel method that learns a relightable 3DGS representation of an outdoor scene captured in the wild. Our approach separates the scene into a relightable foreground and a non-reflective background (the sky), using two distinct sets of Gaussians. R3GW models view-dependent lighting effects in the foreground reflections by combining Physically Based Rendering with the 3DGS scene representation in a varying illumination setting. We evaluate our method quantitatively and qualitatively on the NeRF-OSR dataset, offering state-of-the-art performance and enhanced support for physically-based relighting of unconstrained scenes. Our method synthesizes photorealistic novel views under arbitrary illumination conditions. Additionally, our representation of the sky mitigates depth reconstruction artifacts, improving rendering quality at the sky-foreground boundary
61.9CVMay 22
COSY: Compositional 3DGS Synthesis for Disentangled Human Head EditingFlorian Barthel, Shalini De Mello, Koki Nagano et al.
Recent 3D Gaussian Splatting (3DGS) GANs for human heads synthesize and render photorealistic 3D models in real-time and offer a vast variety in identity and appearance. However, controlling specific semantic attributes such as hair color or glasses remains challenging, as edits in the entangled latent space often induce unintended changes in identity or appearance. Although there are several methods that aim to disentangle the latent space post training by estimating directions that only modify certain features, these methods cannot guarantee complete disentanglement and often require pre-trained classifiers. In our approach, we propose a new generator architecture that synthesizes components, such as hair, skin, glasses, and torso, completely independently. This allows for changing the latent vector for one region while keeping the remaining parts fixed. Further, we achieve this separation using only sparse information such as the hair or skin color, eliminating the requirement of segmentation masks or geometric priors, often seen in prior work. To ensure matching shape and lighting conditions during editing, we allow minimal shared information via context tokens between the independent generators. These tokens even allow us to control the shape and light, without any prior annotation. Compared to existing works on GAN-based generation and editing, our method shows better disentanglement, more precise editing control, and competitive visual quality.
CVFeb 26
Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway PlatformsBin Zeng, Johannes Künzel, Anna Hilsmann et al.
Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.
CVJul 7, 2025Code
RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint ExtractionJohannes Künzel, Anna Hilsmann, Peter Eisert
We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.
CVDec 19, 2023
Compact 3D Scene Representation via Self-Organizing Gaussian GridsWieland Morgenstern, Florian Barthel, Anna Hilsmann et al.
3D Gaussian Splatting has recently emerged as a highly promising technique for modeling of static 3D scenes. In contrast to Neural Radiance Fields, it utilizes efficient rasterization allowing for very fast rendering at high-quality. However, the storage size is significantly higher, which hinders practical deployment, e.g. on resource constrained devices. In this paper, we introduce a compact scene representation organizing the parameters of 3D Gaussian Splatting (3DGS) into a 2D grid with local homogeneity, ensuring a drastic reduction in storage requirements without compromising visual quality during rendering. Central to our idea is the explicit exploitation of perceptual redundancies present in natural scenes. In essence, the inherent nature of a scene allows for numerous permutations of Gaussian parameters to equivalently represent it. To this end, we propose a novel highly parallel algorithm that regularly arranges the high-dimensional Gaussian parameters into a 2D grid while preserving their neighborhood structure. During training, we further enforce local smoothness between the sorted parameters in the grid. The uncompressed Gaussians use the same structure as 3DGS, ensuring a seamless integration with established renderers. Our method achieves a reduction factor of 17x to 42x in size for complex scenes with no increase in training time, marking a substantial leap forward in the domain of 3D scene distribution and consumption. Additional information can be found on our project page: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/
CVApr 16, 2024
Gaussian Splatting Decoder for 3D-aware Generative Adversarial NetworksFlorian Barthel, Arian Beckmann, Wieland Morgenstern et al.
NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes. Project page: florian-barthel.github.io/gaussian_decoder
CVFeb 15, 2024
X-maps: Direct Depth Lookup for Event-based Structured Light SystemsWieland Morgenstern, Niklas Gard, Simon Baumann et al.
We present a new approach to direct depth estimation for Spatial Augmented Reality (SAR) applications using event cameras. These dynamic vision sensors are a great fit to be paired with laser projectors for depth estimation in a structured light approach. Our key contributions involve a conversion of the projector time map into a rectified X-map, capturing x-axis correspondences for incoming events and enabling direct disparity lookup without any additional search. Compared to previous implementations, this significantly simplifies depth estimation, making it more efficient, while the accuracy is similar to the time map-based process. Moreover, we compensate non-linear temporal behavior of cheap laser projectors by a simple time map calibration, resulting in improved performance and increased depth estimation accuracy. Since depth estimation is executed by two lookups only, it can be executed almost instantly (less than 3 ms per frame with a Python implementation) for incoming events. This allows for real-time interactivity and responsiveness, which makes our approach especially suitable for SAR experiences where low latency, high frame rates and direct feedback are crucial. We present valuable insights gained into data transformed into X-maps and evaluate our depth from disparity estimation against the state of the art time map-based results. Additional results and code are available on our project page: https://fraunhoferhhi.github.io/X-maps/
CVNov 10, 2024
Adaptive and Temporally Consistent Gaussian Surfels for Multi-view Dynamic ReconstructionDecai Chen, Brianne Oberson, Ingo Feldmann et al.
3D Gaussian Splatting has recently achieved notable success in novel view synthesis for dynamic scenes and geometry reconstruction in static scenes. Building on these advancements, early methods have been developed for dynamic surface reconstruction by globally optimizing entire sequences. However, reconstructing dynamic scenes with significant topology changes, emerging or disappearing objects, and rapid movements remains a substantial challenge, particularly for long sequences. To address these issues, we propose AT-GS, a novel method for reconstructing high-quality dynamic surfaces from multi-view videos through per-frame incremental optimization. To avoid local minima across frames, we introduce a unified and adaptive gradient-aware densification strategy that integrates the strengths of conventional cloning and splitting techniques. Additionally, we reduce temporal jittering in dynamic surfaces by ensuring consistency in curvature maps across consecutive frames. Our method achieves superior accuracy and temporal coherence in dynamic surface reconstruction, delivering high-fidelity space-time novel view synthesis, even in complex and challenging scenes. Extensive experiments on diverse multi-view video datasets demonstrate the effectiveness of our approach, showing clear advantages over baseline methods. Project page: \url{https://fraunhoferhhi.github.io/AT-GS}
IVDec 15, 2023
Multispectral Stereo-Image Fusion for 3D Hyperspectral Scene ReconstructionEric L. Wisotzky, Jost Triller, Anna Hilsmann et al.
Spectral imaging enables the analysis of optical material properties that are invisible to the human eye. Different spectral capturing setups, e.g., based on filter-wheel, push-broom, line-scanning, or mosaic cameras, have been introduced in the last years to support a wide range of applications in agriculture, medicine, and industrial surveillance. However, these systems often suffer from different disadvantages, such as lack of real-time capability, limited spectral coverage or low spatial resolution. To address these drawbacks, we present a novel approach combining two calibrated multispectral real-time capable snapshot cameras, covering different spectral ranges, into a stereo-system. Therefore, a hyperspectral data-cube can be continuously captured. The combined use of different multispectral snapshot cameras enables both 3D reconstruction and spectral analysis. Both captured images are demosaicked avoiding spatial resolution loss. We fuse the spectral data from one camera into the other to receive a spatially and spectrally high resolution video stream. Experiments demonstrate the feasibility of this approach and the system is investigated with regard to its applicability for surgical assistance monitoring.
CVApr 16, 2024
SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen EnvironmentsNiklas Gard, Anna Hilsmann, Peter Eisert
In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. Our source code is publicly available at https://fraunhoferhhi.github.io/spvloc .
IVDec 21, 2023
Efficient and Accurate Hyperspectral Image Demosaicing with Neural Network ArchitecturesEric L. Wisotzky, Lara Wallburg, Anna Hilsmann et al.
Neural network architectures for image demosaicing have been become more and more complex. This results in long training periods of such deep networks and the size of the networks is huge. These two factors prevent practical implementation and usage of the networks in real-time platforms, which generally only have limited resources. This study investigates the effectiveness of neural network architectures in hyperspectral image demosaicing. We introduce a range of network models and modifications, and compare them with classical interpolation methods and existing reference network approaches. The aim is to identify robust and efficient performing network architectures. Our evaluation is conducted on two datasets, "SimpleData" and "SimRealData," representing different degrees of realism in multispectral filter array (MSFA) data. The results indicate that our networks outperform or match reference models in both datasets demonstrating exceptional performance. Notably, our approach focuses on achieving correct spectral reconstruction rather than just visual appeal, and this emphasis is supported by quantitative and qualitative assessments. Furthermore, our findings suggest that efficient demosaicing solutions, which require fewer parameters, are essential for practical applications. This research contributes valuable insights into hyperspectral imaging and its potential applications in various fields, including medical imaging.
IVDec 6, 2024
Automatic Tissue Differentiation in Parotidectomy using Hyperspectral ImagingEric L. Wisotzky, Alexander Schill, Anna Hilsmann et al.
In head and neck surgery, continuous intraoperative tissue differentiation is of great importance to avoid injury to sensitive structures such as nerves and vessels. Hyperspectral imaging (HSI) with neural network analysis could support the surgeon in tissue differentiation. A 3D Convolutional Neural Network with hyperspectral data in the range of $400-1000$ nm is used in this work. The acquisition system consisted of two multispectral snapshot cameras creating a stereo-HSI-system. For the analysis, 27 images with annotations of glandular tissue, nerve, muscle, skin and vein in 18 patients undergoing parotidectomy are included. Three patients are removed for evaluation following the leave-one-subject-out principle. The remaining images are used for training, with the data randomly divided into a training group and a validation group. In the validation, an overall accuracy of $98.7\%$ is achieved, indicating robust training. In the evaluation on the excluded patients, an overall accuracy of $83.4\%$ has been achieved showing good detection and identification abilities. The results clearly show that it is possible to achieve robust intraoperative tissue differentiation using hyperspectral imaging. Especially the high sensitivity in parotid or nerve tissue is of clinical importance. It is interesting to note that vein was often confused with muscle. This requires further analysis and shows that a very good and comprehensive data basis is essential. This is a major challenge, especially in surgery.
CVDec 13, 2023
Towards Better Morphed Face Images without Ghosting ArtifactsClemens Seibold, Anna Hilsmann, Peter Eisert
Automatic generation of morphed face images often produces ghosting artifacts due to poorly aligned structures in the input images. Manual processing can mitigate these artifacts. However, this is not feasible for the generation of large datasets, which are required for training and evaluating robust morphing attack detectors. In this paper, we propose a method for automatic prevention of ghosting artifacts based on a pixel-wise alignment during morph generation. We evaluate our proposed method on state-of-the-art detectors and show that our morphs are harder to detect, particularly, when combined with style-transfer-based improvement of low-level image characteristics. Furthermore, we show that our approach does not impair the biometric quality, which is essential for high quality morphs.
CVDec 8, 2023
Multi-view Inversion for 3D-aware Generative Adversarial NetworksFlorian Barthel, Anna Hilsmann, Peter Eisert
Current 3D GAN inversion methods for human heads typically use only one single frontal image to reconstruct the whole 3D head model. This leaves out meaningful information when multi-view data or dynamic videos are available. Our method builds on existing state-of-the-art 3D GAN inversion techniques to allow for consistent and simultaneous inversion of multiple views of the same subject. We employ a multi-latent extension to handle inconsistencies present in dynamic face videos to re-synthesize consistent 3D representations from the sequence. As our method uses additional information about the target subject, we observe significant enhancements in both geometric accuracy and image quality, particularly when rendering from wide viewing angles. Moreover, we demonstrate the editability of our inverted 3D renderings, which distinguishes them from NeRF-based scene reconstructions.
CVMar 18, 2025
Improving Adaptive Density Control for 3D Gaussian SplattingGlenn Grubert, Florian Barthel, Anna Hilsmann et al.
3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
CVMar 5
Video-based Locomotion Analysis for Fish Health MonitoringTimon Palm, Clemens Seibold, Anna Hilsmann et al.
Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.
CVMar 6
Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo VisionClemens Seibold, Anna Hilsmann, Peter Eisert
Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.
CVMay 23, 2025
CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head SynthesisFlorian Barthel, Wieland Morgenstern, Paul Hinzer et al.
Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/
CVMay 20, 2025
AppleGrowthVision: A large-scale stereo dataset for phenological analysis, fruit detection, and 3D reconstruction in apple orchardsLaura-Sophia von Hirschhausen, Jannes S. Magnusson, Mykyta Kovalenko et al.
Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 1,125 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 31,084 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 7.69 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 31.06 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.
CVMar 17, 2025
Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor ScenariosIryna Repinetska, Anna Hilsmann, Peter Eisert
Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
CVMar 5, 2025
Automatic Drywall Analysis for Progress Tracking and Quality Control in ConstructionMariusz Trzeciakiewicz, Aleixo Cambeiro Barreiro, Niklas Gard et al.
Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource-efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image-based automated drywall analysis enabling construction progress and quality assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.
CVNov 25, 2024
Multi-Resolution Generative Modeling of Human Motion from Limited DataDavid Eduardo Moreno-Villamarín, Anna Hilsmann, Peter Eisert
We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.
CVJun 17, 2024
3DGS.zip: A survey on 3D Gaussian Splatting Compression MethodsMilena T. Bagdasarian, Paul Knoll, Yi-Hsin Li et al.
3D Gaussian Splatting (3DGS) has emerged as a cutting-edge technique for real-time radiance field rendering, offering state-of-the-art performance in terms of both quality and speed. 3DGS models a scene as a collection of three-dimensional Gaussians, with additional attributes optimized to conform to the scene's geometric and visual properties. Despite its advantages in rendering speed and image fidelity, 3DGS is limited by its significant storage and memory demands. These high demands make 3DGS impractical for mobile devices or headsets, reducing its applicability in important areas of computer graphics. To address these challenges and advance the practicality of 3DGS, this survey provides a comprehensive and detailed examination of compression and compaction techniques developed to make 3DGS more efficient. We classify existing methods into two categories: compression, which focuses on reducing file size, and compaction, which aims to minimize the number of Gaussians. Both methods aim to maintain or improve quality, each by minimizing its respective attribute: file size for compression and Gaussian count for compaction. We introduce the basic mathematical concepts underlying the analyzed methods, as well as key implementation details and design choices. Our report thoroughly discusses similarities and differences among the methods, as well as their respective advantages and disadvantages. We establish a consistent framework for comparing the surveyed methods based on key performance metrics and datasets. Specifically, since these methods have been developed in parallel and over a short period of time, currently, no comprehensive comparison exists. This survey, for the first time, presents a unified framework to evaluate 3DGS compression techniques. We maintain a website that will be regularly updated with emerging methods: https://w-m.github.io/3dgs-compression-survey/ .
LGMay 14, 2024
EEG-Features for Generalized Deepfake DetectionArian Beckmann, Tilman Stephani, Felix Klotzsche et al.
Since the advent of Deepfakes in digital media, the development of robust and reliable detection mechanism is urgently called for. In this study, we explore a novel approach to Deepfake detection by utilizing electroencephalography (EEG) measured from the neural processing of a human participant who viewed and categorized Deepfake stimuli from the FaceForensics++ datset. These measurements serve as input features to a binary support vector classifier, trained to discriminate between real and manipulated facial images. We examine whether EEG data can inform Deepfake detection and also if it can provide a generalized representation capable of identifying Deepfakes beyond the training domain. Our preliminary results indicate that human neural processing signals can be successfully integrated into Deepfake detection frameworks and hint at the potential for a generalized neural representation of artifacts in computer generated faces. Moreover, our study provides next steps towards the understanding of how digital realism is embedded in the human cognitive system, possibly enabling the development of more realistic digital avatars in the future.
CVMar 7, 2024
Video-Driven Animation of Neural Head AvatarsWolfgang Paier, Paul Hinzer, Anna Hilsmann et al.
We present a new approach for video-driven animation of high-quality neural 3D head models, addressing the challenge of person-independent animation from video input. Typically, high-quality generative models are learned for specific individuals from multi-view video footage, resulting in person-specific latent representations that drive the generation process. In order to achieve person-independent animation from video input, we introduce an LSTM-based animation network capable of translating person-independent expression features into personalized animation parameters of person-specific 3D head models. Our approach combines the advantages of personalized head models (high quality and realism) with the convenience of video-driven animation employing multi-person facial performance capture. We demonstrate the effectiveness of our approach on synthesized animations with high quality based on different source videos as well as an ablation study.
CVMay 10, 2023
Towards L-System Captioning for Tree ReconstructionJannes S. Magnusson, Anna Hilsmann, Peter Eisert
This work proposes a novel concept for tree and plant reconstruction by directly inferring a Lindenmayer-System (L-System) word representation from image data in an image captioning approach. We train a model end-to-end which is able to translate given images into L-System words as a description of the displayed tree. To prove this concept, we demonstrate the applicability on 2D tree topologies. Transferred to real image data, this novel idea could lead to more efficient, accurate and semantically meaningful tree and plant reconstruction without using error-prone point cloud extraction, and other processes usually utilized in tree reconstruction. Furthermore, this approach bypasses the need for a predefined L-System grammar and enables species-specific L-System inference without biological knowledge.
CVMay 9, 2023
Fooling State-of-the-Art Deepfake Detection with High-Quality DeepfakesArian Beckmann, Anna Hilsmann, Peter Eisert
Due to the rising threat of deepfakes to security and privacy, it is most important to develop robust and reliable detectors. In this paper, we examine the need for high-quality samples in the training datasets of such detectors. Accordingly, we show that deepfake detectors proven to generalize well on multiple research datasets still struggle in real-world scenarios with well-crafted fakes. First, we propose a novel autoencoder for face swapping alongside an advanced face blending technique, which we utilize to generate 90 high-quality deepfakes. Second, we feed those fakes to a state-of-the-art detector, causing its performance to decrease drastically. Moreover, we fine-tune the detector on our fakes and demonstrate that they contain useful clues for the detection of manipulations. Overall, our results provide insights into the generalization of deepfake detectors and suggest that their training datasets should be complemented by high-quality fakes since training on mere research data is insufficient.
CVFeb 7, 2022
Imposing Temporal Consistency on Deep Monocular Body Shape and Pose EstimationAlexandra Zimmer, Anna Hilsmann, Wieland Morgenstern et al.
Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence is still challenging and the modeling quality is strongly influenced by the temporal consistency of the captured body motion. Our work presents an elegant solution for the integration of temporal constraints in the fitting process. This does not only increase temporal consistency but also robustness during the optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person, including jaw poses, facial expressions, and finger poses. We optimize these parameters over the complete image sequence, fitting one consistent body shape while imposing temporal consistency on the body motion, assuming linear body joint trajectories over a short time. Our approach enables the derivation of realistic 3D body models from image sequences, including facial expression and articulated hands. In extensive experiments, we show that our approach results in accurately estimated body shape and motion, also for challenging movements and poses. Further, we apply it to the special application of sign language analysis, where accurate and temporal consistent motion modelling is essential, and show that the approach is well-suited for this kind of application.
CVFeb 1, 2022
From Explanations to Segmentation: Using Explainable AI for Image SegmentationClemens Seibold, Johannes Künzel, Anna Hilsmann et al.
The new era of image segmentation leveraging the power of Deep Neural Nets (DNNs) comes with a price tag: to train a neural network for pixel-wise segmentation, a large amount of training samples has to be manually labeled on pixel-precision. In this work, we address this by following an indirect solution. We build upon the advances of the Explainable AI (XAI) community and extract a pixel-wise binary segmentation from the output of the Layer-wise Relevance Propagation (LRP) explaining the decision of a classification network. We show that we achieve similar results compared to an established U-Net segmentation architecture, while the generation of the training data is significantly simplified. The proposed method can be trained in a weakly supervised fashion, as the training samples must be only labeled on image-level, at the same time enabling the output of a segmentation mask. This makes it especially applicable to a wider range of real applications where tedious pixel-level labelling is often not possible.
CVJan 31, 2022
Combining Local and Global Pose Estimation for Precise Tracking of Similar ObjectsNiklas Gard, Anna Hilsmann, Peter Eisert
In this paper, we present a multi-object 6D detection and tracking pipeline for potentially similar and non-textured objects. The combination of a convolutional neural network for object classification and rough pose estimation with a local pose refinement and an automatic mismatch detection enables direct application in real-time AR scenarios. A new network architecture, trained solely with synthetic images, allows simultaneous pose estimation of multiple objects with reduced GPU memory consumption and enhanced performance. In addition, the pose estimates are further improved by a local edge-based refinement step that explicitly exploits known object geometry information. For continuous movements, the sole use of local refinement reduces pose mismatches due to geometric ambiguities or occlusions. We showcase the entire tracking pipeline and demonstrate the benefits of the combined approach. Experiments on a challenging set of non-textured similar objects demonstrate the enhanced quality compared to the baseline method. Finally, we illustrate how the system can be used in a real AR assistance application within the field of construction.
CVNov 30, 2021
Automated Damage Inspection of Power Transmission Towers from UAV ImagesAleixo Cambeiro Barreiro, Clemens Seibold, Anna Hilsmann et al.
Infrastructure inspection is a very costly task, requiring technicians to access remote or hard-to-reach places. This is the case for power transmission towers, which are sparsely located and require trained workers to climb them to search for damages. Recently, the use of drones or helicopters for remote recording is increasing in the industry, sparing the technicians this perilous task. This, however, leaves the problem of analyzing big amounts of images, which has great potential for automation. This is a challenging task for several reasons. First, the lack of freely available training data and the difficulty to collect it complicate this problem. Additionally, the boundaries of what constitutes a damage are fuzzy, introducing a degree of subjectivity in the labelling of the data. The unbalanced class distribution in the images also plays a role in increasing the difficulty of the task. This paper tackles the problem of structural damage detection in transmission towers, addressing these issues. Our main contributions are the development of a system for damage detection on remotely acquired drone images, applying techniques to overcome the issue of data scarcity and ambiguity, as well as the evaluation of the viability of such an approach to solve this particular problem.
CVAug 9, 2021
Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic DataMaciej Janik, Niklas Gard, Anna Hilsmann et al.
We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects.
CVMar 26, 2021
Focused LRP: Explainable AI for Face Morphing Attack DetectionClemens Seibold, Anna Hilsmann, Peter Eisert
The task of detecting morphed face images has become highly relevant in recent years to ensure the security of automatic verification systems based on facial images, e.g. automated border control gates. Detection methods based on Deep Neural Networks (DNN) have been shown to be very suitable to this end. However, they do not provide transparency in the decision making and it is not clear how they distinguish between genuine and morphed face images. This is particularly relevant for systems intended to assist a human operator, who should be able to understand the reasoning. In this paper, we tackle this problem and present Focused Layer-wise Relevance Propagation (FLRP). This framework explains to a human inspector on a precise pixel level, which image regions are used by a Deep Neural Network to distinguish between a genuine and a morphed face image. Additionally, we propose another framework to objectively analyze the quality of our method and compare FLRP to other DNN interpretability methods. This evaluation framework is based on removing detected artifacts and analyzing the influence of these changes on the decision of the DNN. Especially, if the DNN is uncertain in its decision or even incorrect, FLRP performs much better in highlighting visible artifacts compared to other methods.
CVSep 22, 2020
Neural Face Models for Example-Based Visual Speech SynthesisWolfgang Paier, Anna Hilsmann, Peter Eisert
Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.
CVSep 2, 2020
Going beyond Free Viewpoint: Creating Animatable Volumetric Video of Human PerformancesAnna Hilsmann, Philipp Fechteler, Wieland Morgenstern et al.
In this paper, we present an end-to-end pipeline for the creation of high-quality animatable volumetric video content of human performances. Going beyond the application of free-viewpoint volumetric video, we allow re-animation and alteration of an actor's performance through (i) the enrichment of the captured data with semantics and animation properties and (ii) applying hybrid geometry- and video-based animation methods that allow a direct animation of the high-quality data itself instead of creating an animatable model that resembles the captured data. Semantic enrichment and geometric animation ability are achieved by establishing temporal consistency in the 3D data, followed by an automatic rigging of each frame using a parametric shape-adaptive full human body model. Our hybrid geometry- and video-based animation approaches combine the flexibility of classical CG animation with the realism of real captured data. For pose editing, we exploit the captured data as much as possible and kinematically deform the captured frames to fit a desired pose. Further, we treat the face differently from the body in a hybrid geometry- and video-based animation approach where coarse movements and poses are modeled in the geometry only, while very fine and subtle details in the face, often lacking in purely geometric methods, are captured in video-based textures. These are processed to be interactively combined to form new facial expressions. On top of that, we learn the appearance of regions that are challenging to synthesize, such as the teeth or the eyes, and fill in missing regions realistically in an autoencoder-based approach. This paper covers the full pipeline from capturing and producing high-quality video content, over the enrichment with semantics and deformation properties for re-animation and processing of the data for the final hybrid animation.
CVApr 23, 2020
Style Your Face Morph and Improve Your Face Morphing Attack DetectorClemens Seibold, Anna Hilsmann, Peter Eisert
A morphed face image is a synthetically created image that looks so similar to the faces of two subjects that both can use it for verification against a biometric verification system. It can be easily created by aligning and blending face images of the two subjects. In this paper, we propose a style transfer based method that improves the quality of morphed face images. It counters the image degeneration during the creation of morphed face images caused by blending. We analyze different state of the art face morphing attack detection systems regarding their performance against our improved morphed face images and other methods that improve the image quality. All detection systems perform significantly worse, when first confronted with our improved morphed face images. Most of them can be enhanced by adding our quality improved morphs to the training data, which further improves the robustness against other means of quality improvement.
CVJul 5, 2018
Reflection Analysis for Face Morphing Attack DetectionClemens Seibold, Anna Hilsmann, Peter Eisert
A facial morph is a synthetically created image of a face that looks similar to two different individuals and can even trick biometric facial recognition systems into recognizing both individuals. This attack is known as face morphing attack. The process of creating such a facial morph is well documented and a lot of tutorials and software to create them are freely available. Therefore, it is mandatory to be able to detect this kind of fraud to ensure the integrity of the face as reliable biometric feature. In this work, we study the effects of face morphing on the physically correctness of the illumination. We estimate the direction to the light sources based on specular highlights in the eyes and use them to generate a synthetic map for highlights on the skin. This map is compared with the highlights in the image that is suspected to be a fraud. Morphing faces with different geometries, a bad alignment of the source images or using images with different illuminations, can lead to inconsistencies in reflections that indicate the existence of a morphing attack.
CVJun 11, 2018
Accurate and Robust Neural Networks for Security Related Applications Exampled by Face Morphing AttacksClemens Seibold, Wojciech Samek, Anna Hilsmann et al.
Artificial neural networks tend to learn only what they need for a task. A manipulation of the training data can counter this phenomenon. In this paper, we study the effect of different alterations of the training data, which limit the amount and position of information that is available for the decision making. We analyze the accuracy and robustness against semantic and black box attacks on the networks that were trained on different training data modifications for the particular example of morphing attacks. A morphing attack is an attack on a biometric facial recognition system where the system is fooled to match two different individuals with the same synthetic face image. Such a synthetic image can be created by aligning and blending images of the two individuals that should be matched with this image.