CVJul 7, 2023Code
RGB-D Mapping and Tracking in a Plenoxel Radiance FieldAndreas L. Teigen, Yeonsoo Park, Annette Stahl et al.
The widespread adoption of Neural Radiance Fields (NeRFs) have ensured significant advances in the domain of novel view synthesis in recent years. These models capture a volumetric radiance field of a scene, creating highly convincing, dense, photorealistic models through the use of simple, differentiable rendering equations. Despite their popularity, these algorithms suffer from severe ambiguities in visual data inherent to the RGB sensor, which means that although images generated with view synthesis can visually appear very believable, the underlying 3D model will often be wrong. This considerably limits the usefulness of these models in practical applications like Robotics and Extended Reality (XR), where an accurate dense 3D reconstruction otherwise would be of significant value. In this paper, we present the vital differences between view synthesis models and 3D reconstruction models. We also comment on why a depth sensor is essential for modeling accurate geometry in general outward-facing scenes using the current paradigm of novel view synthesis methods. Focusing on the structure-from-motion task, we practically demonstrate this need by extending the Plenoxel radiance field model: Presenting an analytical differential approach for dense mapping and tracking with radiance fields based on RGB-D data without a neural network. Our method achieves state-of-the-art results in both mapping and tracking tasks, while also being faster than competing neural network-based approaches. The code is available at: https://github.com/ysus33/RGB-D_Plenoxel_Mapping_Tracking.git
13.0CVMay 18Code
Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory LabelsEspen Uri Høgstedt, Christian Schellewald, Annette Stahl et al.
Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.
CVMar 8, 2022
Geolocation estimation of target vehicles using image processing and geometric computationElnaz Namazi, Rudolf Mester, Chaoru Lu et al.
Estimating vehicles' locations is one of the key components in intelligent traffic management systems (ITMSs) for increasing traffic scene awareness. Traditionally, stationary sensors have been employed in this regard. The development of advanced sensing and communication technologies on modern vehicles (MVs) makes it feasible to use such vehicles as mobile sensors to estimate the traffic data of observed vehicles. This study aims to explore the capabilities of a monocular camera mounted on an MV in order to estimate the geolocation of the observed vehicle in a global positioning system (GPS) coordinate system. We proposed a new methodology by integrating deep learning, image processing, and geometric computation to address the observed-vehicle localization problem. To evaluate our proposed methodology, we developed new algorithms and tested them using real-world traffic data. The results indicated that our proposed methodology and algorithms could effectively estimate the observed vehicle's latitude and longitude dynamically.
CVNov 17, 2023
Removing Adverse Volumetric Effects From Trained Neural Radiance FieldsAndreas L. Teigen, Mauhing Yip, Victor P. Hamran et al.
While the use of neural radiance fields (NeRFs) in different challenging settings has been explored, only very recently have there been any contributions that focus on the use of NeRF in foggy environments. We argue that the traditional NeRF models are able to replicate scenes filled with fog and propose a method to remove the fog when synthesizing novel views. By calculating the global contrast of a scene, we can estimate a density threshold that, when applied, removes all visible fog. This makes it possible to use NeRF as a way of rendering clear views of objects of interest located in fog-filled environments. Additionally, to benchmark performance on such scenes, we introduce a new dataset that expands some of the original synthetic NeRF scenes through the addition of fog and natural environments. The code, dataset, and video results can be found on our project page: https://vegardskui.com/fognerf/
CVSep 30, 2025Code
A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging EnvironmentsEspen Uri Høgstedt, Christian Schellewald, Annette Stahl et al.
Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.
CVJan 6, 2022Code
Realistic Full-Body Anonymization with Surface-Guided GANsHåkon Hukkelås, Morten Smebye, Rudolf Mester et al.
Recent work on image anonymization has shown that generative adversarial networks (GANs) can generate near-photorealistic faces to anonymize individuals. However, scaling up these networks to the entire human body has remained a challenging and yet unsolved task. We propose a new anonymization method that generates realistic humans for in-the-wild images. A key part of our design is to guide adversarial nets by dense pixel-to-surface correspondences between an image and a canonical 3D surface. We introduce Variational Surface-Adaptive Modulation (V-SAM) that embeds surface information throughout the generator. Combining this with our novel discriminator surface supervision loss, the generator can synthesize high quality humans with diverse appearances in complex and varying scenes. We demonstrate that surface guidance significantly improves image quality and diversity of samples, yielding a highly practical generator. Finally, we show that our method preserves data usability without infringing privacy when collecting image datasets for training computer vision models. Source code and appendix is available at: \href{https://github.com/hukkelas/full_body_anonymization}{github.com/hukkelas/full\_body\_anonymization}
CVFeb 25, 2025
Near-Shore Mapping for Detection and Tracking of VesselsNicholas Dalhaug, Annette Stahl, Rudolf Mester et al.
For an autonomous surface vessel (ASV) to dock, it must track other vessels close to the docking area. Kayaks present a particular challenge due to their proximity to the dock and relatively small size. Maritime target tracking has typically employed land masking to filter out land and the dock. However, imprecise land masking makes it difficult to track close-to-dock objects. Our approach uses Light Detection And Ranging (LiDAR) data and maps the docking area offline. The precise 3D measurements allow for precise map creation. However, the mapping could result in static, yet potentially moving, objects being mapped. We detect and filter out potentially moving objects from the LiDAR data by utilizing image data. The visual vessel detection and segmentation method is a neural network that is trained on our labeled data. Close-to-shore tracking improves with an accurate map and is demonstrated on a recently gathered real-world dataset. The dataset contains multiple sequences of a kayak and a day cruiser moving close to the dock, in a collision path with an autonomous ferry prototype.
CVFeb 15
Differential pose optimization in descriptor space -- Combining Geometric and Photometric Methods for Motion EstimationAndreas L. Teigen, Annette Stahl, Rudolf Mester
One of the fundamental problems in computer vision is the two-frame relative pose optimization problem. Primarily, two different kinds of error values are used: photometric error and re-projection error. The selection of error value is usually directly dependent on the selection of feature paradigm, photometric features, or geometric features. It is a trade-off between accuracy, robustness, and the possibility of loop closing. We investigate a third method that combines the strengths of both paradigms into a unified approach. Using densely sampled geometric feature descriptors, we replace the photometric error with a descriptor residual from a dense set of descriptors, thereby enabling the employment of sub-pixel accuracy in differential photometric methods, along with the expressiveness of the geometric feature descriptor. Experiments show that although the proposed strategy is an interesting approach that results in accurate tracking, it ultimately does not outperform pose optimization strategies based on re-projection error despite utilizing more information. We proceed to analyze the underlying reason for this discrepancy and present the hypothesis that the descriptor similarity metric is too slowly varying and does not necessarily correspond strictly to keypoint placement accuracy.
CVSep 3, 2021
Model-Based Parameter Optimization for Ground Texture Based Localization MethodsJan Fabian Schmid, Stephan F. Simon, Rudolf Mester
A promising approach to accurate positioning of robots is ground texture based localization. It is based on the observation that visual features of ground images enable fingerprint-like place recognition. We tackle the issue of efficient parametrization of such methods, deriving a prediction model for localization performance, which requires only a small collection of sample images of an application area. In a first step, we examine whether the model can predict the effects of changing one of the most important parameters of feature-based localization methods: the number of extracted features. We examine two localization methods, and in both cases our evaluation shows that the predictions are sufficiently accurate. Since this model can be used to find suitable values for any parameter, we then present a holistic parameter optimization framework, which finds suitable texture-specific parameter configurations, using only the model to evaluate the considered parameter configurations.
CVMay 31, 2021
Urban Traffic Surveillance (UTS): A fully probabilistic 3D tracking approach based on 2D detectionsHenry Bradler, Adrian Kretz, Rudolf Mester
Urban Traffic Surveillance (UTS) is a surveillance system based on a monocular and calibrated video camera that detects vehicles in an urban traffic scenario with dense traffic on multiple lanes and vehicles performing sharp turning maneuvers. UTS then tracks the vehicles using a 3D bounding box representation and a physically reasonable 3D motion model relying on an unscented Kalman filter based approach. Since UTS recovers positions, shape and motion information in a three-dimensional world coordinate system, it can be employed to recognize diverse traffic violations or to supply intelligent vehicles with valuable traffic information. We build on YOLOv3 as a detector yielding 2D bounding boxes and class labels for each vehicle. A 2D detector renders our system much more independent to different camera perspectives as a variety of labeled training data is available. This allows for a good generalization while also being more hardware efficient. The task of 3D tracking based on 2D detections is supported by integrating class specific prior knowledge about the vehicle shape. We quantitatively evaluate UTS using self generated synthetic data and ground truth from the CARLA simulator, due to the non-existence of datasets with an urban vehicle surveillance setting and labeled 3D bounding boxes. Additionally, we give a qualitative impression of how UTS performs on real-world data. Our implementation is capable of operating in real time on a reasonably modern workstation. To the best of our knowledge, UTS is to date the only 3D vehicle tracking system in a surveillance scenario (static camera observing moving targets).
CVNov 2, 2020
Image Inpainting with Learnable Feature ImputationHåkon Hukkelås, Frank Lindseth, Rudolf Mester
A regular convolution layer applying a filter in the same way over known and unknown areas causes visual artifacts in the inpainted image. Several studies address this issue with feature re-normalization on the output of the convolution. However, these models use a significant amount of learnable parameters for feature re-normalization, or assume a binary representation of the certainty of an output. We propose (layer-wise) feature imputation of the missing input values to a convolution. In contrast to learned feature re-normalization, our method is efficient and introduces a minimal number of parameters. Furthermore, we propose a revised gradient penalty for image inpainting, and a novel GAN architecture trained exclusively on adversarial loss. Our quantitative evaluation on the FDF dataset reflects that our revised gradient penalty and alternative convolution improves generated image quality significantly. We present comparisons on CelebA-HQ and Places2 to current state-of-the-art to validate our model.
CVFeb 27, 2020
Features for Ground Texture Based Localization -- A SurveyJan Fabian Schmid, Stephan F. Simon, Rudolf Mester
Ground texture based vehicle localization using feature-based methods is a promising approach to achieve infrastructure-free high-accuracy localization. In this paper, we provide the first extensive evaluation of available feature extraction methods for this task, using separately taken image pairs as well as synthetic transformations. We identify AKAZE, SURF and CenSurE as best performing keypoint detectors, and find pairings of CenSurE with the ORB, BRIEF and LATCH feature descriptors to achieve greatest success rates for incremental localization, while SIFT stands out when considering severe synthetic transformations as they might occur during absolute localization.
CVFeb 25, 2020
Ground Texture Based Localization Using Compact Binary DescriptorsJan Fabian Schmid, Stephan F. Simon, Rudolf Mester
Ground texture based localization is a promising approach to achieve high-accuracy positioning of vehicles. We present a self-contained method that can be used for global localization as well as for subsequent local localization updates, i.e. it allows a robot to localize without any knowledge of its current whereabouts, but it can also take advantage of a prior pose estimate to reduce computation time significantly. Our method is based on a novel matching strategy, which we call identity matching, that is based on compact binary feature descriptors. Identity matching treats pairs of features as matches only if their descriptors are identical. While other methods for global localization are faster to compute, our method reaches higher localization success rates, and can switch to local localization after the initial localization.
CVSep 10, 2019
DeepPrivacy: A Generative Adversarial Network for Face AnonymizationHåkon Hukkelås, Rudolf Mester, Frank Lindseth
We propose a novel architecture which is able to automatically anonymize faces in images while retaining the original data distribution. We ensure total anonymization of all faces in an image by generating images exclusively on privacy-safe information. Our model is based on a conditional generative adversarial network, generating images considering the original pose and image background. The conditional information enables us to generate highly realistic faces with a seamless transition between the generated face and the existing background. Furthermore, we introduce a diverse dataset of human faces, including unconventional poses, occluded faces, and a vast variability in backgrounds. Finally, we present experimental results reflecting the capability of our model to anonymize images while preserving the data distribution, making the data suitable for further training of deep learning models. As far as we know, no other solution has been proposed that guarantees the anonymization of faces while generating realistic images.
CVAug 17, 2019
Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic ScenesFabian Brickwedde, Steffen Abraham, Rudolf Mester
Existing 3D scene flow estimation methods provide the 3D geometry and 3D motion of a scene and gain a lot of interest, for example in the context of autonomous driving. These methods are traditionally based on a temporal series of stereo images. In this paper, we propose a novel monocular 3D scene flow estimation method, called Mono-SF. Mono-SF jointly estimates the 3D structure and motion of the scene by combining multi-view geometry and single-view depth information. Mono-SF considers that the scene flow should be consistent in terms of warping the reference image in the consecutive image based on the principles of multi-view geometry. For integrating single-view depth in a statistical manner, a convolutional neural network, called ProbDepthNet, is proposed. ProbDepthNet estimates pixel-wise depth distributions from a single image rather than single depth values. Additionally, as part of ProbDepthNet, a novel recalibration technique for regression problems is proposed to ensure well-calibrated distributions. Our experiments show that Mono-SF outperforms state-of-the-art monocular baselines and ablation studies support the Mono-SF approach and ProbDepthNet design.
CVAug 7, 2019
Mono-Stixels: Monocular depth reconstruction of dynamic street scenesFabian Brickwedde, Steffen Abraham, Rudolf Mester
In this paper we present mono-stixels, a compact environment representation specially designed for dynamic street scenes. Mono-stixels are a novel approach to estimate stixels from a monocular camera sequence instead of the traditionally used stereo depth measurements. Our approach jointly infers the depth, motion and semantic information of the dynamic scene as a 1D energy minimization problem based on optical flow estimates, pixel-wise semantic segmentation and camera motion. The optical flow of a stixel is described by a homography. By applying the mono-stixel model the degrees of freedom of a stixel-homography are reduced to only up to two degrees of freedom. Furthermore, we exploit a scene model and semantic information to handle moving objects. In our experiments we use the public available DeepFlow for optical flow estimation and FCN8s for the semantic information as inputs and show on the KITTI 2015 dataset that mono-stixels provide a compact and reliable depth reconstruction of both the static and moving parts of the scene. Thereby, mono-stixels overcome the limitation to static scenes of previous structure-from-motion approaches.
CVJul 24, 2019
SDNet: Semantically Guided Depth Estimation NetworkMatthias Ochs, Adrian Kretz, Rudolf Mester
Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.
CVApr 17, 2019
DistanceNet: Estimating Traveled Distance from Monocular Images using a Recurrent Convolutional Neural NetworkRobin Kreuzig, Matthias Ochs, Rudolf Mester
Classical monocular vSLAM/VO methods suffer from the scale ambiguity problem. Hybrid approaches solve this problem by adding deep learning methods, for example by using depth maps which are predicted by a CNN. We suggest that it is better to base scale estimation on estimating the traveled distance for a set of subsequent images. In this paper, we propose a novel end-to-end many-to-one traveled distance estimator. By using a deep recurrent convolutional neural network (RCNN), the traveled distance between the first and last image of a set of consecutive frames is estimated by our DistanceNet. Geometric features are learned in the CNN part of our model, which are subsequently used by the RNN to learn dynamics and temporal information. Moreover, we exploit the natural order of distances by using ordinal regression to predict the distance. The evaluation on the KITTI dataset shows that our approach outperforms current state-of-the-art deep learning pose estimators and classical mono vSLAM/VO methods in terms of distance prediction. Thus, our DistanceNet can be used as a component to solve the scale problem and help improve current and future classical mono vSLAM/VO methods.
AINov 19, 2018
Simulated Autonomous Driving in a Realistic Driving Environment using Deep Reinforcement Learning and a Deterministic Finite State MachinePatrick Klose, Rudolf Mester
In the field of Autonomous Driving, the system controlling the vehicle can be seen as an agent acting in a complex environment and thus naturally fits into the modern framework of Reinforcement Learning. However, learning to drive can be a challenging task and current results are often restricted to simplified driving environments. To advance the field, we present a method to adaptively restrict the action space of the agent according to its current driving situation and show that it can be used to swiftly learn to drive in a realistic environment based on the Deep Q-Network algorithm.
AIDec 12, 2017
Simulated Autonomous Driving on Realistic Road Networks using Deep Reinforcement LearningPatrick Klose, Rudolf Mester
Using Deep Reinforcement Learning (DRL) can be a promising approach to handle various tasks in the field of (simulated) autonomous driving. However, recent publications mainly consider learning in unusual driving environments. This paper presents Driving School for Autonomous Agents (DSA^2), a software for validating DRL algorithms in more usual driving environments based on artificial and realistic road networks. We also present the results of applying DSA^2 for handling the task of driving on a straight road while regulating the velocity of one vehicle according to different speed limits.
CVMar 15, 2017
Joint Epipolar Tracking (JET): Simultaneous optimization of epipolar geometry and feature correspondencesHenry Bradler, Matthias Ochs, Rudolf Mester
Traditionally, pose estimation is considered as a two step problem. First, feature correspondences are determined by direct comparison of image patches, or by associating feature descriptors. In a second step, the relative pose and the coordinates of corresponding points are estimated, most often by minimizing the reprojection error (RPE). RPE optimization is based on a loss function that is merely aware of the feature pixel positions but not of the underlying image intensities. In this paper, we propose a sparse direct method which introduces a loss function that allows to simultaneously optimize the unscaled relative pose, as well as the set of feature correspondences directly considering the image intensity values. Furthermore, we show how to integrate statistical prior information on the motion into the optimization process. This constructive inclusion of a Bayesian bias term is particularly efficient in application cases with a strongly predictable (short term) dynamic, e.g. in a driving scenario. In our experiments, we demonstrate that the JET algorithm we propose outperforms the classical reprojection error optimization on two synthetic datasets and on the KITTI dataset. The JET algorithm runs in real-time on a single CPU thread.
CVMar 15, 2017
Learning Rank Reduced Interpolation with Principal Component AnalysisMatthias Ochs, Henry Bradler, Rudolf Mester
In computer vision most iterative optimization algorithms, both sparse and dense, rely on a coarse and reliable dense initialization to bootstrap their optimization procedure. For example, dense optical flow algorithms profit massively in speed and robustness if they are initialized well in the basin of convergence of the used loss function. The same holds true for methods as sparse feature tracking when initial flow or depth information for new features at arbitrary positions is needed. This makes it extremely important to have techniques at hand that allow to obtain from only very few available measurements a dense but still approximative sketch of a desired 2D structure (e.g. depth maps, optical flow, disparity maps, etc.). The 2D map is regarded as sample from a 2D random process. The method presented here exploits the complete information given by the principal component analysis (PCA) of that process, the principal basis and its prior distribution. The method is able to determine a dense reconstruction from sparse measurement. When facing situations with only very sparse measurements, typically the number of principal components is further reduced which results in a loss of expressiveness of the basis. We overcome this problem and inject prior knowledge in a maximum a posterior (MAP) approach. We test our approach on the KITTI and the virtual KITTI datasets and focus on the interpolation of depth maps for driving scenes. The evaluation of the results show good agreement to the ground truth and are clearly better than results of interpolation by the nearest neighbor method which disregards statistical information.
CVSep 15, 2016
Lost and Found: Detecting Small Road Hazards for Self-Driving VehiclesPeter Pinggera, Sebastian Ramos, Stefan Gehrig et al.
Detecting small obstacles on the road ahead is a critical part of the driving task which has to be mastered by fully autonomous cars. In this paper, we present a method based on stereo vision to reliably detect such obstacles from a moving vehicle. The proposed algorithm performs statistical hypothesis tests in disparity space directly on stereo image data, assessing freespace and obstacle hypotheses on independent local patches. This detection approach does not depend on a global road model and handles both static and moving obstacles. For evaluation, we employ a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of obstacle and free-space and provide a thorough comparison to several stereo-based baseline methods. The dataset will be made available to the community to foster further research on this important topic. The proposed approach outperforms all considered baselines in our evaluations on both pixel and object level and runs at frame rates of up to 20 Hz on 2 mega-pixel stereo imagery. Small obstacles down to the height of 5 cm can successfully be detected at 20 m distance at low false positive rates.