CVSep 15, 2023Code
YCB-Ev 1.1: Event-vision dataset for 6DoF object pose estimationPavel Rojtberg, Thomas Pöllabauer
Our work introduces the YCB-Ev dataset, which contains synchronized RGB-D frames and event data that enables evaluating 6DoF object pose estimation algorithms using these modalities. This dataset provides ground truth 6DoF object poses for the same 21 YCB objects that were used in the YCB-Video (YCB-V) dataset, allowing for cross-dataset algorithm performance evaluation. The dataset consists of 21 synchronized event and RGB-D sequences, totalling 13,851 frames (7 minutes and 43 seconds of event data). Notably, 12 of these sequences feature the same object arrangement as the YCB-V subset used in the BOP challenge. Ground truth poses are generated by detecting objects in the RGB-D frames, interpolating the poses to align with the event timestamps, and then transferring them to the event coordinate frame using extrinsic calibration. Our dataset is the first to provide ground truth 6DoF pose data for event streams. Furthermore, we evaluate the generalization capabilities of two state-of-the-art algorithms, which were pre-trained for the BOP challenge, using our novel YCB-V sequences. The dataset is publicly available at https://github.com/paroj/ycbev.
CVNov 14, 2025Code
YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimationPavel Rojtberg, Julius Kühn
We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/paroj/ycbev_sd.
CVOct 5, 2022
Transferring dense object detection models to event-based dataVincenz Mechler, Pavel Rojtberg
Event-based image representations are fundamentally different to traditional dense images. This poses a challenge to apply current state-of-the-art models for object detection as they are designed for dense images. In this work we evaluate the YOLO object detection model on event data. To this end we replace dense-convolution layers by either sparse convolutions or asynchronous sparse convolutions which enables direct processing of event-based images and compare the performance and runtime to feeding event-histograms into dense-convolutions. Here, hyper-parameters are shared across all variants to isolate the effect sparse-representation has on detection performance. At this, we show that current sparse-convolution implementations cannot translate their theoretical lower computation requirements into an improved runtime.
CVApr 28, 2020
Style-transfer GANs for bridging the domain gap in synthetic pose estimator trainingPavel Rojtberg, Thomas Pöllabauer, Arjan Kuijper
Given the dependency of current CNN architectures on a large training set, the possibility of using synthetic data is alluring as it allows generating a virtually infinite amount of labeled training data. However, producing such data is a non-trivial task as current CNN architectures are sensitive to the domain gap between real and synthetic data. We propose to adopt general-purpose GAN models for pixel-level image translation, allowing to formulate the domain gap itself as a learning problem. The obtained models are then used either during training or inference to bridge the domain gap. Here, we focus on training the single-stage YOLO6D object pose estimator on synthetic CAD geometry only, where not even approximate surface information is available. When employing paired GAN models, we use an edge-based intermediate domain and introduce different mappings to represent the unknown surface properties. Our evaluation shows a considerable improvement in model performance when compared to a model trained with the same degree of domain randomization, while requiring only very little additional effort.
CVDec 13, 2019
Real-time texturing for 6D object instance detection from RGB ImagesPavel Rojtberg, Arjan Kuijper
For objected detection, the availability of color cues strongly influences detection rates and is even a prerequisite for many methods. However, when training on synthetic CAD data, this information is not available. We therefore present a method for generating a texture-map from image sequences in real-time. The method relies on 6 degree-of-freedom poses and a 3D-model being available. In contrast to previous works this allows interleaving detection and texturing for upgrading the detector on-the-fly. Our evaluation shows that the acquired texture-map significantly improves detection rates using the LINEMOD detector on RGB images only. Additionally, we use the texture-map to differentiate instances of the same object by surface color.
HCJul 9, 2019
User Guidance for Interactive Camera CalibrationPavel Rojtberg
For building a Augmented Reality (AR) pipeline, the most crucial step is the camera calibration as overall quality heavily depends on it. In turn camera calibration itself is influenced most by the choice of camera-to-pattern poses - yet currently there is only little research on guiding the user to a specific pose. We build upon our novel camera calibration framework that is capable to generate calibration poses in real-time and present a user study evaluating different visualization methods to guide the user to a target pose. Using the presented method even novel users are capable to perform a precise camera calibration in about 2 minutes.
CVJul 9, 2019
calibDB: enabling web based computer vision through on-the-fly camera calibrationPavel Rojtberg, Felix Gorschlüter
For many computer vision applications, the availability of camera calibration data is crucial as overall quality heavily depends on it. While calibration data is available on some devices through Augmented Reality (AR) frameworks like ARCore and ARKit, for most cameras this information is not available. Therefore, we propose a web based calibration service that not only aggregates calibration data, but also allows calibrating new cameras on-the-fly. We build upon a novel camera calibration framework that enables even novice users to perform a precise camera calibration in about 2 minutes. This allows general deployment of computer vision algorithms on the web, which was previously not possible due to lack of calibration data.
CVJul 9, 2019
Efficient Pose Selection for Interactive Camera CalibrationPavel Rojtberg, Arjan Kuijper
The choice of poses for camera calibration with planar patterns is only rarely considered - yet the calibration precision heavily depends on it. This work presents a pose selection method that finds a compact and robust set of calibration poses and is suitable for interactive calibration. Consequently, singular poses that would lead to an unreliable solution are avoided explicitly, while poses reducing the uncertainty of the calibration are favoured. For this, we use uncertainty propagation. Our method takes advantage of a self-identifying calibration pattern to track the camera pose in real-time. This allows to iteratively guide the user to the target poses, until the desired quality level is reached. Therefore, only a sparse set of key-frames is needed for calibration. The method is evaluated on separate training and testing sets, as well as on synthetic data. Our approach performs better than comparable solutions while requiring 30% less calibration frames.