Eckehard Steinbach

CV
h-index18
29papers
415citations
Novelty48%
AI Score56

29 Papers

CVJul 8, 2022Code
Bounding Box Disparity: 3D Metrics for Object Detection With Full Degree of Freedom

Michael G. Adam, Martin Piccolrovazzi, Sebastian Eger et al.

The most popular evaluation metric for object detection in 2D images is Intersection over Union (IoU). Existing implementations of the IoU metric for 3D object detection usually neglect one or more degrees of freedom. In this paper, we first derive the analytic solution for three dimensional bounding boxes. As a second contribution, a closed-form solution of the volume-to-volume distance is derived. Finally, the Bounding Box Disparity is proposed as a combined positive continuous metric. We provide open source implementations of the three metrics as standalone python functions, as well as extensions to the Open3D library and as ROS nodes.

IVMar 4, 2022
Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression

A. Burakhan Koyuncu, Han Gao, Atanas Boev et al.

Entropy modeling is a key component for high-performance image compression algorithms. Recent developments in autoregressive context modeling helped learning-based methods to surpass their classical counterparts. However, the performance of those models can be further improved due to the underexploited spatio-channel dependencies in latent space, and the suboptimal implementation of context adaptivity. Inspired by the adaptive characteristics of the transformers, we propose a transformer-based context model, named Contextformer, which generalizes the de facto standard attention mechanism to spatio-channel attention. We replace the context model of a modern compression framework with the Contextformer and test it on the widely used Kodak, CLIC2020, and Tecnick image datasets. Our experimental results show that the proposed model provides up to 11% rate savings compared to the standard Versatile Video Coding (VVC) Test Model (VTM) 16.2, and outperforms various learning-based models in terms of PSNR and MS-SSIM.

IVJun 25, 2023
Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression

A. Burakhan Koyuncu, Panqi Jia, Atanas Boev et al.

Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) - a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ~145x lower model complexity and ~210x faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

CVMar 10, 2023
MCROOD: Multi-Class Radar Out-Of-Distribution Detection

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Eckehard Steinbach

Out-of-distribution (OOD) detection has recently received special attention due to its critical role in safely deploying modern deep learning (DL) architectures. This work proposes a reconstruction-based multi-class OOD detector that operates on radar range doppler images (RDIs). The detector aims to classify any moving object other than a person sitting, standing, or walking as OOD. We also provide a simple yet effective pre-processing technique to detect minor human body movements like breathing. The simple idea is called respiration detector (RESPD) and eases the OOD detection, especially for human sitting and standing classes. On our dataset collected by 60GHz short-range FMCW Radar, we achieve AUROCs of 97.45%, 92.13%, and 96.58% for sitting, standing, and walking classes, respectively. We perform extensive experiments and show that our method outperforms state-of-the-art (SOTA) OOD detection methods. Also, our pipeline performs 24 times faster than the second-best method and is very suitable for real-time processing.

ROOct 9, 2023
Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments

Michael G. Adam, Sebastian Eger, Martin Piccolrovazzi et al.

As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We further provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot.

SPJul 24, 2023
HOOD: Real-Time Human Presence and Out-of-Distribution Detection Using FMCW Radar

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Eckehard Steinbach

Detecting human presence indoors with millimeter-wave frequency-modulated continuous-wave (FMCW) radar faces challenges from both moving and stationary clutter. This work proposes a robust and real-time capable human presence and out-of-distribution (OOD) detection method using 60 GHz short-range FMCW radar. HOOD solves the human presence and OOD detection problems simultaneously in a single pipeline. Our solution relies on a reconstruction-based architecture and works with radar macro and micro range-Doppler images (RDIs). HOOD aims to accurately detect the presence of humans in the presence or absence of moving and stationary disturbers. Since HOOD is also an OOD detector, it aims to detect moving or stationary clutters as OOD in humans' absence and predicts the current scene's output as "no presence." HOOD performs well in diverse scenarios, demonstrating its effectiveness across different human activities and situations. On our dataset collected with a 60 GHz short-range FMCW radar, we achieve an average AUROC of 94.36%. Additionally, our extensive evaluations and experiments demonstrate that HOOD outperforms state-of-the-art (SOTA) OOD detection methods in terms of common OOD detection metrics. Importantly, HOOD also perfectly fits on Raspberry Pi 3B+ with an ARM Cortex-A53 CPU, which showcases its versatility across different hardware environments. Videos of our human presence detection experiments are available at: https://muskahya.github.io/HOOD

SPFeb 27, 2023
Reconstruction-based Out-of-Distribution Detection for Short-Range FMCW Radar

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Eckehard Steinbach

Out-of-distribution (OOD) detection recently has drawn attention due to its critical role in the safe deployment of modern neural network architectures in real-world applications. The OOD detectors aim to distinguish samples that lie outside the training distribution in order to avoid the overconfident predictions of machine learning models on OOD data. Existing detectors, which mainly rely on the logit, intermediate feature space, softmax score, or reconstruction loss, manage to produce promising results. However, most of these methods are developed for the image domain. In this study, we propose a novel reconstruction-based OOD detector to operate on the radar domain. Our method exploits an autoencoder (AE) and its latent representation to detect the OOD samples. We propose two scores incorporating the patch-based reconstruction loss and the energy value calculated from the latent representations of each patch. We achieve an AUROC of 90.72% on our dataset collected by using 60 GHz short-range FMCW Radar. The experiments demonstrate that, in terms of AUROC and AUPR, our method outperforms the baseline (AE) and the other state-of-the-art methods. Also, thanks to its model size of 641 kB, our detector is suitable for embedded usage.

CVMar 17, 2023
Remote Task-oriented Grasp Area Teaching By Non-Experts through Interactive Segmentation and Few-Shot Learning

Furkan Kaynar, Sudarshan Rajagopalan, Shaobo Zhou et al.

A robot operating in unstructured environments must be able to discriminate between different grasping styles depending on the prospective manipulation task. Having a system that allows learning from remote non-expert demonstrations can very feasibly extend the cognitive skills of a robot for task-oriented grasping. We propose a novel two-step framework towards this aim. The first step involves grasp area estimation by segmentation. We receive grasp area demonstrations for a new task via interactive segmentation, and learn from these few demonstrations to estimate the required grasp area on an unseen scene for the given task. The second step is autonomous grasp estimation in the segmented region. To train the segmentation network for few-shot learning, we built a grasp area segmentation (GAS) dataset with 10089 images grouped into 1121 segmentation tasks. We benefit from an efficient meta learning algorithm for training for few-shot adaptation. Experimental evaluation showed that our method successfully detects the correct grasp area on the respective objects in unseen test scenes and effectively allows remote teaching of new grasp strategies by non-experts.

20.8CVMay 26
DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models

Furkan Mert Algan, Eckehard Steinbach

3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.

26.3CVMay 7
A Causal Diffusion Model for Video Reconstruction from Ultra-Low-Bitrate Representations

Cem Eteke, Batuhan Tosun, Martin Piccolrovazzi et al.

We study video reconstruction from ultra-low-bitrate representations, where the primary challenge shifts from encoding to decoding. In this regime, reconstruction with classical and neural codecs introduces blur, while generative and semantic approaches often struggle to jointly preserve fidelity, temporal consistency, and perceptual quality. To address these limitations, we propose a causal video diffusion model that reconstructs videos from ultra-low-bitrate semantics and highly compressed frames by jointly modeling their complementary information. We further introduce temporal-only distillation from a bidirectional teacher to enable parameter-efficient training and causal few-step inference. Through extensive quantitative, qualitative, and subjective evaluation, we show that the proposed method outperforms classical, neural, generative, and semantic baselines in ultra-low-bitrate video reconstruction.

CVSep 18, 2024
LEMON: Localized Editing with Mesh Optimization and Neural Shaders

Furkan Mert Algan, Umut Yazgan, Driton Salihu et al.

In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.

CVJan 13Code
REVNET: Rotation-Equivariant Point Cloud Completion via Vector Neuron Anchor Transformer

Zhifan Ni, Eckehard Steinbach

Incomplete point clouds captured by 3D sensors often result in the loss of both geometric and semantic information. Most existing point cloud completion methods are built on rotation-variant frameworks trained with data in canonical poses, limiting their applicability in real-world scenarios. While data augmentation with random rotations can partially mitigate this issue, it significantly increases the learning burden and still fails to guarantee robust performance under arbitrary poses. To address this challenge, we propose the Rotation-Equivariant Anchor Transformer (REVNET), a novel framework built upon the Vector Neuron (VN) network for robust point cloud completion under arbitrary rotations. To preserve local details, we represent partial point clouds as sets of equivariant anchors and design a VN Missing Anchor Transformer to predict the positions and features of missing anchors. Furthermore, we extend VN networks with a rotation-equivariant bias formulation and a ZCA-based layer normalization to improve feature expressiveness. Leveraging the flexible conversion between equivariant and invariant VN features, our model can generate point coordinates with greater stability. Experimental results show that our method outperforms state-of-the-art approaches on the synthetic MVP dataset in the equivariant setting. On the real-world KITTI dataset, REVNET delivers competitive results compared to non-equivariant networks, without requiring input pose alignment. The source code will be released on GitHub under URL: https://github.com/nizhf/REVNET.

ROMay 23, 2018Code
MAVI: A Research Platform for Telepresence and Teleoperation

Mojtaba Karimi, Tamay Aykut, Eckehard Steinbach

One of the goals in telepresence is to be able to perform daily tasks remotely. A key requirement for this is a robust and reliable mobile robotic platform. Ideally, such a platform should support 360-degree stereoscopic vision and semi-autonomous telemanipulation ability. In this technical report, we present our latest work on designing the telepresence mobile robot platform called MAVI. MAVI is a low-cost and robust but extendable platform for research and educational purpose, especially for machine vision and human interaction in telepresence setups. The MAVI platform offers a balance between modularity, capabilities, accessibility, cost and an open source software framework. With a range of different sensors such as Inertial Measurement Unit (IMU), 360-degree laser rangefinder, ultrasonic proximity sensors, and force sensors along with smart actuation in omnidirectional holonomic locomotion, high load cylindrical manipulator, and actuated stereoscopic Pan-Tilt-Roll Unit (PTRU), not only MAVI can provide the basic feedbacks from its surroundings, but also can interact within the remote environment in multiple ways. The software architecture of MAVI is based on the Robot Operating System (ROS) which allows for the easy integration of the state-of-the-art software packages.

CVDec 19, 2025
FOODER: Real-time Facial Authentication and Expression Recognition

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Boran Hamdi Sivrikaya et al.

Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.

CVDec 14, 2023
HAROOD: Human Activity Classification and Out-of-Distribution Detection with Short-Range FMCW Radar

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Eckehard Steinbach

We propose HAROOD as a short-range FMCW radar-based human activity classifier and out-of-distribution (OOD) detector. It aims to classify human sitting, standing, and walking activities and to detect any other moving or stationary object as OOD. We introduce a two-stage network. The first stage is trained with a novel loss function that includes intermediate reconstruction loss, intermediate contrastive loss, and triplet loss. The second stage uses the first stage's output as its input and is trained with cross-entropy loss. It creates a simple classifier that performs the activity classification. On our dataset collected by 60 GHz short-range FMCW radar, we achieve an average classification accuracy of 96.51%. Also, we achieve an average AUROC of 95.04% as an OOD detector. Additionally, our extensive evaluations demonstrate the superiority of HAROOD over the state-of-the-art OOD detection methods in terms of standard OOD detection metrics.

CVFeb 27, 2024
ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily Living

Marsil Zakour, Partha Pratim Nath, Ludwig Lohmer et al.

Hand-Object Interactions (HOIs) are conditioned on spatial and temporal contexts like surrounding objects, previous actions, and future intents (for example, grasping and handover actions vary greatly based on objects proximity and trajectory obstruction). However, existing datasets for 4D HOI (3D HOI over time) are limited to one subject interacting with one object only. This restricts the generalization of learning-based HOI methods trained on those datasets. We introduce ADL4D, a dataset of up to two subjects interacting with different sets of objects performing Activities of Daily Living (ADL) like breakfast or lunch preparation activities. The transition between multiple objects to complete a certain task over time introduces a unique context lacking in existing datasets. Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations. We develop an automatic system for multi-view multi-hand 3D pose annotation capable of tracking hand poses over time. We integrate and test it against publicly available datasets. Finally, we evaluate our dataset on the tasks of Hand Mesh Recovery (HMR) and Hand Action Segmentation (HAS).

IVFeb 27, 2024
Adapting Learned Image Codecs to Screen Content via Adjustable Transformations

H. Burak Dogaroglu, A. Burakhan Koyuncu, Atanas Boev et al.

As learned image codecs (LICs) become more prevalent, their low coding efficiency for out-of-distribution data becomes a bottleneck for some applications. To improve the performance of LICs for screen content (SC) images without breaking backwards compatibility, we propose to introduce parameterized and invertible linear transformations into the coding pipeline without changing the underlying baseline codec's operation flow. We design two neural networks to act as prefilters and postfilters in our setup to increase the coding efficiency and help with the recovery from coding artifacts. Our end-to-end trained solution achieves up to 10% bitrate savings on SC compression compared to the baseline LICs while introducing only 1% extra parameters.

CVSep 8, 2025
BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration

Cem Eteke, Alexander Griessel, Wolfgang Kellerer et al.

This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.

CVJan 14, 2025
FARE: A Deep Learning-Based Framework for Radar-based Face Recognition and Out-of-distribution Detection

Sabri Mustafa Kahya, Boran Hamdi Sivrikaya, Muhammet Sami Yavuz et al.

In this work, we propose a novel pipeline for face recognition and out-of-distribution (OOD) detection using short-range FMCW radar. The proposed system utilizes Range-Doppler and micro Range-Doppler Images. The architecture features a primary path (PP) responsible for the classification of in-distribution (ID) faces, complemented by intermediate paths (IPs) dedicated to OOD detection. The network is trained in two stages: first, the PP is trained using triplet loss to optimize ID face classification. In the second stage, the PP is frozen, and the IPs-comprising simple linear autoencoder networks-are trained specifically for OOD detection. Using our dataset generated with a 60 GHz FMCW radar, our method achieves an ID classification accuracy of 99.30% and an OOD detection AUROC of 96.91%.

CVNov 18, 2024
FERT: Real-Time Facial Expression Recognition with Short-Range FMCW Radar

Sabri Mustafa Kahya, Muhammet Sami Yavuz, Eckehard Steinbach

This study proposes a novel approach for real-time facial expression recognition utilizing short-range Frequency-Modulated Continuous-Wave (FMCW) radar equipped with one transmit (Tx), and three receive (Rx) antennas. The system leverages four distinct modalities simultaneously: Range-Doppler images (RDIs), micro range-Doppler Images (micro-RDIs), range azimuth images (RAIs), and range elevation images (REIs). Our innovative architecture integrates feature extractor blocks, intermediate feature extractor blocks, and a ResNet block to accurately classify facial expressions into smile, anger, neutral, and no-face classes. Our model achieves an average classification accuracy of 98.91% on the dataset collected using a 60 GHz short-range FMCW radar. The proposed solution operates in real-time in a person-independent manner, which shows the potential use of low-cost FMCW radars for effective facial expression recognition in various applications.

CVJun 6, 2024
FOOD: Facial Authentication and Out-of-Distribution Detection with Short-Range FMCW Radar

Sabri Mustafa Kahya, Boran Hamdi Sivrikaya, Muhammet Sami Yavuz et al.

This paper proposes a short-range FMCW radar-based facial authentication and out-of-distribution (OOD) detection framework. Our pipeline jointly estimates the correct classes for the in-distribution (ID) samples and detects the OOD samples to prevent their inaccurate prediction. Our reconstruction-based architecture consists of a main convolutional block with one encoder and multi-decoder configuration, and intermediate linear encoder-decoder parts. Together, these elements form an accurate human face classifier and a robust OOD detector. For our dataset, gathered using a 60 GHz short-range FMCW radar, our network achieves an average classification accuracy of 98.07% in identifying in-distribution human faces. As an OOD detector, it achieves an average Area Under the Receiver Operating Characteristic (AUROC) curve of 98.50% and an average False Positive Rate at 95% True Positive Rate (FPR95) of 6.20%. Also, our extensive experiments show that the proposed approach outperforms previous OOD detectors in terms of common OOD detection metrics.

ROSep 24, 2019
Minimal Work: A Grasp Quality Metric for Deformable Hollow Objects

Jingyi Xu, Michael Danielczuk, Jeff Ichnowski et al.

Robot grasping of deformable hollow objects such as plastic bottles and cups is challenging as the grasp should resist disturbances while minimally deforming the object so as not to damage it or dislodge liquids. We propose minimal work as a novel grasp quality metric that combines wrench resistance and the object deformation. We introduce an efficient algorithm to compute required work to resist an external wrench for a manipulation task by solving a linear program. The algorithm first computes the minimum required grasp force and an estimation of the gripper jaw displacements based on the object deformability at different locations measured with physical experiments. The work done by the jaws is the product of the grasp force and the displacements. The grasp quality metric is computed based on the required work under perturbations of grasp poses to address uncertainties in actuation. We collect 460 physical grasps with a UR5 robot and a Robotiq gripper. Physical experiments suggest the minimal work quality metric reaches 74.2% balanced accuracy and is up to 24.2% higher than classical wrench-based quality metrics, where the balanced accuracy is the raw accuracy normalized by the number of successful and failed real-world grasps.

ROSep 15, 2019
6DLS: Modeling Nonplanar Frictional Surface Contacts for Grasping using 6D Limit Surfaces

Jingyi Xu, Tamay Aykut, Daolin Ma et al.

Robot grasping with deformable gripper jaws results in nonplanar surface contacts if the jaws deform to the nonplanar local geometry of an object. The frictional force and torque that can be transmitted through a nonplanar surface contact are both three-dimensional, resulting in a six-dimensional frictional wrench (6DFW). Applying traditional planar contact models to such contacts leads to over-conservative results as the models do not consider the nonplanar surface geometry and only compute a three-dimensional subset of the 6DFW. To address this issue, we derive the 6DFW for nonplanar surfaces by combining concepts of differential geometry and Coulomb friction. We also propose two 6D limit surface (6DLS) models, generalized from well-known three-dimensional LS (3DLS) models, which describe the friction-motion constraints for a contact. We evaluate the 6DLS models by fitting them to the 6DFW samples obtained from six parametric surfaces and 2,932 meshed contacts from finite element method simulations of 24 rigid objects. We further present an algorithm to predict multicontact grasp success by building a grasp wrench space with the 6DLS model of each contact. To evaluate the algorithm, we collected 1,035 physical grasps of ten 3D-printed objects with a KUKA robot and a deformable parallel-jaw gripper. In our experiments, the algorithm achieves 66.8% precision, a metric inversely related to false positive predictions, and 76.9% recall, a metric inversely related to false negative predictions. The 6DLS models increase recall by up to 26.1% over 3DLS models with similar precision.

MMJun 20, 2019
Probabilistic Tile Visibility-Based Server-Side Rate Adaptation for Adaptive 360-Degree Video Streaming

Junni Zou, Chenglin Li, Chengming Liu et al.

In this paper, we study the server-side rate adaptation problem for streaming tile-based adaptive 360-degree videos to multiple users who are competing for transmission resources at the network bottleneck. Specifically, we develop a convolutional neural network (CNN)-based viewpoint prediction model to capture the nonlinear relationship between the future and historical viewpoints. A Laplace distribution model is utilized to characterize the probability distribution of the prediction error. Given the predicted viewpoint, we then map the viewport in the spherical space into its corresponding planar projection in the 2-D plane, and further derive the visibility probability of each tile based on the planar projection and the prediction error probability. According to the visibility probability, tiles are classified as viewport, marginal and invisible tiles. The server-side tile rate allocation problem for multiple users is then formulated as a non-linear discrete optimization problem to minimize the overall received video distortion of all users and the quality difference between the viewport and marginal tiles of each user, subject to the transmission capacity constraints and users' specific viewport requirements. We develop a steepest descent algorithm to solve this non-linear discrete optimization problem, by initializing the feasible starting point in accordance with the optimal solution of its continuous relaxation. Extensive experimental results show that the proposed algorithm can achieve a near-optimal solution, and outperforms the existing rate adaptation schemes for tile-based adaptive 360-video streaming.

CVApr 5, 2018
Noise-resistant Deep Learning for Object Classification in 3D Point Clouds Using a Point Pair Descriptor

Dmytro Bobkov, Sili Chen, Ruiqing Jian et al.

Object retrieval and classification in point cloud data is challenged by noise, irregular sampling density and occlusion. To address this issue, we propose a point pair descriptor that is robust to noise and occlusion and achieves high retrieval accuracy. We further show how the proposed descriptor can be used in a 4D convolutional neural network for the task of object classification. We propose a novel 4D convolutional layer that is able to learn class-specific clusters in the descriptor histograms. Finally, we provide experimental validation on 3 benchmark datasets, which confirms the superiority of the proposed approach.

HCMay 16, 2017
Toward QoE-Driven Dynamic Control Scheme Switching for Time-Delayed Teleoperation Systems: A Dedicated Case Study

Xiao Xu, Qian Liu, Eckehard Steinbach

Networked teleoperation with haptic feedback is a prime example for the emerging Tactile Internet, which requires a careful orchestration of haptic communication and control. One major challenge in this context is how to maximize the user's quality-of-experience (QoE) while ensuring at the same time the stability of the global control loop in the presence of communication delay. In this paper, we propose a dynamic control scheme switching strategy for teleoperation systems, which maximizes the QoE for time-varying communication delay. In order to validate the feasibility of the proposed approach, we perform a dedicated case study for a virtual teleoperation environment consisting of a one-dimensional spring-damper system, and conduct extensive subjective tests under various delay conditions for two control schemes: (1) teleoperation with the time-domain passivity approach (TDPA), which is highly delay-sensitive but supports highly dynamic interaction between the operator and a potentially quickly changing remote environment; (2) model-mediated teleoperation (MMT), which is tolerable to relatively larger communication delays, but unsuitable for quickly changing, highly dynamic remote environments. For both schemes, we use recently proposed extensions, which incorporate perceptual data reduction to reduce the required packet rate between the operator and the teleoperator. One key contribution of this paper lies in the exploration of the intrinsic relationship among QoE, communication delay and the control schemes which provides a fundamental guidance, not only to this research, but also to the future joint optimization of communication and control for time-delayed teleoperation systems.

RODec 21, 2015
Deep Learning for Surface Material Classification Using Haptic And Visual Information

Haitian Zheng, Lu Fang, Mengqi Ji et al.

When a user scratches a hand-held rigid tool across an object surface, an acceleration signal can be captured, which carries relevant information about the surface. More importantly, such a haptic signal is complementary to the visual appearance of the surface, which suggests the combination of both modalities for the recognition of the surface material. In this paper, we present a novel deep learning method dealing with the surface material classification problem based on a Fully Convolutional Network (FCN), which takes as input the aforementioned acceleration signal and a corresponding image of the surface texture. Compared to previous surface material classification solutions, which rely on a careful design of hand-crafted domain-specific features, our method automatically extracts discriminative features utilizing the advanced deep learning methodologies. Experiments performed on the TUM surface material database demonstrate that our method achieves state-of-the-art classification accuracy robustly and efficiently.

MMOct 5, 2015
A System for Precise End-to-End Delay Measurements in Video Communication

Christoph Bachhuber, Eckehard Steinbach

Low delay video transmission is becoming increasingly important. Delay critical, video enabled applications range from teleoperation scenarios such as controlling drones or telesurgery to autonomous control through computer vision algorithms applied on real-time video. To judge the quality of the video transmission in such a system, it is important to be able to precisely measure the end-to-end (E2E) delay of the transmitted video. We present a low-complexity system that automatically takes pairwise independent measurements of E2E delay. The precision can be far below the millisecond order, mainly limited by the sampling rate of the measurement system. In our implementation, we achieve a precision of 0.5 milliseconds with a sampling rate of 2kHz.

MMJun 27, 2015
Keypoint Encoding for Improved Feature Extraction from Compressed Video at Low Bitrates

Jianshu Chao, Eckehard Steinbach

In many mobile visual analysis applications, compressed video is transmitted over a communication network and analyzed by a server. Typical processing steps performed at the server include keypoint detection, descriptor calculation, and feature matching. Video compression has been shown to have an adverse effect on feature-matching performance. The negative impact of compression can be reduced by using the keypoints extracted from the uncompressed video to calculate descriptors from the compressed video. Based on this observation, we propose to provide these keypoints to the server as side information and to extract only the descriptors from the compressed video. First, we introduce four different frame types for keypoint encoding to address different types of changes in video content. These frame types represent a new scene, the same scene, a slowly changing scene, or a rapidly moving scene and are determined by comparing features between successive video frames. Then, we propose Intra, Skip and Inter modes of encoding the keypoints for different frame types. For example, keypoints for new scenes are encoded using the Intra mode, and keypoints for unchanged scenes are skipped. As a result, the bitrate of the side information related to keypoint encoding is significantly reduced. Finally, we present pairwise matching and image retrieval experiments conducted to evaluate the performance of the proposed approach using the Stanford mobile augmented reality dataset and 720p format videos. The results show that the proposed approach offers significantly improved feature matching and image retrieval performance at a given bitrate.