Darius Burschka

CV
h-index2
12papers
134citations
Novelty42%
AI Score33

12 Papers

CVJul 13, 2022
Joint Prediction of Monocular Depth and Structure using Planar and Parallax Geometry

Hao Xing, Yifan Cao, Maximilian Biber et al.

Supervised learning depth estimation methods can achieve good performance when trained on high-quality ground-truth, like LiDAR data. However, LiDAR can only generate sparse 3D maps which causes losing information. Obtaining high-quality ground-truth depth data per pixel is difficult to acquire. In order to overcome this limitation, we propose a novel approach combining structure information from a promising Plane and Parallax geometry pipeline with depth information into a U-Net supervised learning network, which results in quantitative and qualitative improvement compared to existing popular learning-based methods. In particular, the model is evaluated on two large-scale and challenging datasets: KITTI Vision Benchmark and Cityscapes dataset and achieve the best performance in terms of relative error. Compared with pure depth supervision models, our model has impressive performance on depth prediction of thin objects and edges, and compared to structure prediction baseline, our model performs more robustly.

CVJul 12, 2022
Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional Network

Hao Xing, Darius Burschka

In skeleton-based action recognition, Graph Convolutional Networks model human skeletal joints as vertices and connect them through an adjacency matrix, which can be seen as a local attention mask. However, in most existing Graph Convolutional Networks, the local attention mask is defined based on natural connections of human skeleton joints and ignores the dynamic relations for example between head, hands and feet joints. In addition, the attention mechanism has been proven effective in Natural Language Processing and image description, which is rarely investigated in existing methods. In this work, we proposed a new adaptive spatial attention layer that extends local attention map to global based on relative distance and relative angle information. Moreover, we design a new initial graph adjacency matrix that connects head, hands and feet, which shows visible improvement in terms of action recognition accuracy. The proposed model is evaluated on two large-scale and challenging datasets in the field of human activities in daily life: NTU-RGB+D and Kinetics skeleton. The results demonstrate that our model has strong performance on both dataset.

AISep 13, 2024
Using The Concept Hierarchy for Household Action Recognition

Andrei Costinescu, Luis Figueredo, Darius Burschka

We propose a method to systematically represent both the static and the dynamic components of environments, i.e. objects and agents, as well as the changes that are happening in the environment, i.e. the actions and skills performed by agents. Our approach, the Concept Hierarchy, provides the necessary information for autonomous systems to represent environment states, perform action modeling and recognition, and plan the execution of tasks. Additionally, the hierarchical structure supports generalization and knowledge transfer to environments. We rigorously define tasks, actions, skills, and affordances that enable human-understandable action and skill recognition.

CVJul 1, 2025
Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Hao Xing, Kai Zhe Boey, Yuankai Wu et al.

Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

ROFeb 6, 2025
Adaptation of Task Goal States from Prior Knowledge

Andrei Costinescu, Darius Burschka

This paper presents a framework to define a task with freedom and variability in its goal state. A robot could use this to observe the execution of a task and target a different goal from the observed one; a goal that is still compatible with the task description but would be easier for the robot to execute. We define the model of an environment state and an environment variation, and present experiments on how to interactively create the variation from a single task demonstration and how to use this variation to create an execution plan for bringing any environment into the goal state.

CVSep 6, 2021
Robust Event Detection based on Spatio-Temporal Latent Action Unit using Skeletal Information

Hao Xing, Yuxuan Xue, Mingchuan Zhou et al.

This paper propose a novel dictionary learning approach to detect event action using skeletal information extracted from RGBD video. The event action is represented as several latent atoms and composed of latent spatial and temporal attributes. We perform the method at the example of fall event detection. The skeleton frames are clustered by an initial K-means method. Each skeleton frame is assigned with a varying weight parameter and fed into our Gradual Online Dictionary Learning (GODL) algorithm. During the training process, outlier frames will be gradually filtered by reducing the weight that is inversely proportional to a cost. In order to strictly distinguish the event action from similar actions and robustly acquire its action unit, we build a latent unit temporal structure for each sub-action. We evaluate the proposed method on parts of the NTURGB+D dataset, which includes 209 fall videos, 405 ground-lift videos, 420 sit-down videos, and 280 videos of 46 otheractions. We present the experimental validation of the achieved accuracy, recall and precision. Our approach achieves the bestperformance on precision and accuracy of human fall event detection, compared with other existing dictionary learning methods. With increasing noise ratio, our method remains the highest accuracy and the lowest variance.

ROJul 7, 2020
Optical Navigation in Unstructured Dynamic Railroad Environments

Darius Burschka, Christian Robl, Sebastian Ohrendorf-Weiss

We present an approach for optical navigation in unstructured, dynamic railroad environments. We propose a way how to cope with the estimation of the train motion from sole observations of the planar track bed. The occasional significant occlusions during the operation of the train limit the available observation to this difficult to track, repetitive area. This approach is a step towards replacement of the expensive train management infrastructure with local intelligence on the train for SmartRail 4.0. We derive our approach for robust estimation of translation and rotation in this difficult environments and provide experimental validation of the approach on real rail scenarios.

CVJun 20, 2018
Classifying Object Manipulation Actions based on Grasp-types and Motion-Constraints

Kartik Gupta, Darius Burschka, Arnav Bhavsar

In this work, we address a challenging problem of fine-grained and coarse-grained recognition of object manipulation actions. Due to the variations in geometrical and motion constraints, there are different manipulations actions possible to perform different sets of actions with an object. Also, there are subtle movements involved to complete most of object manipulation actions. This makes the task of object manipulation action recognition difficult with only just the motion information. We propose to use grasp and motion-constraints information to recognise and understand action intention with different objects. We also provide an extensive experimental evaluation on the recent Yale Human Grasping dataset consisting of large set of 455 manipulation actions. The evaluation involves a) Different contemporary multi-class classifiers, and binary classifiers with one-vs-one multi- class voting scheme, b) Differential comparisons results based on subsets of attributes involving information of grasp and motion-constraints, c) Fine-grained and Coarse-grained object manipulation action recognition based on fine-grained as well as coarse-grained grasp type information, and d) Comparison between Instance level and Sequence level modeling of object manipulation actions. Our results justifies the efficacy of grasp attributes for the task of fine-grained and coarse-grained object manipulation action recognition.

ROApr 27, 2018
Interaction-Aware Probabilistic Behavior Prediction in Urban Environments

Jens Schulz, Constantin Hubmann, Julian Löchner et al.

Planning for autonomous driving in complex, urban scenarios requires accurate prediction of the trajectories of surrounding traffic participants. Their future behavior depends on their route intentions, the road-geometry, traffic rules and mutual interaction, resulting in interdependencies between their trajectories. We present a probabilistic prediction framework based on a dynamic Bayesian network, which represents the state of the complete scene including all agents and respects the aforementioned dependencies. We propose Markovian, context-dependent motion models to define the interaction-aware behavior of drivers. At first, the state of the dynamic Bayesian network is estimated over time by tracking the single agents via sequential Monte Carlo inference. Secondly, we perform a probabilistic forward simulation of the network's estimated belief state to generate the different combinatorial scene developments. This provides the corresponding trajectories for the set of possible, future scenes. Our framework can handle various road layouts and number of traffic participants. We evaluate the approach in online simulations and real-world scenarios. It is shown that our interaction-aware prediction outperforms interaction-unaware physics- and map-based approaches.

CVSep 18, 2017
Direct Pose Estimation with a Monocular Camera

Darius Burschka, Elmar Mair

We present a direct method to calculate a 6DoF pose change of a monocular camera for mobile navigation. The calculated pose is estimated up to a constant unknown scale parameter that is kept constant over the entire reconstruction process. This method allows a direct cal- culation of the metric position and rotation without any necessity to fuse the information in a probabilistic approach over longer frame sequence as it is the case in most currently used VSLAM approaches. The algorithm provides two novel aspects to the field of monocular navigation. It allows a direct pose estimation without any a-priori knowledge about the world directly from any two images and it provides a quality measure for the estimated motion parameters that allows to fuse the resulting information in Kalman Filters. We present the mathematical formulation of the approach together with experimental validation on real scene images.

CVSep 7, 2017
Monocular Navigation in Large Scale Dynamic Environments

Darius Burschka

We present a processing technique for a robust reconstruction of motion properties for single points in large scale, dynamic environments. We assume that the acquisition camera is moving and that there are other independently moving agents in a large environment, like road scenarios. The separation of direction and magnitude of the reconstructed motion allows for robust reconstruction of the dynamic state of the objects in situations, where conventional binocular systems fail due to a small signal (disparity) from the images due to a constant detection error, and where structure from motion approaches fail due to unobserved motion of other agents between the camera frames. We present the mathematical framework and the sensitivity analysis for the resulting system.

CVNov 15, 2016
Motion Estimated-Compensated Reconstruction with Preserved-Features in Free-Breathing Cardiac MRI

Aurelien Bustin, Anne Menini, Martin A. Janich et al.

To develop an efficient motion-compensated reconstruction technique for free-breathing cardiac magnetic resonance imaging (MRI) that allows high-quality images to be reconstructed from multiple undersampled single-shot acquisitions. The proposed method is a joint image reconstruction and motion correction method consisting of several steps, including a non-rigid motion extraction and a motion-compensated reconstruction. The reconstruction includes a denoising with the Beltrami regularization, which offers an ideal compromise between feature preservation and staircasing reduction. Results were assessed in simulation, phantom and volunteer experiments. The proposed joint image reconstruction and motion correction method exhibits visible quality improvement over previous methods while reconstructing sharper edges. Moreover, when the acceleration factor increases, standard methods show blurry results while the proposed method preserves image quality. The method was applied to free-breathing single-shot cardiac MRI, successfully achieving high image quality and higher spatial resolution than conventional segmented methods, with the potential to offer high-quality delayed enhancement scans in challenging patients.