CVSep 11, 2024Code
ENACT: Entropy-based Clustering of Attention Input for Reducing the Computational Needs of Object Detection TransformersGiorgos Savathrakis, Antonis Argyros
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy, due to its similarity between same object pixels. This is expected to reduce GPU usage during training, while maintaining reasonable accuracy. This idea is realized with an implemented module that is called ENtropy-based Attention Clustering for detection Transformers (ENACT), which serves as a plug-in to any multi-head self-attention based transformer network. Experiments on the COCO object detection dataset and three detection transformers demonstrate that the requirements on memory are reduced, while the detection accuracy is degraded only slightly. The code of the ENACT module is available at https://github.com/GSavathrakis/ENACT.
CVApr 29, 2023Code
TAPE: Temporal Attention-based Probabilistic human pose and shape EstimationNikolaos Vasilikopoulos, Nikos Kolotouros, Aggeliki Tsoli et al.
Reconstructing 3D human pose and shape from monocular videos is a well-studied but challenging problem. Common challenges include occlusions, the inherent ambiguities in the 2D to 3D mapping and the computational complexity of video processing. Existing methods ignore the ambiguities of the reconstruction and provide a single deterministic estimate for the 3D pose. In order to address these issues, we present a Temporal Attention based Probabilistic human pose and shape Estimation method (TAPE) that operates on an RGB video. More specifically, we propose to use a neural network to encode video frames to temporal features using an attention-based neural network. Given these features, we output a per-frame but temporally-informed probability distribution for the human pose using Normalizing Flows. We show that TAPE outperforms state-of-the-art methods in standard benchmarks and serves as an effective video-based prior for optimization-based human pose and shape estimation. Code is available at: https: //github.com/nikosvasilik/TAPE
CVSep 12, 2022
Graphing the Future: Activity and Next Active Object Prediction using Graph-based Activity RepresentationsVictoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros
We present a novel approach for the visual prediction of human-object interactions in videos. Rather than forecasting the human and object motion or the future hand-object contact points, we aim at predicting (a)the class of the on-going human-object interaction and (b) the class(es) of the next active object(s) (NAOs), i.e., the object(s) that will be involved in the interaction in the near future as well as the time the interaction will occur. Graph matching relies on the efficient Graph Edit distance (GED) method. The experimental evaluation of the proposed approach was conducted using two well-established video datasets that contain human-object interactions, namely the MSR Daily Activities and the CAD120. High prediction accuracy was obtained for both action prediction and NAO forecasting.
CVJul 22, 2023
Recognizing Unseen States of Unknown Objects by Leveraging Knowledge GraphsFilipos Gouidis, Konstantinos Papoutsakis, Theodore Patkos et al.
We investigate the problem of Object State Classification (OSC) as a zero-shot learning problem. Specifically, we propose the first Object-agnostic State Classification (OaSC) method that infers the state of a certain object without relying on the knowledge or the estimation of the object class. In that direction, we capitalize on Knowledge Graphs (KGs) for structuring and organizing knowledge, which, in combination with visual information, enable the inference of the states of objects in object/state pairs that have not been encountered in the method's training set. A series of experiments investigate the performance of the proposed method in various settings, against several hypotheses and in comparison with state of the art approaches for object attribute classification. The experimental results demonstrate that the knowledge of an object class is not decisive for the prediction of its state. Moreover, the proposed OaSC method outperforms existing methods in all datasets and benchmarks by a great margin.
CVJan 20
OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object CountingMichail Spanakis, Iason Oikonomidis, Antonis Argyros
Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at https://mikespanak.github.io/OCCAM_counter.
CVJan 29
Understanding Multimodal Complementarity for Single-Frame Action AnticipationManuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis et al.
Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
CVDec 2, 2025
Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki et al.
Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.
AIMar 18, 2024
Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State ClassificationFilippos Gouidis, Katerina Papantoniou, Konstantinos Papoutsakis et al.
Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.
CVOct 28, 2024
Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual ContextManuel Benavent-Lledo, David Mulero-Pérez, David Ortiz-Perez et al.
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action's temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
CVJan 7
Combining facial videos and biosignals for stress estimation during drivingParaskevi Valergaki, Vassilis C. Nicodemou, Iason Oikonomidis et al.
Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.
CVMay 21, 2024
Anticipating Object State Changes in Long Procedural VideosVictoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis et al.
In this work, we introduce (a) the new problem of anticipating object state changes in images and videos during procedural activities, (b) new curated annotation data for object state change classification based on the Ego4D dataset, and (c) the first method for addressing this challenging problem. Solutions to this new task have important implications in vision-based scene understanding, automated monitoring systems, and action planning. The proposed novel framework predicts object state changes that will occur in the near future due to yet unseen human actions by integrating learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce an extension noted Ego4D-OSCA that provides new curated annotation data for the object state change anticipation task (OSCA). An extensive experimental evaluation is presented demonstrating the proposed method's efficacy in predicting object state changes in dynamic scenarios. The performance of the proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems and lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.
CVDec 15, 2021
Detecting Object States vs Detecting Objects: A New Dataset and a Quantitative Experimental StudyFilippos Gouidis, Theodore Patkos, Antonis Argyros et al.
The detection of object states in images (State Detection - SD) is a problem of both theoretical and practical importance and it is tightly interwoven with other important computer vision problems, such as action recognition and affordance detection. It is also highly relevant to any entity that needs to reason and act in dynamic domains, such as robotic systems and intelligent agents. Despite its importance, up to now, the research on this problem has been limited. In this paper, we attempt a systematic study of the SD problem. First, we introduce the Object State Detection Dataset (OSDD), a new publicly available dataset consisting of more than 19,000 annotations for 18 object categories and 9 state classes. Second, using a standard deep learning framework used for Object Detection (OD), we conduct a number of appropriately designed experiments, towards an in-depth study of the behavior of the SD problem. This study enables the setup of a baseline on the performance of SD, as well as its relative performance in comparison to OD, in a variety of scenarios. Overall, the experimental outcomes confirm that SD is harder than OD and that tailored SD methods need to be developed for addressing effectively this significant problem.
CVDec 9, 2021
PE-former: Pose Estimation TransformerPaschalis Panteleris, Antonis Argyros
Vision transformer architectures have been demonstrated to work very effectively for image classification tasks. Efforts to solve more challenging vision tasks with transformers rely on convolutional backbones for feature extraction. In this paper we investigate the use of a pure transformer architecture (i.e., one with no CNN backbone) for the problem of 2D body pose estimation. We evaluate two ViT architectures on the COCO dataset. We demonstrate that using an encoder-decoder transformer architecture yields state of the art results on this estimation problem.
NEJul 8, 2021
Even Faster SNN Simulation with Lazy+Event-driven Plasticity and Shared AtomicsDennis Bautembach, Iason Oikonomidis, Antonis Argyros
We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery. We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time which allows us to perform said update in shared memory instead of global memory. This is 2x-2.5x faster than our closest competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside "Spice" (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established models and compare ourselves against three other state of the art simulators.
CVMar 27, 2021
H-GAN: the power of GANs in your HandsSergiu Oprea, Giorgos Karvounas, Pablo Martinez-Gonzalez et al.
We present HandGAN (H-GAN), a cycle-consistent adversarial learning approach implementing multi-scale perceptual discriminators. It is designed to translate synthetic images of hands to the real domain. Synthetic hands provide complete ground-truth annotations, yet they are not representative of the target distribution of real-world data. We strive to provide the perfect blend of a realistic hand appearance with synthetic annotations. Relying on image-to-image translation, we improve the appearance of synthetic hands to approximate the statistical distribution underlying a collection of real images of hands. H-GAN tackles not only the cross-domain tone mapping but also structural differences in localized areas such as shading discontinuities. Results are evaluated on a qualitative and quantitative basis improving previous works. Furthermore, we relied on the hand classification task to claim our generated hands are statistically similar to the real domain of hands.
NEFeb 9, 2021
Multi-GPU SNN Simulation with Static Load BalancingDennis Bautembach, Iason Oikonomidis, Antonis Argyros
We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs. This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi-GPU distribution scheme and 3) a static, yet very effective load balancing strategy. The simulator further features an easy to use API and the ability to create custom models. We compare the proposed simulator against two state of the art ones on a series of benchmarks using three well-established models. We find that our simulator is faster, consumes less memory, and scales linearly with the number of GPUs.
CVApr 10, 2020
A Review on Deep Learning Techniques for Video PredictionSergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia et al.
The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.
CVDec 26, 2019
A Review on Intelligent Object Perception Methods Combining Knowledge-based Reasoning and Machine LearningFilippos Gouidis, Alexandros Vassiliades, Theodore Patkos et al.
Object perception is a fundamental sub-field of Computer Vision, covering a multitude of individual areas and having contributed high-impact results. While Machine Learning has been traditionally applied to address related problems, recent works also seek ways to integrate knowledge engineering in order to expand the level of intelligence of the visual interpretation of objects, their properties and their relations with their environment. In this paper, we attempt a systematic investigation of how knowledge-based methods contribute to diverse object perception tasks. We review the latest achievements and identify prominent research directions.
NEDec 16, 2019
Faster and Simpler SNN Simulation with Work QueuesDennis Bautembach, Iason Oikonomidis, Nikolaos Kyriazis et al.
We present a clock-driven Spiking Neural Network simulator which is up to 3x faster than the state of the art while, at the same time, being more general and requiring less programming effort on both the user's and maintainer's side. This is made possible by designing our pipeline around "work queues" which act as interfaces between stages and greatly reduce implementation complexity. We evaluate our work using three well-established SNN models on a series of benchmarks.
CVOct 14, 2019
ReActNet: Temporal Localization of Repetitive Activities in Real-World VideosGiorgos Karvounas, Iason Oikonomidis, Antonis Argyros
We address the problem of temporal localization of repetitive activities in a video, i.e., the problem of identifying all segments of a video that contain some sort of repetitive or periodic motion. To do so, the proposed method represents a video by the matrix of pairwise frame distances. These distances are computed on frame representations obtained with a convolutional neural network. On top of this representation, we design, implement and evaluate ReActNet, a lightweight convolutional neural network that classifies a given frame as belonging (or not) to a repetitive video segment. An important property of the employed representation is that it can handle repetitive segments of arbitrary number and duration. Furthermore, the proposed training process requires a relatively small number of annotated videos. Our method raises several of the limiting assumptions of existing approaches regarding the contents of the video and the types of the observed repetitive activities. Experimental results on recent, publicly available datasets validate our design choices, verify the generalization potential of ReActNet and demonstrate its superior performance in comparison to the current state of the art.
CVDec 19, 2018
Accurate Hand Keypoint Localization on Mobile DevicesFilippos Gouidis, Paschalis Panteleris, Iason Oikonomidis et al.
We present a novel approach for 2D hand keypoint localization from regular color input. The proposed approach relies on an appropriately designed Convolutional Neural Network (CNN) that computes a set of heatmaps, one per hand keypoint of interest. Extensive experiments with the proposed method compare it against state of the art approaches and demonstrate its accuracy and computational performance on standard, publicly available datasets. The obtained results demonstrate that the proposed method matches or outperforms the competing methods in accuracy, but clearly outperforms them in computational efficiency, making it a suitable building block for applications that require hand keypoint estimation on mobile devices.
CVDec 6, 2018
Learning to Infer the Depth Map of a Hand from its Color ImageVassilis C. Nicodemou, Iason Oikonomidis, Georgios Tzimiropoulos et al.
We propose the first approach to the problem of inferring the depth map of a human hand based on a single RGB image. We achieve this with a Convolutional Neural Network (CNN) that employs a stacked hourglass model as its main building block. Intermediate supervision is used in several outputs of the proposed architecture in a staged approach. To aid the process of training and inference, hand segmentation masks are also estimated in such an intermediate supervision step, and used to guide the subsequent depth estimation process. In order to train and evaluate the proposed method we compile and make publicly available HandRGBD, a new dataset of 20,601 views of hands, each consisting of an RGB image and an aligned depth map. Based on HandRGBD, we explore variants of the proposed approach in an ablative study and determine the best performing one. The results of an extensive experimental evaluation demonstrate that hand depth estimation from a single RGB frame can be achieved with an accuracy of 22mm, which is comparable to the accuracy achieved by contemporary low-cost depth cameras. Such a 3D reconstruction of hands based on RGB information is valuable as a final result on its own right, but also as an input to several other hand analysis and perception algorithms that require depth input. Essentially, in such a context, the proposed approach bridges the gap between RGB and RGBD, by making all existing RGBD-based methods applicable to RGB input.
CVOct 25, 2018
HANDS18: Methods, Techniques and Applications for Hand ObservationIason Oikonomidis, Guillermo Garcia-Hernando, Angela Yao et al.
This report outlines the proceedings of the Fourth International Workshop on Observing and Understanding Hands in Action (HANDS 2018). The fourth instantiation of this workshop attracted significant interest from both academia and the industry. The program of the workshop included regular papers that are published as the workshop's proceedings, extended abstracts, invited posters, and invited talks. Topics of the submitted works and invited talks and posters included novel methods for hand pose estimation from RGB, depth, or skeletal data, datasets for special cases and real-world applications, and techniques for hand motion re-targeting and hand gesture recognition. The invited speakers are leaders in their respective areas of specialization, coming from both industry and academia. The main conclusions that can be drawn are the turn of the community towards RGB data and the maturation of some methods and techniques, which in turn has led to increasing interest for real-world applications.
DCApr 30, 2018
On the Feasibility of Real-Time 3D Hand Tracking using Edge GPGPU AccelerationAmmar Qammaz, Sokol Kosta, Nikolaos Kyriazis et al.
This paper presents the case study of a non-intrusive porting of a monolithic C++ library for real-time 3D hand tracking, to the domain of edge-based computation. Towards a proof of concept, the case study considers a pair of workstations, a computationally powerful and a computationally weak one. By wrapping the C++ library in Java container and by capitalizing on a Java-based offloading infrastructure that supports both CPU and GPGPU computations, we are able to establish automatically the required server-client workflow that best addresses the resource allocation problem in the effort to execute from the weak workstation. As a result, the weak workstation can perform well at the task, despite lacking the sufficient hardware to do the required computations locally. This is achieved by offloading computations which rely on GPGPU, to the powerful workstation, across the network that connects them. We show the edge-based computation challenges associated with the information flow of the ported algorithm, demonstrate how we cope with them, and identify what needs to be improved for achieving even better performance.
CVDec 11, 2017
Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future GoalsShanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger et al.
In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation from depth images? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate the top 10 state-of-the-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [70, 120] degrees, but it is far from being solved for extreme view points; (2) 3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3) Discriminative methods still generalize poorly to unseen hand shapes; (4) While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.
CVDec 11, 2017
Using a single RGB frame for real time 3D hand pose estimation in the wildPaschalis Panteleris, Iason Oikonomidis, Antonis Argyros
We present a method for the real-time estimation of the full 3D pose of one or more human hands using a single commodity RGB camera. Recent work in the area has displayed impressive progress using RGBD input. However, since the introduction of RGBD sensors, there has been little progress for the case of monocular color input. We capitalize on the latest advancements of deep learning, combining them with the power of generative hand pose estimation techniques to achieve real-time monocular 3D hand pose estimation in unrestricted scenarios. More specifically, given an RGB image and the relevant camera calibration information, we employ a state-of-the-art detector to localize hands. Given a crop of a hand in the image, we run the pretrained network of OpenPose for hands to estimate the 2D location of hand joints. Finally, non-linear least-squares minimization fits a 3D model of the hand to the estimated 2D joint positions, recovering the 3D hand pose. Extensive experimental results provide comparison to the state of the art as well as qualitative assessment of the method in the wild.
CVMay 15, 2017
Back to RGB: 3D tracking of hands and hand-object interactions based on short-baseline stereoPaschalis Panteleris, Antonis Argyros
We present a novel solution to the problem of 3D tracking of the articulated motion of human hand(s), possibly in interaction with other objects. The vast majority of contemporary relevant work capitalizes on depth information provided by RGBD cameras. In this work, we show that accurate and efficient 3D hand tracking is possible, even for the case of RGB stereo. A straightforward approach for solving the problem based on such input would be to first recover depth and then apply a state of the art depth-based 3D hand tracking method. Unfortunately, this does not work well in practice because the stereo-based, dense 3D reconstruction of hands is far less accurate than the one obtained by RGBD cameras. Our approach bypasses 3D reconstruction and follows a completely different route: 3D hand tracking is formulated as an optimization problem whose solution is the hand configuration that maximizes the color consistency between the two views of the hand. We demonstrate the applicability of our method for real time tracking of a single hand, of a hand manipulating an object and of two interacting hands. The method has been evaluated quantitatively on standard datasets and in comparison to relevant, state of the art RGBD-based approaches. The obtained results demonstrate that the proposed stereo-based method performs equally well to its RGBD-based competitors, and in some cases, it even outperforms them.