CVNov 24, 2022Code
Explainable Model-Agnostic Similarity and Confidence in Face VerificationMartin Knoche, Torben Teepe, Stefan Hörmann et al.
Recently, face recognition systems have demonstrated remarkable performances and thus gained a vital role in our daily life. They already surpass human face verification accountability in many scenarios. However, they lack explanations for their predictions. Compared to human operators, typical face recognition network system generate only binary decisions without further explanation and insights into those decisions. This work focuses on explanations for face recognition systems, vital for developers and operators. First, we introduce a confidence score for those systems based on facial feature distances between two input images and the distribution of distances across a dataset. Secondly, we establish a novel visualization approach to obtain more meaningful predictions from a face recognition system, which maps the distance deviation based on a systematic occlusion of images. The result is blended with the original images and highlights similar and dissimilar facial regions. Lastly, we calculate confidence scores and explanation maps for several state-of-the-art face verification datasets and release the results on a web platform. We optimize the platform for a user-friendly interaction and hope to further improve the understanding of machine learning decisions. The source code is available on GitHub, and the web platform is publicly available at http://explainable-face-verification.ey.r.appspot.com.
CVSep 6, 2023Code
Do We Still Need Non-Maximum Suppression? Accurate Confidence Estimates and Implicit Duplication Modeling with IoU-Aware CalibrationJohannes Gilg, Torben Teepe, Fabian Herzog et al.
Object detectors are at the heart of many semi- and fully autonomous decision systems and are poised to become even more indispensable. They are, however, still lacking in accessibility and can sometimes produce unreliable predictions. Especially concerning in this regard are the -- essentially hand-crafted -- non-maximum suppression algorithms that lead to an obfuscated prediction process and biased confidence estimates. We show that we can eliminate classic NMS-style post-processing by using IoU-aware calibration. IoU-aware calibration is a conditional Beta calibration; this makes it parallelizable with no hyper-parameters. Instead of arbitrary cutoffs or discounts, it implicitly accounts for the likelihood of each detection being a duplicate and adjusts the confidence score accordingly, resulting in empirically based precision estimates for each detection. Our extensive experiments on diverse detection architectures show that the proposed IoU-aware calibration can successfully model duplicate detections and improve calibration. Compared to the standard sequential NMS and calibration approach, our joint modeling can deliver performance gains over the best NMS-based alternative while producing consistently better-calibrated confidence predictions with less complexity. The \hyperlink{https://github.com/Blueblue4/IoU-AwareCalibration}{code} for all our experiments is publicly available.
CVJun 8, 2022Code
Wavelet Regularization Benefits Adversarial TrainingJun Yan, Huilin Yin, Xiaoyang Deng et al.
Adversarial training methods are state-of-the-art (SOTA) empirical defense methods against adversarial examples. Many regularization methods have been proven to be effective with the combination of adversarial training. Nevertheless, such regularization methods are implemented in the time domain. Since adversarial vulnerability can be regarded as a high-frequency phenomenon, it is essential to regulate the adversarially-trained neural network models in the frequency domain. Faced with these challenges, we make a theoretical analysis on the regularization property of wavelets which can enhance adversarial training. We propose a wavelet regularization method based on the Haar wavelet decomposition which is named Wavelet Average Pooling. This wavelet regularization module is integrated into the wide residual neural network so that a new WideWaveletResNet model is formed. On the datasets of CIFAR-10 and CIFAR-100, our proposed Adversarial Wavelet Training method realizes considerable robustness under different types of attacks. It verifies the assumption that our wavelet regularization method can enhance adversarial robustness especially in the deep wide neural networks. The visualization experiments of the Frequency Principle (F-Principle) and interpretability are implemented to show the effectiveness of our method. A detailed comparison based on different wavelet base functions is presented. The code is available at the repository: \url{https://github.com/momo1986/AdversarialWaveletTraining}.
CVApr 16, 2022
Towards a Deeper Understanding of Skeleton-based Gait RecognitionTorben Teepe, Johannes Gilg, Fabian Herzog et al.
Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns. In recent years, most gait recognition methods used the person's silhouette to extract the gait features. However, silhouette images can lose fine-grained spatial information, suffer from (self) occlusion, and be challenging to obtain in real-world scenarios. Furthermore, these silhouettes also contain other visual clues that are not actual gait features and can be used for identification, but also to fool the system. Model-based methods do not suffer from these problems and are able to represent the temporal motion of body joints, which are actual gait features. The advances in human pose estimation started a new era for model-based gait recognition with skeleton-based gait recognition. In this work, we propose an approach based on Graph Convolutional Networks (GCNs) that combines higher-order inputs, and residual networks to an efficient architecture for gait recognition. Extensive experiments on the two popular gait datasets, CASIA-B and OUMVLP-Pose, show a massive improvement (3x) of the state-of-the-art (SotA) on the largest gait dataset OUMVLP-Pose and strong temporal modeling capabilities. Finally, we visualize our method to understand skeleton-based gait recognition better and to show that we model real gait features.
CVOct 20, 2023
EarlyBird: Early-Fusion for Multi-View Tracking in the Bird's Eye ViewTorben Teepe, Philipp Wolters, Johannes Gilg et al.
Multi-view aggregation promises to overcome the occlusion and missed detection challenge in multi-object detection and tracking. Recent approaches in multi-view detection and 3D object detection made a huge performance leap by projecting all views to the ground plane and performing the detection in the Bird's Eye View (BEV). In this paper, we investigate if tracking in the BEV can also bring the next performance breakthrough in Multi-Target Multi-Camera (MTMC) tracking. Most current approaches in multi-view tracking perform the detection and tracking task in each view and use graph-based approaches to perform the association of the pedestrian across each view. This spatial association is already solved by detecting each pedestrian once in the BEV, leaving only the problem of temporal association. For the temporal association, we show how to learn strong Re-Identification (re-ID) features for each detection. The results show that early-fusion in the BEV achieves high accuracy for both detection and tracking. EarlyBird outperforms the state-of-the-art methods and improves the current state-of-the-art on Wildtrack by +4.6 MOTA and +5.6 IDF1.
CVAug 30, 2022
Synthehicle: Multi-Vehicle Multi-Camera Tracking in Virtual CitiesFabian Herzog, Junpeng Chen, Torben Teepe et al.
Smart City applications such as intelligent traffic routing or accident prevention rely on computer vision methods for exact vehicle localization and tracking. Due to the scarcity of accurately labeled data, detecting and tracking vehicles in 3D from multiple cameras proves challenging to explore. We present a massive synthetic dataset for multiple vehicle tracking and segmentation in multiple overlapping and non-overlapping camera views. Unlike existing datasets, which only provide tracking ground truth for 2D bounding boxes, our dataset additionally contains perfect labels for 3D bounding boxes in camera- and world coordinates, depth estimation, and instance, semantic and panoptic segmentation. The dataset consists of 17 hours of labeled video material, recorded from 340 cameras in 64 diverse day, rain, dawn, and night scenes, making it the most extensive dataset for multi-target multi-camera tracking so far. We provide baselines for detection, vehicle re-identification, and single- and multi-camera tracking. Code and data are publicly available.
CVJul 14, 2022
Octuplet Loss: Make Face Recognition Robust to Image ResolutionMartin Knoche, Mohamed Elkadeem, Stefan Hörmann et al.
Image resolution, or in general, image quality, plays an essential role in the performance of today's face recognition systems. To address this problem, we propose a novel combination of the popular triplet loss to improve robustness against image resolution via fine-tuning of existing face recognition models. With octuplet loss, we leverage the relationship between high-resolution images and their synthetically down-sampled variants jointly with their identity labels. Fine-tuning several state-of-the-art approaches with our method proves that we can significantly boost performance for cross-resolution (high-to-low resolution) face verification on various datasets without meaningfully exacerbating the performance on high-to-high resolution images. Our method applied on the FaceTransformer network achieves 95.12% face verification accuracy on the challenging XQLFW dataset while reaching 99.73% on the LFW database. Moreover, the low-to-low face verification accuracy benefits from our method. We release our code to allow seamless integration of the octuplet loss into existing frameworks.
CVApr 17, 2023
Tackling Face Verification Edge Cases: In-Depth Analysis and Human-Machine Fusion ApproachMartin Knoche, Gerhard Rigoll
Nowadays, face recognition systems surpass human performance on several datasets. However, there are still edge cases that the machine can't correctly classify. This paper investigates the effect of a combination of machine and human operators in the face verification task. First, we look closer at the edge cases for several state-of-the-art models to discover common datasets' challenging settings. Then, we conduct a study with 60 participants on these selected tasks with humans and provide an extensive analysis. Finally, we demonstrate that combining machine and human decisions can further improve the performance of state-of-the-art face verification systems on various benchmark datasets. Code and data are publicly available on GitHub.
CVMay 27, 2022
Face Morphing: Fooling a Face Recognition System Is Simple!Stefan Hörmann, Tianlin Kong, Torben Teepe et al.
State-of-the-art face recognition (FR) approaches have shown remarkable results in predicting whether two faces belong to the same identity, yielding accuracies between 92% and 100% depending on the difficulty of the protocol. However, the accuracy drops substantially when exposed to morphed faces, specifically generated to look similar to two identities. To generate morphed faces, we integrate a simple pretrained FR model into a generative adversarial network (GAN) and modify several loss functions for face morphing. In contrast to previous works, our approach and analyses are not limited to pairs of frontal faces with the same ethnicity and gender. Our qualitative and quantitative results affirm that our approach achieves a seamless change between two faces even in unconstrained scenarios. Despite using features from a simpler FR model for face morphing, we demonstrate that even recent FR systems struggle to distinguish the morphed face from both identities obtaining an accuracy of only 55-70%. Besides, we provide further insights into how knowing the FR system makes it particularly vulnerable to face morphing attacks.
CVMar 12, 2024Code
Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D PerceptionPhilipp Wolters, Johannes Gilg, Torben Teepe et al.
Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense BEV (Bird's Eye View)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.
CVNov 29, 2024Code
SpaRC: Sparse Radar-Camera Fusion for 3D Object DetectionPhilipp Wolters, Johannes Gilg, Torben Teepe et al.
In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird's Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at https://github.com/phi-wol/sparc.
CVMar 19, 2024Code
Lifting Multi-View Detection and Tracking to the Bird's Eye ViewTorben Teepe, Philipp Wolters, Johannes Gilg et al.
Taking advantage of multi-view aggregation presents a promising solution to tackle challenges such as occlusion and missed detection in multi-object tracking and detection. Recent advancements in multi-view detection and 3D object recognition have significantly improved performance by strategically projecting all views onto the ground plane and conducting detection analysis from a Bird's Eye View. In this paper, we compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation. Additionally, we present an architecture that aggregates the features of multiple times steps to learn robust detection and combines appearance- and motion-based cues for tracking. Most current tracking approaches either focus on pedestrians or vehicles. In our work, we combine both branches and add new challenges to multi-view detection with cross-scene setups. Our method generalizes to three public datasets across two domains: (1) pedestrian: Wildtrack and MultiviewX, and (2) roadside perception: Synthehicle, achieving state-of-the-art performance in detection and tracking. https://github.com/tteepe/TrackTacular
ROJan 23, 2025
Knowledge-Informed Multi-Agent Trajectory Prediction at Signalized Intersections for Infrastructure-to-EverythingHuilin Yin, Yangwenhui Xu, Jiaxiang Li et al.
Multi-agent trajectory prediction at signalized intersections is crucial for developing efficient intelligent transportation systems and safe autonomous driving systems. Due to the complexity of intersection scenarios and the limitations of single-vehicle perception, the performance of vehicle-centric prediction methods has reached a plateau. In this paper, we introduce an Infrastructure-to-Everything (I2X) collaborative prediction scheme. In this scheme, roadside units (RSUs) independently forecast the future trajectories of all vehicles and transmit these predictions unidirectionally to subscribing vehicles. Building on this scheme, we propose I2XTraj, a dedicated infrastructure-based trajectory prediction model. I2XTraj leverages real-time traffic signal states, prior maneuver strategy knowledge, and multi-agent interactions to generate accurate, joint multi-modal trajectory prediction. First, a continuous signal-informed mechanism is proposed to adaptively process real-time traffic signals to guide trajectory proposal generation under varied intersection configurations. Second, a driving strategy awareness mechanism estimates the joint distribution of maneuver strategies by integrating spatial priors of intersection areas with dynamic vehicle states, enabling coverage of the full set of feasible maneuvers. Third, a spatial-temporal-mode attention network models multi-agent interactions to refine and adjust joint trajectory outputs.Finally, I2XTraj is evaluated on two real-world datasets of signalized intersections, the V2X-Seq and the SinD drone dataset. In both single-infrastructure and online collaborative scenarios, our model outperforms state-of-the-art methods by over 30\% on V2X-Seq and 15\% on SinD, demonstrating strong generalizability and robustness.
ROOct 26, 2025
RoGER-SLAM: A Robust Gaussian Splatting SLAM System for Noisy and Low-light Environment ResilienceHuilin Yin, Zhaolin Yang, Linchuan Zhang et al.
The reliability of Simultaneous Localization and Mapping (SLAM) is severely constrained in environments where visual inputs suffer from noise and low illumination. Although recent 3D Gaussian Splatting (3DGS) based SLAM frameworks achieve high-fidelity mapping under clean conditions, they remain vulnerable to compounded degradations that degrade mapping and tracking performance. A key observation underlying our work is that the original 3DGS rendering pipeline inherently behaves as an implicit low-pass filter, attenuating high-frequency noise but also risking over-smoothing. Building on this insight, we propose RoGER-SLAM, a robust 3DGS SLAM system tailored for noise and low-light resilience. The framework integrates three innovations: a Structure-Preserving Robust Fusion (SP-RoFusion) mechanism that couples rendered appearance, depth, and edge cues; an adaptive tracking objective with residual balancing regularization; and a Contrastive Language-Image Pretraining (CLIP)-based enhancement module, selectively activated under compounded degradations to restore semantic and structural fidelity. Comprehensive experiments on Replica, TUM, and real-world sequences show that RoGER-SLAM consistently improves trajectory accuracy and reconstruction quality compared with other 3DGS-SLAM systems, especially under adverse imaging conditions.
CVAug 14, 2025
SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous DrivingPhilipp Wolters, Johannes Gilg, Torben Teepe et al.
End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/
CVDec 3, 2021
The Box Size Confidence Bias Harms Your Object DetectorJohannes Gilg, Torben Teepe, Fabian Herzog et al.
Countless applications depend on accurate predictions with reliable confidence estimates from modern object detectors. It is well known, however, that neural networks including object detectors produce miscalibrated confidence estimates. Recent work even suggests that detectors' confidence predictions are biased with respect to object size and position, but it is still unclear how this bias relates to the performance of the affected object detectors. We formally prove that the conditional confidence bias is harming the expected performance of object detectors and empirically validate these findings. Specifically, we demonstrate how to modify the histogram binning calibration to not only avoid performance impairment but also improve performance through conditional confidence calibration. We further find that the confidence bias is also present in detections generated on the training data of the detector, which we leverage to perform our de-biasing without using additional data. Moreover, Test Time Augmentation magnifies this bias, which results in even larger performance gains from our calibration method. Finally, we validate our findings on a diverse set of object detection architectures and show improvements of up to 0.6 mAP and 0.8 mAP50 without extra data or training.
CVAug 23, 2021
Cross-Quality LFW: A Database for Analyzing Cross-Resolution Image Face Recognition in Unconstrained EnvironmentsMartin Knoche, Stefan Hörmann, Gerhard Rigoll
Real-world face recognition applications often deal with suboptimal image quality or resolution due to different capturing conditions such as various subject-to-camera distances, poor camera settings, or motion blur. This characteristic has an unignorable effect on performance. Recent cross-resolution face recognition approaches used simple, arbitrary, and unrealistic down- and up-scaling techniques to measure robustness against real-world edge-cases in image quality. Thus, we propose a new standardized benchmark dataset and evaluation protocol derived from the famous Labeled Faces in the Wild (LFW). In contrast to previous derivatives, which focus on pose, age, similarity, and adversarial attacks, our Cross-Quality Labeled Faces in the Wild (XQLFW) maximizes the quality difference. It contains only more realistic synthetically degraded images when necessary. Our proposed dataset is then used to further investigate the influence of image quality on several state-of-the-art approaches. With XQLFW, we show that these models perform differently in cross-quality cases, and hence, the generalizing capability is not accurately predicted by their performance on LFW. Additionally, we report baseline accuracy with recent deep learning models explicitly trained for cross-resolution applications and evaluate the susceptibility to image quality. To encourage further research in cross-resolution face recognition and incite the assessment of image quality robustness, we publish the database and code for evaluation.
CVJul 8, 2021
Susceptibility to Image Resolution in Face Recognition and Trainings StrategiesMartin Knoche, Stefan Hörmann, Gerhard Rigoll
Face recognition approaches often rely on equal image resolution for verifying faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on face verification performance with a state-of-the-art face recognition model. For images synthetically reduced to $5\,\times\,5$ px resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforwardly with $50\%$ low-resolution images directly within each batch. 2) Train a siamese-network structure and add a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution pair. Thus, we extend the aforementioned methods by training them with multiple image resolutions at once. The performances for particular testing image resolutions are slightly worse, but the advantage is that this model can be applied to arbitrary resolution images and achieves an overall better performance ($97.72\%$ compared to $96.86\%$). Due to the lack of a benchmark for arbitrary resolution images for the cross-resolution and equal-resolution task, we propose an evaluation protocol for five well-known datasets, focusing on high, mid, and low-resolution images.
CVJun 11, 2021
Attention-based Partial Face RecognitionStefan Hörmann, Zeyuan Zhang, Martin Knoche et al.
Photos of faces captured in unconstrained environments, such as large crowds, still constitute challenges for current face recognition approaches as often faces are occluded by objects or people in the foreground. However, few studies have addressed the task of recognizing partial faces. In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with different occluded areas. We achieve this by combining attentional pooling of a ResNet's intermediate feature maps with a separate aggregation module. We further adapt common losses to partial faces in order to ensure that the attention maps are diverse and handle occluded parts. Our thorough analysis demonstrates that we outperform all baselines under multiple benchmark protocols, including naturally and synthetically occluded partial faces. This suggests that our method successfully focuses on the relevant parts of the occluded face.
CVJun 7, 2021
How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the WildOkan Köpüklü, Maja Taseska, Gerhard Rigoll
Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker. Each stage of this pipeline plays an important role for the final performance of the created architecture. Based on a series of controlled experiments, this work presents several practical guidelines for audio-visual active speaker detection. Correspondingly, we present a new architecture called ASDNet, which achieves a new state-of-the-art on the AVA-ActiveSpeaker dataset with a mAP of 93.5% outperforming the second best with a large margin of 4.7%. Our code and pretrained models are publicly available.
CVJan 27, 2021
GaitGraph: Graph Convolutional Network for Skeleton-Based Gait RecognitionTorben Teepe, Ali Khan, Johannes Gilg et al.
Gait recognition is a promising video-based biometric for identifying individual walking patterns from a long distance. At present, most gait recognition methods use silhouette images to represent a person in each frame. However, silhouette images can lose fine-grained spatial information, and most papers do not regard how to obtain these silhouettes in complex scenes. Furthermore, silhouette images contain not only gait features but also other visual clues that can be recognized. Hence these approaches can not be considered as strict gait recognition. We leverage recent advances in human pose estimation to estimate robust skeleton poses directly from RGB images to bring back model-based gait recognition with a cleaner representation of gait. Thus, we propose GaitGraph that combines skeleton poses with Graph Convolutional Network (GCN) to obtain a modern model-based approach for gait recognition. The main advantages are a cleaner, more elegant extraction of the gait features and the ability to incorporate powerful spatio-temporal modeling using GCN. Experiments on the popular CASIA-B gait dataset show that our method archives state-of-the-art performance in model-based gait recognition. The code and models are publicly available.
CVJan 26, 2021
Lightweight Multi-Branch Network for Person Re-IdentificationFabian Herzog, Xunbo Ji, Torben Teepe et al.
Person Re-Identification aims to retrieve person identities from images captured by multiple cameras or the same cameras in different time instances and locations. Because of its importance in many vision applications from surveillance to human-machine interaction, person re-identification methods need to be reliable and fast. While more and more deep architectures are proposed for increasing performance, those methods also increase overall model complexity. This paper proposes a lightweight network that combines global, part-based, and channel features in a unified multi-branch architecture that builds on the resource-efficient OSNet backbone. Using a well-founded combination of training techniques and design choices, our final model achieves state-of-the-art results on CUHK03 labeled, CUHK03 detected, and Market-1501 with 85.1% mAP / 87.2% rank1, 82.4% mAP / 84.9% rank1, and 91.5% mAP / 96.3% rank1, respectively.
ASOct 15, 2020
Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-ConvolutionsLudwig Kürzinger, Nicolas Lindae, Palle Klewitz et al.
Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7% on the TEDlium v2 test dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9%, but only has 21% of its model size.
CVSep 30, 2020
Driver Anomaly Detection: A Dataset and Contrastive Learning ApproachOkan Köpüklü, Jiapeng Zheng, Hang Xu et al.
Distracted drivers are more likely to fail to anticipate hazards, which result in car accidents. Therefore, detecting anomalies in drivers' actions (i.e., any action deviating from normal driving) contains the utmost importance to reduce driver-related accidents. However, there are unbounded many anomalous actions that a driver can do while driving, which leads to an 'open set recognition' problem. Accordingly, instead of recognizing a set of anomalous actions that are commonly defined by previous dataset providers, in this work, we propose a contrastive learning approach to learn a metric to differentiate normal driving from anomalous driving. For this task, we introduce a new video-based benchmark, the Driver Anomaly Detection (DAD) dataset, which contains normal driving videos together with a set of anomalous actions in its training set. In the test set of the DAD dataset, there are unseen anomalous actions that still need to be winnowed out from normal driving. Our method reaches 0.9673 AUC on the test set, demonstrating the effectiveness of the contrastive learning approach on the anomaly detection task. Our dataset, codes and pre-trained models are publicly available.
CVSep 30, 2020
Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video ProcessingOkan Köpüklü, Stefan Hörmann, Fabian Herzog et al.
Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames. There have been many successful 3D-CNN architectures surpassing the state-of-the-art results successively. However, nearly all of them are designed to operate offline creating several serious handicaps during online operation. Firstly, conventional 3D-CNNs are not dynamic since their output features represent the complete input clip instead of the most recent frame in the clip. Secondly, they are not temporal resolution-preserving due to their inherent temporal downsampling. Lastly, 3D-CNNs are constrained to be used with fixed temporal input size limiting their flexibility. In order to address these drawbacks, we propose dissected 3D-CNNs, where the intermediate volumes of the network are dissected and propagated over depth (time) dimension for future calculations, substantially reducing the number of computations at online operation. For action classification, the dissected version of ResNet models performs 77-90% fewer computations at online operation while achieving ~5% better classification accuracy on the Kinetics-600 dataset than conventional 3D-ResNet models. Moreover, the advantages of dissected 3D-CNNs are demonstrated by deploying our approach onto several vision tasks, which consistently improved the performance.
ASJul 25, 2020
MP3 Compression To Diminish Adversarial Noise in End-to-End Speech RecognitionIustina Andronic, Ludwig Kürzinger, Edgar Ricardo Chavez Rosas et al.
Audio Adversarial Examples (AAE) represent specially created inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. Our method is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on uncompressed, as well as MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing only AN.
ASJul 21, 2020
Audio Adversarial Examples for Robust Hybrid CTC/Attention Speech RecognitionLudwig Kürzinger, Edgar Ricardo Chavez Rosas, Lujun Li et al.
Recent advances in Automatic Speech Recognition (ASR) demonstrated how end-to-end systems are able to achieve state-of-the-art performance. There is a trend towards deeper neural networks, however those ASR models are also more complex and prone against specially crafted noisy data. Those Audio Adversarial Examples (AAE) were previously demonstrated on ASR systems that use Connectionist Temporal Classification (CTC), as well as attention-based encoder-decoder architectures. Following the idea of the hybrid CTC/attention ASR system, this work proposes algorithms to generate AAEs to combine both approaches into a joint CTC-attention gradient method. Evaluation is performed using a hybrid CTC/attention end-to-end ASR model on two reference sentences as case study, as well as the TEDlium v2 speech recognition task. We then demonstrate the application of this algorithm for adversarial training to obtain a more robust ASR model.
ASJun 15, 2020
Regularized Forward-Backward Decoder for Attention ModelsTobias Watzel, Ludwig Kürzinger, Lujun Li et al.
Nowadays, attention models are one of the popular candidates for speech recognition. So far, many studies mainly focus on the encoder structure or the attention module to enhance the performance of these models. However, mostly ignore the decoder. In this paper, we propose a novel regularization technique incorporating a second decoder during the training phase. This decoder is optimized on time-reversed target labels beforehand and supports the standard decoder during training by adding knowledge from future context. Since it is only added during training, we are not changing the basic structure of the network or adding complexity during decoding. We evaluate our approach on the smaller TEDLIUMv2 and the larger LibriSpeech dataset, achieving consistent improvements on both of them.
CVJun 2, 2020
A Multi-Task Comparator Framework for Kinship VerificationStefan Hörmann, Martin Knoche, Gerhard Rigoll
Approaches for kinship verification often rely on cosine distances between face identification features. However, due to gender bias inherent in these features, it is hard to reliably predict whether two opposite-gender pairs are related. Instead of fine tuning the feature extractor network on kinship verification, we propose a comparator network to cope with this bias. After concatenating both features, cascaded local expert networks extract the information most relevant for their corresponding kinship relation. We demonstrate that our framework is robust against this gender bias and achieves comparable results on two tracks of the RFIW Challenge 2020. Moreover, we show how our framework can be further extended to handle partially known or unknown kinship relations.
CVMar 2, 2020
DriverMHG: A Multi-Modal Dataset for Dynamic Recognition of Driver Micro Hand Gestures and a Real-Time Recognition FrameworkOkan Köpüklü, Thomas Ledwon, Yao Rong et al.
The use of hand gestures provides a natural alternative to cumbersome interface devices for Human-Computer Interaction (HCI) systems. However, real-time recognition of dynamic micro hand gestures from video streams is challenging for in-vehicle scenarios since (i) the gestures should be performed naturally without distracting the driver, (ii) micro hand gestures occur within very short time intervals at spatially constrained areas, (iii) the performed gesture should be recognized only once, and (iv) the entire architecture should be designed lightweight as it will be deployed to an embedded system. In this work, we propose an HCI system for dynamic recognition of driver micro hand gestures, which can have a crucial impact in automotive sector especially for safety related issues. For this purpose, we initially collected a dataset named Driver Micro Hand Gestures (DriverMHG), which consists of RGB, depth and infrared modalities. The challenges for dynamic recognition of micro hand gestures have been addressed by proposing a lightweight convolutional neural network (CNN) based architecture which operates online efficiently with a sliding window approach. For the CNN model, several 3-dimensional resource efficient networks are applied and their performances are analyzed. Online recognition of gestures has been performed with 3D-MobileNetV2, which provided the best offline accuracy among the applied networks with similar computational complexities. The final architecture is deployed on a driver simulator operating in real-time. We make DriverMHG dataset and our source code publicly available.
CVDec 10, 2019
Deep Attention Based Semi-Supervised 2D-Pose Estimation for Surgical InstrumentsMert Kayhan, Okan Köpüklü, Mhd Hasan Sarhan et al.
For many practical problems and applications, it is not feasible to create a vast and accurately labeled dataset, which restricts the application of deep learning in many areas. Semi-supervised learning algorithms intend to improve performance by also leveraging unlabeled data. This is very valuable for 2D-pose estimation task where data labeling requires substantial time and is subject to noise. This work aims to investigate if semi-supervised learning techniques can achieve acceptable performance level that makes using these algorithms during training justifiable. To this end, a lightweight network architecture is introduced and mean teacher, virtual adversarial training and pseudo-labeling algorithms are evaluated on 2D-pose estimation for surgical instruments. For the applicability of pseudo-labelling algorithm, we propose a novel confidence measure, total variation. Experimental results show that utilization of semi-supervised learning improves the performance on unseen geometries drastically while maintaining high accuracy for seen geometries. For RMIT benchmark, our lightweight architecture outperforms state-of-the-art with supervised learning. For Endovis benchmark, pseudo-labelling algorithm improves the supervised baseline achieving the new state-of-the-art performance.
CVNov 15, 2019
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action LocalizationOkan Köpüklü, Xiangyu Wei, Gerhard Rigoll
Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.
ASNov 5, 2019
Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-ConvolutionsSimon Mittermaier, Ludwig Kürzinger, Bernd Waschneck et al.
Keyword Spotting (KWS) enables speech-based user interaction on smart devices. Always-on and battery-powered application scenarios for smart devices put constraints on hardware resources and power consumption, while also demanding high accuracy as well as real-time capability. Previous architectures first extracted acoustic features and then applied a neural network to classify keyword probabilities, optimizing towards memory footprint and execution time. Compared to previous publications, we took additional steps to reduce power and memory consumption without reducing classification accuracy. Power-consuming audio preprocessing and data transfer steps are eliminated by directly classifying from raw audio. For this, our end-to-end architecture extracts spectral features using parametrized Sinc-convolutions. Its memory footprint is further reduced by grouping depthwise separable convolutions. Our network achieves the competitive accuracy of 96.4% on Google's Speech Commands test set with only 62k parameters.
CVSep 11, 2019
Comparative Analysis of CNN-based Spatiotemporal Reasoning in VideosOkan Köpüklü, Fabian Herzog, Gerhard Rigoll
Understanding actions and gestures in video streams requires temporal reasoning of the spatial content from different time instants, i.e., spatiotemporal (ST) modeling. In this survey paper, we have made a comparative analysis of different ST modeling techniques for action and gecture recognition tasks. Since Convolutional Neural Networks (CNNs) are proved to be an effective tool as a feature extractor for static images, we apply ST modeling techniques on the features of static images from different time instants extracted by CNNs. All techniques are trained end-to-end together with a CNN feature extraction part and evaluated on two publicly available benchmarks: The Jester and the Something-Something datasets. The Jester dataset contains various dynamic and static hand gestures, whereas the Something-Something dataset contains actions of human-object interactions. The common characteristic of these two benchmarks is that the designed architectures need to capture the full temporal content of videos in order to correctly classify actions/gestures. Contrary to expectations, experimental results show that Recurrent Neural Network (RNN) based ST modeling techniques yield inferior results compared to other techniques such as fully convolutional architectures. Codes and pretrained models of this work are publicly available.
CVJul 18, 2019
Real-Time Driver State Monitoring Using a CNN Based Spatio-Temporal ApproachNeslihan Kose, Okan Kopuklu, Alexander Unnervik et al.
Many road accidents occur due to distracted drivers. Today, driver monitoring is essential even for the latest autonomous vehicles to alert distracted drivers in order to take over control of the vehicle in case of emergency. In this paper, a spatio-temporal approach is applied to classify drivers' distraction level and movement decisions using convolutional neural networks (CNNs). We approach this problem as action recognition to benefit from temporal information in addition to spatial information. Our approach relies on features extracted from sparsely selected frames of an action using a pre-trained BN-Inception network. Experiments show that our approach outperforms the state-of-the art results on the Distracted Driver Dataset (96.31%), with an accuracy of 99.10% for 10-class classification while providing real-time performance. We also analyzed the impact of fusion using RGB and optical flow modalities with a very recent data level fusion strategy. The results on the Distracted Driver and Brain4Cars datasets show that fusion of these modalities further increases the accuracy.
CVMay 12, 2019
On Flow Profile Image for Video RepresentationMohammadreza Babaee, David Full, Gerhard Rigoll
Video representation is a key challenge in many computer vision applications such as video classification, video captioning, and video surveillance. In this paper, we propose a novel approach for video representation that captures meaningful information including motion and appearance from a sequence of video frames and compacts it into a single image. To this end, we compute the optical flow and use it in a least squares optimization to find a new image, the so-called Flow Profile Image (FPI). This image encodes motions as well as foreground appearance information while background information is removed. The quality of this image is validated in activity recognition experiments and the results are compared with other video representation techniques such as dynamic images [1] and eigen images [2]. The experimental results as well as visual quality confirm that FPIs can be successfully used in video processing applications.
CVMay 10, 2019
Talking With Your Hands: Scaling Hand Gestures and Recognition With CNNsOkan Köpüklü, Yao Rong, Gerhard Rigoll
The use of hand gestures provides a natural alternative to cumbersome interface devices for Human-Computer Interaction (HCI) systems. As the technology advances and communication between humans and machines becomes more complex, HCI systems should also be scaled accordingly in order to accommodate the introduced complexities. In this paper, we propose a methodology to scale hand gestures by forming them with predefined gesture-phonemes, and a convolutional neural network (CNN) based framework to recognize hand gestures by learning only their constituents of gesture-phonemes. The total number of possible hand gestures can be increased exponentially by increasing the number of used gesture-phonemes. For this objective, we introduce a new benchmark dataset named Scaled Hand Gestures Dataset (SHGD) with only gesture-phonemes in its training set and 3-tuples gestures in the test set. In our experimental analysis, we achieve to recognize hand gestures containing one and three gesture-phonemes with an accuracy of 98.47% (in 15 classes) and 94.69% (in 810 classes), respectively. Our dataset, code and pretrained models are publicly available.
CVApr 4, 2019
Resource Efficient 3D Convolutional Neural NetworksOkan Köpüklü, Neslihan Kose, Ahmet Gunduz et al.
Recently, convolutional neural networks with 3D kernels (3D CNNs) have been very popular in computer vision community as a result of their superior ability of extracting spatio-temporal features within video frames compared to 2D CNNs. Although there has been great advances recently to build resource efficient 2D CNN architectures considering memory and power budget, there is hardly any similar resource efficient architectures for 3D CNNs. In this paper, we have converted various well-known resource efficient 2D CNNs to 3D CNNs and evaluated their performance on three major benchmarks in terms of classification accuracy for different complexity levels. We have experimented on (1) Kinetics-600 dataset to inspect their capacity to learn, (2) Jester dataset to inspect their ability to capture motion patterns, and (3) UCF-101 to inspect the applicability of transfer learning. We have evaluated the run-time performance of each model on a single Titan XP GPU and a Jetson TX2 embedded system. The results of this study show that these models can be utilized for different types of real-world applications since they provide real-time performance with considerable accuracies and memory usage. Our analysis on different complexity levels shows that the resource efficient 3D CNNs should not be designed too shallow or narrow in order to save complexity. The codes and pretrained models used in this work are publicly available.
CVJan 29, 2019
Real-time Hand Gesture Detection and Classification Using Convolutional Neural NetworksOkan Köpüklü, Ahmet Gunduz, Neslihan Kose et al.
Real-time recognition of dynamic hand gestures from video streams is a challenging task since (i) there is no indication when a gesture starts and ends in the video, (ii) performed gestures should only be recognized once, and (iii) the entire architecture should be designed considering the memory and power budget. In this work, we address these challenges by proposing a hierarchical structure enabling offline-working convolutional neural network (CNN) architectures to operate online efficiently by using sliding window approach. The proposed architecture consists of two models: (1) A detector which is a lightweight CNN architecture to detect gestures and (2) a classifier which is a deep CNN to classify the detected gestures. In order to evaluate the single-time activations of the detected gestures, we propose to use Levenshtein distance as an evaluation metric since it can measure misclassifications, multiple detections, and missing detections at the same time. We evaluate our architecture on two publicly available datasets - EgoGesture and NVIDIA Dynamic Hand Gesture Datasets - which require temporal detection and classification of the performed hand gestures. ResNeXt-101 model, which is used as a classifier, achieves the state-of-the-art offline classification accuracy of 94.04% and 83.82% for depth modality on EgoGesture and NVIDIA benchmarks, respectively. In real-time detection and classification, we obtain considerable early detections while achieving performances close to offline operation. The codes and pretrained models used in this work are publicly available.
CVJan 28, 2019
Convolutional Neural Networks with Layer ReuseOkan Köpüklü, Maryam Babaee, Stefan Hörmann et al.
A convolutional layer in a Convolutional Neural Network (CNN) consists of many filters which apply convolution operation to the input, capture some special patterns and pass the result to the next layer. If the same patterns also occur at the deeper layers of the network, why wouldn't the same convolutional filters be used also in those layers? In this paper, we propose a CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers are used repeatedly without the need of introducing new layers to get a better performance. This approach introduces several advantages: (i) Considerable amount of parameters are saved since we are reusing the layers instead of introducing new layers, (ii) the Memory Access Cost (MAC) can be reduced since reused layer parameters can be fetched only once, (iii) the number of nonlinearities increases with layer reuse, and (iv) reused layers get gradient updates from multiple parts of the network. The proposed approach is evaluated on CIFAR-10, CIFAR-100 and Fashion-MNIST datasets for image classification task, and layer reuse improves the performance by 5.14%, 5.85% and 2.29%, respectively. The source code and pretrained models are publicly available.
CVNov 9, 2018
Multiple People Tracking Using Hierarchical Deep Tracklet Re-identificationMaryam Babaee, Ali Athar, Gerhard Rigoll
The task of multiple people tracking in monocular videos is challenging because of the numerous difficulties involved: occlusions, varying environments, crowded scenes, camera parameters and motion. In the tracking-by-detection paradigm, most approaches adopt person re-identification techniques based on computing the pairwise similarity between detections. However, these techniques are less effective in handling long-term occlusions. By contrast, tracklet (a sequence of detections) re-identification can improve association accuracy since tracklets offer a richer set of visual appearance and spatio-temporal cues. In this paper, we propose a tracking framework that employs a hierarchical clustering mechanism for merging tracklets. To this end, tracklet re-identification is performed by utilizing a novel multi-stage deep network that can jointly reason about the visual appearance and spatio-temporal properties of a pair of tracklets, thereby providing a robust measure of affinity. Experimental results on the challenging MOT16 and MOT17 benchmarks show that our method significantly outperforms state-of-the-arts.
CVApr 23, 2018
Person Identification from Partial Gait Cycle Using Fully Convolutional Neural NetworkMaryam Babaee, Linwei Li, Gerhard Rigoll
Gait as a biometric property for person identification plays a key role in video surveillance and security applications. In gait recognition, normally, gait feature such as Gait Energy Image (GEI) is extracted from one full gait cycle. However in many circumstances, such a full gait cycle might not be available due to occlusion. Thus, the GEI is not complete giving rise to a degrading in gait-based person identification rate. In this paper, we address this issue by proposing a novel method to identify individuals from gait feature when a few (or even single) frame(s) is available. To do so, we propose a deep learning-based approach to transform incomplete GEI to the corresponding complete GEI obtained from a full gait cycle. More precisely, this transformation is done gradually by training several auto encoders independently and then combining these as a uniform model. Experimental results on two public gait datasets, namely OULP and Casia-B demonstrate the validity of the proposed method in dealing with very incomplete gait cycles.
CVApr 19, 2018
Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture RecognitionOkan Köpüklü, Neslihan Köse, Gerhard Rigoll
Acquiring spatio-temporal states of an action is the most crucial step for action classification. In this paper, we propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action. MFFs can be used as input to any deep learning architecture with very little modification on the network. We evaluate MFFs on hand gesture recognition tasks using three video datasets - Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets - which require capturing long-term temporal relations of hand movements. Our approach obtains very competitive performance on Jester and ChaLearn benchmarks with the classification accuracies of 96.28% and 57.4%, respectively, while achieving state-of-the-art performance with 84.7% accuracy on NVIDIA benchmark.
CVFeb 6, 2017
A Deep Convolutional Neural Network for Background SubtractionMohammadreza Babaee, Duc Tung Dinh, Gerhard Rigoll
In this work, we present a novel background subtraction system that uses a deep Convolutional Neural Network (CNN) to perform the segmentation. With this approach, feature engineering and parameter tuning become unnecessary since the network parameters can be learned from data by training a single CNN that can handle various video scenes. Additionally, we propose a new approach to estimate background model from video. For the training of the CNN, we employed randomly 5 percent video frames and their ground truth segmentations taken from the Change Detection challenge 2014(CDnet 2014). We also utilized spatial-median filtering as the post-processing of the network outputs. Our method is evaluated with different data-sets, and the network outperforms the existing algorithms with respect to the average ranking over different evaluation metrics. Furthermore, due to the network architecture, our CNN is capable of real time processing.
HCSep 13, 2016
Blending Entropy: A Term for Addressing Information Density in Mediated RealityPhilipp Tiefenbacher, Gerhard Rigoll
The virtuality continuum describes the degrees of positive virtuality under the umbrella term mixed reality. Besides adding virtual information within a mixed environment, diminished reality aims at reducing real world information. Mann defined the term mediated reality (MR), which also considered diminished reality, but without the possibility to describe different degrees of fusion between a mixed and a diminished reality. That is why this work defines the new term blending entropy that captures the relations between a mixed and a diminished reality. The blending entropy is based on the information density of the mediated reality and the actual area the user has to comprehend, which is named perceptual frustum. We describe the blending entropy's twodimensional dependencies and detail important points in the blending entropy's space.
CLDec 15, 2014
A Broadcast News Corpus for Evaluation and Tuning of German LVCSR SystemsFelix Weninger, Björn Schuller, Florian Eyben et al.
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR). We present in detail the structure of a manually segmented and annotated corpus including over 160 hours of German broadcast news, and propose it as an evaluation framework of LVCSR systems. We show our own experimental results on the corpus, achieved with a state-of-the-art LVCSR decoder, measuring the effect of different feature sets and decoding parameters, and thereby demonstrate that real-time decoding of our test set is feasible on a desktop PC at 9.2% word error rate.
HCJun 11, 2014
Acoustic Gait-based Person Identification using Hidden Markov ModelsJürgen T. Geiger, Maximilian Kneißl, Björn Schuller et al.
We present a system for identifying humans by their walking sounds. This problem is also known as acoustic gait recognition. The goal of the system is to analyse sounds emitted by walking persons (mostly the step sounds) and identify those persons. These sounds are characterised by the gait pattern and are influenced by the movements of the arms and legs, but also depend on the type of shoe. We extract cepstral features from the recorded audio signals and use hidden Markov models for dynamic classification. A cyclic model topology is employed to represent individual gait cycles. This topology allows to model and detect individual steps, leading to very promising identification rates. For experimental validation, we use the publicly available TUM GAID database, which is a large gait recognition database containing 3050 recordings of 305 subjects in three variations. In the best setup, an identification rate of 65.5 % is achieved out of 155 subjects. This is a relative improvement of almost 30 % compared to our previous work, which used various audio features and support vector machines.