CVMar 4, 2022Code
F2DNet: Fast Focal Detection Network for Pedestrian DetectionAbdul Hannan Khan, Mohsin Munir, Ludger van Elst et al.
Two-stage detectors are state-of-the-art in object detection as well as pedestrian detection. However, the current two-stage detectors are inefficient as they do bounding box regression in multiple steps i.e. in region proposal networks and bounding box heads. Also, the anchor-based region proposal networks are computationally expensive to train. We propose F2DNet, a novel two-stage detection architecture which eliminates redundancy of current two-stage detectors by replacing the region proposal network with our focal detection network and bounding box head with our fast suppression head. We benchmark F2DNet on top pedestrian detection datasets, thoroughly compare it against the existing state-of-the-art detectors and conduct cross dataset evaluation to test the generalizability of our model to unseen data. Our F2DNet achieves 8.7\%, 2.2\%, and 6.1\% MR-2 on City Persons, Caltech Pedestrian, and Euro City Person datasets respectively when trained on a single dataset and reaches 20.4\% and 26.2\% MR-2 in heavy occlusion setting of Caltech Pedestrian and City Persons datasets when using progressive fine-tunning. Furthermore, F2DNet have significantly lesser inference time compared to the current state-of-the-art. Code and trained models will be available at https://github.com/AbdulHannanKhan/F2DNet.
CVAug 17, 2023
Predicting Crop Yield With Machine Learning: An Extensive Analysis Of Input Modalities And Models On a Field and sub-field LevelDeepak Pathak, Miro Miranda, Francisco Mena et al. · cmu
We introduce a simple yet effective early fusion method for crop yield prediction that handles multiple input modalities with different temporal and spatial resolutions. We use high-resolution crop yield maps as ground truth data to train crop and machine learning model agnostic methods at the sub-field level. We use Sentinel-2 satellite imagery as the primary modality for input data with other complementary modalities, including weather, soil, and DEM data. The proposed method uses input modalities available with global coverage, making the framework globally scalable. We explicitly highlight the importance of input modalities for crop yield prediction and emphasize that the best-performing combination of input modalities depends on region, crop, and chosen model.
CVJun 3
OA-CutMix: Correcting the Label Bias of CutMixTobias Christian Nauen, Stanislav Frolov, Federico Raue et al.
CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.
LGMar 1, 2023Code
On the Importance of Feature Representation for Flood Mapping using Classical Machine Learning ApproachesKevin Iselborn, Marco Stricker, Takashi Miyamoto et al.
Climate change has increased the severity and frequency of weather disasters all around the world. Flood inundation mapping based on earth observation data can help in this context, by providing cheap and accurate maps depicting the area affected by a flood event to emergency-relief units in near-real-time. Building upon the recent development of the Sen1Floods11 dataset, which provides a limited amount of hand-labeled high-quality training data, this paper evaluates the potential of five traditional machine learning approaches such as gradient boosted decision trees, support vector machines or quadratic discriminant analysis. By performing a grid-search-based hyperparameter optimization on 23 feature spaces we can show that all considered classifiers are capable of outperforming the current state-of-the-art neural network-based approaches in terms of total IoU on their best-performing feature spaces. With total and mean IoU values of 0.8751 and 0.7031 compared to 0.70 and 0.5873 as the previous best-reported results, we show that a simple gradient boosting classifier can significantly improve over deep neural network based approaches, despite using less training data. Furthermore, an analysis of the regional distribution of the Sen1Floods11 dataset reveals a problem of spatial imbalance. We show that traditional machine learning models can learn this bias and argue that modified metric evaluations are required to counter artifacts due to spatial imbalance. Lastly, a qualitative analysis shows that this pixel-wise classifier provides highly-precise surface water classifications indicating that a good choice of a feature space and pixel-wise classification can generate high-quality flood maps using optical and SAR data. We make our code publicly available at: https://github.com/DFKI-Earth-And-Space-Applications/Flood_Mapping_Feature_Space_Importance
CVJun 1
GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial TranscriptomicsKaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise et al.
Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.
CVSep 27, 2022
Hitchhiker's Guide to Super-Resolution: Introduction and Recent AdvancesBrian Moser, Federico Raue, Stanislav Frolov et al.
With the advent of Deep Learning (DL), Super-Resolution (SR) has also become a thriving research area. However, despite promising results, the field still faces challenges that require further research e.g., allowing flexible upsampling, more effective loss functions, and better evaluation metrics. We review the domain of SR in light of recent advances, and examine state-of-the-art models such as diffusion (DDPM) and transformer-based SR models. We present a critical discussion on contemporary strategies used in SR, and identify promising yet unexplored research directions. We complement previous surveys by incorporating the latest developments in the field such as uncertainty-driven losses, wavelet networks, neural architecture search, novel normalization methods, and the latests evaluation techniques. We also include several visualizations for the models and methods throughout each chapter in order to facilitate a global understanding of the trends in the field. This review is ultimately aimed at helping researchers to push the boundaries of DL applied to SR.
LGNov 8, 2022
Privacy Meets Explainability: A Comprehensive Impact BenchmarkSaifullah Saifullah, Dominique Mercier, Adriano Lucieri et al.
Since the mid-10s, the era of Deep Learning (DL) has continued to this day, bringing forth new superlatives and innovations each year. Nevertheless, the speed with which these innovations translate into real applications lags behind this fast pace. Safety-critical applications, in particular, underlie strict regulatory and ethical requirements which need to be taken care of and are still active areas of debate. eXplainable AI (XAI) and privacy-preserving machine learning (PPML) are both crucial research fields, aiming at mitigating some of the drawbacks of prevailing data-hungry black-box models in DL. Despite brisk research activity in the respective fields, no attention has yet been paid to their interaction. This work is the first to investigate the impact of private learning techniques on generated explanations for DL-based models. In an extensive experimental analysis covering various image and time series datasets from multiple domains, as well as varying privacy techniques, XAI methods, and model architectures, the effects of private training on generated explanations are studied. The findings suggest non-negligible changes in explanations through the introduction of privacy. Apart from reporting individual effects of PPML on XAI, the paper gives clear recommendations for the choice of techniques in real applications. By unveiling the interdependencies of these pivotal technologies, this work is a first step towards overcoming the remaining hurdles for practically applicable AI in safety-critical domains.
CVApr 4, 2023
Waving Goodbye to Low-Res: A Diffusion-Wavelet Approach for Image Super-ResolutionBrian Moser, Stanislav Frolov, Federico Raue et al.
This paper presents a novel Diffusion-Wavelet (DiWa) approach for Single-Image Super-Resolution (SISR). It leverages the strengths of Denoising Diffusion Probabilistic Models (DDPMs) and Discrete Wavelet Transformation (DWT). By enabling DDPMs to operate in the DWT domain, our DDPM models effectively hallucinate high-frequency information for super-resolved images on the wavelet spectrum, resulting in high-quality and detailed reconstructions in image space. Quantitatively, we outperform state-of-the-art diffusion-based SISR methods, namely SR3 and SRDiff, regarding PSNR, SSIM, and LPIPS on both face (8x scaling) and general (4x scaling) SR benchmarks. Meanwhile, using DWT enabled us to use fewer parameters than the compared models: 92M parameters instead of 550M compared to SR3 and 9.3M instead of 12M compared to SRDiff. Additionally, our method outperforms other state-of-the-art generative methods on classical general SR datasets while saving inference time. Finally, our work highlights its potential for various applications.
LGMar 14, 2022
Less is More: Proxy Datasets in NAS approachesBrian Moser, Federico Raue, Jörn Hees et al.
Neural Architecture Search (NAS) defines the design of Neural Networks as a search problem. Unfortunately, NAS is computationally intensive because of various possibilities depending on the number of elements in the design and the possible connections between them. In this work, we extensively analyze the role of the dataset size based on several sampling approaches for reducing the dataset size (unsupervised and supervised cases) as an agnostic approach to reduce search time. We compared these techniques with four common NAS approaches in NAS-Bench-201 in roughly 1,400 experiments on CIFAR-100. One of our surprising findings is that in most cases we can reduce the amount of training data to 25\%, consequently reducing search time to 25\%, while at the same time maintaining the same accuracy as if training on the full dataset. Additionally, some designs derived from subsets out-perform designs derived from the full dataset by up to 22 p.p. accuracy.
CVJul 22, 2024Code
SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over TimeStanislav Frolov, Brian B. Moser, Andreas Dengel
Generating high-resolution images with generative models has recently been made widely accessible by leveraging diffusion models pre-trained on large-scale datasets. Various techniques, such as MultiDiffusion and SyncDiffusion, have further pushed image generation beyond training resolutions, i.e., from square images to panorama, by merging multiple overlapping diffusion paths or employing gradient descent to maintain perceptual coherence. However, these methods suffer from significant computational inefficiencies due to generating and averaging numerous predictions, which is required in practice to produce high-quality and seamless images. This work addresses this limitation and presents a novel approach that eliminates the need to generate and average numerous overlapping denoising predictions. Our method shifts non-overlapping denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in coherent, high-resolution images with fewer overall steps. We demonstrate the effectiveness of our approach through qualitative and quantitative evaluations, comparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our method offers several key benefits, including improved computational efficiency and faster inference times while producing comparable or better image quality. Link to code https://github.com/stanifrolov/spotdiffusion
CVJan 9, 2023
Instance Segmentation Based Graph Extraction for Handwritten Circuit Diagram ImagesJohannes Bayer, Amit Kumar Roy, Andreas Dengel
Handwritten circuit diagrams from educational scenarios or historic sources usually exist on analogue media. For deriving their functional principles or flaws automatically, they need to be digitized, extracting their electrical graph. Recently, the base technologies for automated pipelines facilitating this process shifted from computer vision to machine learning. This paper describes an approach for extracting both the electrical components (including their terminals and describing texts) as well their interconnections (including junctions and wire hops) by the means of instance segmentation and keypoint extraction. Consequently, the resulting graph extraction process consists of a simple two-step process of model inference and trivial geometric keypoint matching. The dataset itself, its preparation, model training and post-processing are described and publicly available.
CVAug 15, 2023
Dynamic Attention-Guided Diffusion for Image Super-ResolutionBrian B. Moser, Stanislav Frolov, Federico Raue et al.
Diffusion models in image Super-Resolution (SR) treat all image regions uniformly, which risks compromising the overall image quality by potentially introducing artifacts during denoising of less-complex regions. To address this, we propose ``You Only Diffuse Areas'' (YODA), a dynamic attention-guided diffusion process for image SR. YODA selectively focuses on spatial regions defined by attention maps derived from the low-resolution images and the current denoising time step. This time-dependent targeting enables a more efficient conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based methods SR3, DiffBIR, and SRDiff. Our experiments demonstrate new state-of-the-art performances in face and general SR tasks across PSNR, SSIM, and LPIPS metrics. As a side effect, we find that YODA reduces color shift issues and stabilizes training with small batches.
CVAug 18, 2023
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision TransformersTobias Christian Nauen, Sebastian Palacio, Federico Raue et al.
Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity, but their effectiveness in addressing problems in language and vision has sparked extensive research aimed at enhancing their efficiency. However, diverse experimental conditions, spanning multiple input domains, prevent a fair comparison based solely on reported results, posing challenges for model selection. To address this gap in comparability, we perform a large-scale benchmark of more than 45 models for image classification, evaluating key efficiency aspects, including accuracy, speed, and memory usage. Our benchmark provides a standardized baseline for efficiency-oriented transformers. We analyze the results based on the Pareto front -- the boundary of optimal models. Surprisingly, despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics. We observe that hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency. Moreover, our benchmark shows that using a larger model in general is more efficient than using higher resolution images. Thanks to our holistic evaluation, we provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting or developing efficient transformers.
IVJul 10, 2023
DWA: Differential Wavelet Amplifier for Image Super-ResolutionBrian B. Moser, Stanislav Frolov, Federico Raue et al.
This work introduces Differential Wavelet Amplifier (DWA), a drop-in module for wavelet-based image Super-Resolution (SR). DWA invigorates an approach recently receiving less attention, namely Discrete Wavelet Transformation (DWT). DWT enables an efficient image representation for SR and reduces the spatial area of its input by a factor of 4, the overall model size, and computation cost, framing it as an attractive approach for sustainable ML. Our proposed DWA model improves wavelet-based SR models by leveraging the difference between two convolutional filters to refine relevant feature extraction in the wavelet domain, emphasizing local contrasts and suppressing common noise in the input signals. We show its effectiveness by integrating it into existing SR models, e.g., DWSR and MWCNN, and demonstrate a clear improvement in classical SR tasks. Moreover, DWA enables a direct application of DWSR and MWCNN to input image space, reducing the DWT representation channel-wise since it omits traditional DWT.
CVAug 10, 2023
A Comparative Assessment of Multi-view fusion learning for Crop ClassificationFrancisco Mena, Diego Arenas, Marlon Nuske et al.
With a rapidly increasing amount and diversity of remote sensing (RS) data sources, there is a strong need for multi-view learning modeling. This is a complex task when considering the differences in resolution, magnitude, and noise of RS data. The typical approach for merging multiple RS sources has been input-level fusion, but other - more advanced - fusion strategies may outperform this traditional approach. This work assesses different fusion strategies for crop classification in the CropHarvest dataset. The fusion methods proposed in this work outperform models based on individual views and previous fusion methods. We do not find one single fusion method that consistently outperforms all other approaches. Instead, we present a comparison of multi-view fusion methods for three different datasets and show that, depending on the test region, different methods obtain the best performance. Despite this, we suggest a preliminary criterion for the selection of fusion methods.
CVApr 11, 2023
DartsReNet: Exploring new RNN cells in ReNet architecturesBrian Moser, Federico Raue, Jörn Hees et al.
We present new Recurrent Neural Network (RNN) cells for image classification using a Neural Architecture Search (NAS) approach called DARTS. We are interested in the ReNet architecture, which is a RNN based approach presented as an alternative for convolutional and pooling steps. ReNet can be defined using any standard RNN cells, such as LSTM and GRU. One limitation is that standard RNN cells were designed for one dimensional sequential data and not for two dimensions like it is the case for image classification. We overcome this limitation by using DARTS to find new cell designs. We compare our results with ReNet that uses GRU and LSTM cells. Our found cells outperform the standard RNN cells on CIFAR-10 and SVHN. The improvements on SVHN indicate generalizability, as we derived the RNN cell designs from CIFAR-10 without performing a new cell search for SVHN.
CVJul 15, 2024Code
PupilSense: A Novel Application for Webcam-Based Pupil Diameter EstimationVijul Shah, Ko Watanabe, Brian B. Moser et al.
Measuring pupil diameter is vital for gaining insights into physiological and psychological states - traditionally captured by expensive, specialized equipment like Tobii eye-trackers and Pupillabs glasses. This paper presents a novel application that enables pupil diameter estimation using standard webcams, making the process accessible in everyday environments without specialized equipment. Our app estimates pupil diameters from videos and offers detailed analysis, including class activation maps, graphs of predicted left and right pupil diameters, and eye aspect ratios during blinks. This tool expands the accessibility of pupil diameter measurement, particularly in everyday settings, benefiting fields like human behavior research and healthcare. Additionally, we present a new open source dataset for pupil diameter estimation using webcam images containing cropped eye images and corresponding pupil diameter measurements.
AIAug 3, 2023
Towards Self-organizing Personal Knowledge Assistants in Evolving Corporate MemoriesChristian Jilek, Markus Schröder, Heiko Maus et al.
This paper presents a retrospective overview of a decade of research in our department towards self-organizing personal knowledge assistants in evolving corporate memories. Our research is typically inspired by real-world problems and often conducted in interdisciplinary collaborations with research and industry partners. We summarize past experiments and results comprising topics like various ways of knowledge graph construction in corporate and personal settings, Managed Forgetting and (Self-organizing) Context Spaces as a novel approach to Personal Information Management (PIM) and knowledge work support. Past results are complemented by an overview of related work and some of our latest findings not published so far. Last, we give an overview of our related industry use cases including a detailed look into CoMem, a Corporate Memory based on our presented research already in productive use and providing challenges for further research. Many contributions are only first steps in new directions with still a lot of untapped potential, especially with regard to further increasing the automation in PIM and knowledge work support.
LGJan 31, 2023
A Bayesian Generative Adversarial Network (GAN) to Generate Synthetic Time-Series Data, Application in Combined Sewer Flow PredictionAmin E. Bakhshipour, Alireza Koochali, Ulrich Dittmer et al.
Despite various breakthroughs in machine learning and data analysis techniques for improving smart operation and management of urban water infrastructures, some key limitations obstruct this progress. Among these shortcomings, the absence of freely available data due to data privacy or high costs of data gathering and the nonexistence of adequate rare or extreme events in the available data plays a crucial role. Here, Generative Adversarial Networks (GANs) can help overcome these challenges. In machine learning, generative models are a class of methods capable of learning data distribution to generate artificial data. In this study, we developed a GAN model to generate synthetic time series to balance our limited recorded time series data and improve the accuracy of a data-driven model for combined sewer flow prediction. We considered the sewer system of a small town in Germany as the test case. Precipitation and inflow to the storage tanks are used for the Data-Driven model development. The aim is to predict the flow using precipitation data and examine the impact of data augmentation using synthetic data in model performance. Results show that GAN can successfully generate synthetic time series from real data distribution, which helps more accurate peak flow prediction. However, the model without data augmentation works better for dry weather prediction. Therefore, an ensemble model is suggested to combine the advantages of both models.
LGOct 14, 2022
Quantifying Quality of Class-Conditional Generative Models in Time-Series DomainAlireza Koochali, Maria Walch, Sankrutyayan Thota et al.
Generative models are designed to address the data scarcity problem. Even with the exploding amount of data, due to computational advancements, some applications (e.g., health care, weather forecast, fault detection) still suffer from data insufficiency, especially in the time-series domain. Thus generative models are essential and powerful tools, but they still lack a consensual approach for quality assessment. Such deficiency hinders the confident application of modern implicit generative models on time-series data. Inspired by assessment methods on the image domain, we introduce the InceptionTime Score (ITS) and the Frechet InceptionTime Distance (FITD) to gauge the qualitative performance of class conditional generative models on the time-series domain. We conduct extensive experiments on 80 different datasets to study the discriminative capabilities of proposed metrics alongside two existing evaluation metrics: Train on Synthetic Test on Real (TSTR) and Train on Real Test on Synthetic (TRTS). Extensive evaluation reveals that the proposed assessment method, i.e., ITS and FITD in combination with TSTR, can accurately assess class-conditional generative model performance.
CVJun 13, 2022
Revisiting the Shape-Bias of Deep Learning for Dermoscopic Skin Lesion ClassificationAdriano Lucieri, Fabian Schmeisser, Christoph Peter Balada et al.
It is generally believed that the human visual system is biased towards the recognition of shapes rather than textures. This assumption has led to a growing body of work aiming to align deep models' decision-making processes with the fundamental properties of human vision. The reliance on shape features is primarily expected to improve the robustness of these models under covariate shift. In this paper, we revisit the significance of shape-biases for the classification of skin lesion images. Our analysis shows that different skin lesion datasets exhibit varying biases towards individual image features. Interestingly, despite deep feature extractors being inclined towards learning entangled features for skin lesion classification, individual features can still be decoded from this entangled representation. This indicates that these features are still represented in the learnt embedding spaces of the models, but not used for classification. In addition, the spectral analysis of different datasets shows that in contrast to common visual recognition, dermoscopic skin lesion classification, by nature, is reliant on complex feature combinations beyond shape-bias. As a natural consequence, shifting away from the prevalent desire of shape-biasing models can even improve skin lesion classifiers in some cases.
CLApr 13, 2022
A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of TweetsPervaiz Iqbal Khan, Imran Razzak, Andreas Dengel et al.
Health mention classification deals with the disease detection in a given text containing disease words. However, non-health and figurative use of disease words adds challenges to the task. Recently, adversarial training acting as a means of regularization has gained popularity in many NLP tasks. In this paper, we propose a novel approach to train language models for health mention classification of tweets that involves adversarial training. We generate adversarial examples by adding perturbation to the representations of transformer models for tweet examples at various levels using Gaussian noise. Further, we employ contrastive loss as an additional objective function. We evaluate the proposed method on the PHM2017 dataset extended version. Results show that our proposed approach improves the performance of classifier significantly over the baseline methods. Moreover, our analysis shows that adding noise at earlier layers improves models' performance whereas adding noise at intermediate layers deteriorates models' performance. Finally, adding noise towards the final layers performs better than the middle layers noise addition.
CVApr 5, 2022
DT2I: Dense Text-to-Image Generation from Region DescriptionsStanislav Frolov, Prateek Bansal, Jörn Hees et al.
Despite astonishing progress, generating realistic images of complex scenes remains a challenging problem. Recently, layout-to-image synthesis approaches have attracted much interest by conditioning the generator on a list of bounding boxes and corresponding class labels. However, previous approaches are very restrictive because the set of labels is fixed a priori. Meanwhile, text-to-image synthesis methods have substantially improved and provide a flexible way for conditional image generation. In this work, we introduce dense text-to-image (DT2I) synthesis as a new task to pave the way toward more intuitive image generation. Furthermore, we propose DTC-GAN, a novel method to generate images from semantically rich region descriptions, and a multi-modal region feature matching loss to encourage semantic image-text matching. Our results demonstrate the capability of our approach to generate plausible images of complex scenes using region captions.
HCApr 17Code
HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture RecognitionSachin Kumar Singh, Ko Watanabe, Brian Moser et al.
The success of machine learning is deeply linked to the availability of high-quality training data, yet retrieving and manually labeling new data remains a time-consuming and error-prone process. Traditional annotation tools, such as Label Studio, often require post-processing, where users label data after it has been recorded. Post-processing is highly time-consuming and labor-intensive, especially with large datasets, and may lead to erroneous annotations due to the difficulty of subjects' memory tasks when labeling cognitive activities such as emotions or comprehension levels. In this work, we introduce HandyLabel, a real-time annotation tool that leverages hand gesture recognition to map hand signs for labeling. The application enables users to customize gesture mappings through a web-based interface, allowing for real-time annotations. To ensure the performance of HandyLabel, we evaluate several hand gesture recognition models on an open-source hand sign (HaGRID) dataset, with and without skeleton-based preprocessing. We discovered that ResNet50 with preprocessed skeleton-based images performs an F1-score of 0.923. To validate the usability of HandyLabel, a user study was conducted with 46 participants. The results suggest that 88.9% of participants preferred HandyLabel over traditional annotation tools.
AIMar 3, 2022
Improving Health Mentioning Classification of Tweets using Contrastive Adversarial TrainingPervaiz Iqbal Khan, Shoaib Ahmed Siddiqui, Imran Razzak et al.
Health mentioning classification (HMC) classifies an input text as health mention or not. Figurative and non-health mention of disease words makes the classification task challenging. Learning the context of the input text is the key to this problem. The idea is to learn word representation by its surrounding words and utilize emojis in the text to help improve the classification results. In this paper, we improve the word representation of the input text using adversarial training that acts as a regularizer during fine-tuning of the model. We generate adversarial examples by perturbing the embeddings of the model and then train the model on a pair of clean and adversarial examples. Additionally, we utilize contrastive loss that pushes a pair of clean and perturbed examples close to each other and other examples away in the representation space. We train and evaluate the method on an extended version of the publicly available PHM2017 dataset. Experiments show an improvement of 1.0% over BERT-Large baseline and 0.6% over RoBERTa-Large baseline, whereas 5.8% over the state-of-the-art in terms of F1 score. Furthermore, we provide a brief analysis of the results by utilizing the power of explainable AI.
LGMar 28, 2023
From Private to Public: Benchmarking GANs in the Context of Private Time Series ClassificationDominique Mercier, Andreas Dengel, Sheraz Ahmed
Deep learning has proven to be successful in various domains and for different tasks. However, when it comes to private data several restrictions are making it difficult to use deep learning approaches in these application fields. Recent approaches try to generate data privately instead of applying a privacy-preserving mechanism directly, on top of the classifier. The solution is to create public data from private data in a manner that preserves the privacy of the data. In this work, two very prominent GAN-based architectures were evaluated in the context of private time series classification. In contrast to previous work, mostly limited to the image domain, the scope of this benchmark was the time series domain. The experiments show that especially GSWGAN performs well across a variety of public datasets outperforming the competitor DPWGAN. An analysis of the generated datasets further validates the superiority of GSWGAN in the context of time series generation.
LGJul 11, 2024
Explainability of Sub-Field Level Crop Yield Prediction using Remote SensingHiba Najjar, Miro Miranda, Marlon Nuske et al.
Crop yield forecasting plays a significant role in addressing growing concerns about food security and guiding decision-making for policymakers and farmers. When deep learning is employed, understanding the learning and decision-making processes of the models, as well as their interaction with the input data, is crucial for establishing trust in the models and gaining insight into their reliability. In this study, we focus on the task of crop yield prediction, specifically for soybean, wheat, and rapeseed crops in Argentina, Uruguay, and Germany. Our goal is to develop and explain predictive models for these crops, using a large dataset of satellite images, additional data modalities, and crop yield maps. We employ a long short-term memory network and investigate the impact of using different temporal samplings of the satellite data and the benefit of adding more relevant modalities. For model explainability, we utilize feature attribution methods to quantify input feature contributions, identify critical growth stages, analyze yield variability at the field level, and explain less accurate predictions. The modeling results show an improvement when adding more modalities or using all available instances of satellite data. The explainability results reveal distinct feature importance patterns for each crop and region. We further found that the most influential growth stages on the prediction are dependent on the temporal sampling of the input data. We demonstrated how these critical growth stages, which hold significant agronomic value, closely align with the existing literature in agronomy and crop development biology.
CVJul 4, 2024
DocXplain: A Novel Model-Agnostic Explainability Method for Document Image ClassificationSaifullah Saifullah, Stefan Agne, Andreas Dengel et al.
Deep learning (DL) has revolutionized the field of document image analysis, showcasing superhuman performance across a diverse set of tasks. However, the inherent black-box nature of deep learning models still presents a significant challenge to their safe and robust deployment in industry. Regrettably, while a plethora of research has been dedicated in recent years to the development of DL-powered document analysis systems, research addressing their transparency aspects has been relatively scarce. In this paper, we aim to bridge this research gap by introducing DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps for the task of document image classification. In particular, our approach involves independently segmenting the foreground and background features of the documents into different document elements and then ablating these elements to assign feature importance. We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics, 2 widely recognized document benchmark datasets, and 10 state-of-the-art document image classification models. By conducting a thorough quantitative and qualitative analysis against 9 existing state-of-the-art attribution methods, we demonstrate the superiority of our approach in terms of both faithfulness and interpretability. To the best of the authors' knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images. We anticipate that our work will significantly contribute to advancing research on transparency, fairness, and robustness of document image classification models.
IVOct 31, 2023
Medi-CAT: Contrastive Adversarial Training for Medical Image ClassificationPervaiz Iqbal Khan, Andreas Dengel, Sheraz Ahmed
There are not many large medical image datasets available. For these datasets, too small deep learning models can't learn useful features, so they don't work well due to underfitting, and too big models tend to overfit the limited data. As a result, there is a compromise between the two issues. This paper proposes a training strategy Medi-CAT to overcome the underfitting and overfitting phenomena in medical imaging datasets. Specifically, the proposed training methodology employs large pre-trained vision transformers to overcome underfitting and adversarial and contrastive learning techniques to prevent overfitting. The proposed method is trained and evaluated on four medical image classification datasets from the MedMNIST collection. Our experimental results indicate that the proposed approach improves the accuracy up to 2% on three benchmark datasets compared to well-known approaches, whereas it increases the performance up to 4.1% over the baseline methods.
CVAug 19, 2024
Latent Diffusion for Guided Document Table GenerationSyed Jawwad Haider Hamdani, Saifullah Saifullah, Stefan Agne et al.
Obtaining annotated table structure data for complex tables is a challenging task due to the inherent diversity and complexity of real-world document layouts. The scarcity of publicly available datasets with comprehensive annotations for intricate table structures hinders the development and evaluation of models designed for such scenarios. This research paper introduces a novel approach for generating annotated images for table structure by leveraging conditioned mask images of rows and columns through the application of latent diffusion models. The proposed method aims to enhance the quality of synthetic data used for training object detection models. Specifically, the study employs a conditioning mechanism to guide the generation of complex document table images, ensuring a realistic representation of table layouts. To evaluate the effectiveness of the generated data, we employ the popular YOLOv5 object detection model for training. The generated table images serve as valuable training samples, enriching the dataset with diverse table structures. The model is subsequently tested on the challenging pubtables-1m testset, a benchmark for table structure recognition in complex document layouts. Experimental results demonstrate that the introduced approach significantly improves the quality of synthetic data for training, leading to YOLOv5 models with enhanced performance. The mean Average Precision (mAP) values obtained on the pubtables-1m testset showcase results closely aligned with state-of-the-art methods. Furthermore, low FID results obtained on the synthetic data further validate the efficacy of the proposed methodology in generating annotated images for table structure.
CLJul 22, 2024
StylusAI: Stylistic Adaptation for Robust German Handwritten Text GenerationNauman Riaz, Saifullah Saifullah, Stefan Agne et al.
In this study, we introduce StylusAI, a novel architecture leveraging diffusion models in the domain of handwriting style generation. StylusAI is specifically designed to adapt and integrate the stylistic nuances of one language's handwriting into another, particularly focusing on blending English handwriting styles into the context of the German writing system. This approach enables the generation of German text in English handwriting styles and German handwriting styles into English, enriching machine-generated handwriting diversity while ensuring that the generated text remains legible across both languages. To support the development and evaluation of StylusAI, we present the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset encompassing 37 distinct handwriting styles within the German language. This dataset provides a fundamental resource for training and benchmarking in the realm of handwritten text generation. Our results demonstrate that StylusAI not only introduces a new method for style adaptation in handwritten text generation but also surpasses existing models in generating handwriting samples that improve both text quality and stylistic fidelity, evidenced by its performance on the IAM database and our newly proposed DHSD. Thus, StylusAI represents a significant advancement in the field of handwriting style generation, offering promising avenues for future research and applications in cross-linguistic style adaptation for languages with similar scripts.
AIOct 29, 2023
A Unique Training Strategy to Enhance Language Models Capabilities for Health Mention Detection from Social Media ContentPervaiz Iqbal Khan, Muhammad Nabeel Asim, Andreas Dengel et al.
An ever-increasing amount of social media content requires advanced AI-based computer programs capable of extracting useful information. Specifically, the extraction of health-related content from social media is useful for the development of diverse types of applications including disease spread, mortality rate prediction, and finding the impact of diverse types of drugs on diverse types of diseases. Language models are competent in extracting the syntactic and semantics of text. However, they face a hard time extracting similar patterns from social media texts. The primary reason for this shortfall lies in the non-standardized writing style commonly employed by social media users. Following the need for an optimal language model competent in extracting useful patterns from social media text, the key goal of this paper is to train language models in such a way that they learn to derive generalized patterns. The key goal is achieved through the incorporation of random weighted perturbation and contrastive learning strategies. On top of a unique training strategy, a meta predictor is proposed that reaps the benefits of 5 different language models for discriminating posts of social media text into non-health and health-related classes. Comprehensive experimentation across 3 public benchmark datasets reveals that the proposed training strategy improves the performance of the language models up to 3.87%, in terms of F1-score, as compared to their performance with traditional training. Furthermore, the proposed meta predictor outperforms existing health mention classification predictors across all 3 benchmark datasets.
CLOct 5, 2023
PrIeD-KIE: Towards Privacy Preserved Document Key Information ExtractionSaifullah Saifullah, Stefan Agne, Andreas Dengel et al.
In this paper, we introduce strategies for developing private Key Information Extraction (KIE) systems by leveraging large pretrained document foundation models in conjunction with differential privacy (DP), federated learning (FL), and Differentially Private Federated Learning (DP-FL). Through extensive experimentation on six benchmark datasets (FUNSD, CORD, SROIE, WildReceipts, XFUND, and DOCILE), we demonstrate that large document foundation models can be effectively fine-tuned for the KIE task under private settings to achieve adequate performance while maintaining strong privacy guarantees. Moreover, by thoroughly analyzing the impact of various training and model parameters on model performance, we propose simple yet effective guidelines for achieving an optimal privacy-utility trade-off for the KIE task under global DP. Finally, we introduce FeAm-DP, a novel DP-FL algorithm that enables efficiently upscaling global DP from a standalone context to a multi-client federated environment. We conduct a comprehensive evaluation of the algorithm across various client and privacy settings, and demonstrate its capability to achieve comparable performance and privacy guarantees to standalone DP, even when accommodating an increasing number of participating clients. Overall, our study offers valuable insights into the development of private KIE systems, and highlights the potential of document foundation models for privacy-preserved Document AI applications. To the best of authors' knowledge, this is the first work that explores privacy preserved document KIE using document foundation models.
CVApr 16
MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical EngineeringSaif ur Rehman Khan, Imad Ahmed Waqar, Arooj Zaib et al.
Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.
LGAug 8, 2024
FedAD-Bench: A Unified Benchmark for Federated Unsupervised Anomaly Detection in Tabular DataAhmed Anwar, Brian Moser, Dayananda Herurkar et al.
The emergence of federated learning (FL) presents a promising approach to leverage decentralized data while preserving privacy. Furthermore, the combination of FL and anomaly detection is particularly compelling because it allows for detecting rare and critical anomalies (usually also rare in locally gathered data) in sensitive data from multiple sources, such as cybersecurity and healthcare. However, benchmarking the performance of anomaly detection methods in FL environments remains an underexplored area. This paper introduces FedAD-Bench, a unified benchmark for evaluating unsupervised anomaly detection algorithms within the context of FL. We systematically analyze and compare the performance of recent deep learning anomaly detection models under federated settings, which were typically assessed solely in centralized settings. FedAD-Bench encompasses diverse datasets and metrics to provide a holistic evaluation. Through extensive experiments, we identify key challenges such as model aggregation inefficiencies and metric unreliability. We present insights into FL's regularization effects, revealing scenarios in which it outperforms centralized approaches due to its inherent ability to mitigate overfitting. Our work aims to establish a standardized benchmark to guide future research and development in federated anomaly detection, promoting reproducibility and fair comparison across studies.
LGFeb 19
The Sound of Death: Deep Learning Reveals Vascular Damage from Carotid UltrasoundChristoph Balada, Aida Romano-Martinez, Payal Varshney et al.
Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, yet early risk detection is often limited by available diagnostics. Carotid ultrasound, a non-invasive and widely accessible modality, encodes rich structural and hemodynamic information that is largely untapped. Here, we present a machine learning (ML) framework that extracts clinically meaningful representations of vascular damage (VD) from carotid ultrasound videos, using hypertension as a weak proxy label. The model learns robust features that are biologically plausible, interpretable, and strongly associated with established cardiovascular risk factors, comorbidities, and laboratory measures. High VD stratifies individuals for myocardial infarction, cardiac death, and all-cause mortality, matching or outperforming conventional risk models such as SCORE2. Explainable AI analyses reveal that the model relies on vessel morphology and perivascular tissue characteristics, uncovering novel functional and anatomical signatures of vascular damage. This work demonstrates that routine carotid ultrasound contains far more prognostic information than previously recognized. Our approach provides a scalable, non-invasive, and cost-effective tool for population-wide cardiovascular risk assessment, enabling earlier and more personalized prevention strategies without reliance on laboratory tests or complex clinical inputs.
LGFeb 25
DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting DiffusionMarcel Lamott, Saifullah Saifullah, Nauman Riaz et al.
Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average $87\%$ of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.
HCApr 20
Empowering Vocabulary Learning Through Teaching AI: Using LLMs as a Student to Perform Learning by Teaching in Vocabulary AcquisitionTokio Uchida, Ko Watanabe, Andrew Vargo et al.
"Learning by Teaching (LbT)" helps learners deepen their understanding by explaining concepts to others, with questions playing a vital role in identifying knowledge gaps and reinforcing comprehension. However, existing systems for generating such questions often rely on rigid templates and are expensive to build. To overcome these limitations, we developed a system using Large Language Models (LLMs) to create dynamic, contextually relevant questions for LbT. In our English vocabulary learning study, we examined which learner characteristics best leverage the system's benefits. Our results showed improved memory retention over traditional methods at three and seven days of testing, with ten participants. Additionally, we identified traits linked to better learning outcomes, highlighting the potential for tailored approaches. These findings support the development of scalable, cost-effective solutions to enhance LbT methods across various fields.
CVNov 21, 2023
Q-Seg: Quantum Annealing-Based Unsupervised Image SegmentationSupreeth Mysore Venkatesh, Antonio Macaluso, Marlon Nuske et al.
We present Q-Seg, a novel unsupervised image segmentation method based on quantum annealing, tailored for existing quantum hardware. We formulate the pixel-wise segmentation problem, which assimilates spectral and spatial information of the image, as a graph-cut optimization task. Our method efficiently leverages the interconnected qubit topology of the D-Wave Advantage device, offering superior scalability over existing quantum approaches and outperforming several tested state-of-the-art classical methods. Empirical evaluations on synthetic datasets have shown that Q-Seg has better runtime performance than the state-of-the-art classical optimizer Gurobi. The method has also been tested on earth observation image segmentation, a critical area with noisy and unreliable annotations. In the era of noisy intermediate-scale quantum, Q-Seg emerges as a reliable contender for real-world applications in comparison to advanced techniques like Segment Anything. Consequently, Q-Seg offers a promising solution using available quantum hardware, especially in situations constrained by limited labeled data and the need for efficient computational runtime.
AISep 6, 2024
Using Large Language Models to Generate Authentic Multi-agent Knowledge Work DatasetsDesiree Heim, Christian Jilek, Adrian Ulges et al.
Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and optimizations of knowledge work assistance systems. Due to the considerable resources needed to collect such data in real-life settings and the necessity of data censorship, collecting such a dataset appears nearly impossible. For this reason, we propose a configurable, multi-agent knowledge work dataset generator. This system simulates collaborative knowledge work among agents producing Large Language Model-generated documents and accompanying data traces. Additionally, the generator captures all background information, given in its configuration or created during the simulation process, in a knowledge graph. Finally, the resulting dataset can be utilized and shared without privacy or confidentiality concerns. This paper introduces our approach's design and vision and focuses on generating authentic knowledge work documents using Large Language Models. Our study involving human raters who assessed 53% of the generated and 74% of the real documents as realistic demonstrates the potential of our approach. Furthermore, we analyze the authenticity criteria mentioned in the participants' comments and elaborate on potential improvements for identified common issues.
CVDec 20, 2022
Common Practices and Taxonomy in Deep Multi-view Fusion for Remote Sensing ApplicationsFrancisco Mena, Diego Arenas, Marlon Nuske et al.
The advances in remote sensing technologies have boosted applications for Earth observation. These technologies provide multiple observations or views with different levels of information. They might contain static or temporary views with different levels of resolution, in addition to having different types and amounts of noise due to sensor calibration or deterioration. A great variety of deep learning models have been applied to fuse the information from these multiple views, known as deep multi-view or multi-modal fusion learning. However, the approaches in the literature vary greatly since different terminology is used to refer to similar concepts or different illustrations are given to similar techniques. This article gathers works on multi-view fusion for Earth observation by focusing on the common practices and approaches used in the literature. We summarize and structure insights from several different publications concentrating on unifying points and ideas. In this manuscript, we provide a harmonized terminology while at the same time mentioning the various alternative terms that are used in literature. The topics covered by the works reviewed focus on supervised learning with the use of neural network models. We hope this review, with a long list of recent references, can support future research and lead to a unified advance in the area.
CVMar 16
Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI ApproachHasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer et al.
Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model performance.EfficientNet-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.
CVMay 21
TextTeacher: What Can Language Teach About Images?Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser et al.
The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher
CVMay 21
SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor SegmentationHasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer et al.
Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder--decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2--4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.
GNMar 25
A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing DataYuichiro Iwashita, Ahtisham Fazeel Abbasi, Muhammad Nabeel Asim et al.
Single-cell RNA sequencing (scRNA-seq) is inherently affected by sparsity caused by dropout events, in which expressed genes are recorded as zeros due to technical limitations. These artifacts distort gene expression distributions and can compromise downstream analyses. Numerous imputation methods have been proposed to address this, and these methods encompass a wide range of approaches from traditional statistical models to recently developed deep learning (DL)-based methods. However, their comparative performance remains unclear, as existing benchmarking studies typically evaluate only a limited subset of methods, datasets, and downstream analytical tasks. Here, we present a comprehensive benchmark of 15 scRNA-seq imputation methods spanning 7 methodological categories, including traditional and modern DL-based methods. These methods are evaluated across 30 datasets sourced from 10 experimental protocols and assessed in terms of 6 downstream analytical tasks. Our results show that traditional imputation methods, such as model-based, smoothing-based, and low-rank matrix-based methods, generally outperform DL-based methods, such as diffusion-based, GAN-based, GNN-based, and autoencoder-based methods. In addition, strong performance in numerical gene expression recovery does not necessarily translate into improved biological interpretability in downstream analyses. Furthermore, the performance of imputation methods varies substantially across datasets, protocols, and downstream analytical tasks, and no single method consistently outperforms others across all evaluation scenarios. Together, our results provide practical guidance for selecting imputation methods tailored to specific analytical objectives and highlight the importance of task-specific evaluation when assessing imputation performance in scRNA-seq data analysis.
CVMar 6, 2024Code
Unlocking Dataset Distillation with Diffusion ModelsBrian B. Moser, Federico Raue, Sebastian Palacio et al.
Dataset distillation seeks to condense datasets into smaller but highly representative synthetic samples. While diffusion models now lead all generative benchmarks, current distillation methods avoid them and rely instead on GANs or autoencoders, or, at best, sampling from a fixed diffusion prior. This trend arises because naive backpropagation through the long denoising chain leads to vanishing gradients, which prevents effective synthetic sample optimization. To address this limitation, we introduce Latent Dataset Distillation with Diffusion Models (LD3M), the first method to learn gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model. A linearly decaying skip connection, injected from the initial noisy state into every reverse step, preserves the gradient signal across dozens of timesteps without requiring diffusion weight fine-tuning. Across multiple ImageNet subsets at 128x128 and 256x256, LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over the prior state-of-the-art. The code for LD3M is provided at https://github.com/Brian-Moser/prune_and_distill.
LGJul 22, 2024
Increasing the Robustness of Model Predictions to Missing Sensors in Earth ObservationFrancisco Mena, Diego Arenas, Andreas Dengel
Multi-sensor ML models for EO aim to enhance prediction accuracy by integrating data from various sources. However, the presence of missing data poses a significant challenge, particularly in non-persistent sensors that can be affected by external factors. Existing literature has explored strategies like temporal dropout and sensor-invariant models to address the generalization to missing data issues. Inspired by these works, we study two novel methods tailored for multi-sensor scenarios, namely Input Sensor Dropout (ISensD) and Ensemble Sensor Invariant (ESensI). Through experimentation on three multi-sensor temporal EO datasets, we demonstrate that these methods effectively increase the robustness of model predictions to missing sensors. Particularly, we focus on how the predictive performance of models drops when sensors are missing at different levels. We observe that ensemble multi-sensor models are the most robust to the lack of sensors. In addition, the sensor dropout component in ISensD shows promising robustness results.
LGAug 21, 2024
Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient FeaturesHiba Najjar, Marlon Nuske, Andreas Dengel
The availability of temporal geospatial data in multiple modalities has been extensively leveraged to enhance the performance of machine learning models. While efforts on the design of adequate model architectures are approaching a level of saturation, focusing on a data-centric perspective can complement these efforts to achieve further enhancements in data usage efficiency and model generalization capacities. This work contributes to this direction. We leverage model explanation methods to identify the features crucial for the model to reach optimal performance and the smallest set of features sufficient to achieve this performance. We evaluate our approach on three temporal multimodal geospatial datasets and compare multiple model explanation techniques. Our results reveal that some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.
CVAug 19, 2024
Webcam-based Pupil Diameter Prediction Benefits from UpscalingVijul Shah, Brian B. Moser, Ko Watanabe et al.
Capturing pupil diameter is essential for assessing psychological and physiological states such as stress levels and cognitive load. However, the low resolution of images in eye datasets often hampers precise measurement. This study evaluates the impact of various upscaling methods, ranging from bicubic interpolation to advanced super-resolution, on pupil diameter predictions. We compare several pre-trained methods, including CodeFormer, GFPGAN, Real-ESRGAN, HAT, and SRResNet. Our findings suggest that pupil diameter prediction models trained on upscaled datasets are highly sensitive to the selected upscaling method and scale. Our results demonstrate that upscaling methods consistently enhance the accuracy of pupil diameter prediction models, highlighting the importance of upscaling in pupilometry. Overall, our work provides valuable insights for selecting upscaling techniques, paving the way for more accurate assessments in psychological and physiological research.
CVJul 11, 2024
XAI-Guided Enhancement of Vegetation Indices for Crop MappingHiba Najjar, Francisco Mena, Marlon Nuske et al.
Vegetation indices allow to efficiently monitor vegetation growth and agricultural activities. Previous generations of satellites were capturing a limited number of spectral bands, and a few expert-designed vegetation indices were sufficient to harness their potential. New generations of multi- and hyperspectral satellites can however capture additional bands, but are not yet efficiently exploited. In this work, we propose an explainable-AI-based method to select and design suitable vegetation indices. We first train a deep neural network using multispectral satellite data, then extract feature importance to identify the most influential bands. We subsequently select suitable existing vegetation indices or modify them to incorporate the identified bands and retrain our model. We validate our approach on a crop classification task. Our results indicate that models trained on individual indices achieve comparable results to the baseline model trained on all bands, while the combination of two indices surpasses the baseline in certain cases.