LGMay 22
When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by SplittingDiede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich
Dimensionality Reduction (DR) methods are widely used to visualize high-dimensional data. One key task in DR-based analysis is discovering neighborhoods, which relies on analyzing the fine-grained local structure of a projection. However, DR is an inherently lossy process; no technique can perfectly preserve the high-dimensional relationships, and projections therefore contain visual artifacts. In this paper, we highlight a typically overlooked source of visual artifacts: ambiguous instances. These are instances that are highly similar to multiple mutually dissimilar neighborhoods in the high-dimensional space. Standard DR methods cannot faithfully project such instances, since each data instance is mapped to a single point in the visual space. As a result, such an instance is placed in only one of its neighborhoods (or in none at all), so only part of its neighborhood structure is represented. We call this distortion partial neighborhood embedding. In this paper, we introduce a graph-based approach that identifies ambiguous instances and replicates them as multiple points in the projection, placing each copy within its respective neighborhood. We use UMAP for our results, but our approach also generalizes to other local graph-based DR techniques, and we show that our approach reveals previously hidden neighborhood memberships in projections and reduces partial neighborhood embedding across multiple examples, and is further supported by quantitative analyses.
LGSep 4, 2025
Why Can't I See My Clusters? A Precision-Recall Approach to Dimensionality Reduction ValidationDiede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich
Dimensionality Reduction (DR) is widely used for visualizing high-dimensional data, often with the goal of revealing expected cluster structure. However, such a structure may not always appear in the projections. Existing DR quality metrics assess projection reliability (to some extent) or cluster structure quality, but do not explain why expected structures are missing. Visual Analytics solutions can help, but are often time-consuming due to the large hyperparameter space. This paper addresses this problem by leveraging a recent framework that divides the DR process into two phases: a relationship phase, where similarity relationships are modeled, and a mapping phase, where the data is projected accordingly. We introduce two supervised metrics, precision and recall, to evaluate the relationship phase. These metrics quantify how well the modeled relationships align with an expected cluster structure based on some set of labels representing this structure. We illustrate their application using t-SNE and UMAP, and validate the approach through various usage scenarios. Our approach can guide hyperparameter tuning, uncover projection artifacts, and determine if the expected structure is captured in the relationships, making the DR process faster and more reliable.
LGMar 29, 2022
HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling TechniquesAngelos Chatzimparmpas, Fernando V. Paulovich, Andreas Kerren
Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.
LGJun 21, 2021
Multivariate Data Explanation by Jumping Emerging Patterns VisualizationMário Popolin Neto, Fernando V. Paulovich
Visual Analytics (VA) tools and techniques have been instrumental in supporting users to build better classification models, interpret models' overall logic, and audit results. In a different direction, VA has recently been applied to transform classification models into descriptive mechanisms instead of predictive. The idea is to use such models as surrogates for data patterns, visualizing the model to understand the phenomenon represented by the data. Although very useful and inspiring, the few proposed approaches have opted to use low complex classification models to promote straightforward interpretation, presenting limitations to capture intricate data patterns. In this paper, we present VAX (multiVariate dAta eXplanation), a new VA method to support the identification and visual interpretation of patterns in multivariate datasets. Unlike the existing similar approaches, VAX uses the concept of Jumping Emerging Patterns to identify and aggregate several diversified patterns, producing explanations through logic combinations of data variables. The potential of VAX to interpret complex multivariate datasets is demonstrated through use cases employing two real-world datasets covering different scenarios.
LGJun 14, 2021
HUMAP: Hierarchical Uniform Manifold Approximation and ProjectionWilson E. Marcílio-Jr, Danilo M. Eler, Fernando V. Paulovich et al.
Dimensionality reduction (DR) techniques help analysts to understand patterns in high-dimensional spaces. These techniques, often represented by scatter plots, are employed in diverse science domains and facilitate similarity analysis among clusters and data samples. For datasets containing many granularities or when analysis follows the information visualization mantra, hierarchical DR techniques are the most suitable approach since they present major structures beforehand and details on demand. This work presents HUMAP, a novel hierarchical dimensionality reduction technique designed to be flexible on preserving local and global structures and preserve the mental map throughout hierarchical exploration. We provide empirical evidence of our technique's superiority compared with current hierarchical approaches and show a case study applying HUMAP for dataset labelling.
LGMar 18, 2021
Explainable Adversarial Attacks in Deep Neural Networks Using Activation ProfilesGabriel D. Cantareira, Rodrigo F. Mello, Fernando V. Paulovich
As neural networks become the tool of choice to solve an increasing variety of problems in our society, adversarial attacks become critical. The possibility of generating data instances deliberately designed to fool a network's analysis can have disastrous consequences. Recent work has shown that commonly used methods for model training often result in fragile abstract representations that are particularly vulnerable to such attacks. This paper presents a visual framework to investigate neural network models subjected to adversarial examples, revealing how models' perception of the adversarial data differs from regular data instances and their relationships with class perception. Through different use cases, we show how observing these elements can quickly pinpoint exploited areas in a model, allowing further study of vulnerable features in input data and serving as a guide to improving model training and architecture.
HCJan 19, 2021
Q4EDA: A Novel Strategy for Textual Information Retrieval Based on User Interactions with Visual Representations of Time SeriesLeonardo Christino, Martha D. Ferreira, Fernando V. Paulovich
Knowing how to construct text-based Search Queries (SQs) for use in Search Engines (SEs) such as Google or Wikipedia has become a fundamental skill. Though much data are available through such SEs, most structured datasets live outside their scope. Visualization tools aid in this limitation, but no such tools come close to the sheer amount of information available through general-purpose SEs. To fill this gap, this paper presents Q4EDA, a novel framework that converts users' visual selection queries executed on top of time series visual representations, providing valid and stable SQs to be used in general-purpose SEs and suggestions of related information. The usefulness of Q4EDA is presented and validated by users through an application linking a Gapminder's line-chart replica with a SE populated with Wikipedia documents, showing how Q4EDA supports and enhances exploratory analysis of United Nations world indicators. Despite some limitations, Q4EDA is unique in its proposal and represents a real advance towards providing solutions for querying textual information based on user interactions with visual representations.
LGMay 8, 2020
Explainable Matrix -- Visualization for Global and Local Interpretability of Random Forest Classification EnsemblesMário Popolin Neto, Fernando V. Paulovich
Over the past decades, classification models have proven to be essential machine learning tools given their potential and applicability in various domains. In these years, the north of the majority of the researchers had been to improve quantitative metrics, notwithstanding the lack of information about models' decisions such metrics convey. This paradigm has recently shifted, and strategies beyond tables and numbers to assist in interpreting models' decisions are increasing in importance. Part of this trend, visualization techniques have been extensively used to support classification models' interpretability, with a significant focus on rule-based models. Despite the advances, the existing approaches present limitations in terms of visual scalability, and the visualization of large and complex models, such as the ones produced by the Random Forest (RF) technique, remains a challenge. In this paper, we propose Explainable Matrix (ExMatrix), a novel visualization method for RF interpretability that can handle models with massive quantities of rules. It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates, enabling the analysis of entire models and auditing classification results. ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
SPMar 8, 2020
Xtreaming: an incremental multidimensional projection technique and its application to streaming dataTácito T. A. T. Neves, Rafael M. Martins, Danilo B. Coimbra et al.
Streaming data applications are becoming more common due to the ability of different information sources to continuously capture or produce data, such as sensors and social media. Despite recent advances, most visualization approaches, in particular, multidimensional projection or dimensionality reduction techniques, cannot be directly applied in such scenarios due to the transient nature of streaming data. Currently, only a few methods address this limitation using online or incremental strategies, continuously processing data, and updating the visualization. Despite their relative success, most of them impose the need for storing and accessing the data multiple times, not being appropriate for streaming where data continuously grow. Others do not impose such requirements but are not capable of updating the position of the data already projected, potentially resulting in visual artifacts. In this paper, we present Xtreaming, a novel incremental projection technique that continuously updates the visual representation to reflect new emerging structures or patterns without visiting the multidimensional data more than once. Our tests show that Xtreaming is competitive in terms of global distance preservation if compared to other streaming and incremental techniques, but it is orders of magnitude faster. To the best of our knowledge, it is the first methodology that is capable of evolving a projection to faithfully represent new emerging structures without the need to store all data, providing reliable results for efficiently and effectively projecting streaming data.
CVMar 8, 2019
A Grid-based Method for Removing Overlaps of Dimensionality Reduction Scatterplot LayoutsGladys M. Hilasaca, Wilson E. Marcílio-Jr, Danilo M. Eler et al.
Dimensionality Reduction (DR) scatterplot layouts have become a ubiquitous visualization tool for analyzing multidimensional datasets. Despite their popularity, such scatterplots suffer from occlusion, especially when informative glyphs are used to represent data instances, potentially obfuscating critical information for the analysis under execution. Different strategies have been devised to address this issue, either producing overlap-free layouts that lack the powerful capabilities of contemporary DR techniques in uncovering interesting data patterns or eliminating overlaps as a post-processing strategy. Despite the good results of post-processing techniques, most of the best methods typically expand or distort the scatterplot area, thus reducing glyphs' size (sometimes) to unreadable dimensions, defeating the purpose of removing overlaps. This paper presents Distance Grid (DGrid), a novel post-processing strategy to remove overlaps from DR layouts that faithfully preserves the original layout's characteristics and bounds the minimum glyph sizes. We show that DGrid surpasses the state-of-the-art in overlap removal (through an extensive comparative evaluation considering multiple different metrics) while also being one of the fastest techniques, especially for large datasets. A user study with 51 participants also shows that DGrid is consistently ranked among the top techniques for preserving the original scatterplots' visual characteristics and the aesthetics of the final results.