CLMar 6
Navigating the Concept Space of Language ModelsWilson E. Marcílio-Jr, Danilo M. Eler
Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.
LGJun 14, 2021
HUMAP: Hierarchical Uniform Manifold Approximation and ProjectionWilson E. Marcílio-Jr, Danilo M. Eler, Fernando V. Paulovich et al.
Dimensionality reduction (DR) techniques help analysts to understand patterns in high-dimensional spaces. These techniques, often represented by scatter plots, are employed in diverse science domains and facilitate similarity analysis among clusters and data samples. For datasets containing many granularities or when analysis follows the information visualization mantra, hierarchical DR techniques are the most suitable approach since they present major structures beforehand and details on demand. This work presents HUMAP, a novel hierarchical dimensionality reduction technique designed to be flexible on preserving local and global structures and preserve the mental map throughout hierarchical exploration. We provide empirical evidence of our technique's superiority compared with current hierarchical approaches and show a case study applying HUMAP for dataset labelling.
HCJan 26, 2021
Contrastive analysis for scatterplot-based representations of dimensionality reductionWilson E. Marcílio-Jr, Danilo M. Eler, Rogério E. Garcia
Cluster interpretation after dimensionality reduction (DR) is a ubiquitous part of exploring multidimensional datasets. DR results are frequently represented by scatterplots, where spatial proximity encodes similarity among data samples. In the literature, techniques support the understanding of scatterplots' organization by visualizing the importance of the features for cluster definition with layout enrichment strategies. However, current approaches usually focus on global information, hampering the analysis whenever the focus is to understand the differences among clusters. Thus, this paper introduces a methodology to visually explore DR results and interpret clusters' formation based on contrastive analysis. We also introduce a bipartite graph to visually interpret and explore the relationship between the statistical variables employed to understand how the data features influence cluster formation. Our approach is demonstrated through case studies, in which we explore two document collections related to news articles and tweets about COVID-19 symptoms. Finally, we evaluate our approach through quantitative results to demonstrate its robustness to support multidimensional analysis.
LGJan 26, 2021
Model-agnostic interpretation by visualization of feature perturbationsWilson E. Marcílio-Jr, Danilo M. Eler, Fabrício Breve
Interpretation of machine learning models has become one of the most important research topics due to the necessity of maintaining control and avoiding bias in these algorithms. Since many machine learning algorithms are published every day, there is a need for novel model-agnostic interpretation approaches that could be used to interpret a great variety of algorithms. Thus, one advantageous way to interpret machine learning models is to feed different input data to understand the changes in the prediction. Using such an approach, practitioners can define relations among data patterns and a model's decision. This work proposes a model-agnostic interpretation approach that uses visualization of feature perturbations induced by the PSO algorithm. We validate our approach on publicly available datasets, showing the capability to enhance the interpretation of different classifiers while yielding very stable results compared with state-of-the-art algorithms.
HCJun 25, 2020
Visual analytics of COVID-19 dissemination in São Paulo state, BrazilWilson E. Marcílio-Jr, Danilo M. Eler, Rogério E. Garcia et al.
Visual analytics techniques are useful tools to support decision-making and cope with increasing data, which is particularly important when monitoring natural or artificial phenomena. When monitoring disease progression, visual analytics approaches help decision-makers choose to understand or even prevent dissemination paths. In this paper, we propose a new visual analytics tool for monitoring COVID-19 dissemination. We use k-nearest neighbors of cities to mimic neighboring cities and analyze COVID-19 dissemination based on the comparison of a city under consideration and its neighborhood. Moreover, such analysis is performed based on periods, which facilitates the assessment of isolation policies. We validate our tool by analyzing the progression of COVID-19 in neighboring cities of São Paulo state, Brazil.
CVMar 8, 2019
A Grid-based Method for Removing Overlaps of Dimensionality Reduction Scatterplot LayoutsGladys M. Hilasaca, Wilson E. Marcílio-Jr, Danilo M. Eler et al.
Dimensionality Reduction (DR) scatterplot layouts have become a ubiquitous visualization tool for analyzing multidimensional datasets. Despite their popularity, such scatterplots suffer from occlusion, especially when informative glyphs are used to represent data instances, potentially obfuscating critical information for the analysis under execution. Different strategies have been devised to address this issue, either producing overlap-free layouts that lack the powerful capabilities of contemporary DR techniques in uncovering interesting data patterns or eliminating overlaps as a post-processing strategy. Despite the good results of post-processing techniques, most of the best methods typically expand or distort the scatterplot area, thus reducing glyphs' size (sometimes) to unreadable dimensions, defeating the purpose of removing overlaps. This paper presents Distance Grid (DGrid), a novel post-processing strategy to remove overlaps from DR layouts that faithfully preserves the original layout's characteristics and bounds the minimum glyph sizes. We show that DGrid surpasses the state-of-the-art in overlap removal (through an extensive comparative evaluation considering multiple different metrics) while also being one of the fastest techniques, especially for large datasets. A user study with 51 participants also shows that DGrid is consistently ranked among the top techniques for preserving the original scatterplots' visual characteristics and the aesthetics of the final results.