CLJul 25, 2024Code
A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text SpatializationsDaniel Atzberger, Tim Cech, Willy Scheibel et al.
The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.
CLJul 17, 2023
Large-Scale Evaluation of Topic Models and Dimensionality Reduction Methods for 2D Text SpatializationDaniel Atzberger, Tim Cech, Willy Scheibel et al.
Topic models are a class of unsupervised learning algorithms for detecting the semantic structure within a text corpus. Together with a subsequent dimensionality reduction algorithm, topic models can be used for deriving spatializations for text corpora as two-dimensional scatter plots, reflecting semantic similarity between the documents and supporting corpus analysis. Although the choice of the topic model, the dimensionality reduction, and their underlying hyperparameters significantly impact the resulting layout, it is unknown which particular combinations result in high-quality layouts with respect to accuracy and perception metrics. To investigate the effectiveness of topic models and dimensionality reduction methods for the spatialization of corpora as two-dimensional scatter plots (or basis for landscape-type visualizations), we present a large-scale, benchmark-based computational evaluation. Our evaluation consists of (1) a set of corpora, (2) a set of layout algorithms that are combinations of topic models and dimensionality reductions, and (3) quality metrics for quantifying the resulting layout. The corpora are given as document-term matrices, and each document is assigned to a thematic class. The chosen metrics quantify the preservation of local and global properties and the perceptual effectiveness of the two-dimensional scatter plots. By evaluating the benchmark on a computing cluster, we derived a multivariate dataset with over 45 000 individual layouts and corresponding quality metrics. Based on the results, we propose guidelines for the effective design of text spatializations that are based on topic models and dimensionality reductions. As a main result, we show that interpretable topic models are beneficial for capturing the structure of text corpora. We furthermore recommend the use of t-SNE as a subsequent dimensionality reduction.
85.9HCApr 28
Visual Boosting Techniques for Spatiotemporal Dense Pixel VisualizationsJulius Rauscher, Frederik L. Dennig, Udo Schlegel et al.
The analysis of spatiotemporal data is essential in domains such as epidemiology and environmental monitoring, where understanding the interplay between spatially distributed phenomena and their temporal evolution is critical. Dense pixel visualizations offer a compact, effective overview of spatiotemporal dynamics. However, the necessary linearization of 2D geographic space into a 1D ordering inevitably introduces structural distortions that manifest as visual artifacts. We propose a measure-driven visual analytics approach that captures visual artifacts through neighborhood preservation measures for 1D orderings and renders them using visual boosting techniques such as glyphs, halos, and hatching. We demonstrate our approach through a usage scenario analyzing COVID-19 incidence data across German districts, showing that interactive, measure-driven boosting enables analysts to reliably distinguish genuine spatial patterns from linearization artifacts.
CESep 23, 2025
AlloyInter: Visualising Alloy Mixture Interpolations in t-SNE RepresentationsBenedikt Kantz, Peter Waldert, Stefan Lengauer et al.
This entry description proposes AlloyInter, a novel system to enable joint exploration of input mixtures and output parameters space in the context of the SciVis Contest 2025. We propose an interpolation approach, guided by eXplainable Artificial Intelligence (XAI) based on a learned model ensemble that allows users to discover input mixture ratios by specifying output parameter goals that can be iteratively adjusted and improved towards a goal. We strengthen the capabilities of our system by building upon prior research within the robustness of XAI, as well as combining well-established techniques like manifold learning with interpolation approaches.
HCSep 15, 2020
dg2pix: Pixel-Based Visual Analysis of Dynamic GraphsEren Cakmak, Dominik Jäckle, Tobias Schreck et al.
Presenting long sequences of dynamic graphs remains challenging due to the underlying large-scale and high-dimensional data. We propose dg2pix, a novel pixel-based visualization technique, to visually explore temporal and structural properties in long sequences of large-scale graphs. The approach consists of three main steps: (1) the multiscale modeling of the temporal dimension; (2) unsupervised graph embeddings to learn low-dimensional representations of the dynamic graph data; and (3) an interactive pixel-based visualization to simultaneously explore the evolving data at different temporal aggregation scales. dg2pix provides a scalable overview of a dynamic graph, supports the exploration of long sequences of high-dimensional graph data, and enables the identification and comparison of similar temporal states. We show the applicability of the technique to synthetic and real-world datasets, demonstrating that temporal patterns in dynamic graphs can be identified and interpreted over time. dg2pix contributes a suitable intermediate representation between node-link diagrams at the high detail end and matrix representations on the low detail end.
HCAug 19, 2020
Multiscale Snapshots: Visual Analysis of Temporal Summaries in Dynamic GraphsEren Cakmak, Udo Schlegel, Dominik Jäckle et al.
The overview-driven visual analysis of large-scale dynamic graphs poses a major challenge. We propose Multiscale Snapshots, a visual analytics approach to analyze temporal summaries of dynamic graphs at multiple temporal scales. First, we recursively generate temporal summaries to abstract overlapping sequences of graphs into compact snapshots. Second, we apply graph embeddings to the snapshots to learn low-dimensional representations of each sequence of graphs to speed up specific analytical tasks (e.g., similarity search). Third, we visualize the evolving data from a coarse to fine-granular snapshots to semi-automatically analyze temporal states, trends, and outliers. The approach enables to discover similar temporal summaries (e.g., recurring states), reduces the temporal data to speed up automatic analysis, and to explore both structural and temporal properties of a dynamic graph. We demonstrate the usefulness of our approach by a quantitative evaluation and the application to a real-world dataset.
CRJul 30, 2020
SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party VisualizationJiazhi Xia, Tianxiang Chen, Lei Zhang et al.
Nowadays, as data becomes increasingly complex and distributed, data analyses often involve several related datasets that are stored on different servers and probably owned by different stakeholders. While there is an emerging need to provide these stakeholders with a full picture of their data under a global context, conventional visual analytical methods, such as dimensionality reduction, could expose data privacy when multi-party datasets are fused into a single site to build point-level relationships. In this paper, we reformulate the conventional t-SNE method from the single-site mode into a secure distributed infrastructure. We present a secure multi-party scheme for joint t-SNE computation, which can minimize the risk of data leakage. Aggregated visualization can be optionally employed to hide disclosure of point-level relationships. We build a prototype system based on our method, SMAP, to support the organization, computation, and exploration of secure joint embedding. We demonstrate the effectiveness of our approach with three case studies, one of which is based on the deployment of our system in real-world applications.
HCJul 30, 2020
ConceptExplorer: Visual Analysis of Concept Driftsin Multi-source Time-series DataXumeng Wang, Wei Chen, Jiazhi Xia et al.
Time-series data is widely studied in various scenarios, like weather forecast, stock market, customer behavior analysis. To comprehensively learn about the dynamic environments, it is necessary to comprehend features from multiple data sources. This paper proposes a novel visual analysis approach for detecting and analyzing concept drifts from multi-sourced time-series. We propose a visual detection scheme for discovering concept drifts from multiple sourced time-series based on prediction models. We design a drift level index to depict the dynamics, and a consistency judgment model to justify whether the concept drifts from various sources are consistent. Our integrated visual interface, ConceptExplorer, facilitates visual exploration, extraction, understanding, and comparison of concepts and concept drifts from multi-source time-series data. We conduct three case studies and expert interviews to verify the effectiveness of our approach.
LGJul 29, 2019
FDive: Learning Relevance Models using Pattern-based Similarity MeasuresFrederik L. Dennig, Tom Polk, Zudi Lin et al.
The detection of interesting patterns in large high-dimensional datasets is difficult because of their dimensionality and pattern complexity. Therefore, analysts require automated support for the extraction of relevant patterns. In this paper, we present FDive, a visual active learning system that helps to create visually explorable relevance models, assisted by learning a pattern-based similarity. We use a small set of user-provided labels to rank similarity measures, consisting of feature descriptor and distance function combinations, by their ability to distinguish relevant from irrelevant data. Based on the best-ranked similarity measure, the system calculates an interactive Self-Organizing Map-based relevance model, which classifies data according to the cluster affiliation. It also automatically prompts further relevance feedback to improve its accuracy. Uncertain areas, especially near the decision boundaries, are highlighted and can be refined by the user. We evaluate our approach by comparison to state-of-the-art feature selection techniques and demonstrate the usefulness of our approach by a case study classifying electron microscopy images of brain cells. The results show that FDive enhances both the quality and understanding of relevance models and can thus lead to new insights for brain research.
IRJul 3, 2018
Visual Pattern-Driven Exploration of Big DataMichael Behrisch, Robert Krueger, Fritz Lekschas et al.
Pattern extraction algorithms are enabling insights into the ever-growing amount of today's datasets by translating reoccurring data properties into compact representations. Yet, a practical problem arises: With increasing data volumes and complexity also the number of patterns increases, leaving the analyst with a vast result space. Current algorithmic and especially visualization approaches often fail to answer central overview questions essential for a comprehensive understanding of pattern distributions and support, their quality, and relevance to the analysis task. To address these challenges, we contribute a visual analytics pipeline targeted on the pattern-driven exploration of result spaces in a semi-automatic fashion. Specifically, we combine image feature analysis and unsupervised learning to partition the pattern space into interpretable, coherent chunks, which should be given priority in a subsequent in-depth analysis. In our analysis scenarios, no ground-truth is given. Thus, we employ and evaluate novel quality metrics derived from the distance distributions of our image feature vectors and the derived cluster model to guide the feature selection process. We visualize our results interactively, allowing the user to drill down from overview to detail into the pattern space and demonstrate our techniques in a case study on biomedical genomic data.