Michaël Aupetit

h-index19

13papers

72citations

Novelty40%

AI Score31

Ranked #131,021 of 194,257 authors (top 67%)#1,136 in HC (top 45%)

13 Papers

15.5LGAug 1, 2023Code

Classes are not Clusters: Improving Label-based Evaluation of Dimensionality Reduction

Hyeon Jeon, Yun-Hsin Kuo, Michaël Aupetit et al.

A common way to evaluate the reliability of dimensionality reduction (DR) embeddings is to quantify how well labeled classes form compact, mutually separated clusters in the embeddings. This approach is based on the assumption that the classes stay as clear clusters in the original high-dimensional space. However, in reality, this assumption can be violated; a single class can be fragmented into multiple separated clusters, and multiple classes can be merged into a single cluster. We thus cannot always assure the credibility of the evaluation using class labels. In this paper, we introduce two novel quality measures -- Label-Trustworthiness and Label-Continuity (Label-T&C) -- advancing the process of DR evaluation based on class labels. Instead of assuming that classes are well-clustered in the original space, Label-T&C work by (1) estimating the extent to which classes form clusters in the original and embedded spaces and (2) evaluating the difference between the two. A quantitative evaluation showed that Label-T&C outperform widely used DR evaluation measures (e.g., Trustworthiness and Continuity, Kullback-Leibler divergence) in terms of the accuracy in assessing how well DR embeddings preserve the cluster structure, and are also scalable. Moreover, we present case studies demonstrating that Label-T&C can be successfully used for revealing the intrinsic characteristics of DR techniques and their hyperparameters.

7.8LGSep 20, 2022Code

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

Hyeon Jeon, Michael Aupetit, DongHwa Shin et al.

We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.

14.4LGMar 3, 2025Code

Measuring the Validity of Clustering Validation Datasets

Hyeon Jeon, Michaël Aupetit, DongHwa Shin et al.

Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation.

6.3IRMar 9, 2025

HCT-QA: A Benchmark for Question Answering on Human-Centric Tables

Mohammad S. Ahmad, Zan A. Naeem, Michaël Aupetit et al.

Tabular data embedded within PDF files, web pages, and other document formats are prevalent across numerous sectors such as government, engineering, science, and business. These human-centric tables (HCTs) possess a unique combination of high business value, intricate layouts, limited operational power at scale, and sometimes serve as the only data source for critical insights. However, their complexity poses significant challenges to traditional data extraction, processing, and querying methods. While current solutions focus on transforming these tables into relational formats for SQL queries, they fall short in handling the diverse and complex layouts of HCTs and hence being amenable to querying. This paper describes HCT-QA, an extensive benchmark of HCTs, natural language queries, and related answers on thousands of tables. Our dataset includes 2,188 real-world HCTs with 9,835 QA pairs and 4,679 synthetic tables with 67.5K QA pairs. While HCTs can be potentially processed by different type of query engines, in this paper, we focus on Large Language Models as potential engines and assess their ability in processing and querying such tables.

5.1HCJan 30, 2022

ClassSPLOM -- A Scatterplot Matrix to Visualize Separation of Multiclass Multidimensional Data

Michael Aupetit, Ahmed Ali

In multiclass classification of multidimensional data, the user wants to build a model of the classes to predict the label of unseen data. The model is trained on the data and tested on unseen data with known labels to evaluate its quality. The results are visualized as a confusion matrix which shows how many data labels have been predicted correctly or confused with other classes. The multidimensional nature of the data prevents the direct visualization of the classes so we design ClassSPLOM to give more perceptual insights about the classification results. It uses the Scatterplot Matrix (SPLOM) metaphor to visualize a Linear Discriminant Analysis projection of the data for each pair of classes and a set of Receiving Operating Curves to evaluate their trustworthiness. We illustrate ClassSPLOM on a use case in Arabic dialects identification.

10.0HCJan 17, 2022

Distortion-Aware Brushing for Reliable Cluster Analysis in Multidimensional Projections

Hyeon Jeon, Michaël Aupetit, Soohyun Lee et al.

Brushing is a common interaction technique in 2D scatterplots, allowing users to select clustered points within a continuous, enclosed region for further analysis or filtering. However, applying conventional brushing to 2D representations of multidimensional (MD) data, i.e., Multidimensional Projections (MDPs), can lead to unreliable cluster analysis due to MDP-induced distortions that inaccurately represent the cluster structure of the original MD data. To alleviate this problem, we introduce a novel brushing technique for MDPs called Distortion-aware brushing. As users perform brushing, Distortion-aware brushing corrects distortions around the currently brushed points by dynamically relocating points in the projection, pulling data points close to the brushed points in MD space while pushing distant ones apart. This dynamic adjustment helps users brush MD clusters more accurately, leading to more reliable cluster analysis. Our user studies with 24 participants show that Distortion-aware brushing significantly outperforms previous brushing techniques for MDPs in accurately separating clusters in the MD space and remains robust against distortions. We further demonstrate the effectiveness of our technique through two use cases: (1) conducting cluster analysis of geospatial data and (2) interactively labeling MD clusters.

6.4HCJun 1, 2021

ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings

Mostafa M. Abbas, Ehsan Ullah, Abdelkader Baggag et al.

Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.

3.7HCJan 29, 2021

Aquanims: Area-Preserving Animated Transitions in Statistical Data Graphics based on a Hydraulic Metaphor

Michael Aupetit

We propose "aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. Animated transitions are used to facilitate understanding of graphical transformations between different visualizations. Area is key information to preserve during filtering or ordering transitions of area-based charts like bar charts, histograms, treemaps, or mosaic plots. As liquids are incompressible fluids, we use a hydraulic metaphor to convey the sense of area preservation during animated transitions: in aquanims, graphical objects can change shape, position, color, and even connectedness but not displayed area, as for a liquid contained in a transparent vessel or transferred between such vessels communicating through hidden pipes. We present various aquanims for product plots like bar charts and histograms to accommodate changes in data, in the ordering of bars or in a number of bins, and to provide animated tips. We also consider confusion matrices visualized as fluctuation diagrams and mosaic plots, and show how aquanims can be used to ease the understanding of different classification errors of real data.

3.3HCDec 8, 2020Code

An Enhanced MA Plot with R-Shiny to Ease Exploratory Analysis of Transcriptomic Data

Ali Sheharyar, Talar Boghos Yacoubian, Dina Aljogol et al.

MA plots are used to analyze the genome-wide differences in gene expression between two distinct biological conditions. An MA plot is usually rendered as a static scatter plot. Our interview with 3 experts in genomics showed that we could improve the usability of this plot by adding interactive analytic features. In this work we present the design study of the enhanced MA plot.

3.3HCNov 15, 2020

Aquanims -- Area-Preserving Animated Transitions based on a Hydraulic Metaphor

Michael Aupetit

We propose "Aquanims" as new design metaphors for animated transitions that preserve displayed areas during the transformation. As liquids are incompressible fluids, we use a hydraulic metaphor to convey the sense of area preservation during animated transitions. We study the design space of Aquanims for rectangle-based charts.

7.7HCMay 15, 2017

Visualizing Dimensionality Reduction Artifacts: An Evaluation

Nicolas Heulot, Jean-Daniel Fekete, Michael Aupetit

Multidimensional scaling allows visualizing high-dimensional data as 2D maps with the premise that insights in 2D reveal valid information in high-dimensions. However, the resulting projections suffer from artifacts such as bad local neighborhood preservation and clusters tearing. Interactively coloring the projection according to the discrepancy between original proximities relative to a reference item reveals these artifacts, but it is not clear if conveying these proximities using color and displaying only local information really helps the visual analysis of projections. We conducted a controlled experiment to investigate the relevance of this interactive technique to help the visual analysis of any projection regardless its quality. We compared the bare projection to the interactive coloring of the original proximities on different visual analysis tasks involving outliers and clusters. Results indicate that the interactive coloring is worthwhile for local tasks as it is significantly robust to projection artifacts whereas the projection is not. However this interactive technique does not help significantly for visual clustering tasks for that projections already give a suitable overview.

3.2HCMay 10, 2017

Visualization of Wearable Data and Biometrics for Analysis and Recommendations in Childhood Obesity

Michael Aupetit, Luis Fernandez-Luque, Meghna Singh et al.

Obesity is one of the major health risk factors be- hind the rise of non-communicable conditions. Understanding the factors influencing obesity is very complex since there are many variables that can affect the health behaviors leading to it. Nowadays, multiple data sources can be used to study health behaviors, such as wearable sensors for physical activity and sleep, social media, mobile and health data. In this paper we describe the design of a dashboard for the visualization of actigraphy and biometric data from a childhood obesity camp in Qatar. This dashboard allows quantitative discoveries that can be used to guide patient behavior and orient qualitative research.

2.3IRMar 9, 2012

A new supervised non-linear mapping

Sylvain Lespinats, Anke Meyer-Baese, Michael Aupetit

Supervised mapping methods project multi-dimensional labeled data onto a 2-dimensional space attempting to preserve both data similarities and topology of classes. Supervised mappings are expected to help the user to understand the underlying original class structure and to classify new data visually. Several methods have been designed to achieve supervised mapping, but many of them modify original distances prior to the mapping so that original data similarities are corrupted and even overlapping classes tend to be separated onto the map ignoring their original topology. We propose ClassiMap, an alternative method for supervised mapping. Mappings come with distortions which can be split between tears (close points mapped far apart) and false neighborhoods (points far apart mapped as neighbors). Some mapping methods favor the former while others favor the latter. ClassiMap switches between such mapping methods so that tears tend to appear between classes and false neighborhood within classes, better preserving classes' topology. We also propose two new objective criteria instead of the usual subjective visual inspection to perform fair comparisons of supervised mapping methods. ClassiMap appears to be the best supervised mapping method according to these criteria in our experiments on synthetic and real datasets.