Gerd Stumme

AI
h-index5
33papers
172citations
Novelty36%
AI Score42

33 Papers

LGFeb 19, 2023
Greedy Discovery of Ordinal Factors

Dominik Dürrschnabel, Gerd Stumme

In large datasets, it is hard to discover and analyze structure. It is thus common to introduce tags or keywords for the items. In applications, such datasets are then filtered based on these tags. Still, even medium-sized datasets with a few tags result in complex and for humans hard-to-navigate systems. In this work, we adopt the method of ordinal factor analysis to address this problem. An ordinal factor arranges a subset of the tags in a linear order based on their underlying structure. A complete ordinal factorization, which consists of such ordinal factors, precisely represents the original dataset. Based on such an ordinal factorization, we provide a way to discover and explain relationships between different items and attributes in the dataset. However, computing even just one ordinal factor of high cardinality is computationally complex. We thus propose the greedy algorithm in this work. This algorithm extracts ordinal factors using already existing fast algorithms developed in formal concept analysis. Then, we leverage to propose a comprehensive way to discover relationships in the dataset. We furthermore introduce a distance measure based on the representation emerging from the ordinal factorization to discover similar items. To evaluate the method, we conduct a case study on different datasets.

AIApr 10, 2023
Ordinal Motifs in Lattices

Johannes Hirth, Viktoria Horn, Gerd Stumme et al.

Lattices are a commonly used structure for the representation and analysis of relational and ontological knowledge. In particular, the analysis of these requires a decomposition of a large and high-dimensional lattice into a set of understandably large parts. With the present work we propose /ordinal motifs/ as analytical units of meaning. We study these ordinal substructures (or standard scales) through (full) scale-measures of formal contexts from the field of formal concept analysis. We show that the underlying decision problems are NP-complete and provide results on how one can incrementally identify ordinal motifs to save computational effort. Accompanying our theoretical results, we demonstrate how ordinal motifs can be leveraged to retrieve basic meaning from a medium sized ordinal data set.

AIJul 13, 2023
Towards Ordinal Data Science

Gerd Stumme, Dominik Dürrschnabel, Tom Hanika

Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- particularly important for this line of research -- is that order-based methods are often seen as too mathematically rigorous for applying them to real-world data. In this paper, we will therefore discuss different means for measuring and 'calculating' with ordinal structures -- a specific class of directed graphs -- and show how to infer knowledge from them. Our aim is to establish Ordinal Data Science as a fundamentally new research agenda. Besides cross-fertilization with other cornerstone machine learning and knowledge representation methods, a broad range of disciplines will benefit from this endeavor, including, psychology, sociology, economics, web science, knowledge engineering, scientometrics.

AIApr 6, 2023
Maximal Ordinal Two-Factorizations

Dominik Dürrschnabel, Gerd Stumme

Given a formal context, an ordinal factor is a subset of its incidence relation that forms a chain in the concept lattice, i.e., a part of the dataset that corresponds to a linear order. To visualize the data in a formal context, Ganter and Glodeanu proposed a biplot based on two ordinal factors. For the biplot to be useful, it is important that these factors comprise as much data points as possible, i.e., that they cover a large part of the incidence relation. In this work, we investigate such ordinal two-factorizations. First, we investigate for formal contexts that omit ordinal two-factorizations the disjointness of the two factors. Then, we show that deciding on the existence of two-factorizations of a given size is an NP-complete problem which makes computing maximal factorizations computationally expensive. Finally, we provide the algorithm Ord2Factor that allows us to compute large ordinal two-factorizations.

AIApr 17, 2023
Automatic Textual Explanations of Concept Lattices

Johannes Hirth, Viktoria Horn, Gerd Stumme et al.

Lattices and their order diagrams are an essential tool for communicating knowledge and insights about data. This is in particular true when applying Formal Concept Analysis. Such representations, however, are difficult to comprehend by untrained users and in general in cases where lattices are large. We tackle this problem by automatically generating textual explanations for lattices using standard scales. Our method is based on the general notion of ordinal motifs in lattices for the special case of standard scales. We show the computational complexity of identifying a small number of standard scales that cover most of the lattice structure. For these, we provide textual explanation templates, which can be applied to any occurrence of a scale in any data domain. These templates are derived using principles from human-computer interaction and allow for a comprehensive textual explanation of lattices. We demonstrate our approach on the spices planner data set, which is a medium sized formal context comprised of fifty-six meals (objects) and thirty-seven spices (attributes). The resulting 531 formal concepts can be covered by means of about 100 standard scales.

AINov 18, 2022
Discovering Locally Maximal Bipartite Subgraphs

Dominik Dürrschnabel, Tom Hanika, Gerd Stumme

Induced bipartite subgraphs of maximal vertex cardinality are an essential concept for the analysis of graphs. Yet, discovering them in large graphs is known to be computationally hard. Therefore, we consider in this work a weaker notion of this problem, where we discard the maximality constraint in favor of inclusion maximality. Thus, we aim to discover locally maximal bipartite subgraphs. For this, we present three heuristic approaches to extract such subgraphs and compare their results to the solutions of the global problem. For the latter, we employ the algorithmic strength of fast SAT-solvers. Our three proposed heuristics are based on a greedy strategy, a simulated annealing approach, and a genetic algorithm, respectively. We evaluate all four algorithms with respect to their time requirement and the vertex cardinality of the discovered bipartite subgraphs on several benchmark datasets

AIMay 31, 2022
Attribute Exploration with Multiple Contradicting Partial Experts

Maximilian Felde, Gerd Stumme

Attribute exploration is a method from Formal Concept Analysis (FCA) that helps a domain expert discover structural dependencies in knowledge domains which can be represented as formal contexts (cross tables of objects and attributes). In this paper we present an extension of attribute exploration that allows for a group of domain experts and explores their shared views. Each expert has their own view of the domain and the views of multiple experts may contain contradicting information.

DLApr 25, 2022
Mapping Research Trajectories

Bastian Schäfermeier, Gerd Stumme, Tom Hanika

Steadily growing amounts of information, such as annually published scientific papers, have become so large that they elude an extensive manual analysis. Hence, to maintain an overview, automated methods for the mapping and visualization of knowledge domains are necessary and important, e.g., for scientific decision makers. Of particular interest in this field is the development of research topics of different entities (e.g., scientific authors and venues) over time. However, existing approaches for their analysis are only suitable for single entity types, such as venues, and they often do not capture the research topics or the time dimension in an easily interpretable manner. Hence, we propose a principled approach for \emph{mapping research trajectories}, which is applicable to all kinds of scientific entities that can be represented by sets of published papers. For this, we transfer ideas and principles from the geographic visualization domain, specifically trajectory maps and interactive geographic maps. Our visualizations depict the research topics of entities over time in a straightforward interpr. manner. They can be navigated by the user intuitively and restricted to specific elements of interest. The maps are derived from a corpus of research publications (i.e., titles and abstracts) through a combination of unsupervised machine learning methods. In a practical demonstrator application, we exemplify the proposed approach on a publication corpus from machine learning. We observe that our trajectory visualizations of 30 top machine learning venues and 1000 major authors in this field are well interpretable and are consistent with background knowledge drawn from the entities' publications. Next to producing interactive, interpr. visualizations supporting different kinds of analyses, our computed trajectories are suitable for trajectory mining applications in the future.

CGMar 17
DimFlux: Force-Directed Additive Line Diagrams

Marcel Nöhre, Dominik Dürrschnabel, Bernhard Ganter et al.

The visualization of concept lattices is a central problem in the field of Formal Concept Analysis. Force-directed algorithms, as popular in graph drawing, are a promising approach, treating lattice diagrams as physical models, optimizing node positions based on forces derived from the lattice structure. We build on the work of Zschalig, who, however, limited himself to attribute-additive diagrams. We use a more general additivity, in which both the attributes and the objects contribute to the positions of the concept nodes. We replace the planarity enhancer used by Zschalig to obtain a starting diagram for force-directed optimization with the DimDraw algorithm, which generates structured order diagrams on its own. The combination results in DimFlux, an algorithm that leverages the advantages of DimDraw but generates additive diagrams in which readability is increased by maximizing the conflict distance between nodes and non-incident edges.

DMMay 2, 2024
The Birkhoff completion of finite lattices

Mohammad Abdulla, Johannes Hirth, Gerd Stumme

We introduce the Birkhoff completion as the smallest distributive lattice in which a given finite lattice can be embedded as semi-lattice. We discuss its relationship to implicational theories, in particular to R. Wille's simply-implicational theories. By an example, we show how the Birkhoff completion can be used as a tool for ordinal data science.

AIJun 29, 2025
Rises for Measuring Local Distributivity in Lattices

Mohammad Abdulla, Tobias Hille, Dominik Dürrschnabel et al.

Distributivity is a well-established and extensively studied notion in lattice theory. In the context of data analysis, particularly within Formal Concept Analysis (FCA), lattices are often observed to exhibit a high degree of distributivity. However, no standardized measure exists to quantify this property. In this paper, we introduce the notion of rises in (concept) lattices as a means to assess distributivity. Rises capture how the number of attributes or objects in covering concepts change within the concept lattice. We show that a lattice is distributive if and only if no non-unit rises occur. Furthermore, we relate rises to the classical notion of meet- and join distributivity. We observe that concept lattices from real-world data are to a high degree join-distributive, but much less meet-distributive. We additionally study how join-distributivity manifests on the level of ordered sets.

AIJun 27, 2025
Conceptual Topic Aggregation

Klara M. Gutekunst, Dominik Dürrschnabel, Johannes Hirth et al.

The vast growth of data has rendered traditional manual inspection infeasible, necessitating the adoption of computational methods for efficient data exploration. Topic modeling has emerged as a powerful tool for analyzing large-scale textual datasets, enabling the extraction of latent semantic structures. However, existing methods for topic modeling often struggle to provide interpretable representations that facilitate deeper insights into data structure and content. In this paper, we propose FAT-CAT, an approach based on Formal Concept Analysis (FCA) to enhance meaningful topic aggregation and visualization of discovered topics. Our approach can handle diverse topics and file types -- grouped by directories -- to construct a concept lattice that offers a structured, hierarchical representation of their topic distribution. In a case study on the ETYNTKE dataset, we evaluate the effectiveness of our approach against other representation methods to demonstrate that FCA-based aggregation provides more meaningful and interpretable insights into dataset composition than existing topic modeling techniques.

SIApr 25, 2024
Conceptual Mapping of Controversies

Claude Draude, Dominik Dürrschnabel, Johannes Hirth et al.

With our work, we contribute towards a qualitative analysis of the discourse on controversies in online news media. For this, we employ Formal Concept Analysis and the economics of conventions to derive conceptual controversy maps. In our experiments, we analyze two maps from different news journals with methods from ordinal data science. We show how these methods can be used to assess the diversity, complexity and potential bias of controversies. In addition to that, we discuss how the diagrams of concept lattices can be used to navigate between news articles.

IRSep 21, 2021
Towards Explainable Scientific Venue Recommendations

Bastian Schäfermeier, Gerd Stumme, Tom Hanika

Selecting the best scientific venue (i.e., conference/journal) for the submission of a research article constitutes a multifaceted challenge. Important aspects to consider are the suitability of research topics, a venue's prestige, and the probability of acceptance. The selection problem is exacerbated through the continuous emergence of additional venues. Previously proposed approaches for supporting authors in this process rely on complex recommender systems, e.g., based on Word2Vec or TextCNN. These, however, often elude an explanation for their recommendations. In this work, we propose an unsophisticated method that advances the state-of-the-art in two aspects: First, we enhance the interpretability of recommendations through non-negative matrix factorization based topic models; Second, we surprisingly can obtain competitive recommendation performance while using simpler learning methods.

LGSep 3, 2021
LG4AV: Combining Language Models and Graph Neural Networks for Author Verification

Maximilian Stubbemann, Gerd Stumme

The automatic verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media platforms. Therefore, it is important that authorship information in frequently used web services and platforms is correct. The question whether a given document is written by a given author is commonly referred to as authorship verification (AV). While AV is a widely investigated problem in general, only few works consider settings where the documents are short and written in a rather uniform style. This makes most approaches unpractical for online databases and knowledge graphs in the scholarly domain. Here, authorships of scientific publications have to be verified, often with just abstracts and titles available. To this point, we present our novel approach LG4AV which combines language models and graph neural networks for authorship verification. By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features that are not meaningful in scenarios where the writing style is, at least to some extent, standardized. By the incorporation of a graph neural network structure, our model can benefit from relations between authors that are meaningful with respect to the verification process. For example, scientific authors are more likely to write about topics that are addressed by their co-authors and twitter users tend to post about the same subjects as people they follow. We experimentally evaluate our model and study to which extent the inclusion of co-authorships enhances verification decisions in bibliometric environments.

AIJun 21, 2021
Attribute Selection using Contranominal Scales

Dominik Dürrschnabel, Maren Koyda, Gerd Stumme

Formal Concept Analysis (FCA) allows to analyze binary data by deriving concepts and ordering them in lattices. One of the main goals of FCA is to enable humans to comprehend the information that is encapsulated in the data; however, the large size of concept lattices is a limiting factor for the feasibility of understanding the underlying structural properties. The size of such a lattice depends on the number of subcontexts in the corresponding formal context that are isomorphic to a contranominal scale of high dimension. In this work, we propose the algorithm ContraFinder that enables the computation of all contranominal scales of a given formal context. Leveraging this algorithm, we introduce delta-adjusting, a novel approach in order to decrease the number of contranominal scales in a formal context by the selection of an appropriate attribute subset. We demonstrate that delta-adjusting a context reduces the size of the hereby emerging sub-semilattice and that the implication set is restricted to meaningful implications. This is evaluated with respect to its associated knowledge by means of a classification task. Hence, our proposed technique strongly improves understandability while preserving important conceptual structures.

NIJun 17, 2021
Topological Indoor Mapping through WiFi Signals

Bastian Schaefermeier, Gerd Stumme, Tom Hanika

The ubiquitous presence of WiFi access points and mobile devices capable of measuring WiFi signal strengths allow for real-world applications in indoor localization and mapping. In particular, no additional infrastructure is required. Previous approaches in this field were, however, often hindered by problems such as effortful map-building processes, changing environments and hardware differences. We tackle these problems focussing on topological maps. These represent discrete locations, such as rooms, and their relations, e.g., distances and transition frequencies. In our unsupervised method, we employ WiFi signal strength distributions, dimension reduction and clustering. It can be used in settings where users carry mobile devices and follow their normal routine. We aim for applications in short-lived indoor events such as conferences.

AIFeb 4, 2021
Triadic Exploration and Exploration with Multiple Experts

Maximilian Felde, Gerd Stumme

Formal Concept Analysis (FCA) provides a method called attribute exploration which helps a domain expert discover structural dependencies in knowledge domains that can be represented by a formal context (a cross table of objects and attributes). Triadic Concept Analysis is an extension of FCA that incorporates the notion of conditions. Many extensions and variants of attribute exploration have been studied but only few attempts at incorporating multiple experts have been made. In this paper we present triadic exploration based on Triadic Concept Analysis to explore conditional attribute implications in a triadic domain. We then adapt this approach to formulate attribute exploration with multiple experts that have different views on a domain.

LGOct 23, 2020
Topic Space Trajectories: A case study on machine learning literature

Bastian Schäfermeier, Gerd Stumme, Tom Hanika

The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. Hence, even for researchers it becomes harder and harder to keep track of research topics and their progress. In this task, researchers can be supported by automated publication analysis. Yet, many such methods result in uninterpretable, purely numerical representations. As an attempt to support human analysts, we present topic space trajectories, a structure that allows for the comprehensible tracking of research topics. We demonstrate how these trajectories can be interpreted based on eight different analysis approaches. To obtain comprehensible results, we employ non-negative matrix factorization as well as suitable visualization techniques. We show the applicability of our approach on a publication corpus spanning 50 years of machine learning research from 32 publication venues. Our novel analysis method may be employed for paper classification, for the prediction of future research topics, and for the recommendation of fitting conferences and journals for submitting unpublished work.

AIAug 23, 2019
Interactive Collaborative Exploration using Incomplete Contexts

Maximilian Felde, Gerd Stumme

A well-known knowledge acquisition method in the field of Formal Concept Analysis (FCA) is attribute exploration. It is used to reveal dependencies in a set of attributes with help of a domain expert. In most applications no single expert is capable (time- and knowledge-wise) of exploring the knowledge domain alone. However, there is up to now no theory that models the interaction of multiple experts for the task of attribute exploration with incomplete knowledge. To this end, we to develop a theoretical framework that allows multiple experts to explore domains together. We use a representation of incomplete knowledge as three-valued contexts. We then adapt the corresponding version of attribute exploration to fit the setting of multiple experts. We suggest formalizations for key components like expert knowledge, interaction and collaboration strategy. In particular, we define an order that allows to compare the results of different exploration strategies on the same task with respect to their information completeness. Furthermore we discuss other ways of comparing collaboration strategies and suggest avenues for future research.

AIJul 22, 2019
Orometric Methods in Bounded Metric Data

Maximilian Stubbemann, Tom Hanika, Gerd Stumme

A large amount of data accommodated in knowledge graphs (KG) is actually metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities, chemical compounds or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify relevant mountain peaks on the surface of the earth, we demonstrate a notion to use them for metric data sets in general. Notably, metric sets of items inclosed in knowledge graphs. Based on this we present a method for identifying outstanding items using the transferred valuations functions 'isolation' and 'prominence'. Building up on this we imagine an item recommendation process. To demonstrate the relevance of the novel valuations for such processes we use item sets from the Wikidata knowledge graph. We then evaluate the usefulness of 'isolation' and 'prominence' empirically in a supervised machine learning setting. In particular, we find structurally relevant items in the geographic population distributions of Germany and France.

LGMay 16, 2019
Collaborative Interactive Learning -- A clarification of terms and a differentiation from other research fields

Tom Hanika, Marek Herde, Jochen Kuhn et al.

The field of collaborative interactive learning (CIL) aims at developing and investigating the technological foundations for a new generation of smart systems that support humans in their everyday life. While the concept of CIL has already been carved out in detail (including the fields of dedicated CIL and opportunistic CIL) and many research objectives have been stated, there is still the need to clarify some terms such as information, knowledge, and experience in the context of CIL and to differentiate CIL from recent and ongoing research in related fields such as active learning, collaborative learning, and others. Both aspects are addressed in this paper.

CGMar 2, 2019
DimDraw -- A novel tool for drawing concept lattices

Dominik Dürrschnabel, Tom Hanika, Gerd Stumme

Concept lattice drawings are an important tool to visualize complex relations in data in a simple manner to human readers. Many attempts were made to transfer classical graph drawing approaches to order diagrams. Although those methods are satisfying for some lattices they unfortunately perform poorly in general. In this work we present a novel tool to draw concept lattices that is purely motivated by the order structure.

AIFeb 3, 2019
Discovering Implicational Knowledge in Wikidata

Tom Hanika, Maximilian Marx, Gerd Stumme

Knowledge graphs have recently become the state-of-the-art tool for representing the diverse and complex knowledge of the world. Examples include the proprietary knowledge graphs of companies such as Google, Facebook, IBM, or Microsoft, but also freely available ones such as YAGO, DBpedia, and Wikidata. A distinguishing feature of Wikidata is that the knowledge is collaboratively edited and curated. While this greatly enhances the scope of Wikidata, it also makes it impossible for a single individual to grasp complex connections between properties or understand the global impact of edits in the graph. We apply Formal Concept Analysis to efficiently identify comprehensible implications that are implicitly present in the data. Although the complex structure of data modelling in Wikidata is not amenable to a direct approach, we overcome this limitation by extracting contextual representations of parts of Wikidata in a systematic fashion. We demonstrate the practical feasibility of our approach through several experiments and show that the results may lead to the discovery of interesting implicational knowledge. Besides providing a method for obtaining large real-world data sets for FCA, we sketch potential applications in offering semantic assistance for editing and curating Wikidata.

AIDec 20, 2018
Relevant Attributes in Formal Contexts

Tom Hanika, Maren Koyda, Gerd Stumme

Computing conceptual structures, like formal concept lattices, is in the age of massive data sets a challenging task. There are various approaches to deal with this, e.g., random sampling, parallelization, or attribute extraction. A so far not investigated method in the realm of formal concept analysis is attribute selection, as done in machine learning. Building up on this we introduce a method for attribute selection in formal contexts. To this end, we propose the notion of relevant attributes which enables us to define a relative relevance function, reflecting both the order structure of the concept lattice as well as distribution of objects on it. Finally, we overcome computational challenges for computing the relative relevance through an approximation approach based on information entropy.

LGSep 19, 2018
Distances for WiFi Based Topological Indoor Mapping

Bastian Schäfermeier, Tom Hanika, Gerd Stumme

For localization and mapping of indoor environments through WiFi signals, locations are often represented as likelihoods of the received signal strength indicator. In this work we compare various measures of distance between such likelihoods in combination with different methods for estimation and representation. In particular, we show that among the considered distance measures the Earth Mover's Distance seems the most beneficial for the localization task. Combined with kernel density estimation we were able to retain the topological structure of rooms in a real-world office scenario.

AIMay 15, 2018
Intrinsic dimension and its application to association rules

Tom Hanika, Friedrich Martin Schneider, Gerd Stumme

The curse of dimensionality in the realm of association rules is twofold. Firstly, we have the well known exponential increase in computational complexity with increasing item set size. Secondly, there is a \emph{related curse} concerned with the distribution of (spare) data itself in high dimension. The former problem is often coped with by projection, i.e., feature selection, whereas the best known strategy for the latter is avoidance. This work summarizes the first attempt to provide a computationally feasible method for measuring the extent of dimension curse present in a data set with respect to a particular class machine of learning procedures. This recent development enables the application of various other methods from geometric analysis to be investigated and applied in machine learning procedures in the presence of high dimension.

AIJan 24, 2018
Intrinsic Dimension of Geometric Data Sets

Tom Hanika, Friedrich Martin Schneider, Gerd Stumme

The curse of dimensionality is a phenomenon frequently observed in machine learning (ML) and knowledge discovery (KD). There is a large body of literature investigating its origin and impact, using methods from mathematics as well as from computer science. Among the mathematical insights into data dimensionality, there is an intimate link between the dimension curse and the phenomenon of measure concentration, which makes the former accessible to methods of geometric analysis. The present work provides a comprehensive study of the intrinsic geometry of a data set, based on Gromov's metric measure geometry and Pestov's axiomatic approach to intrinsic dimension. In detail, we define a concept of geometric data set and introduce a metric as well as a partial order on the set of isomorphism classes of such data sets. Based on these objects, we propose and investigate an axiomatic approach to the intrinsic dimension of geometric data sets and establish a concrete dimension function with the desired properties. Our model for data sets and their intrinsic dimension is computationally feasible and, moreover, adaptable to specific ML/KD-algorithms, as illustrated by various experiments.

CVDec 14, 2017
Adaptive kNN using Expected Accuracy for Classification of Geo-Spatial Data

Mark Kibanov, Martin Becker, Juergen Mueller et al.

The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e.g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy.

IRMay 9, 2017
Predicting Rising Follower Counts on Twitter Using Profile Information

Juergen Mueller, Gerd Stumme

When evaluating the cause of one's popularity on Twitter, one thing is considered to be the main driver: Many tweets. There is debate about the kind of tweet one should publish, but little beyond tweets. Of particular interest is the information provided by each Twitter user's profile page. One of the features are the given names on those profiles. Studies on psychology and economics identified correlations of the first name to, e.g., one's school marks or chances of getting a job interview in the US. Therefore, we are interested in the influence of those profile information on the follower count. We addressed this question by analyzing the profiles of about 6 Million Twitter users. All profiles are separated into three groups: Users that have a first name, English words, or neither of both in their name field. The assumption is that names and words influence the discoverability of a user and subsequently his/her follower count. We propose a classifier that labels users who will increase their follower count within a month by applying different models based on the user's group. The classifiers are evaluated with the area under the receiver operator curve score and achieves a score above 0.800.

CLJun 17, 2016
Gender Inference using Statistical Name Characteristics in Twitter

Juergen Mueller, Gerd Stumme

Much attention has been given to the task of gender inference of Twitter users. Although names are strong gender indicators, the names of Twitter users are rarely used as a feature; probably due to the high number of ill-formed names, which cannot be found in any name dictionary. Instead of relying solely on a name database, we propose a novel name classifier. Our approach extracts characteristics from the user names and uses those in order to assign the names to a gender. This enables us to classify international first names as well as ill-formed names.

IRMar 3, 2013
Onomastics 2.0 - The Power of Social Co-Occurrences

Folke Mitzlaff, Gerd Stumme

Onomastics is "the science or study of the origin and forms of proper names of persons or places." ["Onomastics". Merriam-Webster.com, 2013. http://www.merriam-webster.com (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste. With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names. The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed. The discovered relations among given names are the foundation of "nameling" [http://nameling.net], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

IRFeb 18, 2013
Recommending Given Names

Folke Mitzlaff, Gerd Stumme

All over the world, future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and especially personal taste. Although this task is omnipresent, little research has been conducted on the analysis and application of interrelations among given names from a data mining perspective. The present work tackles the problem of recommending given names, by firstly mining for inter-name relatedness in data from the Social Web. Based on these results, the name search engine "Nameling" was built, which attracted more than 35,000 users within less than six months, underpinning the relevance of the underlying recommendation task. The accruing usage data is then used for evaluating different state-of-the-art recommendation systems, as well our new NameRank algorithm which we adopted from our previous work on folksonomies and which yields the best results, considering the trade-off between prediction accuracy and runtime performance as well as its ability to generate personalized recommendations. We also show, how the gathered inter-name relationships can be used for meaningful result diversification of PageRank-based recommendation systems. As all of the considered usage data is made publicly available, the present work establishes baseline results, encouraging other researchers to implement advanced recommendation systems for given names.