LGAug 1, 2023
Explainable Graph Spectral Clustering of Text DocumentsBartłomiej Starosta, Mieczysław A. Kłopotek, Sławomir T. Wierzchoń
Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore there is an urgent need to elaborate methods for explaining the outcome of the clustering. This paper presents a contribution towards this goal. We present a proposal of explanation of results of combinatorial Laplacian based graph spectral clustering. It is based on showing (approximate) equivalence of combinatorial Laplacian embedding, $K$-embedding (proposed in this paper) and term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results. We provide theoretical background for this approach. We performed experimental study showing that $K$-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions.
LGAug 18, 2023
Eigenvalue-based Incremental Spectral ClusteringMieczysław A. Kłopotek, Bartłmiej Starosta, Sławomir T. Wierzchoń
Our previous experiments demonstrated that subsets collections of (short) documents (with several hundred entries) share a common normalized in some way eigenvalue spectrum of combinatorial Laplacian. Based on this insight, we propose a method of incremental spectral clustering. The method consists of the following steps: (1) split the data into manageable subsets, (2) cluster each of the subsets, (3) merge clusters from different subsets based on the eigenvalue spectrum similarity to form clusters of the entire set. This method can be especially useful for clustering methods of complexity strongly increasing with the size of the data sample,like in case of typical spectral clustering. Experiments were performed showing that in fact the clustering and merging the subsets yields clusters close to clustering the entire dataset.
LGAug 7, 2023
Wide Gaps and Clustering AxiomsMieczysław A. Kłopotek
The widely applied k-means algorithm produces clusterings that violate our expectations with respect to high/low similarity/density and is in conflict with Kleinberg's axiomatic system for distance based clustering algorithms that formalizes those expectations in a natural way. k-means violates in particular the consistency axiom. We hypothesise that this clash is due to the not explicated expectation that the data themselves should have the property of being clusterable in order to expect the algorithm clustering hem to fit a clustering axiomatic system. To demonstrate this, we introduce two new clusterability properties, variational k-separability and residual k-separability and show that then the Kleinberg's consistency axiom holds for k-means operating in the Euclidean or non-Euclidean space. Furthermore, we propose extensions of k-means algorithm that fit approximately the Kleinberg's richness axiom that does not hold for k-means. In this way, we reconcile k-means with Kleinberg's axiomatic framework in Euclidean and non-Euclidean settings. Besides contribution to the theory of axiomatic frameworks of clustering and for clusterability theory, practical contribution is the possibility to construct {datasets for testing purposes of algorithms optimizing k-means cost function. This includes a method of construction of {clusterable data with known in advance global optimum.
LGAug 2, 2023
Are Easy Data Easy (for K-Means)Mieczysław A. Kłopotek
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm. The concept of well-separatedness used here is derived directly from the common definition of clusters, which imposes an interplay between the requirements of within-cluster-homogenicity and between-clusters-diversity. Conditions are derived for a special case of well-separated clusters such that the global minimum of $k$-means cost function coincides with the well-separatedness. An experimental investigation is performed to find out whether or no various brands of $k$-means are actually capable of discovering well separated clusters. It turns out that they are not. A new algorithm is proposed that is a variation of $k$-means++ via repeated {sub}sampling when choosing a seed. The new algorithm outperforms four other algorithms from $k$-means family on the task.
LGNov 30, 2022
High-Dimensional Wide Gap $k$-Means Versus Clustering AxiomsMieczysław A. Kłopotek
Kleinberg's axioms for distance based clustering proved to be contradictory. Various efforts have been made to overcome this problem. Here we make an attempt to handle the issue by embedding in high-dimensional space and granting wide gaps between clusters.
AIOct 27, 2022
How To Overcome Richness Axiom FallacyMieczysław A. Kłopotek, Robert A. Kłopotek
The paper points at the grieving problems implied by the richness axiom in the Kleinberg's axiomatic system and suggests resolutions. The richness induces learnability problem in general and leads to conflicts with consistency axiom. As a resolution, learnability constraints and usage of centric consistency or restriction of the domain of considered clusterings to super-ball-clusterings is proposed.
CLApr 16, 2025
A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents -- Extended VersionMieczysław A. Kłopotek, Sławomir T. Wierzchoń, Bartłomiej Starosta et al.
This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.
LGDec 13, 2025
Rough Sets for Explainability of Spectral Graph ClusteringBartłomiej Starosta, Sławomir T. Wierzchoń, Piotr Borkowski et al.
Graph Spectral Clustering methods (GSC) allow representing clusters of diverse shapes, densities, etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Furthermore, the presence of documents without clear content meaning and the stochastic nature of the clustering algorithms deteriorate explainability. This paper proposes an enhancement to the explanation methodology, proposed in an earlier research of our team. It allows us to overcome the latter problems by taking inspiration from rough set theory.
LGAug 12, 2025
Explainable Graph Spectral Clustering For Text EmbeddingsMieczysław A. Kłopotek, Sławomir T. Wierzchoń, Bartłomiej Starosta et al.
In a previous paper, we proposed an introduction to the explainability of Graph Spectral Clustering results for textual documents, given that document similarity is computed as cosine similarity in term vector space. In this paper, we generalize this idea by considering other embeddings of documents, in particular, based on the GloVe embedding idea.
LGFeb 19, 2022
A Clustering Preserving Transformation for k-Means Algorithm OutputMieczysław A. Kłopotek
This note introduces a novel clustering preserving transformation of cluster sets obtained from $k$-means algorithm. This transformation may be used to generate new labeled data{}sets from existent ones. It is more flexible that Kleinberg axiom based consistency transformation because data points in a cluster can be moved away and datapoints between clusters may come closer together.
AIJun 15, 2020
p-d-Separation -- A Concept for Expressing Dependence/Independence Relations in Causal NetworksMieczysław A. Kłopotek
Spirtes, Glymour and Scheines formulated a Conjecture that a direct dependence test and a head-to-head meeting test would suffice to construe directed acyclic graph decompositions of a joint probability distribution (Bayesian network) for which Pearl's d-separation applies. This Conjecture was later shown to be a direct consequence of a result of Pearl and Verma. This paper is intended to prove this Conjecture in a new way, by exploiting the concept of p-d-separation (partial dependency separation). While Pearl's d-separation works with Bayesian networks, p-d-separation is intended to apply to causal networks: that is partially oriented networks in which orientations are given to only to those edges, that express statistically confirmed causal influence, whereas undirected edges express existence of direct influence without possibility of determination of direction of causation. As a consequence of the particular way of proving the validity of this Conjecture, an algorithm for construction of all the directed acyclic graphs (dags) carrying the available independence information is also presented. The notion of a partially oriented graph (pog) is introduced and within this graph the notion of p-d-separation is defined. It is demonstrated that the p-d-separation within the pog is equivalent to d-separation in all derived dags.
AIMay 25, 2020
Non-Destructive Sample Generation From Conditional Belief FunctionsMieczysław A. Kłopotek
This paper presents a new approach to generate samples from conditional belief functions for a restricted but non trivial subset of conditional belief functions. It assumes the factorization (decomposition) of a belief function along a bayesian network structure. It applies general conditional belief functions.
LGSep 28, 2019
A Note On k-Means Probabilistic PovertyMieczysław A. Kłopotek
It is proven, by example, that the version of $k$-means with random initialization does not have the property probabilistic k-richness.
AISep 26, 2019
Query Optimization Properties of Modified VBSMieczysław A. Kłopotek, Sławomir T. Wierzchoń
Valuation-Based~System can represent knowledge in different domains including probability theory, Dempster-Shafer theory and possibility theory. More recent studies show that the framework of VBS is also appropriate for representing and solving Bayesian decision problems and optimization problems. In this paper after introducing the valuation based system (VBS) framework, we present Markov-like properties of VBS and a method for resolving queries to VBS.
AIDec 14, 2018
Factorization of Dempster-Shafer Belief Functions Based on DataAndrzej Matuszewski, Mieczysław A. Kłopotek
One important obstacle in applying Dempster-Shafer Theory (DST) is its relationship to frequencies. In particular, there exist serious difficulties in finding factorizations of belief functions from data. In probability theory factorizations are usually related to notion of (conditional) independence and their possibility tested accordingly. However, in DST conditional belief distributions prove to be non-proper belief functions (that is ones connected with negative "frequencies"). This makes statistical testing of potential conditional independencies practically impossible, as no coherent interpretation could be found so far for negative belief function values. In this paper a novel attempt is made to overcome this difficulty. In the proposal no conditional beliefs are calculated, but instead a new measure F is introduced within the framework of DST, closely related to conditional independence, allowing to apply conventional statistical tests for detection of dependence/independence.
AIDec 7, 2018
On Marginally Correct Approximations of Dempster-Shafer Belief Functions from DataMieczysław A. Kłopotek, Sławomir T. Wierzchoń
Mathematical Theory of Evidence (MTE), a foundation for reasoning under partial ignorance, is blamed to leave frequencies outside (or aside of) its framework. The seriousness of this accusation is obvious: no experiment may be run to compare the performance of MTE-based models of real world processes against real world data. In this paper we consider this problem from the point of view of conditioning in the MTE. We describe the class of belief functions for which marginal consistency with observed frequencies may be achieved and conditional belief functions are proper belief functions,%\ and deal with implications for (marginal) approximation of general belief functions by this class of belief functions and for inference models in MTE.
CVDec 7, 2018
Rigid Body Structure and Motion From Two-Frame Point-Correspondences Under Perspective ProjectionMieczysław A. Kłopotek
This paper is concerned with possibility of recovery of motion and structure parameters from multiframes under perspective projection when only points on a rigid body are traced. Free (unrestricted and uncontrolled) pattern of motion between frames is assumed. The major question is how many points and/or how many frames are necessary for the task. It has been shown in an earlier paper {Klopotek:95b} that for orthogonal projection two frames are insufficient for the task. The paper demonstrates that, under perspective projection, that total uncertainty about relative position of focal point versus projection plane makes the recovery of structure and motion from two frames impossible.
CVNov 30, 2018
Structure and Motion from MultiframesMieczysław A. Kłopotek
The paper gives an overview of the problems and methods of recovery of structure and motion parameters of rigid bodies from multiframes.
AIJun 6, 2018
Dempsterian-Shaferian Belief Network From DataMieczysław A. Kłopotek
Shenoy and Shafer {Shenoy:90} demonstrated that both for Dempster-Shafer Theory and probability theory there exists a possibility to calculate efficiently marginals of joint belief distributions (by so-called local computations) provided that the joint distribution can be decomposed (factorized) into a belief network. A number of algorithms exists for decomposition of probabilistic joint belief distribution into a bayesian (belief) network from data. For example Spirtes, Glymour and Schein{Spirtes:90b} formulated a Conjecture that a direct dependence test and a head-to-head meeting test would suffice to construe bayesian network from data in such a way that Pearl's concept of d-separation {Geiger:90} applies. This paper is intended to transfer Spirtes, Glymour and Scheines {Spirtes:90b} approach onto the ground of the Dempster-Shafer Theory (DST). For this purpose, a frequentionistic interpretation of the DST developed in {Klopotek:93b} is exploited. A special notion of conditionality for DST is introduced and demonstrated to behave with respect to Pearl's d-separation {Geiger:90} much the same way as conditional probability (though some differences like non-uniqueness are evident). Based on this, an algorithm analogous to that from {Spirtes:90b} is developed. The notion of a partially oriented graph (pog) is introduced and within this graph the notion of p-d-separation is defined. If direct dependence test and head-to-head meeting test are used to orient the pog then its p-d-separation is shown to be equivalent to the Pearl's d-separation for any compatible dag.
AIMay 30, 2018
Too Fast Causal Inference under Causal InsufficiencyMieczysław A. Kłopotek
Causally insufficient structures (models with latent or hidden variables, or with confounding etc.) of joint probability distributions have been subject of intense study not only in statistics, but also in various AI systems. In AI, belief networks, being representations of joint probability distribution with an underlying directed acyclic graph structure, are paid special attention due to the fact that efficient reasoning (uncertainty propagation) methods have been developed for belief network structures. Algorithms have been therefore developed to acquire the belief network structure from data. As artifacts due to variable hiding negatively influence the performance of derived belief networks, models with latent variables have been studied and several algorithms for learning belief network structure under causal insufficiency have also been developed. Regrettably, some of them are known already to be erroneous (e.g. IC algorithm of [Pearl:Verma:91]. This paper is devoted to another algorithm, the Fast Causal Inference (FCI) Algorithm of [Spirtes:93]. It is proven by a specially constructed example that this algorithm, as it stands in [Spirtes:93], is also erroneous. Fundamental reason for failure of this algorithm is the temporary introduction of non-real links between nodes of the network with the intention of later removal. While for trivial dependency structures these non-real links may be actually removed, this may not be the case for complex ones, e.g. for the case described in this paper. A remedy of this failure is proposed.
AIJul 13, 2017
On (Anti)Conditional Independence in Dempster-Shafer TheoryMieczysław A. Kłopotek
This paper verifies a result of {Shenoy:94} concerning graphoidal structure of Shenoy's notion of independence for Dempster-Shafer theory of belief functions. Shenoy proved that his notion of independence has graphoidal properties for positive normal valuations. The requirement of strict positive normal valuations as prerequisite for application of graphoidal properties excludes a wide class of DS belief functions. It excludes especially so-called probabilistic belief functions. It is demonstrated that the requirement of positiveness of valuation may be weakened in that it may be required that commonality function is non-zero for singleton sets instead, and the graphoidal properties for independence of belief function variables are then preserved. This means especially that probabilistic belief functions with all singleton sets as focal points possess graphoidal properties for independence.
AIJul 13, 2017
Fast Restricted Causal InferenceMieczysław A. Kłopotek
Hidden variables are well known sources of disturbance when recovering belief networks from data based only on measurable variables. Hence models assuming existence of hidden variables are under development. This paper presents a new algorithm "accelerating" the known CI algorithm of Spirtes, Glymour and Scheines {Spirtes:93}. We prove that this algorithm does not produces (conditional) independencies not present in the data if statistical independence test is reliable. This result is to be considered as non-trivial since e.g. the same claim fails to be true for FCI algorithm, another "accelerator" of CI, developed in {Spirtes:93}.
AIJul 12, 2017
Identification and Interpretation of Belief Structure in Dempster-Shafer TheoryMieczysław A. Kłopotek
Mathematical Theory of Evidence called also Dempster-Shafer Theory (DST) is known as a foundation for reasoning when knowledge is expressed at various levels of detail. Though much research effort has been committed to this theory since its foundation, many questions remain open. One of the most important open questions seems to be the relationship between frequencies and the Mathematical Theory of Evidence. The theory is blamed to leave frequencies outside (or aside of) its framework. The seriousness of this accusation is obvious: (1) no experiment may be run to compare the performance of DST-based models of real world processes against real world data, (2) data may not serve as foundation for construction of an appropriate belief model. In this paper we develop a frequentist interpretation of the DST bringing to fall the above argument against DST. An immediate consequence of it is the possibility to develop algorithms acquiring automatically DST belief models from data. We propose three such algorithms for various classes of belief model structures: for tree structured belief networks, for poly-tree belief networks and for general type belief networks.
AIJul 12, 2017
Independence, Conditionality and Structure of Dempster-Shafer Belief FunctionsMieczysław A. Kłopotek
Several approaches of structuring (factorization, decomposition) of Dempster-Shafer joint belief functions from literature are reviewed with special emphasis on their capability to capture independence from the point of view of the claim that belief functions generalize bayes notion of probability. It is demonstrated that Zhu and Lee's {Zhu:93} logical networks and Smets' {Smets:93} directed acyclic graphs are unable to capture statistical dependence/independence of bayesian networks {Pearl:88}. On the other hand, though Shenoy and Shafer's hypergraphs can explicitly represent bayesian network factorization of bayesian belief functions, they disclaim any need for representation of independence of variables in belief functions. Cano et al. {Cano:93} reject the hypergraph representation of Shenoy and Shafer just on grounds of missing representation of variable independence, but in their frameworks some belief functions factorizable in Shenoy/Shafer framework cannot be factored. The approach in {Klopotek:93f} on the other hand combines the merits of both Cano et al. and of Shenoy/Shafer approach in that for Shenoy/Shafer approach no simpler factorization than that in {Klopotek:93f} approach exists and on the other hand all independences among variables captured in Cano et al. framework and many more are captured in {Klopotek:93f} approach.%
AIJun 30, 2017
Restricted Causal Inference AlgorithmMieczysław A. Kłopotek
This paper proposes a new algorithm for recovery of belief network structure from data handling hidden variables. It consists essentially in an extension of the CI algorithm of Spirtes et al. by restricting the number of conditional dependencies checked up to k variables and in an extension of the original CI by additional steps transforming so called partial including path graph into a belief network. Its correctness is demonstrated.
AIJun 8, 2017
Evidence Against Evidence Theory (?!)Mieczysław A. Kłopotek, Andrzej Matuszewski
This paper is concerned with the apparent greatest weakness of the Mathematical Theory of Evidence (MTE) of Shafer \cite{Shafer:76}, which has been strongly criticized by Wasserman \cite{Wasserman:92ijar} - the relationship to frequencies. Weaknesses of various proposals of probabilistic interpretation of MTE belief functions are demonstrated. A new frequency-based interpretation is presented overcoming various drawbacks of earlier interpretations.
AIJun 8, 2017
What Does a Belief Function Believe In ?Andrzej Matuszewski, Mieczysław A. Kłopotek
The conditioning in the Dempster-Shafer Theory of Evidence has been defined (by Shafer \cite{Shafer:90} as combination of a belief function and of an "event" via Dempster rule. On the other hand Shafer \cite{Shafer:90} gives a "probabilistic" interpretation of a belief function (hence indirectly its derivation from a sample). Given the fact that conditional probability distribution of a sample-derived probability distribution is a probability distribution derived from a subsample (selected on the grounds of a conditioning event), the paper investigates the empirical nature of the Dempster- rule of combination. It is demonstrated that the so-called "conditional" belief function is not a belief function given an event but rather a belief function given manipulation of original empirical data.\\ Given this, an interpretation of belief function different from that of Shafer is proposed. Algorithms for construction of belief networks from data are derived for this interpretation.
CVMay 11, 2017
Distribution of degrees of freedom over structure and motion of rigid bodiesMieczysław A. Kłopotek
This paper is concerned with recovery of motion and structure parameters from multiframes under orthogonal projection when only points are traced. The main question is how many points and/or how many frames are necessary for the task. It is demonstrated that 3 frames and 3 points are the absolute minimum. Closed-form solution is presented. Furthermore, it is shown that the task may be linearized if either four points or four frames are available. It is demonstrated that no increase in the number of points may lead to recovery of structure and motion parameters from two frames only. It is shown that instead the increase in the number of points may support the task of tracing the points from frame to frame.
LGApr 24, 2017
An Aposteriorical Clusterability Criterion for $k$-Means++ and Simplicity of ClusteringMieczysław A. Kłopotek
We define the notion of a well-clusterable data set combining the point of view of the objective of $k$-means clustering algorithm (minimising the centric spread of data elements) and common sense (clusters shall be separated by gaps). We identify conditions under which the optimum of $k$-means objective coincides with a clustering under which the data is separated by predefined gaps. We investigate two cases: when the whole clusters are separated by some gap and when only the cores of the clusters meet some separation condition. We overcome a major obstacle in using clusterability criteria due to the fact that known approaches to clusterability checking had the disadvantage that they are related to the optimal clustering which is NP hard to identify. Compared to other approaches to clusterability, the novelty consists in the possibility of an a posteriori (after running $k$-means) check if the data set is well-clusterable or not. As the $k$-means algorithm applied for this purpose has polynomial complexity so does therefore the appropriate check. Additionally, if $k$-means++ fails to identify a clustering that meets clusterability criteria, with high probability the data is not well-clusterable.
CVApr 18, 2017
A Comment on "Analysis of Video Image Sequences Using Point and Line Correspondences"Mieczysław A. Kłopotek
In this paper we would like to deny the results of Wang et al. raising two fundamental claims: * A line does not contribute anything to recognition of motion parameters from two images * Four traceable points are not sufficient to recover motion parameters from two perspective To be constructive, however, we show that four traceable points are sufficient to recover motion parameters from two frames under orthogonal projection and that five points are sufficient to simplify the solution of the two-frame problem under orthogonal projection to solving a linear equation system.
AIApr 12, 2017
Beliefs in Markov Trees - From Local Computations to Local ValuationMieczysław A. Kłopotek
This paper is devoted to expressiveness of hypergraphs for which uncertainty propagation by local computations via Shenoy/Shafer method applies. It is demonstrated that for this propagation method for a given joint belief distribution no valuation of hyperedges of a hypergraph may provide with simpler hypergraph structure than valuation of hyperedges by conditional distributions. This has vital implication that methods recovering belief networks from data have no better alternative for finding the simplest hypergraph structure for belief propagation. A method for recovery tree-structured belief networks has been developed and specialized for Dempster-Shafer belief functions
CVApr 11, 2017
Reconstruction of~3-D Rigid Smooth Curves Moving Free when Two Traceable Points Only are AvailableMieczysław A. Kłopotek
This paper extends previous research in that sense that for orthogonal projections of rigid smooth (true-3D) curves moving totally free it reduces the number of required traceable points to two only (the best results known so far to the author are 3 points from free motion and 2 for motion restricted to rotation around a fixed direction and and 2 for motion restricted to influence of a homogeneous force field). The method used is exploitation of information on tangential projections. It discusses also possibility of simplification of reconstruction of flat curves moving free for prospective projections.
AIApr 11, 2017
Beliefs and Probability in Bacchus' l.p. Logic: A~3-Valued Logic Solution to Apparent Counter-intuitionMieczysław A. Kłopotek
Fundamental discrepancy between first order logic and statistical inference (global versus local properties of universe) is shown to be the obstacle for integration of logic and probability in L.p. logic of Bacchus. To overcome the counterintuitiveness of L.p. behaviour, a 3-valued logic is proposed.
AIApr 8, 2017
Basic Formal Properties of A Relational Model of The Mathematical Theory of EvidenceMieczysław A. Kłopotek, Sławomir T. Wierzchoń
The paper presents a novel view of the Dempster-Shafer belief function as a measure of diversity in relational data bases. It is demonstrated that under the interpretation The Dempster rule of evidence combination corresponds to the join operator of the relational database theory. This rough-set based interpretation is qualitative in nature and can represent a number of belief function operators. The interpretation has the property that Given a definition of the belief measure of objects in the interpretation domain we can perform operations in this domain and the measure of the resulting object is derivable from measures of component objects via belief operator. We demonstrated this property for Dempster rule of combination, marginalization, Shafer's conditioning, independent variables, Shenoy's notion of conditional independence of variables. The interpretation is based on rough sets (in connection with decision tables), but differs from previous interpretations of this type in that it counts the diversity rather than frequencies in a decision table.
DSMar 4, 2017
Machine Learning Friendly Set Version of Johnson-Lindenstrauss LemmaMieczysław A. Kłopotek
In this paper we make a novel use of the Johnson-Lindenstrauss Lemma. The Lemma has an existential form saying that there exists a JL transformation $f$ of the data points into lower dimensional space such that all of them fall into predefined error range $δ$. We formulate in this paper a theorem stating that we can choose the target dimensionality in a random projection type JL linear transformation in such a way that with probability $1-ε$ all of them fall into predefined error range $δ$ for any user-predefined failure probability $ε$. This result is important for applications such a data clustering where we want to have a priori dimensionality reducing transformation instead of trying out a (large) number of them, as with traditional Johnson-Lindenstrauss Lemma. In particular, we take a closer look at the $k$-means algorithm and prove that a good solution in the projected space is also a good solution in the original space. Furthermore, under proper assumptions local optima in the original space are also ones in the projected space. We define also conditions for which clusterability property of the original space is transmitted to the projected space, so that special case algorithms for the original space are also applicable in the projected space.
LGFeb 20, 2017
On the Consistency of $k$-means++ algorithmMieczysław A. Kłopotek
We prove in this paper that the expected value of the objective function of the $k$-means++ algorithm for samples converges to population expected value. As $k$-means++, for samples, provides with constant factor approximation for $k$-means objectives, such an approximation can be achieved for the population with increase of the sample size. This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).
LGJan 19, 2017
Validity of Clusters Produced By kernel-$k$-means With Kernel-TrickMieczysław A. Kłopotek
This paper corrects the proof of the Theorem 2 from the Gower's paper \cite[page 5]{Gower:1982} as well as corrects the Theorem 7 from Gower's paper \cite{Gower:1986}. The first correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$-means clustering algorithm, on the grounds of distance matrix. The correction encompasses the missing if-part proof and dropping unnecessary conditions. The second correction deals with transformation of the kernel matrix into a one embeddable in Euclidean space.
IRJan 16, 2017
Semantic classifier approach to document classificationPiotr Borkowski, Krzysztof Ciesielski, Mieczysław A. Kłopotek
In this paper we propose a new document classification method, bridging discrepancies (so-called semantic gap) between the training set and the application sets of textual data. We demonstrate its superiority over classical text classification approaches, including traditional classifier ensembles. The method consists in combining a document categorization technique with a single classifier or a classifier ensemble (SEMCOM algorithm - Committee with Semantic Categorizer).
CLMay 10, 2016
Grammatical Case Based IS-A Relation Extraction with Boosting for PolishPaweł Łoziński, Dariusz Czerski, Mieczysław A. Kłopotek
Pattern-based methods of IS-A relation extraction rely heavily on so called Hearst patterns. These are ways of expressing instance enumerations of a class in natural language. While these lexico-syntactic patterns prove quite useful, they may not capture all taxonomical relations expressed in text. Therefore in this paper we describe a novel method of IS-A relation extraction from patterns, which uses morpho-syntactical annotations along with grammatical case of noun phrases that constitute entities participating in IS-A relation. We also describe a method for increasing the number of extracted relations that we call pseudo-subclass boosting which has potential application in any pattern-based relation extraction method. Experiments were conducted on a corpus of about 0.5 billion web documents in Polish language.