ITJan 22, 2025
Guaranteed Recovery of Unambiguous ClustersKayvon Mazooji, Ilan Shomorony
Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the $K$-clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.
ITJan 28, 2021
Private DNA Sequencing: Hiding Information in Discrete NoiseKayvon Mazooji, Roy Dong, Ilan Shomorony
When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.