ITJan 20, 2018
Analog-to-Digital Compression: A New Paradigm for Converting Signals to BitsAlon Kipnis, Yonina C. Eldar, Andrea J. Goldsmith
Processing, storing and communicating information that originates as an analog signal involves conversion of this information to bits. This conversion can be described by the combined effect of sampling and quantization, as illustrated in Fig. 1. The digital representation is achieved by first sampling the analog signal so as to represent it by a set of discrete-time samples and then quantizing these samples to a finite number of bits. Traditionally, these two operations are considered separately. The sampler is designed to minimize information loss due to sampling based on characteristics of the continuous-time input. The quantizer is designed to represent the samples as accurately as possible, subject to a constraint on the number of bits that can be used in the representation. The goal of this article is to revisit this paradigm by illuminating the dependency between these two operations. In particular, we explore the requirements on the sampling system subject to constraints on the available number of bits for storing, communicating or processing the analog information.
CLAug 25, 2023
EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text CompressionAlexander Tsvetkov, Alon Kipnis
We propose an unsupervised method to extract keywords and keyphrases from texts based on a pre-trained language model (LM) and Shannon's information maximization. Specifically, our method extracts phrases having the highest conditional entropy under the LM. The resulting set of keyphrases turns out to solve a relevant information-theoretic problem: if provided as side information, it leads to the expected minimal binary code length in compressing the text using the LM and an entropy encoder. Alternately, the resulting set is an approximation via a causal LM to the set of phrases that minimize the entropy of the text when conditioned upon it. Empirically, the method provides results comparable to the most commonly used methods in various keyphrase extraction benchmark challenges.
ITAug 24, 2023
An Information-Theoretic Approach for Detecting Edits in AI-Generated TextIdan Kashtan, Alon Kipnis
We propose a method to determine whether a given article was written entirely by a generative language model or perhaps contains edits by a different author, possibly a human. Our process involves multiple tests for the origin of individual sentences or other pieces of text and combining these tests using a method that is sensitive to rare alternatives, i.e., non-null effects are few and scattered across the text in unknown locations. Interestingly, this method also identifies pieces of text suspected to contain edits. We demonstrate the effectiveness of the method in detecting edits through extensive evaluations using real data and provide an information-theoretic analysis of the factors affecting its success. In particular, we discuss optimality properties under a theoretical framework for text editing saying that sentences are generated mainly by the language model, except perhaps for a few sentences that might have originated via a different mechanism. Our analysis raises several interesting research questions at the intersection of information theory and data science.
CLOct 24, 2024
Critical biblical studies via word frequency analysis: unveiling text authorshipShira Faigenbaum-Golovin, Alon Kipnis, Axel Bühler et al.
The Bible, a product of an extensive and intricate process of oral-written transmission spanning centuries, obscures the contours of its earlier recensions. Debate rages over determining the existing layers and identifying the date of composition and historical background of the biblical texts. Traditional manual methodologies have grappled with authorship challenges through scrupulous textual criticism, employing linguistic, stylistic, inner-biblical, and historical criteria. Despite recent progress in computer-assisted analysis, many patterns still need to be uncovered in Biblical Texts. In this study, we address the question of authorship of biblical texts by employing statistical analysis to the frequency of words using a method that is particularly sensitive to deviations in frequencies associated with a few words out of potentially many. We aim to differentiate between three distinct authors across numerous chapters spanning the first nine books of the Bible. In particular, we examine 50 chapters labeled according to biblical exegesis considerations into three corpora (D, DtrH, and P). Without prior assumptions about author identity, our approach leverages subtle differences in word frequencies to distinguish among the three corpora and identify author-dependent linguistic properties. Our analysis indicates that the first two authors (D and DtrH) are much more closely related compared to P, a fact that aligns with expert assessments. Additionally, we attain high accuracy in attributing authorship by evaluating the similarity of each chapter with the reference corpora. This study sheds new light on the authorship of biblical texts by providing interpretable, statistically significant evidence that there are different linguistic characteristics of biblical authors and that these differences can be identified.
ITJan 9, 2020
Gaussian Approximation of Quantization Error for Estimation from Compressed DataAlon Kipnis, Galen Reeves
We consider the distributional connection between the lossy compressed representation of a high-dimensional signal $X$ using a random spherical code and the observation of $X$ under an additive white Gaussian noise (AWGN). We show that the Wasserstein distance between a bitrate-$R$ compressed version of $X$ and its observation under an AWGN-channel of signal-to-noise ratio $2^{2R}-1$ is sub-linear in the problem dimension. We utilize this fact to connect the risk of an estimator based on an AWGN-corrupted version of $X$ to the risk attained by the same estimator when fed with its bitrate-$R$ quantized version. We demonstrate the usefulness of this connection by deriving various novel results for inference problems under compression constraints, including minimax estimation, sparse regression, compressed sensing, and the universality of linear estimation in remote source coding.
CLOct 30, 2019
Higher Criticism for Discriminating Word-Frequency Tables and Testing AuthorshipAlon Kipnis
We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning; reporting accuracy at the state of the art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.
ITJan 10, 2019
Mean Estimation from One-Bit MeasurementsAlon Kipnis, John C. Duchi
We consider the problem of estimating the mean of a symmetric log-concave distribution under the constraint that only a single bit per sample from this distribution is available to the estimator. We study the mean squared error as a function of the sample size (and hence the number of bits). We consider three settings: first, a centralized setting, where an encoder may release $n$ bits given a sample of size $n$, and for which there is no asymptotic penalty for quantization; second, an adaptive setting in which each bit is a function of the current observation and previously recorded bits, where we show that the optimal relative efficiency compared to the sample mean is precisely the efficiency of the median; lastly, we show that in a distributed setting where each bit is only a function of a local sample, no estimator can achieve optimal efficiency uniformly over the parameter space. We additionally complement our results in the adaptive setting by showing that \emph{one} round of adaptivity is sufficient to achieve optimal mean-square error.