CRMay 21
Exact Bias of Linear TRNG Correctors -- Spectral ApproachMaciej Skorski, Francisco-Javier Soto, Onur Günlü
Using Fourier analysis, this paper establishes near-optimal security bounds for linear correctors commonly used in True Random Number Generators (TRNGs), expressed through code weight enumerators and input bias parameters. We provide the first near-tight bias characterization in total variation, by interpolating between optimal $\ell_\infty$ and $\ell_2$ norm results. Our bounds improve security assessments by an order of magnitude over previously known (overly conservative) estimates. Across $\sim $20,000 codes, we examine fundamental trade-offs between compression efficiency, cryptographic security, and hardware complexity. Achieving 80-bit security with 10\% input bias typically requires sacrificing more than 50\% of the code rate and incurs increased hardware cost. This quantifies the inherent cost of randomness extraction in hardware TRNG implementations.
CYJul 20, 2023Code
What Twitter Data Tell Us about the Future?Alina Landowska, Marek Robak, Maciej Skorski
Anticipation is a fundamental human cognitive ability that involves thinking about and living towards the future. While language markers reflect anticipatory thinking, research on anticipation from the perspective of natural language processing is limited. This study aims to investigate the futures projected by futurists on Twitter and explore the impact of language cues on anticipatory thinking among social media users. We address the research questions of what futures Twitter's futurists anticipate and share, and how these anticipated futures can be modeled from social data. To investigate this, we review related works on anticipation, discuss the influence of language markers and prestigious individuals on anticipatory thinking, and present a taxonomy system categorizing futures into "present futures" and "future present". This research presents a compiled dataset of over 1 million publicly shared tweets by future influencers and develops a scalable NLP pipeline using SOTA models. The study identifies 15 topics from the LDA approach and 100 distinct topics from the BERTopic approach within the futurists' tweets. These findings contribute to the research on topic modelling and provide insights into the futures anticipated by Twitter's futurists. The research demonstrates the futurists' language cues signals futures-in-the-making that enhance social media users to anticipate their own scenarios and respond to them in present. The fully open-sourced dataset, interactive analysis, and reproducible source code are available for further exploration.
CLMay 21
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations CorporaMaciej Skorski
Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.
LGMar 21, 2023
Exact Non-Oblivious Performance of Rademacher Random EmbeddingsMaciej Skorski, Alessandro Temperoni
This paper revisits the performance of Rademacher random projections, establishing novel statistical guarantees that are numerically sharp and non-oblivious with respect to the input data. More specifically, the central result is the Schur-concavity property of Rademacher random projections with respect to the inputs. This offers a novel geometric perspective on the performance of random projections, while improving quantitatively on bounds from previous works. As a corollary of this broader result, we obtained the improved performance on data which is sparse or is distributed with small spread. This non-oblivious analysis is a novelty compared to techniques from previous work, and bridges the frequently observed gap between theory and practise. The main result uses an algebraic framework for proving Schur-concavity properties, which is a contribution of independent interest and an elegant alternative to derivative-based criteria.
AIJul 20, 2024
Mapping the Technological Future: A Topic, Sentiment, and Emotion Analysis in Social Media DiscourseAlina Landowska, Maciej Skorski, Krzysztof Rajda
People worldwide are currently confronted with a number of technological challenges, which act as a potent source of uncertainty. The uncertainty arising from the volatility and unpredictability of technology (such as AI) and its potential consequences is widely discussed on social media. This study uses BERTopic modelling along with sentiment and emotion analysis on 1.5 million tweets from 2021 to 2023 to identify anticipated tech-driven futures and capture the emotions communicated by 400 key opinion leaders (KOLs). Findings indicate positive sentiment significantly outweighs negative, with a prevailing dominance of positive anticipatory emotions. Specifically, the 'Hope' score is approximately 10.33\% higher than the median 'Anxiety' score. KOLs emphasize 'Optimism' and benefits over 'Pessimism' and challenges. The study emphasizes the important role KOLs play in shaping future visions through anticipatory discourse and emotional tone during times of technological uncertainty.
CLAug 19, 2025
Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values UnderstandingMaciej Skorski, Alina Landowska
How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
CLJul 24, 2025
The Moral Gap of Large Language ModelsMaciej Skorski, Alina Landowska
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.
SIMar 25, 2025
Mapping Technological Futures: Anticipatory Discourse Through Text MiningMaciej Skorski, Alina Landowska, Krzysztof Rajda
The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textit{present futures} -- optimistic visions of transformative technologies like AI and IoT -- and influencing \textit{future presents}, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like ``Machine Learning, Data Science, and Deep Learning,'' while discussions around ``Climate Change'' and ``War, Ukraine, and Trump People'' elicit \textit{Anxiety}. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.
LGFeb 22, 2022
Robust and Provable Guarantees for Sparse Random EmbeddingsMaciej Skorski, Alessandro Temperoni, Martin Theobald
In this work, we improve upon the guarantees for sparse random embeddings, as they were recently provided and analyzed by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'19). Specifically, we show that (a) our bounds are explicit as opposed to the asymptotic guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants across a wide range of parameters, including the dimensionality, sparsity and dispersion of the data. Moreover, we empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets, such as collections of images, text documents represented as bags-of-words, and text sequences vectorized by neural embeddings. Behind our numerical improvements are techniques of broader interest, which improve upon key steps of previous analyses in terms of (c) tighter estimates for certain types of quadratic chaos, (d) establishing extreme properties of sparse linear forms, and (e) improvements on bounds for the estimation of sums of independent random variables.
MLApr 14, 2021
Mean-Squared Accuracy of Good-Turing EstimatorMaciej Skorski
The brilliant method due to Good and Turing allows for estimating objects not occurring in a sample. The problem, known under names "sample coverage" or "missing mass" goes back to their cryptographic work during WWII, but over years has found has many applications, including language modeling, inference in ecology and estimation of distribution properties. This work characterizes the maximal mean-squared error of the Good-Turing estimator, for any sample \emph{and} alphabet size.
LGApr 6, 2021
Confidence-Optimal Random EmbeddingsMaciej Skorski
The seminal result of Johnson and Lindenstrauss on random embeddings has been intensively studied in applied and theoretical computer science. Despite that vast body of literature, we still lack of complete understanding of statistical properties of random projections; a particularly intriguing question is: why are the theoretical bounds that far behind the empirically observed performance? Motivated by this question, this work develops Johnson-Lindenstrauss distributions with optimal, data-oblivious, statistical confidence bounds. These bounds are numerically best possible, for any given data dimension, embedding dimension, and distortion tolerance. They improve upon prior works in terms of statistical accuracy, as well as exactly determine the no-go regimes for data-oblivious approaches. Furthermore, the corresponding projection matrices are efficiently samplable. The construction relies on orthogonal matrices, and the proof uses certain elegant properties of the unit sphere. The following techniques introduced in this work are of independent interest: a) a compact expression for distortion in terms of singular eigenvalues of the projection matrix, b) a parametrization linking the unit sphere and the Dirichlet distribution and c) anti-concentration bounds for the Dirichlet distribution. Besides the technical contribution, the paper presents applications and numerical evaluation along with working implementation in Python.
LGDec 31, 2020
Random Embeddings with Optimal AccuracyMaciej Skorski
This work constructs Jonson-Lindenstrauss embeddings with best accuracy, as measured by variance, mean-squared error and exponential concentration of the length distortion. Lower bounds for any data and embedding dimensions are determined, and accompanied by matching and efficiently samplable constructions (built on orthogonal matrices). Novel techniques: a unit sphere parametrization, the use of singular-value latent variables and Schur-convexity are of independent interest.
STDec 23, 2020
A Modern Analysis of Hutchinson's Trace EstimatorMaciej Skorski
The paper establishes the new state-of-art in the accuracy analysis of Hutchinson's trace estimator. Leveraging tools that have not been previously used in this context, particularly hypercontractive inequalities and concentration properties of sub-gamma distributions, we offer an elegant and modular analysis, as well as numerically superior bounds. Besides these improvements, this work aims to better popularize the aforementioned techniques within the CS community.
STAug 20, 2020
Simple Analysis of Johnson-Lindenstrauss Transform under Neuroscience ConstraintsMaciej Skorski
The paper re-analyzes a version of the celebrated Johnson-Lindenstrauss Lemma, in which matrices are subjected to constraints that naturally emerge from neuroscience applications: a) sparsity and b) sign-consistency. This particular variant was studied first by Allen-Zhu, Gelashvili, Micali, Shavit and more recently by Jagadeesan (RANDOM'19). The contribution of this work is a novel proof, which in contrast to previous works a) uses the modern probability toolkit, particularly basics of sub-gaussian and sub-gamma estimates b) is self-contained, with no dependencies on subtle third-party results c) offers explicit constants. At the heart of our proof is a novel variant of Hanson-Wright Lemma (on concentration of quadratic forms). Of independent interest are also auxiliary facts on sub-gaussian random variables.
STMay 19, 2020
Revisiting Concentration of Missing MassMaciej Skorski
We revisit the problem of \emph{missing mass concentration}, developing a new method of estimating concentration of heterogenic sums, in spirit of celebrated Rosenthal's inequality. As a result we slightly improve the state-of-art bounds due to Ben-Hamou at al., and simplify the proofs.
LGApr 20, 2020
Revisiting Initialization of Neural NetworksMaciej Skorski, Alessandro Temperoni, Martin Theobald
The proper initialization of weights is crucial for the effective training and fast convergence of deep neural networks (DNNs). Prior work in this area has mostly focused on balancing the variance among weights per layer to maintain stability of (i) the input data propagated forwards through the network and (ii) the loss gradients propagated backwards, respectively. This prevalent heuristic is however agnostic of dependencies among gradients across the various layers and captures only firstorder effects. In this paper, we propose and discuss an initialization principle that is based on a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. The proposed approach is more systematic and recovers previous results for DNN activations such as smooth functions, dropouts, and ReLU. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool which helps to more rigorously initialize weights
LGApr 10, 2020
Efficient Sampled Softmax for TensorflowMaciej Skorski
This short paper discusses an efficient implementation of \emph{sampled softmax loss} for Tensorflow. The speedup over the default implementation is achieved due to simplification of the graph for the forward and backward passes.
OTFeb 28, 2019
Bounds on Bayes Factors for Binomial A/B TestingMaciej Skorski
Bayes factors, in many cases, have been proven to bridge the classic -value based significance testing and bayesian analysis of posterior odds. This paper discusses this phenomena within the binomial A/B testing setup (applicable for example to conversion testing). It is shown that the bayes factor is controlled by the \emph{Jensen-Shannon divergence} of success ratios in two tested groups, which can be further bounded by the Welch statistic. As a result, bayesian sample bounds almost match frequentionist's sample bounds. The link between Jensen-Shannon divergence and Welch's test as well as the derivation are an elegant application of tools from information geometry.
STJan 2, 2019
Kernel Density Estimation Bias under Minimal AssumptionsMaciej Skorski
Kernel Density Estimation is a very popular technique of approximating a density function from samples. The accuracy is generally well-understood and depends, roughly speaking, on the kernel decay and local smoothness of the true density. However concrete statements in the literature are often invoked in very specific settings (simplified or overly conservative assumptions) or miss important but subtle points (e.g. it is common to heuristically apply Taylor's expansion globally without referring to compactness). The contribution of this paper is twofold (a) we demonstrate that, when the bandwidth is an arbitrary invertible matrix going to zero, it is necessary to keep a certain balance between the \emph{kernel decay} and \emph{magnitudes of bandwidth eigenvalues}; in fact, without the sufficient decay the estimates may not be even bounded (b) we give a rigorous derivation of bounds with explicit constants for the bias, under possibly minimal assumptions. This connects the kernel decay, bandwidth norm, bandwidth determinant and density smoothness. It has been folklore that the issue with Taylor's formula can be fixed with more complicated assumptions on the density (for example p. 95 of "Kernel Smoothing" by Wand and Jones); we show that this is actually not necessary and can be handled by the kernel decay alone.
LGAug 13, 2018
Simple Root Cause Analysis by Separable LikelihoodsMaciej Skorski
Root Cause Analysis for Anomalies is challenging because of the trade-off between the accuracy and its explanatory friendliness, required for industrial applications. In this paper we propose a framework for simple and friendly RCA within the Bayesian regime under certain restrictions (that Hessian at the mode is diagonal, here referred to as \emph{separability}) imposed on the predictive posterior. We show that this assumption is satisfied for important base models, including Multinomal, Dirichlet-Multinomial and Naive Bayes. To demonstrate the usefulness of the framework, we embed it into the Bayesian Net and validate on web server error logs (real world data set).
CRApr 27, 2017
Non-Uniform Attacks Against PseudoentropyKrzysztof Pietrzak, Maciej Skorski
De, Trevisan and Tulsiani [CRYPTO 2010] show that every distribution over $n$-bit strings which has constant statistical distance to uniform (e.g., the output of a pseudorandom generator mapping $n-1$ to $n$ bit strings), can be distinguished from the uniform distribution with advantage $ε$ by a circuit of size $O( 2^nε^2)$. We generalize this result, showing that a distribution which has less than $k$ bits of min-entropy, can be distinguished from any distribution with $k$ bits of $δ$-smooth min-entropy with advantage $ε$ by a circuit of size $O(2^kε^2/δ^2)$. As a special case, this implies that any distribution with support at most $2^k$ (e.g., the output of a pseudoentropy generator mapping $k$ to $n$ bit strings) can be distinguished from any given distribution with min-entropy $k+1$ with advantage $ε$ by a circuit of size $O(2^kε^2)$. Our result thus shows that pseudoentropy distributions face basically the same non-uniform attacks as pseudorandom distributions.
CRJun 22, 2015
A New Approximate Min-Max Theorem with Applications in CryptographyMaciej Skorski
We propose a novel proof technique that can be applied to attack a broad class of problems in computational complexity, when switching the order of universal and existential quantifiers is helpful. Our approach combines the standard min-max theorem and convex approximation techniques, offering quantitative improvements over the standard way of using min-max theorems as well as more concise and elegant proofs.
CRMay 25, 2015
A Time-Success Ratio Analysis of wPRF-based Leakage-Resilient Stream CiphersMaciej Skorski
Weak pseudorandom functions (wPRFs) found an important application as main building blocks for leakage-resilient ciphers (EUROCRYPT'09). Several security bounds, based on different techniques, were given to these stream ciphers. The security loss in these reduction-based proofs is always polynomial, but has not been studied in detail. The aim of this paper is twofold. First, we present a clear comparison of quantitatively different security bounds in the literature. Second, we revisit the current proof techniques and answer the natural question of how far we are from meaningful and provable security guarantees, when instantiating weak PRFs with standard primitives (block ciphers or hash functions). In particular, we demonstrate a flaw in the recent (TCC'14) analysis of the EUROCRYPT'09 stream cipher, which means that we still don't know if it offers provable security when instantiated with a standard block cipher. Our approach is a \emph{time-to-success Ratio} analysis, a universal measure introduced by Luby, which allow us to compare different security bounds.
CRApr 28, 2015
Condensed UnpredictabilityMaciej Skorski, Alexander Golovnev, Krzysztof Pietrzak
We consider the task of deriving a key with high HILL entropy from an unpredictable source. Previous to this work, the only known way to transform unpredictability into a key that was $\eps$ indistinguishable from having min-entropy was via pseudorandomness, for example by Goldreich-Levin (GL) hardcore bits. This approach has the inherent limitation that from a source with $k$ bits of unpredictability entropy one can derive a key of length (and thus HILL entropy) at most $k-2\log(1/ε)$ bits. In many settings, e.g. when dealing with biometric data, such a $2\log(1/ε)$ bit entropy loss in not an option. Our main technical contribution is a theorem that states that in the high entropy regime, unpredictability implies HILL entropy. The loss in circuit size in this argument is exponential in the entropy gap $d$. To overcome the above restriction, we investigate if it's possible to first "condense" unpredictability entropy and make the entropy gap small. We show that any source with $k$ bits of unpredictability can be condensed into a source of length $k$ with $k-3$ bits of unpredictability entropy. Our condenser simply "abuses" the GL construction and derives a $k$ bit key from a source with $k$ bits of unpredicatibily. The original GL theorem implies nothing when extracting that many bits, but we show that in this regime, GL still behaves like a "condenser" for unpredictability. This result comes with two caveats (1) the loss in circuit size is exponential in $k$ and (2) we require that the source we start with has \emph{no} HILL entropy (equivalently, one can efficiently check if a guess is correct). We leave it as an intriguing open problem to overcome these restrictions or to prove they're inherent.
CRApr 9, 2015
Lower bounds on $q$-wise independence tails and applications to min-entropy condensersMaciej Skorski
We present novel and sharp lower bounds for higher load moments in the classical problem of mapping $M$ balls into $N$ bins by $q$-universal hashing, specialized to the case when $M=N$. As a corollary we prove a tight counterpart for the result about min-entropy condensers due to Dodis, Pietrzak and Wichs (CRYPTO'14), which has found important applications in key derivation. It states that condensing $k$ bits of min-entropy into a $k$-bit string $ε$-close to almost full min-entropy (precisely $ k-\log\log(1/ε)$ bits of entropy) can be achieved by the use of $q$-independent hashing with $q= \log(1/ε)$. We prove that when given a source of min-entropy $k$ and aiming at entropy loss $\ell = \log\log (1/ε) - 3$, the independence level $q=(1-o(1))\log(1/ε)$ is necessary (for small values of $ε$), which almost matches the positive result. Besides these asymptotic bounds, we provide clear hard bounds in terms of Bell numbers and some numerical examples. Our technique is based on an explicit representation of the load moments in terms of Stirling numbers, some asymptotic estimates on Stirling numbers and a tricky application of the Paley-Zygmund inequality. \keywords{ min-entropy condensers, key derivation, balls and bins hashing, anti-concentration inequalities }
CRMar 2, 2015
Simulating Auxiliary Inputs, RevisitedMaciej Skorski
For any pair $(X,Z)$ of correlated random variables we can think of $Z$ as a randomized function of $X$. Provided that $Z$ is short, one can make this function computationally efficient by allowing it to be only approximately correct. In folklore this problem is known as \emph{simulating auxiliary inputs}. This idea of simulating auxiliary information turns out to be a powerful tool in computer science, finding applications in complexity theory, cryptography, pseudorandomness and zero-knowledge. In this paper we revisit this problem, achieving the following results: \begin{enumerate}[(a)] We discuss and compare efficiency of known results, finding the flaw in the best known bound claimed in the TCC'14 paper "How to Fake Auxiliary Inputs". We present a novel boosting algorithm for constructing the simulator. Our technique essentially fixes the flaw. This boosting proof is of independent interest, as it shows how to handle "negative mass" issues when constructing probability measures in descent algorithms. Our bounds are much better than bounds known so far. To make the simulator $(s,ε)$-indistinguishable we need the complexity $O\left(s\cdot 2^{5\ell}ε^{-2}\right)$ in time/circuit size, which is better by a factor $ε^{-2}$ compared to previous bounds. In particular, with our technique we (finally) get meaningful provable security for the EUROCRYPT'09 leakage-resilient stream cipher instantiated with a standard 256-bit block cipher, like $\mathsf{AES256}$.
ITFeb 8, 2013
Modulus Computational EntropyMaciej Skorski
The so-called {\em leakage-chain rule} is a very important tool used in many security proofs. It gives an upper bound on the entropy loss of a random variable $X$ in case the adversary who having already learned some random variables $Z_{1},\ldots,Z_{\ell}$ correlated with $X$, obtains some further information $Z_{\ell+1}$ about $X$. Analogously to the information-theoretic case, one might expect that also for the \emph{computational} variants of entropy the loss depends only on the actual leakage, i.e. on $Z_{\ell+1}$. Surprisingly, Krenn et al.\ have shown recently that for the most commonly used definitions of computational entropy this holds only if the computational quality of the entropy deteriorates exponentially in $|(Z_{1},\ldots,Z_{\ell})|$. This means that the current standard definitions of computational entropy do not allow to fully capture leakage that occurred "in the past", which severely limits the applicability of this notion. As a remedy for this problem we propose a slightly stronger definition of the computational entropy, which we call the \emph{modulus computational entropy}, and use it as a technical tool that allows us to prove a desired chain rule that depends only on the actual leakage and not on its history. Moreover, we show that the modulus computational entropy unifies other,sometimes seemingly unrelated, notions already studied in the literature in the context of information leakage and chain rules. Our results indicate that the modulus entropy is, up to now, the weakest restriction that guarantees that the chain rule for the computational entropy works. As an example of application we demonstrate a few interesting cases where our restricted definition is fulfilled and the chain rule holds.