Tomasz Steifer

LG
h-index5
11papers
40citations
Novelty52%
AI Score50

11 Papers

LGMay 15Code
Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

Tomasz Steifer

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

CCFeb 6, 2023
Find a witness or shatter: the landscape of computable PAC learning

Valentino Delle Rose, Alexander Kozachinskiy, Cristobal Rojas et al.

This paper contributes to the study of CPAC learnability -- a computable version of PAC learning -- by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypothesis which is properly CPAC learnable, but only with uncomputably fast growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.

LGAug 15, 2023
Simple online learning with consistent oracle

Alexander Kozachinskiy, Tomasz Steifer

We consider online learning in the model where a learning algorithm can access the class only via the \emph{consistent oracle} -- an oracle, that, at any moment, can give a function from the class that agrees with all examples seen so far. This model was recently considered by Assos et al.~(COLT'23). It is motivated by the fact that standard methods of online learning rely on computing the Littlestone dimension of subclasses, a computationally intractable problem. Assos et al.~gave an online learning algorithm in this model that makes at most $C^d$ mistakes on classes of Littlestone dimension $d$, for some absolute unspecified constant $C > 0$. We give a novel algorithm that makes at most $O(256^d)$ mistakes. Our proof is significantly simpler and uses only very basic properties of the Littlestone dimension. We also show that there exists no algorithm in this model that makes less than $3^d$ mistakes.

LGFeb 5
Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga

The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.

AINov 3, 2022
No Agreement Without Loss: Learning and Social Choice in Peer Review

Pablo Barceló, Mauricio Duarte, Cristóbal Rojas et al.

In peer review systems, reviewers are often asked to evaluate various features of submissions, such as technical quality or novelty. A score is given to each of the predefined features and based on these the reviewer has to provide an overall quantitative recommendation. It may be assumed that each reviewer has her own mapping from the set of features to a recommendation, and that different reviewers have different mappings in mind. This introduces an element of arbitrariness known as commensuration bias. In this paper we discuss a framework, introduced by Noothigattu, Shah and Procaccia, and then applied by the organizers of the AAAI 2022 conference. Noothigattu, Shah and Procaccia proposed to aggregate reviewer's mapping by minimizing certain loss functions, and studied axiomatic properties of this approach, in the sense of social choice theory. We challenge several of the results and assumptions used in their work and report a number of negative results. On the one hand, we study a trade-off between some of the axioms proposed and the ability of the method to properly capture agreements of the majority of reviewers. On the other hand, we show that dropping a certain unrealistic assumption has dramatic effects, including causing the method to be discontinuous.

GTDec 20, 2024
Optimal bounds for dissatisfaction in perpetual voting

Alexander Kozachinskiy, Alexander Shen, Tomasz Steifer

In perpetual voting, multiple decisions are made at different moments in time. Taking the history of previous decisions into account allows us to satisfy properties such as proportionality over periods of time. In this paper, we consider the following question: is there a perpetual approval voting method that guarantees that no voter is dissatisfied too many times? We identify a sufficient condition on voter behavior -- which we call 'bounded conflicts' condition -- under which a sublinear growth of dissatisfaction is possible. We provide a tight upper bound on the growth of dissatisfaction under bounded conflicts, using techniques from Kolmogorov complexity. We also observe that the approval voting with binary choices mimics the machine learning setting of prediction with expert advice. This allows us to present a voting method with sublinear guarantees on dissatisfaction under bounded conflicts, based on the standard techniques from prediction with expert advice.

LGJan 31, 2025
Strassen Attention, Split VC Dimension and Compositionality in Transformers

Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez et al.

We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

LGJan 22, 2025
Ehrenfeucht-Haussler Rank and Chain of Thought

Pablo Barceló, Alexander Kozachinskiy, Tomasz Steifer

The notion of \emph{rank} of a Boolean function has been a cornerstone in PAC learning theory, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function $f$ corresponds to the minimum number of \emph{Chain of Thought} (CoT) steps required by a single-layer Transformer with hard attention to compute $f$. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that \(\ell\)-fold function composition necessitates exactly \(\ell\) CoT steps. Furthermore, we analyze the problem of identifying the position of the \(k\)-th occurrence of 1 in a Boolean sequence, proving that it requires \(k\) CoT steps. Finally, we introduce the notion of the multi-head rank that captures multi-head single-layer transformers, and perform the analysis of PAC-learnability of the classes of functions with bounded multi-head rank.

LGNov 22, 2024
Effective Littlestone Dimension

Valentino Delle Rose, Alexander Kozachinskiy, Tomasz Steifer

Delle Rose et al.~(COLT'23) introduced an effective version of the Vapnik-Chervonenkis dimension, and showed that it characterizes improper PAC learning with total computable learners. In this paper, we introduce and study a similar effectivization of the notion of Littlestone dimension. Finite effective Littlestone dimension is a necessary condition for computable online learning but is not a sufficient one -- which we already establish for classes of the effective Littlestone dimension 2. However, the effective Littlestone dimension equals the optimal mistake bound for computable learners in two special cases: a) for classes of Littlestone dimension 1 and b) when the learner receives as additional information an upper bound on the numbers to be guessed. Interestingly, finite effective Littlestone dimension also guarantees that the class consists only of computable functions.

LGJan 5, 2025
A completely uniform transformer for parity

Alexander Kozachinskiy, Tomasz Steifer

We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).

LGOct 21, 2025
Computable universal online learning

Dariusz Kalociński, Tomasz Steifer

Understanding when learning is possible is a fundamental task in the theory of machine learning. However, many characterizations known from the literature deal with abstract learning as a mathematical object and ignore the crucial question: when can learning be implemented as a computer program? We address this question for universal online learning, a generalist theoretical model of online binary classification, recently characterized by Bousquet et al. (STOC'21). In this model, there is no hypothesis fixed in advance; instead, Adversary -- playing the role of Nature -- can change their mind as long as local consistency with the given class of hypotheses is maintained. We require Learner to achieve a finite number of mistakes while using a strategy that can be implemented as a computer program. We show that universal online learning does not imply computable universal online learning, even if the class of hypotheses is relatively easy from a computability-theoretic perspective. We then study the agnostic variant of computable universal online learning and provide an exact characterization of classes that are learnable in this sense. We also consider a variant of proper universal online learning and show exactly when it is possible. Together, our results give a more realistic perspective on the existing theory of online binary classification and the related problem of inductive inference.