Sergey Zlobin

CLAug 12, 2025

READER: Retrieval-Assisted Drafter for Efficient LLM Inference

Maxim Divilkovskiy, Vitaly Malygin, Sergey Zlobin et al.

Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on inference latency. This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models. Existing acceleration techniques partially mitigate token-level latency by relying on auxiliary draft models or introducing an additional training phase, but fail to address the dominant memory and communication costs. We present READER, a provably lossless speculative decoding framework that bypasses the training of the auxiliary draft model. READER formalizes speculative decoding as a stochastic tree construction problem and exploits the empirical redundancy structure of natural language to generate high-probability candidate continuations. Our method revisits the problem of constructing draft trees, establishing substantial statistical improvements over stochastic draft-tree methods and providing a complexity-theoretic analysis that characterizes the optimality frontier of speculative decoding under bounded computation and memory resources. Beyond the single-sequence regime traditionally considered in prior work, we introduce a memory-optimal key-value cache-serving strategy that guarantees amortized sublinear overhead in the batch dimension, allowing READER to scale to realistic inference workloads. Comprehensive experiments demonstrate up to 6.13x wall-clock speedup on single-prompt inference and up to 5.92x on batched inference, consistently surpassing prior speculative decoding baselines, while preserving exact output equivalence, with even more pronounced gains in retrieval-augmented generation pipelines. Our results close a key gap between theoretical parallelism limits and practical LLM inference, suggesting a new standard for efficient deployment.

IMSep 10, 2019

Photometric light curves classification with machine learning

Tatiana Gabruseva, Sergey Zlobin, Peter Wang

The Large Synoptic Survey Telescope will complete its survey in 2022 and produce terabytes of imaging data each night. To work with this massive onset of data, automated algorithms to classify astronomical light curves are crucial. Here, we present a method for automated classification of photometric light curves for a range of astronomical objects. Our approach is based on the gradient boosting of decision trees, feature extraction and selection, and augmentation. The solution was developed in the context of The Photometric LSST Astronomical Time Series Classification Challenge (PLAsTiCC) and achieved one of the top results in the challenge.

Sergey Zlobin

2 Papers