CLOct 11, 2024Code
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model EnsemblesBuu Phan, Brandon Amos, Itai Gat et al. · meta-ai
Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at: https://github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs
ITApr 16
One-Shot Broadcast Joint Source-Channel Coding with Codebook DiversityJoseph Rowan, Buu Phan, Ashish Khisti
We study a one-shot joint source-channel coding setting where the source is encoded once and broadcast to $K$ decoders through independent channels. Success is predicated on at least one decoder recovering the source within a maximum distortion constraint. We find that in the one-shot regime, utilizing disjoint codebooks at each decoder yields a codebook diversity gain, distinct from the channel diversity gain that may be expected when several decoders observe independent realizations of the channel's output but share the same codebook. Coding schemes are introduced that leverage this phenomenon, where first- and second-order achievability bounds are derived via an adaptation of the Poisson matching lemma which allows for multiple decoders using disjoint codebooks. We further propose a hybrid coding scheme that partitions decoders into groups to optimally balance codebook and channel diversity. Numerical results on the binary symmetric channel demonstrate that the hybrid approach outperforms strategies where the decoders' codebooks are either fully shared or disjoint.
COMay 12
Multi-Marginal Couplings for Metropolis-HastingsBuu Phan, Gergely Flamich, Ashish Khisti et al.
Convergence diagnosis for Markov chain Monte Carlo is a matter of fundamental importance in computational statistics: it determines the resources allocated to a particular sampling problem and influences the practitioner's view of the quality of estimates obtained from a Markov chain. Motivated by this, we contribute to the emerging class of coupling-based convergence diagnostic algorithms. Concretely, we study coupling multiple Metropolis-Hastings chains using multi-marginal coupling. We introduce a natural objective for this setting and establish lower and upper bounds by drawing connections to list-level distribution coupling and distributed pairwise-matching problems. This analysis ultimately leads to a shared-randomness Poisson Monte Carlo construction for coupling multiple Markov chains. In this process, we avoid a key dimension-dependent bottleneck in the runtime complexity of classical Poisson Monte Carlo by developing an adaptive rule for updating the point process, yielding significant gains in high-dimensional settings. Experiments on grand couplings of Markov chains show that our methods improve coalescence rates across dimensions, reducing meeting times by up to 50% compared with existing baselines.
CLDec 16, 2025
Cross-Tokenizer Likelihood Scoring Algorithms for Language Model DistillationBuu Phan, Ashish Khisti, Karen Ullrich
Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
ITOct 7, 2025
Channel Simulation and Distributed Compression with Ensemble Rejection SamplingBuu Phan, Ashish Khisti
We study channel simulation and distributed matching, two fundamental problems with several applications to machine learning, using a recently introduced generalization of the standard rejection sampling (RS) algorithm known as Ensemble Rejection Sampling (ERS). For channel simulation, we propose a new coding scheme based on ERS that achieves a near-optimal coding rate. In this process, we demonstrate that standard RS can also achieve a near-optimal coding rate and generalize the result of Braverman and Garg (2014) to the continuous alphabet setting. Next, as our main contribution, we present a distributed matching lemma for ERS, which serves as the rejection sampling counterpart to the Poisson Matching Lemma (PML) introduced by Li and Anantharam (2021). Our result also generalizes a recent work on importance matching lemma (Phan et al, 2024) and, to our knowledge, is the first result on distributed matching in the family of rejection sampling schemes where the matching probability is close to PML. We demonstrate the practical significance of our approach over prior works by applying it to distributed compression. The effectiveness of our proposed scheme is validated through experiments involving synthetic Gaussian sources and distributed image compression using the MNIST dataset.
LGSep 30, 2025
Delayed Attention Training Improves Length Generalization in Transformer--RNN HybridsBuu Phan, Reza Ebrahimi, Sanjay Haresh et al.
We study length generalization in sequence models on a composite problem involving both state tracking and associative recall. Prior work finds that recurrent networks handle state tracking well but struggle with recall, whereas Transformers excel at recall yet fail to extend state-tracking capabilities to longer sequences. Motivated by the complementary strengths of these architectures, we construct hybrid models integrating recurrent and attention-based components, and train them on the combined task to evaluate whether both capabilities can be preserved. Our results reveal that, in such hybrids, the Transformer component tends to exploit shortcut solutions, leading to poor length generalization. We identify this shortcut reliance as a key obstacle and propose a simple yet effective training strategy -- delaying the training of the attention layers -- that mitigates this effect and significantly improves length generalization performance. Our experiments show that this approach enables hybrid models to achieve near-perfect accuracy ($>90\%$) on hybrid sequences three times longer than those used during training.
LGJun 5, 2025
Gumbel-max List Sampling for Distribution Coupling with Multiple SamplesJoseph Rowan, Buu Phan, Ashish Khisti
We study a relaxation of the problem of coupling probability distributions -- a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.
LGFeb 15, 2025
On Self-Adaptive Perception Loss Function for Sequential Lossy CompressionSadaf Salehkalaibar, Buu Phan, Likun Cai et al.
We consider causal, low-latency, sequential lossy compression, with mean squared-error (MSE) as the distortion loss, and a perception loss function (PLF) to enhance the realism of reconstructions. As the main contribution, we propose and analyze a new PLF that considers the joint distribution between the current source frame and the previous reconstructions. We establish the theoretical rate-distortion-perception function for first-order Markov sources and analyze the Gaussian model in detail. From a qualitative perspective, the proposed metric can simultaneously avoid the error-permanence phenomenon and also better exploit the temporal correlation between high-quality reconstructions. The proposed metric is referred to as self-adaptive perception loss function (PLF-SA), as its behavior adapts to the quality of reconstructed frames. We provide a detailed comparison of the proposed perception loss function with previous approaches through both information theoretic analysis as well as experiments involving moving MNIST and UVG datasets.
CLJun 24, 2024
Understanding and Mitigating Tokenization Bias in Language ModelsBuu Phan, Marton Havasi, Matthew Muckley et al.
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
IVMay 30, 2023
On the Choice of Perception Loss Function for Learned Video CompressionSadaf Salehkalaibar, Buu Phan, Jun Chen et al.
We study causal, low-latency, sequential video compression when the output is subjected to both a mean squared-error (MSE) distortion loss as well as a perception loss to target realism. Motivated by prior approaches, we consider two different perception loss functions (PLFs). The first, PLF-JD, considers the joint distribution (JD) of all the video frames up to the current one, while the second metric, PLF-FMD, considers the framewise marginal distributions (FMD) between the source and reconstruction. Using information theoretic analysis and deep-learning based experiments, we demonstrate that the choice of PLF can have a significant effect on the reconstruction, especially at low-bit rates. In particular, while the reconstruction based on PLF-JD can better preserve the temporal correlation across frames, it also imposes a significant penalty in distortion compared to PLF-FMD and further makes it more difficult to recover from errors made in the earlier output frames. Although the choice of PLF decisively affects reconstruction quality, we also demonstrate that it may not be essential to commit to a particular PLF during encoding and the choice of PLF can be delegated to the decoder. In particular, encoded representations generated by training a system to minimize the MSE (without requiring either PLF) can be {\em near universal} and can generate close to optimal reconstructions for either choice of PLF at the decoder. We validate our results using (one-shot) information-theoretic analysis, detailed study of the rate-distortion-perception tradeoff of the Gauss-Markov source model as well as deep-learning based experiments on moving MNIST and KTH datasets.
CVFeb 7, 2021
Adversarial Imaging PipelinesBuu Phan, Fahim Mannan, Felix Heide
Adversarial attacks play an essential role in understanding deep neural network predictions and improving their robustness. Existing attack methods aim to deceive convolutional neural network (CNN)-based classifiers by manipulating RGB images that are fed directly to the classifiers. However, these approaches typically neglect the influence of the camera optics and image processing pipeline (ISP) that produce the network inputs. ISPs transform RAW measurements to RGB images and traditionally are assumed to preserve adversarial patterns. However, these low-level pipelines can, in fact, destroy, introduce or amplify adversarial patterns that can deceive a downstream detector. As a result, optimized patterns can become adversarial for the classifier after being transformed by a certain camera ISP and optic but not for others. In this work, we examine and develop such an attack that deceives a specific camera ISP while leaving others intact, using the same down-stream classifier. We frame camera-specific attacks as a multi-task optimization problem, relying on a differentiable approximation for the ISP itself. We validate the proposed method using recent state-of-the-art automotive hardware ISPs, achieving 92% fooling rate when attacking a specific ISP. We demonstrate physical optics attacks with 90% fooling rate for a specific camera lenses.
CVDec 13, 2019
Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler RadarNicolas Scheiner, Florian Kraus, Fangyin Wei et al.
Conventional sensor systems record information about directly visible objects, whereas occluded scene components are considered lost in the measurement process. Non-line-of-sight (NLOS) methods try to recover such hidden objects from their indirect reflections - faint signal components, traditionally treated as measurement noise. Existing NLOS approaches struggle to record these low-signal components outside the lab, and do not scale to large-scale outdoor scenes and high-speed motion, typical in automotive scenarios. In particular, optical NLOS capture is fundamentally limited by the quartic intensity falloff of diffuse indirect reflections. In this work, we depart from visible-wavelength approaches and demonstrate detection, classification, and tracking of hidden objects in large-scale dynamic environments using Doppler radars that can be manufactured at low-cost in series production. To untangle noisy indirect and direct reflections, we learn from temporal sequences of Doppler velocity and position measurements, which we fuse in a joint NLOS detection and tracking network over time. We validate the approach on in-the-wild automotive scenes, including sequences of parked cars or house facades as relay surfaces, and demonstrate low-cost, real-time NLOS in dynamic automotive environments.
LGOct 23, 2019
Detecting Out-of-Distribution Inputs in Deep Neural Networks Using an Early-Layer OutputVahdat Abdelzad, Krzysztof Czarnecki, Rick Salay et al.
Deep neural networks achieve superior performance in challenging tasks such as image classification. However, deep classifiers tend to incorrectly classify out-of-distribution (OOD) inputs, which are inputs that do not belong to the classifier training distribution. Several approaches have been proposed to detect OOD inputs, but the detection task is still an ongoing challenge. In this paper, we propose a new OOD detection approach that can be easily applied to an existing classifier and does not need to have access to OOD samples. The detector is a one-class classifier trained on the output of an early layer of the original classifier fed with its original training set. We apply our approach to several low- and high-dimensional datasets and compare it to the state-of-the-art detection approaches. Our approach achieves substantially better results over multiple metrics.
LGApr 27, 2019
Analysis of Confident-Classifiers for Out-of-distribution DetectionSachin Vernekar, Ashish Gaurav, Taylor Denouden et al.
Discriminatively trained neural classifiers can be trusted, only when the input data comes from the training distribution (in-distribution). Therefore, detecting out-of-distribution (OOD) samples is very important to avoid classification errors. In the context of OOD detection for image classification, one of the recent approaches proposes training a classifier called "confident-classifier" by minimizing the standard cross-entropy loss on in-distribution samples and minimizing the KL divergence between the predictive distribution of OOD samples in the low-density regions of in-distribution and the uniform distribution (maximizing the entropy of the outputs). Thus, the samples could be detected as OOD if they have low confidence or high entropy. In this paper, we analyze this setting both theoretically and experimentally. We conclude that the resulting confident-classifier still yields arbitrarily high confidence for OOD samples far away from the in-distribution. We instead suggest training a classifier by adding an explicit "reject" class for OOD samples.
LGDec 6, 2018
Improving Reconstruction Autoencoder Out-of-distribution Detection with Mahalanobis DistanceTaylor Denouden, Rick Salay, Krzysztof Czarnecki et al.
There is an increasingly apparent need for validating the classifications made by deep learning systems in safety-critical applications like autonomous vehicle systems. A number of recent papers have proposed methods for detecting anomalous image data that appear different from known inlier data samples, including reconstruction-based autoencoders. Autoencoders optimize the compression of input data to a latent space of a dimensionality smaller than the original input and attempt to accurately reconstruct the input using that compressed representation. Since the latent vector is optimized to capture the salient features from the inlier class only, it is commonly assumed that images of objects from outside of the training class cannot effectively be compressed and reconstructed. Some thus consider reconstruction error as a kind of novelty measure. Here we suggest that reconstruction-based approaches fail to capture particular anomalies that lie far from known inlier samples in latent space but near the latent dimension manifold defined by the parameters of the model. We propose incorporating the Mahalanobis distance in latent space to better capture these out-of-distribution samples and our results show that this method often improves performance over the baseline approach.
LGNov 27, 2018
Calibrating Uncertainties in Object Localization TaskBuu Phan, Rick Salay, Krzysztof Czarnecki et al.
In many safety-critical applications such as autonomous driving and surgical robots, it is desirable to obtain prediction uncertainties from object detection modules to help support safe decision-making. Specifically, such modules need to estimate the probability of each predicted object in a given region and the confidence interval for its bounding box. While recent Bayesian deep learning methods provide a principled way to estimate this uncertainty, the estimates for the bounding boxes obtained using these methods are uncalibrated. In this paper, we address this problem for the single-object localization task by adapting an existing technique for calibrating regression models. We show, experimentally, that the resulting calibrated model obtains more reliable uncertainty estimates.
LGMay 4, 2018
Improve Uncertainty Estimation for Unknown Classes in Bayesian Neural Networks with Semi-Supervised /One Set ClassificationBuu Phan
Although deep neural network (DNN) has achieved many state-of-the-art results, estimating the uncertainty presented in the DNN model and the data is a challenging task. Problems related to uncertainty such as classifying unknown classes (class which does not appear in the training data) data as known class with high confidence, is critically concerned in the safety domain area (e.g, autonomous driving, medical diagnosis). In this paper, we show that applying current Bayesian Neural Network (BNN) techniques alone does not effectively capture the uncertainty. To tackle this problem, we introduce a simple way to improve the BNN by using one class classification (in this paper, we use the term "set classification" instead). We empirically show the result of our method on an experiment which involves three datasets: MNIST, notMNIST and FMNIST.