AI CLJan 29

FBS: Modeling Native Parallel Reading inside a Transformer

arXiv:2601.21708v16 citations

Originality Incremental advance

AI Analysis

This addresses inference acceleration for LLM users, though it appears incremental as it builds on existing Transformer methods.

The paper tackles the inefficiency of token-by-token autoregression in large language models by proposing the Fovea-Block-Skip Transformer (FBS), which integrates content-adaptive foresight and chunk-structure-aware compute allocation to improve the quality-efficiency trade-off without adding parameters.

Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.

View on arXiv PDF

Similar