FBS: Modeling Native Parallel Reading inside a Transformer
This addresses inference acceleration for LLM users, though it appears incremental as it builds on existing Transformer methods.
The paper tackles the inefficiency of token-by-token autoregression in large language models by proposing the Fovea-Block-Skip Transformer (FBS), which integrates content-adaptive foresight and chunk-structure-aware compute allocation to improve the quality-efficiency trade-off without adding parameters.
Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.