Transformers from Compressed Representations
This addresses the problem of high computational and memory costs in AI for researchers and practitioners, though it is incremental as it builds on existing transformer and compression techniques.
The paper tackles the problem of inefficient data processing in representation learning by introducing TEMPEST, a method that uses compressed file structures for tokenization, enabling transformers to learn from compressed data without full decoding, which reduces tokens and achieves competitive accuracy with efficiency gains in memory and compute.
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.