LGAIJun 2, 2025

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

arXiv:2506.01919v1h-index: 10
Originality Incremental advance
AI Analysis

This work offers theoretical insights into Transformer mechanisms for multi-task learning, which is incremental as it builds on existing empirical success to provide foundational understanding.

The paper investigates how Transformers generalize across multiple tasks by analyzing their layerwise behavior on Hidden Markov Models, finding that lower layers extract local features while upper layers decouple features for time disentanglement, and provides theoretical constructions that align with these empirical observations.

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive theoretical understanding of this phenomenon remains limited. In this work, we investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability. Taking explorations on a typical sequence model, i.e, Hidden Markov Models, which are fundamental to many language tasks, we observe that: first, lower layers of Transformers focus on extracting feature representations, primarily influenced by neighboring tokens; second, on the upper layers, features become decoupled, exhibiting a high degree of time disentanglement. Building on these empirical insights, we provide theoretical analysis for the expressiveness power of Transformers. Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes