LG AIMar 8

Interpretable-by-Design Transformers via Architectural Stream Independence

arXiv:2603.07482v17.8h-index: 2

Predicted impact top 54% in LG · last 90 daysOriginality Highly original

AI Analysis

This work provides a method for designing more interpretable transformer models for researchers and practitioners who need to understand model decision-making, addressing the long-standing problem of transformer opacity.

This paper proposes a transformer architecture that enforces interpretability by separating token stream and contextual semantics into independent streams, integrating them only at the output. This Late Fusion Architecture (LFA) maintains interpretable symbolic heads throughout its layers, quantified by a Token-Position Dependence Score ($PDS_{max}$ = 0.276 for LFA vs. 0.058 for standard transformers), and shows functional modularity with minimal semantic damage upon intervention (Cohen's d = -0.158 vs. -0.672 for baselines).

While transformers achieve strong performance, their internal decision-making processes remain opaque. We investigate whether architectural constraints can enforce interpretability by design through architectural stream independence: maintaining a token stream (carrying symbolic structure) and contextual semantics in separated streams that remain independently observable throughout processing, with integration delayed until output. We validate this principle through the Late Fusion Architecture (LFA), which demonstrates interpretable symbolic heads through all the final layers, while standard transformers show dissolution by the third of six layers; we quantify this effect by introducing the Token-Position Dependence Score (PDS), with $PDS_{max}$ = 0.276 and 0.058, respectively. Crucially, intervention experiments demonstrate functional modularity: suppressing LFA's recency heads causes minimal semantic damage (Cohen's d = -0.158) versus catastrophic entanglement in baselines (d = -0.672). LFA demonstrates that architectural constraints improve underlying learning mechanisms, averaging 42% stability versus 19% and 11% for baseline comparisons, with extremes from 50% on LFA's best pairs (6 of 12 heads position-invariant) down to 0% complete collapse in over-constrained cases. By preventing premature entanglement, architectural independence steers models toward semantic understanding over positional heuristics, establishing interpretability as an architectural design criterion enforceable through structural constraints rather than post-hoc analysis.

View on arXiv PDF

Similar