You Only Need Your Transformer 25% of the Time: Meaning-First Execution for Eliminating Unnecessary Inference
This addresses the inefficiency of AI inference systems for users by reducing computational costs without modifying models, though it is incremental as it builds on existing optimization techniques.
The paper tackled the problem of unnecessary transformer inference by introducing Meaning-First Execution (MFEE), a control-plane architecture that selectively invokes inference only when needed, achieving a 78.1% reduction in execution while maintaining 100% exact-match equivalence for invoked executions.
Modern AI inference systems treat transformer execution as mandatory, conflating model capability with execution necessity. We reframe inference as a control-plane decision problem: determining when execution is necessary versus when correctness can be preserved through alternative pathways. We introduce Meaning-First Execution (MFEE), a control-plane architecture implementing this framework, selectively invoking transformer inference only when required. MFEE operates as a gating layer above existing stacks without modifying models, weights, or parameters. Across 1,000 diverse prompts under deterministic decoding, MFEE achieves 78.1% execution reduction while maintaining 100% exact-match equivalence for invoked executions. Comparative evaluation reveals pattern-based routers achieve at most 53.3% avoidance with correctness failures, while MFEE reaches 100% avoidance with zero failures through semantic analysis. We prove this limitation via Theorem 1: any router operating solely on finite feature maps cannot simultaneously guarantee zero false skips and positive avoidance on feature-collision pairs. These results establish execution governance as a foundational layer in ML systems infrastructure, orthogonal to model-level optimization techniques.