PL AR LGNov 11, 2025

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

Gina Sohn, Genghan Zhang, Konstantin Hossfeld, Jungwoo Kim, Nathan Sobotka, Nathan Zhang, Olivia Hsu, Kunle Olukotun

arXiv:2511.07776v11.2h-index: 69

Originality Highly original

AI Analysis

This addresses the problem of limited expressiveness for dynamic behaviors in tensor applications, such as machine learning with ragged tensors, for developers using spatial dataflow accelerators, representing a novel method for a known bottleneck.

The paper tackled the challenge of efficiently running dynamic tensor workloads on spatial dataflow accelerators by introducing the Streaming Tensor Program (STeP) abstraction, which reduced on-chip memory requirement by 2.18x, improved latency by 1.5x, and increased compute utilization by 2.57x over prior methods.

Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

View on arXiv PDF

Similar