Sarah Schultz

AR
3papers
1citation
Novelty58%
AI Score48

3 Papers

79.5ARMay 22Code
DORA: Dataflow-Instruction Orchestration Architecture for DNN Acceleration

Xingzhen Chen, Zhuoping Yang, Jinming Zhuang et al.

As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads. To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints. We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.

14.3ARApr 7
PHAROS: Pipelined Heterogeneous Accelerators for Real-time Safety-critical Systems With Deadline Compliance

Shixin Ji, Jinming Zhuang, Sarah Schultz et al.

Spatially partitioned heterogeneous accelerators (HAs) are increasingly adopted in embedded systems for their performance and flexibility. Yet most existing HA design frameworks optimize primarily for throughput or quality-of-service (QoS) metrics. They often overlook safety-critical real-time requirements, including hardware support for predictable execution, real-time-aware design space exploration (DSE), and rigorous schedulability analysis. These requirements are essential in safety-critical applications such as smart transportation, where schedulability guarantees directly affect system safety. To address this gap, we present PHAROS, a real-time-centric HA design framework. PHAROS introduces preemption mechanisms and scheduler designs for spatially partitioned HAs under first-in-first-out (FIFO) and earliest-deadline-first (EDF) policies. Leveraging modern real-time theory, we further develop a soft real-time (SRT) schedulability-oriented DSE with objectives and constraints tailored to SRT schedulability. Through comprehensive modeling, analysis, and evaluation across diverse applications, we show that PHAROS's DSE discovers more feasible configurations for a broader range of task sets than throughput-oriented DSE baselines while delivering improved real-time performance. We also provide response-time analyses for the supported scheduling algorithms.

50.2ARApr 8
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

Xingzhen Chen, Jinming Zhuang, Zhuoping Yang et al.

With the development of deep neural network (DNN) enabled applications, achieving high hardware resource efficiency on diverse workloads is non-trivial in heterogeneous computing platforms. Prior works discuss dedicated architectures to achieve maximal resource efficiency. However, a mismatch between hardware and workloads always exists in various diverse workloads. Other works discuss overlay architecture that can dynamically switch dataflow for different workloads. However, these works are still limited by flexibility granularity and induce much resource inefficiency. To solve this problem, we propose a flexible composing architecture, FILCO, that can efficiently match diverse workloads to achieve the optimal storage and computation resource efficiency. FILCO can be reconfigured in real-time and flexibly composed into a unified or multiple independent accelerators. We also propose the FILCO framework, including an analytical model with a two-stage DSE that can achieve the optimal design point. We also evaluate the FILCO framework on the 7nm AMD Versal VCK190 board. Compared with prior works, our design can achieve 1.3x - 5x throughput and hardware efficiency on various diverse workloads.