Keep Your Friends Close: Leveraging Affinity Groups to Accelerate AI Inference Workflows
This work addresses latency issues in streaming AI inference for developers, offering an incremental improvement by complementing existing techniques like caching and scheduling.
The paper tackles the problem of high latency in AI inference workflows by proposing an affinity grouping mechanism that enables coordinated data management, resulting in significantly lower latency as workload and scale-out increase with only minor code changes.
AI inference workflows are typically structured as a pipeline or graph of AI programs triggered by events. As events occur, the AIs perform inference or classification tasks under time pressure to respond or take some action. Standard techniques that reduce latency in other streaming settings (such as caching and optimization-driven scheduling) are of limited value because AI data access patterns (models, databases) change depending on the triggering event: a significant departure from traditional streaming. In this work, we propose a novel affinity grouping mechanism that makes it easier for developers to express application-specific data access correlations, enabling coordinated management of data objects in server clusters hosting streaming inference tasks. Our proposals are thus complementary to other approaches such as caching and scheduling. Experiments confirm the limitations of standard techniques, while showing that the proposed mechanism is able to maintain significantly lower latency as workload and scale-out increase, and yet requires only minor code changes.