Zengyi Qin, Jinyuan Chen, Yunze Man et al.
This addresses the resource-intensive infrastructure needed for computer use agent research, offering a scalable solution.
Distributed systems, parallel computing, cloud
Zengyi Qin, Jinyuan Chen, Yunze Man et al.
This addresses the resource-intensive infrastructure needed for computer use agent research, offering a scalable solution.
Ruibo Fan, Xiangrui Yu, Xinglin Pan et al.
This addresses the problem of slow and memory-intensive LLM serving for AI practitioners, offering a novel co-designed solution that provides both compression and acceleration.
Xinyi Hu, Yuhao Shen, Baolin Zhang et al.
This addresses the bottleneck of verification compute in production-grade LLM serving, offering a novel solution for high-concurrency scenarios.
Xueshen Liu, Yongji Wu, Yuncheng Yao et al.
This addresses a major bottleneck for LLM service providers using autoscaling, enabling faster response to changing workloads.
Yuhao Shen, Junyi Shen, Quan Kong et al.
This addresses inference latency issues for users of large language models, representing a significant but incremental improvement over existing speculative decoding methods.
Ruihang Lai, Hao Kang, Haozhan Tang et al.
This work addresses the high cost and complexity for AI coding agents to understand and extend existing MoE training frameworks, offering a more efficient development path for framework engineers and researchers.
Yongjun He, Shuai Zhang, Jiading Gai et al. · amazon-science
This work provides a practical solution for organizations to utilize heterogeneous GPU clusters for LLM post-training, reducing reliance on scarce homogeneous high-end GPUs.
Jiale Xu, Rui Zhang, Yi Xiong et al.
For LLM serving systems, eLLM addresses memory fragmentation and utilization inefficiencies, enabling higher throughput and larger batch sizes.
Feng Ren, Ruoyu Qin, Teng Ma et al.
This addresses performance and resilience issues in large-scale GPU clusters for industrial LLM serving, offering a novel solution to a critical bottleneck.
Tianhao Hu, Xiangcheng Liu, Youshao Xiao et al.
For large-scale LLM RL training, DORA solves the efficiency-accuracy tradeoff in asynchronous rollout, enabling faster training while maintaining algorithmic correctness.
Zhuoshan Zhou, Chen Zhang, Shuyi Zhang et al.
For ML practitioners deploying large MoE models on multi-GPU systems, MoE-Hub addresses the communication bottleneck with a novel hardware-accelerated approach, though it requires hardware changes.
Youhe Jiang, Ran Yan, You Peng et al.
This addresses the challenge of adapting LLM serving to dynamic workloads and cluster conditions, representing a new paradigm rather than an incremental improvement.
Lingfeng Tang, Daoping Zhang, Junjie Chen et al.
For LLM serving systems, MMA addresses the critical bottleneck of host-GPU data movement, significantly improving bandwidth and reducing latency without hardware changes.
Abolfazl Younesi, Nouhaila Innan, Alberto Marchisio et al.
For researchers and practitioners running hybrid quantum-classical algorithms on cloud quantum computers, EFaaS addresses the critical bottleneck of decoupled batch queues that cause long delays and drift penalties.
Nina Wiedemann, Quentin Leboutet, Michael Paulitsch et al.
This addresses the problem of efficient GPU kernel optimization for developers and researchers, offering a novel method that improves performance over existing approaches.
Haohui Mai, Xiaoyan Guo, Xiangyun Ding et al.
For GPU kernel developers, Argus bridges the gap between automated code generation and hand-tuned performance, solving a critical bottleneck in LLM inference.
Insu Jang, Runyu Lu, Nikhil Bansal et al.
For researchers and engineers training large multimodal models, Cornstarch provides a more efficient distributed training approach tailored to the heterogeneity of MLLMs.
Li Zhang, Youhe Jiang, Guoliang He et al.
This work addresses the need for automated, hardware-aware mixed-precision inference in LLMs, offering significant performance gains for practitioners deploying large models.
Yichao Yuan, Mosharaf Chowdhury, Nishil Talati
For AI serving systems, KAIROS tackles the emerging problem of power management in agentic workloads, which are fundamentally different from traditional LLM serving.
Xinran Wei, Yan Pan, Fusong Ju et al.
This work addresses a computational bottleneck in quantum chemistry simulations, enabling faster and more scalable calculations for molecular systems.