DCJan 5
RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race InferenceJiarui Wang, Huichao Chai, Yuanhang Zhang et al.
Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.
DCApr 1
Hotspot-Aware Scheduling of Virtual Machines with Overcommitment for Ultimate Utilization in Cloud DatacentersJiaxi Wu, Pavel Popov, Wenquan Yang et al.
We address the problem of under-utilization of resources in datacenters during cloud operations, specifically focusing on the challenge of online virtual machine (VM) scheduling. Rather than following the traditional approach of scheduling VMs based solely on their static flavors, we take into account their dynamic CPU utilization. We employ $Γ$-robustness theory to manage the dynamic nature and introduce a novel variant of bin packing - Probabilistic k-Bins Packing (PkBP), which theoretically protects the Physical Machines (PMs) from hotspots formation within a specified probability $α$. We develop a scheduling algroithm named CloseRadiusFit and cold-start AI based prediction algorithms for the online version of PkBP. To verify the quality of our approach towards the optimal solutions, we solve the Offline PkBP problem by designing a novel Mixed Integer Linear Programming (MILP) model and a combination of numerical upper and lower bounds. Our experimental results demonstrate that CloseRadiusFit achieves narrow gaps of 1.6% and 3.1% when compared to the lower and upper bounds, respectively.
ROMay 11
ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor EnvironmentsHanyu Jin, Zhefan Xu, Haoyu Shen et al.
Indoor infrastructure inspection, such as tunnels and industrial facilities, requires systematic surface coverage to ensure that all inspection targets are properly observed. Unmanned Aerial Vehicles (UAVs) offer an alternative to manual inspection by conducting map-guided surface inspection using prior structural models. However, in practice, indoor inspection often relies on floorplan-derived reference maps that may not reflect unforeseen obstacles, such as temporary structures or equipment, leading to occluded viewpoints and degraded inspection quality. Existing coverage planning methods typically assume a fully known inspection environment and perform deterministic global viewpoint optimization based on accurate prior maps, making them vulnerable to environmental discrepancies during execution. This work presents an adaptive UAV inspection framework for partially known structured indoor environments. The proposed method integrates a segment-based global coverage planner with an inspection-oriented local view-angle adaptation module. The global planner organizes planar inspection targets into surface-aligned clusters to generate compact viewpoint sequences with improved orientation consistency. The local planner generates collision-free trajectories and adjusts the viewing direction online to mitigate occlusion-induced coverage loss while preserving the planned trajectory structure. The simulation results across randomized scene configurations demonstrate that the proposed global planner achieves near-complete coverage while reducing trajectory length compared to representative baselines. Real-world flight experiments further validate that the framework produces usable inspection data for downstream analysis. These results indicate that the proposed framework improves inspection efficiency and adaptability in partially known structured indoor environments.