NIMar 6
CrossCheck: Input Validation for WAN Control SystemsAlexander Krentsel, Rishabh Iyer, Isaac Keslassy et al.
We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs - often stemming from bugs in the SDN control infrastructure - CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data).
96.8AIMay 22
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified SystemsShubham Agarwal, Alexander Krentsel, Shu Liu et al.
AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.
94.5DBMay 22
The Time is Here for Just-in-Time Systems: Challenges and OpportunitiesShu Liu, Alexander Krentsel, Shubham Agarwal et al.
Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across deployments, paying a significant performance cost. We argue that LLM-based coding agents now make a different approach tractable: Just-in-Time Systems, in which the entire system is synthesized from scratch, specialized to the environment, workload, and required system properties. We present a JIT system synthesis pipeline, Jitskit, and explore its effectiveness in synthesizing key-value stores from spec cards that span different YCSB workloads, deployment constraints (e.g., compute resources), and system properties (e.g., consistency and durability). Jitskit iteratively refines a system implementation to match the specification against an evolving evaluation test suite. The resulting synthesized systems are performant, beating comparable state-of-the-art systems on 18 of 18 specs tried, by up to 4.6x over the best off-the-shelf baseline on the most favorable spec. Naively running Claude Code either reward-hacks or underperforms Jitskit by up to 5.4x. We discuss the challenges we overcame in building Jitskit and our key takeaways.
65.4NIMay 21
EnCoR: An end-to-end architecture for simplifying cellular networksWesley Woo, Zhuowei Wen, Monniiesh Velmurugan et al.
Since their creation, cellular networks have made in-network mobility support a key feature of their service model. While this approach provides seamless connectivity for legacy traffic, it has the side effects of inflating end-user latency and increasing complexity and operational overhead for operators. Yet modern applications and transport protocols are increasingly mobility tolerant, prompting us to revisit the assumption that mobility must be provided as an in-network service. In this paper, we propose EnCoR (End-to-End Core and RAN), a deployable cellular network architecture that removes mobility from the core entirely. Leveraging end-to-end mobility, EnCoR eliminates tunnel-based IP anchoring while preserving compatibility with existing authentication, charging, and QoS techniques. We demonstrate that EnCoR works with unmodified phones while providing equivalent performance as traditional LTE networks for real applications including video and voice calling and video streaming. We show that EnCoR not only allows network operators to reduce end to end latency, but can also reduce the capital cost of providing low latency service to users by more than 90% compared to 3GPP networks, based on cost estimates for cellular network core and border router infrastructure provided by the FCC. Finally, we demonstrate that these gains are achieved while reducing the amount of overall handover control messaging, allowing the EnCoR core network to handle a greater number of mobility handover events than an LTE core under identical hardware constraints, achieving a 2.6x lower handover latency under load.
64.2OSMay 19
Clove: Object-Level CXL Memory Management in Managed RuntimesSam Son, Zhihong Luo, Wen Zhang et al.
Object-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store interface. As a result, existing approaches remain limited in scope, primarily targeting unmanaged-language applications with bespoke runtimes or compiler support. This paper identifies and explores a new design point for object-level CXL management: managed languages and their runtimes. The key observation is that existing managed runtimes already provide highly optimized mechanisms for problems closely related to object-level management, including object relocation and dynamic code generation. However, they still lack the features needed for tiered memory management, such as hotness tracking and relocation policies, and thus must be carefully extended to fully realize this direction. We present Clove, a system that extends existing managed runtimes to support object-level CXL management for managed-language applications. Clove combines profile-guided object hotness tracking with object relocation techniques and policies. Our JVM prototype demonstrates that this extension enables high utilization of fast-tier memory while bounding runtime overhead, reducing application slowdown by 22-84% compared to page-based systems.
51.5NIMay 3
GATE: GPU-Accelerated Traffic Engineering for the WANRahul Bothra, Alexander Krentsel, Saptarshi Mandal et al.
Traffic engineering (TE) has become a crucial tool for enforcing routing policy and maintaining operational efficiency in large networks. Existing TE solutions pick an objective function to optimize, aiming to balance (i) allocating traffic optimally with (ii) reacting quickly to demand changes and disruption events. However, as the scale of networks grows, the runtime of the existing optimal solution becomes infeasibly large. The alternative - approximate solvers - result in costly inefficiencies. We present GPU-Accelerated Traffic Engineering (GATE), which achieves the best of both worlds: enabling fast TE runtimes through a highly-parallelizable GPU-compatible decomposition, while iteratively converging to the provably optimal solution. GATE unlocks a unique set of desirable properties: it becomes increasingly parallelizable with network size, supports a wide spectrum of fairness objectives, and offers theoretically guaranteed convergence to the optimal solution and near-optimal convergence within a bounded time. We evaluate GATE on production traces from two large cloud WANs, and show that GATE achieves near-optimal solutions 5-10x faster than state-of-the-art.
93.3NIApr 30
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM ServingJunsun Choi, Sam Son, Sunjin Choi et al.
Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.
NIJul 28, 2025
Load Balancing for AI Training WorkloadsSarah McClure, Sylvia Ratnasamy, Scott Shenker
We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
NIOct 21, 2024
Managing Bandwidth: The Key to Cloud-Assisted Autonomous DrivingAlexander Krentsel, Peter Schafhalter, Joseph E. Gonzalez et al.
Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car.
CRNov 8, 2018
TimeCrypt: Encrypted Data Stream Processing at Scale with Cryptographic Access ControlLukas Burkhalter, Anwar Hithnawi, Alexander Viand et al.
A growing number of devices and services collect detailed time series data that is stored in the cloud. Protecting the confidentiality of this vast and continuously generated data is an acute need for many applications in this space. At the same time, we must preserve the utility of this data by enabling authorized services to securely and selectively access and run analytics. This paper presents TimeCrypt, a system that provides scalable and real-time analytics over large volumes of encrypted time series data. TimeCrypt allows users to define expressive data access and privacy policies and enforces it cryptographically via encryption. In TimeCrypt, data is encrypted end-to-end, and authorized parties can only decrypt and verify queries within their authorized access scope. Our evaluation of TimeCrypt shows that its memory overhead and performance are competitive and close to operating on data in the clear.
CRJun 6, 2018
Droplet: Decentralized Authorization and Access Control for Encrypted Data StreamsHossein Shafagh, Lukas Burkhalter, Anwar Hithnawi et al.
This paper presents Droplet, a decentralized data access control service. Droplet enables data owners to securely and selectively share their encrypted data while guaranteeing data confidentiality in the presence of unauthorized parties and compromised data servers. Droplet's contribution lies in coupling two key ideas: (i) a cryptographically-enforced access control construction for encrypted data streams which enables users to define fine-grained stream-specific access policies, and (ii) a decentralized authorization service that serves user-defined access policies. In this paper, we present Droplet's design, the reference implementation of Droplet, and the experimental results of three case-study applications deployed with Droplet: Fitbit activity tracker, Ava health tracker, and ECOviz smart meter dashboard, demonstrating Droplet's applicability for secure sharing of IoT streams.