Fiodar Kazhamiaka

DC
h-index15
5papers
17citations
Novelty55%
AI Score53

5 Papers

74.5AIMay 26
Natural Language Query to Configuration for Retrieval Agents

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob et al.

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

64.5ARApr 3
Octopus: Enhancing CXL Memory Pods via Sparse Topology

Yuhong Zhong, Fiodar Kazhamiaka, Pantea Zardoshti et al.

The Compute Express Link (CXL) interconnect enables compute "pods" that pool memory across servers to reduce cost and improve efficiency. These pods also facilitate pairwise communication whose needs conflict with pooling. Importantly, existing pod designs are small or require indirection through expensive switches. These conventional designs implicitly assume that pods must fully connect all servers to all CXL pooling devices. This paper breaks with this conventional wisdom by introducing Octopus pods. Octopus directly connects servers to low-port-count CXL pooling devices (e.g., 4 ports) yet scales to large pods without switches by constructing a sparse CXL topology in which each pooling device connects to a carefully chosen subset of servers. Octopus explicitly balances "overlap", where two servers connect to the same pooling device: overlap reduces pooling efficiency but enables low-latency communication. Octopus resolves this tension by grouping servers into "islands" with low-latency intra-island communication and interconnecting islands to favor pooling. We build a three-server CXL pod prototype and simulate scaled pods with 96 servers under measured device characteristics and physical constraints (1.5 m copper cables). On hardware, Octopus RPCs are 3.2x faster than in-rack RDMA and 2.4x faster than CXL switches. In simulation, Octopus achieves net server cost savings of 3-5.4% whereas CXL switches result in a net cost increase.

84.2DCMar 19
From Servers to Sites: Compositional Power Trace Generation of LLM Inference for Infrastructure Planning

Grant Wilkins, Fiodar Kazhamiaka, Ram Rajagopal

Datacenter operators and electrical utilities rely on power traces at different spatiotemporal scales. Operators use fine-grained traces for provisioning, facility management, and scheduling, while utilities use site-level load profiles for capacity and interconnection planning. Existing datacenter power models do not capture LLM inference workloads, in which GPUs shift rapidly among compute-intensive prefill, lower-power decode, and idle states, and facility demand depends on how these states evolve and synchronize across many devices. We show that LLM inference power can be represented compositionally through two components: workload-driven transitions among operating states and configuration-specific power distributions within those states. Building on this observation, we develop a trace-generation framework that learns from measured traces and synthesizes power profiles for new traffic conditions and serving configurations. These traces aggregate from GPU servers to rack-, row-, and facility-scale load profiles at the temporal granularity required by the study. Across multiple LLMs, tensor-parallel settings, and GPU generations, our framework achieves median absolute energy error below 5% for most configurations while preserving temporal autocorrelation structure. The resulting traces support downstream analyses including oversubscription, power modulation, and utility-facing load characterization, enabling infrastructure evaluations that flat nameplate assumptions and static trace replay cannot support.

67.8DCMay 15
Designing Datacenter Power Delivery Hierarchies for the AI Era

Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare et al.

Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.

LGSep 24, 2025
Energy Use of AI Inference: Efficiency Pathways and Test-Time Compute

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse et al.

As AI inference scales to billions of queries and emerging reasoning and agentic workflows increase token demand, reliable estimates of per-query energy use are increasingly important for capacity planning, emissions accounting, and efficiency prioritization. Many public estimates are inconsistent and overstate energy use, because they extrapolate from limited benchmarks and fail to reflect efficiency gains achievable at scale. In this perspective, we introduce a bottom-up methodology to estimate the per-query energy of large-scale LLM systems based on token throughput. For models running on an H100 node under realistic workloads, GPU utilization and PUE constraints, we estimate a median energy per query of 0.34 Wh (IQR: 0.18-0.67) for frontier-scale models (>200 billion parameters). These results are consistent with measurements using production-scale configurations and show that non-production estimates and assumptions can overstate energy use by 4-20x. Extending to test-time scaling scenarios with 15x more tokens per typical query, the median energy rises 13x to 4.32 Wh, indicating that targeting efficiency in this regime will deliver the largest fleet-wide savings. We quantify achievable efficiency gains at the model, serving platform, and hardware levels, finding individual median reductions of 1.5-3.5x in energy per query, while combined advances can plausibly deliver 8-20x reductions. To illustrate the system-level impact, we estimate the baseline daily energy use of a deployment serving 1 billion queries to be 0.8 GWh/day. If 10% are long queries, demand could grow to 1.8 GWh/day. With targeted efficiency interventions, it falls to 0.9 GWh/day, similar to the energy footprint of web search at that scale. This echoes how data centers historically tempered energy growth through efficiency gains during the internet and cloud build-up.