Aditya Gaur

DCJul 7, 2024

The infrastructure powering IBM's Gen AI model development

Talia Gershon, Seetharami Seelam, Brian Belgodere et al.

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.

11.4LGMay 25

Closed-Form Node Classification with Exact Graph Unlearning

Aditya Gaur, Charu Sharma

Graph neural networks for node classification are typically trained by gradient descent over hundreds or thousands of epochs. Recent work has shown that, when properly tuned, classic GCN/SAGE/GAT architectures can match graph transformers on many node-classification benchmarks. We ask a complementary question: how much of this performance can be recovered by deterministic closed-form solvers, and what guarantees does this enable? We introduce a routed closed-form framework selected by adjusted homophily. For assortative graphs, we use SGC-style propagation followed by Ridge regression; for heterophilous graphs, we introduce LCF-Net, a layer-wise closed-form graph feature-refinement network whose per-layer Ridge solves are capped by a Gaussian kernel-Ridge head. Across 14 benchmarks, including ogbn-arxiv and ogbn-proteins, our closed-form predictors match or beat the best vanilla 2-layer GCN/SAGE/GAT on 9 of 9 measured datasets, tie tuned deep recipes within one standard deviation on 9 of 12 small benchmarks, and exceed the OGB-leaderboard plain GCN on both large graphs. The remaining heterophilous gap closely tracks the gain from vanilla 2-layer to deep SAGE, suggesting that the residual difference is primarily architectural. Because our predictors are explicit solutions of deterministic linear systems, modified graph inputs can be re-solved to obtain retrain-equivalent parameters. We formalize exact graph-object unlearning for label, feature, edge, node, and subgraph modifications, prove K-hop locality for Ridge components, and verify exactness across 109 configurations. On ogbn-arxiv, localized updates give $21$--$45\times$ speedups over full re-solving and roughly $10^{6}\times$ speedups over gradient retraining. Structural-inversion experiments further quantify the privacy floor of exact retraining and the additional leakage of approximate graph-unlearning methods.

Aditya Gaur

2 Papers