LGJan 4, 2025

TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training

Sebastian Loeschcke, David Pitt, Robert Joseph George, Jiawei Zhao, Cheng Luo, Yuandong Tian, Jean Kossaifi, Anima Anandkumar

arXiv:2501.02379v211.46 citationsh-index: 28

Originality Incremental advance

AI Analysis

This addresses memory efficiency for industrial-scale neural operator training in scientific computing, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles the memory challenges in training neural operators for high-resolution scientific problems by introducing TensorGRaD, a method that reduces total memory usage by over 50% while maintaining or improving accuracy on tasks like turbulent Navier-Stokes equations.

Scientific problems require resolving multi-scale phenomena across different resolutions and learning solution operators in infinite-dimensional function spaces. Neural operators provide a powerful framework for this, using tensor-parameterized layers to capture complex, multi-dimensional relationships. However, scaling neural operators to high-resolution problems leads to significant computational demands, making the training of industrial-scale models prohibitive. In this work, we introduce \textbf{TensorGRaD}, a novel method that directly addresses the memory challenges associated with optimizing large tensor-structured weights. Our approach, based on a \texit{robust tensor decomposition}, factorizes gradients as the sum of a low-rank tensor and a sparse one to efficiently capture information within optimizer states, including outliers. Additionally, we provide a recipe for mixed precision training of TensorGRaD, achieving further memory savings without sacrificing accuracy. We showcase the effectiveness of TensorGRaD on Fourier Neural Operators, a class of models crucial for solving partial differential equations (PDE). We provide theoretical guarantees for TensorGRaD, demonstrating its fundamental advantage over matrix-based gradient compression methods. We empirically demonstrate large improvements across various PDE tasks, including the challenging turbulent Navier-Stokes case at a Reynolds number of $10^5$. TensorGRaD reduces total memory usage by over $50\%$ while maintaining and sometimes even improving accuracy.

View on arXiv PDF

Similar