CE DCApr 20

Matrix-Free 3D SIMP Topology Optimization with Fused Gather-GEMM-Scatter Kernels

arXiv:2604.180204.91 citationsh-index: 1

Predicted impact top 55% in CE · last 90 daysOriginality Synthesis-oriented

AI Analysis

For practitioners of 3D topology optimization, this work offers a practical speedup and energy saving on consumer GPUs, though the method is incremental (fusing existing operations) and the BF16 variant is shown to be infeasible due to conditioning issues.

The paper presents a fused CUDA kernel for 3D SIMP topology optimization that eliminates DRAM traffic between gather, GEMM, and scatter stages, achieving 2.3-7.3x speedup and 3.2-4.9x energy reduction on an RTX 4090 compared to conventional three-stage implementations.

The matrix-free gather-batched-GEMM-scatter pattern eliminates global stiffness assembly for three-dimensional SIMP topology optimization, but the conventional three-stage implementation forces avoidable DRAM traffic between stages. We present a single fused CUDA kernel, implemented through CuPy's runtime compilation interface, that performs gather, per-element stiffness multiplication, and scatter accumulation in one pass. On a single RTX 4090 (24 GB), the fused path reaches a problem-size-dependent 4.6-7.3x end-to-end SIMP wall-time speedup across 216k-4.9M cantilever elements and 4.4x on the 499,125-element torsion benchmark. Against the same-precision FP32 three-stage baseline, the fused path still yields 2.3-4.6x on cantilever and 2.8x on torsion. Isolated CUDA-event cantilever-operator measurements reach 8.9-13.8x per matvec call, while separate instrumented board-power traces at 216k and 1M show 3.2-4.9x lower energy than matched FP64 runs. A separate bridge stress test shows the same FP32-versus-FP64 three-stage trend under one distributed-load case; direct fused-kernel bridge benchmarks are not reported. We also evaluate a BF16 WMMA variant: a separate PyTorch BF16 GEMM proxy on matching tensor shapes yields 14.3x, but direct condition-number estimates of 6.1e5-2.3e6 across 64k-512k uniform-density test states imply BF16 conditioning products of 2.4e3-9.1e3, far above the 256 threshold, observed alongside BF16 iterative-refinement stagnation at the two tested inner tolerances.

View on arXiv PDF

Similar