Clebsch-Gordan Transformer: Fast and Global Equivariant Attention
This addresses the problem of high computational costs and restricted feature modeling in equivariant transformers for researchers and practitioners in fields like physics, biochemistry, and robotics, representing a novel method rather than an incremental improvement.
The paper tackles the computational inefficiency and limited expressiveness of existing equivariant transformers by proposing the Clebsch-Gordan Transformer, which achieves efficient global attention with O(N log N) complexity and scales to high-order features, resulting in improved GPU memory, speed, and accuracy across benchmarks like n-body simulation and QM9.
The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $\SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N \log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.