Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?
This work addresses a foundational gap in understanding the capacity of equivariant representations for machine learning, with implications for designing robust models in computer vision and related fields.
The paper tackles the problem of quantifying the expressivity of group-equivariant representations for linear classification under all possible views, finding that the fraction of separable dichotomies depends on the dimension fixed by the group action, with local pooling decreasing this fraction, and validates the theory on convolutional neural networks with perfect agreement.
Equivariance has emerged as a desirable property of representations of objects subject to identity-preserving transformations that constitute a group, such as translations and rotations. However, the expressivity of a representation constrained by group equivariance is still not fully understood. We address this gap by providing a generalization of Cover's Function Counting Theorem that quantifies the number of linearly separable and group-invariant binary dichotomies that can be assigned to equivariant representations of objects. We find that the fraction of separable dichotomies is determined by the dimension of the space that is fixed by the group action. We show how this relation extends to operations such as convolutions, element-wise nonlinearities, and global and local pooling. While other operations do not change the fraction of separable dichotomies, local pooling decreases the fraction, despite being a highly nonlinear operation. Finally, we test our theory on intermediate representations of randomly initialized and fully trained convolutional neural networks and find perfect agreement.