Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment
This addresses resource efficiency for deploying models in constrained environments like edge devices, though it is incremental as it builds on existing Transformer architectures.
The paper tackles the problem of resource waste in deep learning models under dynamic scenarios by proposing a Transformer^-1 architecture that dynamically matches input features with computational resources, reducing FLOPs by 42.7% and peak memory usage by 34.1% on ImageNet-1K while maintaining comparable accuracy.
Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer$^{-1}$ architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7\% and peak memory usage by 34.1\% compared to the standard Transformer, while maintaining comparable accuracy ($\pm$0.3\%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.