Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers
This provides an efficient solution for in-context regression tasks, reducing computational requirements while maintaining optimal performance, though it is incremental as it builds on existing transformer methods.
The paper tackles in-context learning for nonparametric regression with smooth functions, proving that a pretrained transformer with logarithmic parameters and polynomial pretraining sequences achieves the minimax-optimal convergence rate of O(n^{-2α/(2α+d)}) in mean squared error, using fewer resources than prior work.
We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.