Do Subsampled Newton Methods Work for High-Dimensional Data?
This work addresses scalability issues in optimization for high-dimensional data, making subsampled Newton methods more practical for large-scale machine learning applications.
The paper tackles the problem of subsampled Newton methods requiring too many samples for high-dimensional data, proving that only θ(d_eff^γ) samples are needed, where d_eff^γ is much smaller than d when nγ ≫ 1, enabling efficient optimization.
Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require $Ω(d)$ samples to approximate Hessians, where $d$ is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when $d$ is comparably as large as the number of data points $n$, which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on high dimensional data. Specifically, we prove only $\widetildeΘ(d^γ_{\rm eff})$ samples are needed in the approximation of Hessian matrices, where $d^γ_{\rm eff}$ is the $γ$-ridge leverage and can be much smaller than $d$ as long as $nγ\gg 1$. Additionally, we extend this result so that subsampled Newton methods can work for high-dimensional data on both distributed optimization problems and non-smooth regularized problems.