LGMLAug 16, 2023

Convergence of Two-Layer Regression with Nonlinear Units

arXiv:2308.08358v19 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses a foundational optimization challenge in training large language models, but it is incremental as it builds on existing methods for specific structures like ReLU units.

The paper tackles the regression problem with ReLU units, deriving a closed-form Hessian representation and proving convergence of a greedy approximate Newton method under certain conditions, achieving convergence in distance to optimal solution and later in loss value under relaxed assumptions.

Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes