2.8CVApr 8
Orthogonal Quadratic Complements for Vision Transformer Feed-Forward NetworksWang Zixian
Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.
LGDec 28, 2025
APO: Alpha-Divergence Preference OptimizationWang Zixian
Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.
LGFeb 24
Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert SpaceWang Zixian
We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.
LGNov 2, 2025
Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer AdaptationWang Zixian
Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.
LGOct 21, 2025
ADPO: Anchored Direct Preference OptimizationWang Zixian
Direct Preference Optimization (DPO) has emerged as a simple alternative to reinforcement learning from human feedback (RLHF) for aligning language models, but its reliance on hard pairwise labels makes it brittle under noise; our experiments show performance degrading by up to 93 percent in noisy settings. We introduce Anchored Direct Preference Optimization (ADPO), a unified framework that addresses this fragility through reference anchoring. By minimizing KL(q || softmax((l - l_ref) / tau_anc)), where l_ref are reference policy log probabilities, ADPO provides three key advantages: (1) it unifies major learning paradigms, including supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO, as special cases through different choices of target distribution q, anchor policy pi_ref, and temperature tau_anc; (2) it induces an implicit trust region governed by the softmax Fisher metric with curvature scaling as 1 / tau_anc^2, providing geometric regularization absent in standard methods; and (3) it enables flexible anchor strategies tailored to different learning contexts. Empirically, ADPO consistently outperforms standard DPO by 12 to 93 percent across twelve noisy scenarios, with listwise variants achieving top performance in eleven of twelve cases. In offline distillation, ADPO reduces student-teacher KL by 4 to 49 times while achieving superior returns (for example, 279.3 vs -309.0 for knowledge distillation on HalfCheetah). We further uncover a task-dependent tradeoff: dynamic anchors excel at online exploration in noisy environments (plus 5 to 11 percent), while fixed anchors enable stable offline distillation. Our work establishes anchoring as a general principle for robust policy optimization, with clear practical guidance for anchor selection across diverse learning scenarios.