Ortho-Hydra: Orthogonalized Experts for DiT LoRA

arXiv:2605.0325254.7Has Code

AI Analysis

For practitioners fine-tuning diffusion transformers on multi-style data, this work addresses the cold-start router deadlock in mixture-of-experts LoRA, enabling efficient expert specialization.

LoRA fine-tuning of diffusion transformers on multi-style data suffers from style bleed, which mixture-of-experts approaches like HydraLoRA fail to address due to router deadlock from zero initialization. Ortho-Hydra introduces a re-parameterization with orthogonalized shared basis and disjoint output subspaces, enabling router specialization from the first step and breaking the uniform prior within hundreds of steps.

LoRA fine-tuning of diffusion transformers (DiT) on multi-style data suffers from \emph{style bleed}: a single low-rank residual cannot represent several distinct artist fingerprints, and the optimizer converges to their average. Mixture-of-experts LoRA in the HydraLoRA style replaces the up-projection with $E$ heads under a router, but when every expert is zero-initialized the router receives identical gradient from each head and remains at the uniform prior. The experts then evolve permutation-symmetrically, and the network trains as a single rank-$r$ LoRA at $E{\times}$ the cost. We present \textbf{Ortho-Hydra}, a re-parameterisation that combines an OFT-style Cayley-orthogonal shared basis with per-expert \emph{disjoint output subspaces} carved from the top-$(Er)$ left singular vectors of the pretrained weight. Disjointness makes the router's per-expert score non-degenerate at step~$0$, so specialization receives gradient signal before any expert has trained. We test the predicted deadlock on a DiT pipeline by comparing two HydraLoRA baselines, a zero-initialized shared-basis variant and the original $σ{=}0.1$ Gaussian-jitter mitigation, against Ortho-Hydra under a matched optimiser, dataset, and step budget. Neither baseline leaves the uniform prior within the first $1\text{k}$ steps; Ortho-Hydra begins de-uniformising within the first few hundred. End-task generation quality on multi-style data is out of scope; we report the construction, the cold-start mechanism, and the routing dynamics it changes. Code: https://github.com/sorryhyun/anima_lora.

View on arXiv PDF Code

Similar