From Muon to Spectra

Spectral Dynamics of Neural Network Weights under Muon


Table of Contents

  1. Background: SGD and Spectral Dynamics
  2. Muon Optimizer
  3. Stochastic Analysis of Muon’s Update
  4. Stationary Distribution under Muon
  5. Hypothesis
  6. Experiment
  7. Missing Pieces and Open Questions

1. Background: SGD and Spectral Dynamics

Reference: Olsen et al., From SGD to Spectra, ICML 2025. arXiv:2507.12709.

1.1 Setup

Weight matrix $W \in \mathbb{R}^{m \times n}$ evolves under SGD with isotropic noise:

\[dW = -\eta \nabla_W \mathcal{L}\, dt + \sqrt{2\eta D}\, d\mathcal{W}\]

where $\eta$ = learning rate, $D$ = effective diffusion constant, $d\mathcal{W}$ = matrix-valued Wiener process with independent entries.

1.2 SDE for Singular Values under SGD

Write SVD $W = U\Sigma V^\top$. Applying Itô’s lemma to $\sigma_k(W)$, using:

\[\nabla_W \sigma_k = u_k v_k^\top, \qquad \Delta_W \sigma_k = \frac{m-n+1}{2\sigma_k} + \sum_{j \neq k} \frac{\sigma_k}{\sigma_k^2 - \sigma_j^2}\]

gives:

\[d\sigma_k = \left[ -\eta u_k^\top (\nabla_W \mathcal{L}) v_k + \eta D \left(\frac{m-n+1}{2\sigma_k} + \sum_{j\neq k} \frac{\sigma_k}{\sigma_k^2 - \sigma_j^2}\right) \right] dt + \sqrt{2\eta D}\, d\beta_k\]

1.3 Dyson Brownian Motion for Squared Singular Values

Set $\lambda_k = \sigma_k^2$. Apply Itô again ($d\lambda_k = 2\sigma_k\, d\sigma_k + (d\sigma_k)^2$):

\[d\lambda_k = \left[ -2\sqrt{\lambda_k}\, \eta u_k^\top(\nabla_W\mathcal{L})v_k + \eta D(m-n+3) + 2\eta D \sum_{j\neq k} \frac{\lambda_k}{\lambda_k - \lambda_j} \right] dt + 2\sqrt{2\eta D \lambda_k}\, d\beta_k\]

In the gradient-flat regime ($\nabla_W \mathcal{L} \approx 0$), after time rescaling $s = t/(2\eta D)$, this is the $\beta=1$ Dyson Brownian motion:

\[dY_k = \left( \frac{m-n+3}{2} + \sum_{j\neq k} \frac{Y_k}{Y_k - Y_j} \right) ds + 2\sqrt{Y_k}\, dW_k\]

Critical observation: diffusion coefficient scales as $\sqrt{\lambda_k}$ — this is multiplicative noise.

1.4 Stationary Distribution under SGD

Mean-field theory replaces the interaction term with a linear restoring force, giving the single-particle SDE:

\[d\lambda_t = (\alpha_0 - \beta_1 \lambda_t)\, dt + \sqrt{8\eta D \lambda_t}\, dW_t\]

where $\alpha_0 = \eta D(m-n+3)$, $\beta_1 > 0$. This is a Cox–Ingersoll–Ross (CIR) process.

Stationary Fokker–Planck gives a Gamma distribution:

\[p(\lambda) \propto \lambda^{\frac{\alpha_0}{4\eta D} - 1} \exp\!\left(-\frac{\beta_1}{4\eta D}\lambda\right)\]

Pushing forward to $\sigma = \sqrt{\lambda}$:

\[p_\sigma(\sigma) \propto \sigma^{\frac{m-n+1}{2}} \exp\!\left(-\frac{\beta_1}{4\eta D}\sigma^2\right)\]

This is a Gamma-type density with power-law prefactor → the empirically observed bulk + tail ESD structure.


2. Muon Optimizer

Muon applies Nesterov momentum to get gradient estimate $G$, then orthogonalizes via Newton–Schulz:

\[\tilde{G} = \text{NS}(G) \approx UV^\top, \quad G = U\Sigma V^\top\]

Weight update:

\[W \leftarrow W - \eta \tilde{G}_B\]

Key property: singular values of $\tilde{G}_B$ are all approximately 1.


3. Stochastic Analysis of Muon’s Update

3.1 Perturbation through Orthogonalization

Mini-batch gradient:

\[G_B = \nabla\mathcal{L} + \xi = U\Sigma V^\top + \xi\]

where $\xi$ is mean-zero noise with variance $\sigma_\xi^2$.

Perturb the SVD of $G_B$ to first order in $\xi$:

\[\delta\Sigma_{kk} = u_k^\top \xi v_k \qquad \text{(singular value perturbation)}\] \[(\delta U)_{\perp k} = \sum_{j \neq k} \frac{u_j^\top \xi v_k}{\sigma_k - \sigma_j} u_j \qquad \text{(left SV perturbation)}\] \[(\delta V)_{\perp k} = \sum_{j \neq k} \frac{u_k^\top \xi v_j}{\sigma_k - \sigma_j} v_j \qquad \text{(right SV perturbation)}\]

3.2 Orthogonalization Kills Singular Value Noise

The orthogonalized update to first order:

\[\tilde{G}_B \approx UV^\top + U\,\delta V^\top + \delta U\, V^\top\]

The stochastic perturbation is:

\[\delta\tilde{G} = U\,\delta V^\top + \delta U\, V^\top\]

Key: $\delta\tilde{G}$ depends only on $\delta U, \delta V$ — the singular vector perturbations. The singular value noise $\delta\Sigma_{kk} = u_k^\top \xi v_k$ is projected out by orthogonalization.

3.3 Structure of the Residual Noise

Components of $\delta\tilde{G}$ scale as:

\[(\delta\tilde{G})_{jk} \sim \frac{\xi_{jk}}{\sigma_k - \sigma_j}\]

Two properties:

  1. No $\lambda_k$-dependence in the numerator — noise doesn’t scale with singular value size (unlike SGD)
  2. Gap-dependent denominator — noise amplified when $\sigma_k \approx \sigma_j$

Noise covariance:

\[\mathbb{E}[\delta\tilde{G} \otimes \delta\tilde{G}] \sim \sum_{j \neq k} \frac{\sigma_\xi^2}{(\sigma_k - \sigma_j)^2} \cdot (\text{rank-1 terms in } U, V)\]

3.4 SDE for $\lambda_k$ under Muon

Gradient drift: since $\tilde{G}$ is orthogonal and aligned with $W$’s singular vectors, $u_k^\top \tilde{G} v_k \approx 1$ for all $k$ — uniform drift across all singular value directions.

Full SDE for $\lambda_k = \sigma_k^2$ under Muon:

\[\boxed{d\lambda_k = \left[ -2\eta\sqrt{\lambda_k} + \eta D_\text{eff}(m-n+3) + 2\eta D_\text{eff} \sum_{j\neq k} \frac{\lambda_k}{\lambda_k - \lambda_j} \right] dt + 2\sqrt{\frac{\eta^2\sigma_\xi^2 \lambda_k}{(\lambda_k - \lambda_j)^2}}\, d\beta_k}\]

3.5 Comparison with SGD

Term SGD Muon    
Gradient drift $-2\eta\sqrt{\lambda_k}\, u_k^\top\nabla\mathcal{L}\, v_k$ (anisotropic) $-2\eta\sqrt{\lambda_k}$ (uniform)    
Diffusion coefficient $\sim \sqrt{\lambda_k}$ (multiplicative) $\sim \sqrt{\lambda_k}/ \lambda_k - \lambda_j $ (gap-dependent)
Dyson repulsion Standard Same form, $D_\text{eff}$ gap-modulated    

Remark: Gap-dependent noise acts as adaptive repulsion — when $\lambda_k \approx \lambda_j$, the stochastic force pushing them apart diverges. This reinforces the deterministic Dyson repulsion, actively preventing singular value clustering.


4. Stationary Distribution under Muon

4.1 Mean-Field Reduction

In the mean-field limit, replace $\sum_{j\neq k} \frac{1}{(\lambda_k - \lambda_j)^2}$ by its average $\sim r/\Delta^2$ where $\Delta$ = mean gap. Effective single-particle diffusion becomes constant (independent of $\lambda_k$):

\[d\lambda_t = \left(-2\eta\sqrt{\lambda_t} + \alpha_0\right) dt + C\, dW_t\]

where $C = 2\sqrt{\eta^2\sigma_\xi^2 r/\Delta^2}$ is constant. This is additive noise + nonlinear drift — qualitatively different from CIR.

4.2 Stationary Fokker–Planck (TODO: solve)

\[0 = -\partial_\lambda\left[(-2\eta\sqrt{\lambda} + \alpha_0)\, p(\lambda)\right] + \frac{C^2}{2}\, \partial^2_\lambda\left[p(\lambda)\right]\]

TODO: Solve under zero-flux BCs. The restoring force $-2\eta\sqrt{\lambda}$ is sub-linear in $\lambda$, so $p(\lambda)$ should decay faster than Gamma — conjectured Weibull or sub-Gaussian.

Contrast with SGD: SGD has multiplicative noise $\sim\sqrt{\lambda}$ + linear drift $\sim\lambda$ → CIR → Gamma → power-law tails. Muon has additive noise + sub-linear drift $\sim\sqrt{\lambda}$ → thinner tails.


5. Hypothesis

  1. ESD of Muon-trained weight matrices has thinner tails than SGD-trained.
  2. Power-law exponent: $\alpha_\text{Muon} > \alpha_\text{SGD}$ (faster tail decay).
  3. Excess kurtosis: $\kappa_\text{Muon} < \kappa_\text{SGD}$.
  4. Holds for bulk weight matrices (large, approximately square). Does not hold for output layers or very rectangular matrices — mean-field breaks down with few SVs, and Muon is not designed for these layers anyway.

6. Experiment

6.1 Setup

  • Architecture: 2-layer MLP. fc1: $3072 \to 512$, fc2: $512 \to 10$
  • Dataset: CIFAR-10
  • SGD: lr=0.01, momentum=0.9, 15 epochs, batch 256
  • Muon: lr=0.02, momentum=0.95, 15 epochs, batch 256. Matrix params only; biases → AdamW fallback. torch.optim.Muon (PyTorch ≥ 2.7)

6.2 Metrics

Metric Definition Thinner tail =
Power-law $\alpha$ MLE fit to tail via powerlaw larger $\alpha$
Hill-$\alpha$ $\left(\frac{1}{k}\sum_{i=1}^k \log\frac{\sigma_{(i)}}{\sigma_{(k)}}\right)^{-1}$, top-10% SVs larger
Excess kurtosis $\mathbb{E}[(\sigma-\mu)^4]/\text{Var}^2 - 3$ smaller
Tail fraction fraction of SVs above $x_\text{min}$ smaller

6.3 Results

Layer Metric SGD Muon Consistent?
fc1 $\alpha$
fc1 Hill-$\alpha$
fc1 kurtosis
fc1 tail_frac
fc2 $\alpha$
fc2 Hill-$\alpha$

Fill in numbers after running esd_sgd_vs_muon.py.

6.4 Interpretation

  • fc1 ($3072 \times 512$): hypothesis supported across all metrics.
  • fc2 ($512 \times 10$): hypothesis not supported — expected because:
    • Only 10 SVs → mean-field approximation breaks down
    • Gradient at fc2 is highly structured (rank ≈ num_classes), not approximately isotropic
    • Muon not recommended for output heads by design

7. Missing Pieces and Open Questions

7.1 Critical Ablation (required for publishability)

Current experiment doesn’t rule out the magnitude confound: Muon’s update has a different effective scale than SGD, which alone could affect the ESD.

Proposed ablation: Train with norm-matched SGD — standard SGD update $\nabla\mathcal{L}$ rescaled to have the same Frobenius norm as Muon’s $\tilde{G}$. If this produces heavier tails than Muon, orthogonalization (not scale) is the cause.

7.2 Unsolved: Stationary Distribution of Muon’s FP Equation

Steps needed:

  1. Integrate $0 = -\partial_\lambda[(-2\eta\sqrt{\lambda} + \alpha_0)p] + \frac{C^2}{2}\partial^2_\lambda p$ once under zero-flux BC
  2. Solve the resulting first-order ODE (possibly via special functions or asymptotic expansion)
  3. Extract tail exponent, compare with Gamma
  4. Verify sub-Gaussian / Weibull conjecture

7.3 Other Open Questions

  • Does gap-dependent noise enhancement of repulsion lead to a more uniform (Marchenko–Pastur-like) bulk? Would explain Muon’s empirical prevention of rank collapse.
  • How does this interact with depth? Attention vs FFN layers have different aspect ratios and gradient structures.
  • Connection to μP: does spectral normalization in μP interact with Muon’s spectral dynamics?
  • Edge of Stability: Muon’s uniform gradient drift ($-2\eta\sqrt{\lambda_k}$ for all $k$) may interact differently with EoS than SGD, since EoS is driven by the largest SV of the Hessian.

References

  • Olsen et al. From SGD to Spectra: A Theory of Neural Network Weight Dynamics. ICML 2025. arXiv:2507.12709
  • Martin & Mahoney. Implicit Self-Regularization in Deep Neural Networks. JMLR 2021.
  • Aarts & de Wit. Dyson Brownian Motion of Neural Network Weights. arXiv:2403.04567, 2024.
  • Jordan et al. Muon: An optimizer for hidden layers. github.com/KellerJordan/Muon, 2024.
  • Alstott et al. powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions. PLoS ONE, 2014.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • A Simple Toy Model Bridging HTSR & Alpha-REQ
  • Muon and Manifold Versions
  • Celebration is the secret
  • New POVs on hypernetworks