From Muon to Spectra
Spectral Dynamics of Neural Network Weights under Muon
Table of Contents
- Background: SGD and Spectral Dynamics
- Muon Optimizer
- Stochastic Analysis of Muon’s Update
- Stationary Distribution under Muon
- Hypothesis
- Experiment
- Missing Pieces and Open Questions
1. Background: SGD and Spectral Dynamics
Reference: Olsen et al., From SGD to Spectra, ICML 2025. arXiv:2507.12709.
1.1 Setup
Weight matrix $W \in \mathbb{R}^{m \times n}$ evolves under SGD with isotropic noise:
\[dW = -\eta \nabla_W \mathcal{L}\, dt + \sqrt{2\eta D}\, d\mathcal{W}\]where $\eta$ = learning rate, $D$ = effective diffusion constant, $d\mathcal{W}$ = matrix-valued Wiener process with independent entries.
1.2 SDE for Singular Values under SGD
Write SVD $W = U\Sigma V^\top$. Applying Itô’s lemma to $\sigma_k(W)$, using:
\[\nabla_W \sigma_k = u_k v_k^\top, \qquad \Delta_W \sigma_k = \frac{m-n+1}{2\sigma_k} + \sum_{j \neq k} \frac{\sigma_k}{\sigma_k^2 - \sigma_j^2}\]gives:
\[d\sigma_k = \left[ -\eta u_k^\top (\nabla_W \mathcal{L}) v_k + \eta D \left(\frac{m-n+1}{2\sigma_k} + \sum_{j\neq k} \frac{\sigma_k}{\sigma_k^2 - \sigma_j^2}\right) \right] dt + \sqrt{2\eta D}\, d\beta_k\]1.3 Dyson Brownian Motion for Squared Singular Values
Set $\lambda_k = \sigma_k^2$. Apply Itô again ($d\lambda_k = 2\sigma_k\, d\sigma_k + (d\sigma_k)^2$):
\[d\lambda_k = \left[ -2\sqrt{\lambda_k}\, \eta u_k^\top(\nabla_W\mathcal{L})v_k + \eta D(m-n+3) + 2\eta D \sum_{j\neq k} \frac{\lambda_k}{\lambda_k - \lambda_j} \right] dt + 2\sqrt{2\eta D \lambda_k}\, d\beta_k\]In the gradient-flat regime ($\nabla_W \mathcal{L} \approx 0$), after time rescaling $s = t/(2\eta D)$, this is the $\beta=1$ Dyson Brownian motion:
\[dY_k = \left( \frac{m-n+3}{2} + \sum_{j\neq k} \frac{Y_k}{Y_k - Y_j} \right) ds + 2\sqrt{Y_k}\, dW_k\]Critical observation: diffusion coefficient scales as $\sqrt{\lambda_k}$ — this is multiplicative noise.
1.4 Stationary Distribution under SGD
Mean-field theory replaces the interaction term with a linear restoring force, giving the single-particle SDE:
\[d\lambda_t = (\alpha_0 - \beta_1 \lambda_t)\, dt + \sqrt{8\eta D \lambda_t}\, dW_t\]where $\alpha_0 = \eta D(m-n+3)$, $\beta_1 > 0$. This is a Cox–Ingersoll–Ross (CIR) process.
Stationary Fokker–Planck gives a Gamma distribution:
\[p(\lambda) \propto \lambda^{\frac{\alpha_0}{4\eta D} - 1} \exp\!\left(-\frac{\beta_1}{4\eta D}\lambda\right)\]Pushing forward to $\sigma = \sqrt{\lambda}$:
\[p_\sigma(\sigma) \propto \sigma^{\frac{m-n+1}{2}} \exp\!\left(-\frac{\beta_1}{4\eta D}\sigma^2\right)\]This is a Gamma-type density with power-law prefactor → the empirically observed bulk + tail ESD structure.
2. Muon Optimizer
Muon applies Nesterov momentum to get gradient estimate $G$, then orthogonalizes via Newton–Schulz:
\[\tilde{G} = \text{NS}(G) \approx UV^\top, \quad G = U\Sigma V^\top\]Weight update:
\[W \leftarrow W - \eta \tilde{G}_B\]Key property: singular values of $\tilde{G}_B$ are all approximately 1.
3. Stochastic Analysis of Muon’s Update
3.1 Perturbation through Orthogonalization
Mini-batch gradient:
\[G_B = \nabla\mathcal{L} + \xi = U\Sigma V^\top + \xi\]where $\xi$ is mean-zero noise with variance $\sigma_\xi^2$.
Perturb the SVD of $G_B$ to first order in $\xi$:
\[\delta\Sigma_{kk} = u_k^\top \xi v_k \qquad \text{(singular value perturbation)}\] \[(\delta U)_{\perp k} = \sum_{j \neq k} \frac{u_j^\top \xi v_k}{\sigma_k - \sigma_j} u_j \qquad \text{(left SV perturbation)}\] \[(\delta V)_{\perp k} = \sum_{j \neq k} \frac{u_k^\top \xi v_j}{\sigma_k - \sigma_j} v_j \qquad \text{(right SV perturbation)}\]3.2 Orthogonalization Kills Singular Value Noise
The orthogonalized update to first order:
\[\tilde{G}_B \approx UV^\top + U\,\delta V^\top + \delta U\, V^\top\]The stochastic perturbation is:
\[\delta\tilde{G} = U\,\delta V^\top + \delta U\, V^\top\]Key: $\delta\tilde{G}$ depends only on $\delta U, \delta V$ — the singular vector perturbations. The singular value noise $\delta\Sigma_{kk} = u_k^\top \xi v_k$ is projected out by orthogonalization.
3.3 Structure of the Residual Noise
Components of $\delta\tilde{G}$ scale as:
\[(\delta\tilde{G})_{jk} \sim \frac{\xi_{jk}}{\sigma_k - \sigma_j}\]Two properties:
- No $\lambda_k$-dependence in the numerator — noise doesn’t scale with singular value size (unlike SGD)
- Gap-dependent denominator — noise amplified when $\sigma_k \approx \sigma_j$
Noise covariance:
\[\mathbb{E}[\delta\tilde{G} \otimes \delta\tilde{G}] \sim \sum_{j \neq k} \frac{\sigma_\xi^2}{(\sigma_k - \sigma_j)^2} \cdot (\text{rank-1 terms in } U, V)\]3.4 SDE for $\lambda_k$ under Muon
Gradient drift: since $\tilde{G}$ is orthogonal and aligned with $W$’s singular vectors, $u_k^\top \tilde{G} v_k \approx 1$ for all $k$ — uniform drift across all singular value directions.
Full SDE for $\lambda_k = \sigma_k^2$ under Muon:
\[\boxed{d\lambda_k = \left[ -2\eta\sqrt{\lambda_k} + \eta D_\text{eff}(m-n+3) + 2\eta D_\text{eff} \sum_{j\neq k} \frac{\lambda_k}{\lambda_k - \lambda_j} \right] dt + 2\sqrt{\frac{\eta^2\sigma_\xi^2 \lambda_k}{(\lambda_k - \lambda_j)^2}}\, d\beta_k}\]3.5 Comparison with SGD
| Term | SGD | Muon | ||
|---|---|---|---|---|
| Gradient drift | $-2\eta\sqrt{\lambda_k}\, u_k^\top\nabla\mathcal{L}\, v_k$ (anisotropic) | $-2\eta\sqrt{\lambda_k}$ (uniform) | ||
| Diffusion coefficient | $\sim \sqrt{\lambda_k}$ (multiplicative) | $\sim \sqrt{\lambda_k}/ | \lambda_k - \lambda_j | $ (gap-dependent) |
| Dyson repulsion | Standard | Same form, $D_\text{eff}$ gap-modulated |
Remark: Gap-dependent noise acts as adaptive repulsion — when $\lambda_k \approx \lambda_j$, the stochastic force pushing them apart diverges. This reinforces the deterministic Dyson repulsion, actively preventing singular value clustering.
4. Stationary Distribution under Muon
4.1 Mean-Field Reduction
In the mean-field limit, replace $\sum_{j\neq k} \frac{1}{(\lambda_k - \lambda_j)^2}$ by its average $\sim r/\Delta^2$ where $\Delta$ = mean gap. Effective single-particle diffusion becomes constant (independent of $\lambda_k$):
\[d\lambda_t = \left(-2\eta\sqrt{\lambda_t} + \alpha_0\right) dt + C\, dW_t\]where $C = 2\sqrt{\eta^2\sigma_\xi^2 r/\Delta^2}$ is constant. This is additive noise + nonlinear drift — qualitatively different from CIR.
4.2 Stationary Fokker–Planck (TODO: solve)
\[0 = -\partial_\lambda\left[(-2\eta\sqrt{\lambda} + \alpha_0)\, p(\lambda)\right] + \frac{C^2}{2}\, \partial^2_\lambda\left[p(\lambda)\right]\]TODO: Solve under zero-flux BCs. The restoring force $-2\eta\sqrt{\lambda}$ is sub-linear in $\lambda$, so $p(\lambda)$ should decay faster than Gamma — conjectured Weibull or sub-Gaussian.
Contrast with SGD: SGD has multiplicative noise $\sim\sqrt{\lambda}$ + linear drift $\sim\lambda$ → CIR → Gamma → power-law tails. Muon has additive noise + sub-linear drift $\sim\sqrt{\lambda}$ → thinner tails.
5. Hypothesis
- ESD of Muon-trained weight matrices has thinner tails than SGD-trained.
- Power-law exponent: $\alpha_\text{Muon} > \alpha_\text{SGD}$ (faster tail decay).
- Excess kurtosis: $\kappa_\text{Muon} < \kappa_\text{SGD}$.
- Holds for bulk weight matrices (large, approximately square). Does not hold for output layers or very rectangular matrices — mean-field breaks down with few SVs, and Muon is not designed for these layers anyway.
6. Experiment
6.1 Setup
- Architecture: 2-layer MLP. fc1: $3072 \to 512$, fc2: $512 \to 10$
- Dataset: CIFAR-10
- SGD: lr=0.01, momentum=0.9, 15 epochs, batch 256
- Muon: lr=0.02, momentum=0.95, 15 epochs, batch 256. Matrix params only; biases → AdamW fallback.
torch.optim.Muon(PyTorch ≥ 2.7)
6.2 Metrics
| Metric | Definition | Thinner tail = |
|---|---|---|
| Power-law $\alpha$ | MLE fit to tail via powerlaw | larger $\alpha$ |
| Hill-$\alpha$ | $\left(\frac{1}{k}\sum_{i=1}^k \log\frac{\sigma_{(i)}}{\sigma_{(k)}}\right)^{-1}$, top-10% SVs | larger |
| Excess kurtosis | $\mathbb{E}[(\sigma-\mu)^4]/\text{Var}^2 - 3$ | smaller |
| Tail fraction | fraction of SVs above $x_\text{min}$ | smaller |
6.3 Results
| Layer | Metric | SGD | Muon | Consistent? |
|---|---|---|---|---|
| fc1 | $\alpha$ | — | — | ✓ |
| fc1 | Hill-$\alpha$ | — | — | ✓ |
| fc1 | kurtosis | — | — | ✓ |
| fc1 | tail_frac | — | — | ✓ |
| fc2 | $\alpha$ | — | — | ✗ |
| fc2 | Hill-$\alpha$ | — | — | ✗ |
Fill in numbers after running esd_sgd_vs_muon.py.
6.4 Interpretation
- fc1 ($3072 \times 512$): hypothesis supported across all metrics.
- fc2 ($512 \times 10$): hypothesis not supported — expected because:
- Only 10 SVs → mean-field approximation breaks down
- Gradient at fc2 is highly structured (rank ≈ num_classes), not approximately isotropic
- Muon not recommended for output heads by design
7. Missing Pieces and Open Questions
7.1 Critical Ablation (required for publishability)
Current experiment doesn’t rule out the magnitude confound: Muon’s update has a different effective scale than SGD, which alone could affect the ESD.
Proposed ablation: Train with norm-matched SGD — standard SGD update $\nabla\mathcal{L}$ rescaled to have the same Frobenius norm as Muon’s $\tilde{G}$. If this produces heavier tails than Muon, orthogonalization (not scale) is the cause.
7.2 Unsolved: Stationary Distribution of Muon’s FP Equation
Steps needed:
- Integrate $0 = -\partial_\lambda[(-2\eta\sqrt{\lambda} + \alpha_0)p] + \frac{C^2}{2}\partial^2_\lambda p$ once under zero-flux BC
- Solve the resulting first-order ODE (possibly via special functions or asymptotic expansion)
- Extract tail exponent, compare with Gamma
- Verify sub-Gaussian / Weibull conjecture
7.3 Other Open Questions
- Does gap-dependent noise enhancement of repulsion lead to a more uniform (Marchenko–Pastur-like) bulk? Would explain Muon’s empirical prevention of rank collapse.
- How does this interact with depth? Attention vs FFN layers have different aspect ratios and gradient structures.
- Connection to μP: does spectral normalization in μP interact with Muon’s spectral dynamics?
- Edge of Stability: Muon’s uniform gradient drift ($-2\eta\sqrt{\lambda_k}$ for all $k$) may interact differently with EoS than SGD, since EoS is driven by the largest SV of the Hessian.
References
- Olsen et al. From SGD to Spectra: A Theory of Neural Network Weight Dynamics. ICML 2025. arXiv:2507.12709
- Martin & Mahoney. Implicit Self-Regularization in Deep Neural Networks. JMLR 2021.
- Aarts & de Wit. Dyson Brownian Motion of Neural Network Weights. arXiv:2403.04567, 2024.
- Jordan et al. Muon: An optimizer for hidden layers. github.com/KellerJordan/Muon, 2024.
- Alstott et al. powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions. PLoS ONE, 2014.
Enjoy Reading This Article?
Here are some more articles you might like to read next: