A Simple Toy Model Bridging HTSR & Alpha-REQ
TLDR
Generalization in neural networks has been shown to be correlated with one phenomenon expressed in two venues: how strongly singular values decay in (i) weights (Martin and Mahoney’s HTSR theory) and (ii) features (Agarwal’s $\alpha$-REQ). This blog connects the two approaches for toy linear models to show how HTSR may give rise to $\alpha$-REQ.
Introduction
Recent advances in deep learning theory have used spectral analysis as a tool to study generalization of neural networks. On one hand, we have Heavy-Tailed Self-Regularization (HTSR) theory, which analyzes a network’s weights to demonstrate that in well-trained networks, singular values of weights follow a power-law decay. On the other hand, $\alpha$-ReQ analyzes activations or features to show that high-quality representations also follow a specific spectral decay that correlates with generalization.
While the two approaches are proposed and used separately, it is plain that properties of weights and features are coupled by design. Hence, this post explores the connection between HTSR and $\alpha$-REQ through fundamental linear algebra on a toy linear model $Y = XW$.
- First, we will quickly relate the condition number of a matrix to the decay of the singular values.
- Then we will use our linear model to look at how the singular value decay/conditioning of the weights $W$ can impact the features $Y$.
The result in this post is mainly a corollary of those in the vast literature on signal propagation in neural networks. Our aim here is to simply yet mathematically relate two interesting emergences of geometry in distinct spaces, i.e., weights and features.
Linking singular value decay to condition number
Intuitively, one can see that a steeper decay in the singular value spectrum of a matrix should correspond to a higher condition number and vice versa, as a flatter spectrum implies that the minimum singular value is still reasonably close to the maximum singular value. Let us prove this formally.
We start with a matrix $A \in \mathbb{R}^{n \times m}$ that has full rank, i.e., $A^T A$ is symmetric and positive semi-definite (PSD). Then, we can write the condition number of $A$ as
\[\kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} = \frac{\sqrt{\lambda_{\max}(A^T A)}}{\sqrt{\lambda_{\min}(A^T A)}}\]Then, assuming that the eigenvalues of $A^T A$ follow a powerlaw such that $\lambda = c_0 u^{-\alpha}$ where $u$ is a scalar and $c_0$ is a fixed constant, we can write
\[\lambda_{\max} = c_0 (u_{\max})^{-\alpha} \quad \text{and} \quad \lambda_{\min} = c_0 (u_{\min})^{-\alpha}\]Since $u_{\min} = m \ u_{\max}$, we can substitute this in the expression for $\kappa(A)$ to find
\[\kappa(A) = \frac{\sqrt{\lambda_{\max}}}{\sqrt{\lambda_{\min}}} = \frac{\sqrt{c_0 (u_{\min} / m)^{-\alpha}}}{\sqrt{c_0 (u_{\min})^{-\alpha}}} = m^{\alpha/2}\]This result confirms our intuition: how fast the singular value spectrum of $A$ decays is exponentially proportional to the $\kappa(A)$. Accordingly, this result will let us connect HTSR with $\alpha$-REQ cleanly in the following.
Toy model to bridge HTSR and $\alpha$-REQ
Given a data matrix $X \in \mathbb{R}^{n \times d}$, where $n$ is the number of data points ad $d$ is the dimension of each sample’s features, we are interested in a linear transform of $X$ by weights $W \in \mathbb{R}^{d \times m}$ to compute $Y \in \mathbb{R}^{n \times m}$ as
\[Y = X W\]According to HTSR theory, the powerlaw fit of the decay of singular values of $W$ should reduce as the weights are updated throughout training, and $\alpha$-REQ shows how the same happens for features $Y$. We will now derive a link between the two observations.
For any two compatible matrices $A$ and $B$, the condition number of their matrix multiplication is bounded as
\[\kappa(AB) \leq \kappa(A) \cdot \kappa(B)\]This result simply follows from the nice properties of singular values: given some singular values arranged in descending order like $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_N \geq 0$, we have for $A$ and $B$
\[\sigma_N(A)\sigma_i(B) \leq \sigma_i(AB) \leq \sigma_1(A)\sigma_i(B) \quad i=1,2,\dots,N\]Corresondingly, $\sigma_{\max}(AB) \leq \sigma_{\max}(A) \ \sigma_{\max}(B)$ and $\sigma_{\text{min}}(AB) \geq \sigma_{\text{min}}(A) \ \sigma_{\text{min}}(B)$ give the above result for the condition number of a matrix multiplication. Then, we can see that
\[\kappa(Y) \leq \kappa(X) \cdot \kappa(W) = C_{\text{data}} \cdot \kappa(W)\]Here $\kappa(X)$ is denoted by a fixed scalar $C_{\text{data}}$ that assumes the data matrix to be of full rank and fixed. Finally, using the relationship between condition number and powerlaw fit coefficient for singular value decay, we can see the bound
\[c_{Y} \cdot m^{\alpha_Y/2} \leq (C_{\text{data}} \cdot c_{W}) \cdot m^{\alpha_W/2} \implies m^{\alpha_Y/2} \leq \frac{C_{\text{data}} \cdot c_{W}}{c_{Y}} \cdot m^{\alpha_W/2}\]This shows that as the powerlaw fit of the singular value decay $\alpha_W$ for weights $W$ reduces, so should the maximum value of the corresponding powerlaw fit $\alpha_Y$ for features $Y$.
Discussion
What happens when the $\kappa(X)$ is not stationary? This is precisely the case for deep neural networks, as the inputs to the linear projection evolve due to non-linear activation functions and due to cross-activation interactions during training. While one can continually substitute the fixed condition number of the inputs to the neural network, non-linearities complicate analytical expressions of the relationship of HTSR and $\alpha$-REQ.
Enjoy Reading This Article?
Here are some more articles you might like to read next: