A Simple Toy Model Bridging HTSR & Alpha-REQ
TLDR
Generalization in neural networks has been shown to be correlated with one phenomenon expressed in two venues: how strongly singular values decay in (i) weights (Martin and Mahoney’s HTSR theory) and (ii) features (Agarwal’s $\alpha$-REQ). This blog connects the two approaches for toy linear models to show how HTSR may give rise to $\alpha$-REQ.
Introduction
Recent advances in deep learning theory have used spectral analysis as a tool to study generalization of neural networks. On one hand, we have Heavy-Tailed Self-Regularization (HTSR) theory, which analyzes a network’s weights to demonstrate that in well-trained networks, singular values of weights follow a power-law decay. If the powerlaw coefficient is less than 2, that signifies over-fitting. In contrast, powelaw coefficients between 2 and 4 signify a well generalised state and above 6 denotes under-trained weight. On the other hand, $\alpha$-ReQ analyzes activations or features to show that high-quality representations also follow a specific spectral decay that correlates with generalization, i.e., powerlaw fit of less than 1.5 denotes over-fitting while 1.5 to 2 denotes general features.
While the two approaches are proposed and used separately, it is plain that properties of weights and features are coupled by design. Hence, this post explores the connection between HTSR and $\alpha$-REQ through fundamental linear algebra on a toy linear model $Y = XW$.
- First, we will quickly relate the condition number of a matrix to the decay of the singular values.
- Then we will use our linear model to look at how the singular value decay/conditioning of the weights $W$ can impact the features $Y$.
The result in this post is mainly a corollary of those in the vast literature on signal propagation in neural networks. Our aim here is to simply yet mathematically relate two interesting emergences of geometry in distinct spaces, i.e., weights and features.
Linking singular value decay to condition number
Intuitively, one can see that a steeper decay in the singular value spectrum of a matrix should correspond to a higher condition number and vice versa, as a flatter spectrum implies that the minimum singular value is still reasonably close to the maximum singular value. Let us prove this formally.
We start with a matrix $A \in \mathbb{R}^{n \times m}$ that has full rank, i.e., $A^T A$ is symmetric and positive semi-definite (PSD). Then, we can write the condition number of $A$ as
\[\kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} = \frac{\sqrt{\lambda_{\max}(A^T A)}}{\sqrt{\lambda_{\min}(A^T A)}}\]Then, we fit to the eigenvalues of $A^T A$ a powerlaw such that $\lambda_k = c \cdot k^{-\alpha}$ where $k$ is the index of the eigenvalue in the spectrum (in descending oder) and $c$ is a fixed scalar. Note that since $A$ is rull rank, the number of non-zero eigenvalues of $A^T A$ is $m$, which is also its rank. Accordingly, one can say that if $\lambda_{\max} = c$ then $\lambda_{\min} = c \cdot (m)^{-\alpha}$
Then, we can substitute this in the expression for $\kappa(A)$ to find
\[\kappa(A) = \frac{\sqrt{\lambda_{\max}(A^T A)}}{\sqrt{\lambda_{\min}(A^T A)}} = \frac{\sqrt{c}}{\sqrt{c (m)^{-\alpha}}} = m^{\alpha/2}\]This result confirms our intuition: $\kappa(A)$ is indeed proportional to how fast the singular value spectrum of $A$ decays. In the following, we will use this result to connect HTSR with $\alpha$-REQ cleanly.
Toy model to bridge HTSR and $\alpha$-REQ
Given a data matrix $X \in \mathbb{R}^{n \times d}$, where $n$ is the number of data points ad $d$ is the dimension of each sample’s features, we are interested in a linear transform of $X$ by weights $W \in \mathbb{R}^{d \times m}$ to compute $Y \in \mathbb{R}^{n \times m}$ as
\[Y = X W\]According to HTSR theory, the powerlaw fit of the decay of singular values of $W$ should reduce as the weights are updated throughout training, and $\alpha$-REQ shows how the same happens for features $Y$. We will now derive a link between the two observations.
For any two compatible matrices $A$ and $B$, the condition number of their matrix multiplication is bounded as
\[\kappa(AB) \leq \kappa(A) \cdot \kappa(B)\]This result simply follows from the nice properties of singular values: given some singular values arranged in descending order like $\sigma_1 \geq \sigma_2 \geq \dots \geq \sigma_N \geq 0$, we have for $A$ and $B$
\[\sigma_N(A)\sigma_i(B) \leq \sigma_i(AB) \leq \sigma_1(A)\sigma_i(B) \quad i=1,2,\dots,N\]Corresondingly, $\sigma_{\max}(AB) \leq \sigma_{\max}(A) \ \sigma_{\max}(B)$ and $\sigma_{\text{min}}(AB) \geq \sigma_{\text{min}}(A) \ \sigma_{\text{min}}(B)$ give the above result for the condition number of a matrix multiplication. Then, we can see that
\[\kappa(Y) \leq \kappa(X) \cdot \kappa(W) = C_{\text{data}} \cdot \kappa(W)\]Here $\kappa(X)$ is denoted by a fixed scalar $C_{\text{data}}$ that assumes the data matrix to be of full rank and fixed. Finally, using the relationship between condition number and powerlaw fit coefficient for singular value decay, we can see the bound
\[{r_Y}^{\alpha_Y/2} \leq (C_{\text{data}}) \cdot {r_W}^{\alpha_W/2} \implies \alpha_Y \leq 2\log_{r_Y}(C_{\text{data}}) + \alpha_W \log_{r_Y}(r_W)\]Here, $r_Y$ and $r_W$ are the ranks of $Y^T Y$ and $W^T W$. If both $Y$ and $W$ are full rank, then $r_Y = r_W = m$.
This shows that as the powerlaw fit of the singular value decay $\alpha_W$ for weights $W$ reduces, so should the maximum value of the corresponding powerlaw fit $\alpha_Y$ for features $Y$.
Discussion on deeper models
It is not difficult to see what would happen for a 2-layer network with the following architecture:
\[Y = XW_1 \ ; \ \widetilde{Y} = \varphi(Y) \ ; \ Z = \widetilde{Y}W_2\] \[X \in \mathbb{R}^{n\times d}, W_1 \in \mathbb{R}^{d\times m}, Y \in \mathbb{R}^{n\times m}, \widetilde{Y} \in \mathbb{R}^{n\times m}, W_2 \in \mathbb{R}^{m \times c}, Z \in \mathbb{R}^{n\times c}\]and $\varphi(\bullet)$ is an element-wise non-linear activation function such as $\text{ReLU}(\bullet)$. Now while we have derived a relation between the conditioning/singular value decay of $Y$ and $W_1$ above, note that there does not exist any definitive analytical link from $\kappa(Y)$ to $\kappa(Z)$. This is because of the non-linearity $\varphi(\bullet)$, which changes the conditioning of $Y$ rather opaquely. However, it is not the case that we cannot make any comment on the conditioning of $Z$. This is shown as follows.
Since ReLU-like non-linearities zero-out negative inputs, we cannot know $\text{rank}(\widetilde{Y}^\top \widetilde{Y})$ without explicitly measuring it. Let us denote it by $r_{\widetilde{Y}}$. Then, using the bound on condition number of the product of two matrices, we now write
\[\kappa(Z) \leq \kappa(\widetilde{Y}) \cdot \kappa(W_2) \implies r_Z^{\alpha_Z/2} \leq r_{\widetilde{Y}}^{\alpha_{\widetilde{Y}}/2} \cdot r_{W_2}^{\alpha_{W_2}/2}\]Taking log w.r.t. $r_Z$ on both sides, we get
\[\alpha_Z \leq \alpha_{\widetilde{Y}}\cdot \log_{r_Z}(r_{\widetilde{Y}}) + \alpha_{W_2} \cdot \log_{r_Z}(r_{W_2})\]This means that the representation quality of the output is bounded by the representation quality of the input and how “well-trained” the weight operand is. In contrast, input layers only depend on the quality of the weight. Particularly, since ReLU can reduce steepness of the singular value decay, it can be possible that high-quality input features can prevent the quality of output features from collapsing, even under under-trained weights.
Why should we care about this?
While the above result is no ground-breaking insight, it is in fact a nice sanity check: quality of input features and weights can make or break the quality of output features. In terms of applications, it can be interesting to think about interventions that mainly address features, yet remain crucially shaped by weight. Particularly, the result above can make one ponder about steering of language models. In steering, one adds learnable vectors at certain features within the network in order to modify or steer the network’s predictions. Given a query for which we want to optimize our network via steering, we would like to intervene on high-quality features: which encode rich, general concept modes relevant to our query. In contrast, features that have collapsed to only a few concept modes due to over-fitting would be hard to steer. Also, knowing the conditioning/trained-ness of the weights that would operate on chosen features can allow insight into how the effect of our steering propagates through the network. For instance, ill-conditioned weights can likely suppress directions that can be potentially relevant to optimizing the query. However, this remains conjecture at present, requiring experimental validation and deeper analysis that we keep for future work.
Enjoy Reading This Article?
Here are some more articles you might like to read next: