How Spurious Features are Memorized: Precise Analysis for Random and NTK Features
This paper offers a theoretical explanation for why AI models memorize irrelevant data, revealing how model stability and feature alignment play key roles.
Background & Academic Lineage
The Origin & Academic Lineage
The problem addressed in this paper originates from a fundamental challenge in deep learning: the tendency of models to overfit and memorize spurious features present in the training dataset. This phenomenon, where models learn irrelevant patterns that are uncorrelated with the actual learning task, has been widely observed empirically. However, a rigorous theoretical framework to precisely quantify and understand this behavior has been conspicuously absent.
Historically, the emergence of this specific problem is rooted in the practical limitations of deep learning models when deployed in real-world scenarios. Spurious correlations, such as background noise or specific visual textures, have been shown to lead to poor out-of-distribution robustness (e.g., Geirhos et al., 2019; Zhou et al., 2021), fairness issues (Zliobaite, 2015), and even data privacy breaches (Leino & Fredrikson, 2020). For instance, a model trained to identify cats might inadvertently learn to associate cats with a particular background, like a grassy field. If presented with a cat in a different setting, say, a snowy landscape, its performance might degrade significantly. Formally, we model the sample $z$ as composed by two distinct parts, i.e., $z = [x, y]$, where $x$ is the core feature and $y$ the spurious one, see Figure 1 for an illustration.
Figure 1. Example of a training sample z (top-left) and its spuri- ous counterpart zs (top-right). In experiments, we add a noise background (y) around the original images (x) before training (bottom-left). We then query the trained model only with the noise component (bottom-right)
Previous research has explored related concepts like "benign overfitting," where models achieve zero training error but still generalize well, and the in-distribution generalization of interpolating models such as random features (RF) and neural tangent kernels (NTK). However, these powerful theoretical tools typically model noise as residing in the labels, not within the input data itself. This crucial distinction meant that the memorization of spurious features—which are inherent to the input data structure—was not adequately covered. Furthermore, while practical work has attempted to disentangle spurious features from core features, theoretical approaches predominantly focused on the complexity of features or the degree of overparameterization, often neglecting the architectural role in this memorization process. This paper aims to bridge this theoretical gap by providing an analytically tractable framework to quantify how and why spurious features are memorized, specifically considering their independence from the true label.
Intuitive Domain Terms
To help a zero-base reader grasp the core concepts, here are some specialized terms from the paper, translated into intuitive analogies:
- Spurious Features: Imagine you're teaching a child to identify apples. The "core feature" is the apple itself (its shape, color, stem). A "spurious feature" might be the specific fruit bowl the apple is always placed in during training, or the pattern on the tablecloth beneath it. These are irrelevant to what makes an apple an apple, but the child might accidentally associate them with "apple-ness."
- Memorization (of Spurious Features): This is like the child, when asked to find an apple, looking for the fruit bowl or tablecloth pattern instead of the apple itself. If you show them an apple in a different setting, they might struggle because they've "memorized" the irrelevant context rather than the essential characteristics.
- Feature Alignment: Think of it as how much a model "sees" a similarity between a complete picture (e.g., an apple in its usual bowl) and just the irrelevant part of that picture (e.g., only the fruit bowl, with the apple digitally removed). High feature alignment means the model strongly associates the spurious part with the whole, even when the core object is gone.
- Stability ($S_{z_1}$): Consider a chef perfecting a recipe. If removing one specific ingredient from their practice (say, a pinch of salt) causes the entire dish to taste completely different, their recipe is "unstable" with respect to that ingredient. If the taste remains largely the same, it's "stable." In machine learning, it measures how much a model's predictions change if a single training example is removed from the dataset.
Notation Table
| Notation | Description |
|---|---|
| $z = [x, y]$ | Input sample decomposed into core feature $x$ and spurious feature $y$. |
| $g$ | True label, dependent only on $x$. |
| $f(z, \theta^*)$ | Model's output for sample $z$ with learned parameters $\theta^*$. |
| $z^s = [-, y]$ | Spurious counterpart of sample $z$, where core feature $x$ is removed (e.g., replaced with zeros). |
| $\text{Cov}(f(z^s, \theta^*), g_i)$ | Metric for quantifying memorization of spurious features. |
| $S_{z_1}(z)$ | Stability of model's output for generic sample $z$ with respect to training sample $z_1$. |
| $F_\phi(z, z_1)$ | Feature alignment between samples $z$ and $z_1$ in the feature space induced by $\phi$. |
| $\phi$ | Feature map (e.g., for Random Features or Neural Tangent Kernel models). |
| $P_{\Phi_{-1}}$ | Projection matrix onto the subspace spanned by features of all training samples except $z_1$. |
| $\gamma_\phi$ | Concentrated value of feature alignment, dependent on activation function and $\alpha$. |
| $R_{Z_{-1}}$ | Generalization error of the model trained on $Z_{-1}$ (dataset without $z_1$). |
| $\alpha = d_y/d$ | Fraction of spurious feature dimension to total input dimension. |
Problem Definition & Constraints
Core Problem Formulation & The Dilemma
Deep learning models are widely known to overfit and memorize spurious features present in their training datasets. These spurious features are patterns that are uncorrelated with the actual learning task. For instance, an image might contain a cat (core feature, $x$) on a specific background (spurious feature, $y$). The true label $g$ (e.g., "cat") depends only on $x$, while $y$ is uninformative. The current state is that models, especially over-parameterized ones, often learn these irrelevant patterns, leading to undesirable behaviors like poor out-of-distribution robustness, fairness issues, and data privacy risks.
The desired endpoint is to develop a rigorous theoretical framework that can precisely understand and quantify how spurious features are memorized. This framework should allow for the design of machine learning models that are less prone to memorizing such correlations, thereby improving robustness, fairness, and privacy.
The exact missing link or mathematical gap this paper attempts to bridge lies in the existing theoretical machinery. While prior work has characterized benign overfitting and in-distribution generalization, it typically models noise as residing in the labels, not within the input data as spurious features. Furthermore, previous theoretical approaches often focused on feature complexity or overparameterization without adequately capturing the role of the model's architecure and activation function in this memorization process. This paper introduces an analytically tractable framwork to precisely characterize the memorization of spurious features by decomposing it into two key components: (i) the stability of the model with respect to individual training samples, and (ii) the feature alignment between a training sample and its spurious counterpart. The memorization is formally quantified by the covariance $\text{Cov}(f(z_i^s, \theta^*), g_i)$, where $f(z_i^s, \theta^*)$ is the model's output on a spurious sample $z_i^s = [x, y_i]$ (where $x$ is an independent feature, effectively removing the core feature $x_i$), and $g_i = g(x_i)$ is the true label.
The painful trade-off or dilemma that has trapped previous researchers is that avoiding the memorization of spurious features is not always straightforward. In many cases, memorization can actually be optimal for achieving high accuracy, particularly in over-parameterized models trained to achieve zero training error. This creates a tension: pursuing maximum training accuracy might inherently lead to memorizing irrelevant patterns. The paper's findings suggest that memorization of spurious features weakens as the generalization capability of the model increases, implying an intrinsic link between these two phenomena rather than a simple trade-off where one must be sacrificed for the other.
Constraints & Failure Modes
The problem of precisely characterizing spurious feature memorization is insanely difficult due to several harsh, realistic walls:
- Uncorrelated Nature of Spurious Features: The core definition of spurious features in this paper is that they are uncorrelated with the learning task. This makes them inherently uninformative for the true task, yet models still learn them. Distinguishing and disentangling these from core features is a fundamental challenge.
- Unpredictable Test-Time Behavior: Memorization can lead to models exhibiting unpredictable behaviors at test time, such as poor performance on samples where the core feature is different but the spurious feature is the same as in the training data. This highlights the practical implications of failure modes.
- Model Architecture and Activation Function Dependence: The model's architecture and its activation function play a crucial role in how spurious features are memorized. Previous theoretical frameworks often overlooked this dependency, making it difficult to develop generalizable insights.
- Mathematical Tractability: To make the problem analytically tractable, the authors focus on specific model classes:
- Generalized Linear Regression: The analysis is primarily conducted within this simplified setting, although empirical results suggest its applicability to more complex neural networks.
- Random Features (RF) and Neural Tangent Kernel (NTK) Regression: These two prototypical settings are chosen for their analytical tractability, allowing for precise characterization of feature alignment.
- Kernel Invertibility: A critical mathematical constraint is that the induced kernel $K$ (or $K_{NTK}$) must be invertible. This condition is necessary for the model to perfectly interpolate the training data, which is a prerequisite for the theoretical framework to apply. Invertibility implies specific relationships between the number of data points $N$ and the number of neurons $k$.
- Over-parameterization Regime: The analysis strictly operates within an over-parameterized regime. For Random Features, this means $N \log^3 N = o(k)$, $\sqrt{d} \log d = o(k)$, and $k \log^4 k = o(d^2)$. For NTK, it requires $N \log^3 N = o(kd)$, $N > d$, and $k = O(d)$. These conditions ensure interpolation and allow for the theoretical machinery to work, but they represent specific operating conditions that might not always hold in practice.
- Activation Function Properties: For RF models, the activation function $\phi$ must be a non-linear L-Lipschitz function (Assumption 5.3). For NTK models, the activation function $\phi$ must be non-linear, and its derivative $\phi'$ must be L-Lipschitz (Assumption 6.2). These properties are crucial for the theoritical framework's concentration results.
- Data Distribution Assumptions: Strict assumptions are made on the data distribution (Assumption 5.1):
- Training samples $z_i = [x_i, y_i]$ are i.i.d. from $P_x \times P_y$.
- Core features $x_i$ are independent of spurious features $y_i$.
- Specific norm conditions: $||x||_2 = \sqrt{d_x}$ and $||y||_2 = \sqrt{d_y}$.
- Mean-zero for $x$ and $y$.
- $P_x$ and $P_y$ must satisfy Lipschitz concentration. These assumptions simplify the analysis but might not hold for all real-world datasets.
Why This Approach
The Inevitability of the Choice
The authors embarked on this specific analytical approach because a rigorous theoretical framework to quantify the memorization of spurious features in deep learning was conspicuously absent. While deep learning models are empirically known to overfit and memorize these irrelevant features, and extensive empirical efforts have been made to mitigate this, a foundational theoretical understanding was missing.
The exact moment the authors realized traditional "SOTA" theoretical methods were insufficient is clearly articulated in the introduction. Existing powerful theoretical machinery, particularly work characterizing benign overfitting in interpolating models like Random Features (RF) and Neural Tangent Kernels (NTK), primarily models noise as residing in the labels, not in the input data. This crucial distinction meant that these established frameworks, despite their sophistication, could not adequately explain or quantify the memorization of spurious features that are part of the input sample itself and are uncorrelated with the true learning task. To bridge this significant gap, the authors needed an analytically tractable framework that could precisely characterize this phenomenon when spurious features are embedded within the input.
Comparative Superiority
The superiority of this approach isn't measured by typical performance metrics like accuracy or speed, as it's a theoretical framework rather than a new practical algorithm. Instead, its qualitative superiority lies in its ability to provide a rigorous and analytically tractable understanding of a previously unquantified phenomenon. The previous "gold standard" was, in essence, a theoretical void regarding spurious feature memorization in the input data.
This method introduces a novel decomposition of spurious feature memorization into two distinct components: (i) the model's stability with respect to individual training samples, and (ii) the feature alignment between the spurious feature and the full sample. The latter, feature alignment, is a novel concept introduced by this paper. This structural advantage allows for a precise characterization of how memorization occurs, unveiling the specific roles of the model architecture and its activation function. By focusing on RF and NTK models, which are known for their analytical tractability while still exhibiting deep-learning-like behaviors, the authors gain the ability to derive closed-form expressions and concentration results (e.g., for $\gamma_{\text{RF}}$ and $\gamma_{\text{NTK}}$). This level of analytical depth and quantification was simply not available in prior work, which largely treated spurious features either empirically or as label noise. It's a qualitative leap in theoretical understanding, providing a foundational lens through which to view a complex problem.
Alignment with Constraints
The chosen method perfectly aligns with the problem's inherent constraints and definitions. The core problem is to quantify the memorization of spurious features $y$ that are uncorrelated with the true label $g$, where the input sample is defined as $z = [x, y]$ (core feature $x$, spurious feature $y$). The memorization is specifically quantified by the covariance $\text{Cov}(f(z^s, \theta^*), g_i)$, where $z^s = [-, y]$ represents querying the model with only the spurious component.
The solution's unique properties, particularly its focus on generalized linear regression with Random Features (RF) and Neural Tangent Kernel (NTK) models, are a direct "marriage" to these requirements.
1. Analytical Tractability: RF and NTK models are chosen precisely because they are analytically tractable. This property is paramount for developing a rigorous theoretical framework that can quantify memorization, rather than just observing it. This tractability allows for the derivation of precise mathematical relationships, such as the connection between memorization, generalization error, and the newly defined feature alignment.
2. Decomposition of Input: The framework's ability to analyze feature alignment $F_\phi(z^s, z)$ directly addresses the problem's decomposition of the input into core and spurious parts. The definition of $z^s$ as the spurious counterpart is central to the problem, and the feature alignment term directly measures the similarity in feature space between the original sample and its spurious version.
3. Role of Model and Activation Function: The analysis of feature alignment for RF and NTK models explicitly unveils the role of the model and its activation function in memorization. This directly addresses the need to understand how deep learning architectures contribute to this phenomenon, going beyond just observing it.
In essence, the method's analytical rigor and its ability to dissect the problem into quantifiable components (stability and feature alignment) are perfectly suited to the constraint of providing a precise characterization of spurious feature memorization in a theoretically sound manner.
Rejection of Alternatives
The paper doesn't explicitly reject other popular models like GANs or Diffusion for practical tasks. Instead, it highlights the limitations of existing theoretical frameworks for deep learning when applied to the specific problem of spurious feature memorization in input data.
The authors implicitly reject the direct application of prior theoretical work on benign overfitting because such work typically models noise as being present in the labels rather than in the input data. While these frameworks, often utilizing RF and NTK models, are powerful for understanding generalization in interpolating models, they simply do not capture the mechanism of memorizing spurious features that are part of the input sample itself. The problem definition, where $z = [x, y]$ and $y$ is an uninformative spurious feature in the input, necessitates a different theoretical lens.
Therefore, the "rejection" is not of alternative deep learning architectures for solving tasks, but of alternative theoretical modeling assumptions that would have failed to provide a precise, analytically tractable framework for the specific problem of input-data spurious feature memorization. The paper's contribution is to fill this theoretical void by developing a framework tailored to this specific type of spurious correlation.
Figure 6. Test and spurious accuracies as a function of the number of training samples N, for various binary classification tasks. In the first two plots, we consider the NTK model in (24) with k = 100 trained over Gaussian data with d = 1000. The labeling function is g(x) = sign(u⊤x). We repeat the experiments for α = {0.25, 0.5}, and for the two activations whose derivatives are ϕ′ 2 = h0 + h1 and ϕ′ 4 = h0 + h3, where hi denotes the i-th Hermite polynomial (see Appendix A.1). In the last two plots, we consider the same model with ReLU activation, trained over two MNIST and CIFAR-10 classes. The width of the noise background is 10 pixels for MNIST and 8 pixels for CIFAR-10, see Figure 1. The spurious accuracy is obtained by querying the model only with the noise background from the training set, replacing all the other pixels with 0, and taking the sign of the output. As we consider binary classification, an accuracy of 0.5 is achieved by random guessing. We plot the average over 10 independent trials and the confidence band at 1 standard deviation
Mathematical & Logical Mechanism
The Master Equation
The core of this paper's analysis, which elegantly connects the model's stability to the alignment of features, is encapsulated in Lemma 4.1. This lemma establishes a fundamental relationship between the stability of the model's output for a generic sample and its stability for a specific training sample, mediated by a novel concept called "feature alignment."
The master equation, representing this transformation logic, is:
$$S_{z_1}(z) = F_\phi(z, z_1) S_{z_1}(z_1)$$
This equation is further powered by the definition of the feature alignment term $F_\phi(z, z_1)$:
$$F_\phi(z, z_1) := \frac{\phi(z)^T P_{\Phi_{-1}} \phi(z_1)}{\|P_{\Phi_{-1}} \phi(z_1)\|_2^2}$$
Term-by-Term Autopsy
Let's dissect these equations to understand the role of each component.
Equation: $S_{z_1}(z) = F_\phi(z, z_1) S_{z_1}(z_1)$
-
$S_{z_1}(z)$
- Mathematical Definition: This term, defined as $f(z, \theta^*) - f(z, \theta^*_{-1})$, represents the difference in the model's output for a generic sample $z$. The first part, $f(z, \theta^*)$, is the output when the model is trained on the full dataset $Z$. The second part, $f(z, \theta^*_{-1})$, is the output when the model is trained on the dataset $Z_{-1}$ (which excludes the specific training sample $z_1$).
- Physical/Logical Role: This is the model's stability with respect to the training sample $z_1$, evaluated at the generic sample $z$. It quantifies how much the model's prediction for $z$ changes if $z_1$ is removed from (or added to) the training set. A large value indicates that $z_1$ has a significant influence on the model's prediction for $z$, implying low stability.
- Why subtraction: Subtraction is used here to directly measure the change or impact. It isolates the effect of $z_1$'s presence in the training data on the prediction for $z$.
-
$F_\phi(z, z_1)$
- Mathematical Definition: This is the "feature alignment" term, defined by the second equation above.
- Physical/Logical Role: This term measures the similarity or alignment between the feature representation of the generic sample $z$ and the feature representation of the training sample $z_1$, specifically after $z_1$'s features have been projected onto the subspace spanned by the other training samples. It captures how much $z$ "looks like" $z_1$ in the model's internal feature space, considering the context of the rest of the training data. A higher value means stronger alignment.
- Why multiplication: In the master equation, multiplication combines the "self-influence" of $z_1$ (its own stability) with how that influence "propogates" to other samples ($z$) based on their feature alignment. It suggests a direct scaling: if $z$ is highly aligned with $z_1$, then $z_1$'s impact on itself will strongly affect $z$.
-
$S_{z_1}(z_1)$
- Mathematical Definition: This is $f(z_1, \theta^*) - f(z_1, \theta^*_{-1})$, which is the stability of the model's output for the specific training sample $z_1$ itself.
- Physical/Logical Role: This term quantifies how much the model's prediction for $z_1$ changes when $z_1$ is included in the training set versus when it's excluded. In the over-parameterized regime where the model interpolates the data, $f(z_1, \theta^*)$ often equals the true label $g_1$. Thus, $S_{z_1}(z_1)$ becomes $g_1 - f(z_1, \theta^*_{-1})$, representing the "error" or "surprise" the model experiences when $z_1$ is added, given its prior knowledge from $Z_{-1}$. This term is directly linked to the generalization error of the model trained on $Z_{-1}$.
- Why subtraction: Similar to $S_{z_1}(z)$, it measures the direct change in output for $z_1$.
Equation: $F_\phi(z, z_1) := \frac{\phi(z)^T P_{\Phi_{-1}} \phi(z_1)}{\|P_{\Phi_{-1}} \phi(z_1)\|_2^2}$
-
$\phi(z)$ and $\phi(z_1)$
- Mathematical Definition: $\phi: \mathbb{R}^d \to \mathbb{R}^p$ is the feature map. $\phi(z)$ and $\phi(z_1)$ are the feature vectors in $\mathbb{R}^p$ corresponding to input samples $z$ and $z_1$, respectively.
- Physical/Logical Role: These are the model's internal representations of the input data points in a higher-dimensional feature space. They are how the model "perceives" the raw inputs.
- Why vectors: They represent points in a multi-dimensional feature space, where each dimension corresponds to a learned feature.
-
$P_{\Phi_{-1}}$
- Mathematical Definition: This is a projection matrix that projects vectors onto the linear subspace spanned by the rows of $\Phi_{-1}$. $\Phi_{-1}$ is the feature matrix of the training dataset $Z_{-1}$ (all training samples except $z_1$).
- Physical/Logical Role: This projector "filters" the feature vector $\phi(z_1)$, retaining only the component that lies within the subspace already "learned" or "represented" by the other training samples. It effectively contextualizes $\phi(z_1)$ by removing any part unique to $z_1$ that is orthogonal to the existing data manifold.
- Why projector: A projector is a standard linear algebra tool to decompose a vector into components that lie within and orthogonal to a given subspace. It's used here to isolate the "shared" information.
-
$\phi(z)^T P_{\Phi_{-1}} \phi(z_1)$
- Mathematical Definition: This is the dot product between the feature vector of $z$ and the projected feature vector of $z_1$.
- Physical/Logical Role: This measures the linear correlation or similarity between $z$'s feature representation and the "contextualized" feature representation of $z_1$. A larger dot product implies greater alignment.
- Why dot product: The dot product is a fundamental way to quantify the extent to which two vectors point in the same direction or overlap.
-
$\|P_{\Phi_{-1}} \phi(z_1)\|_2^2$
- Mathematical Definition: This is the squared Euclidean (L2) norm of the projected feature vector of $z_1$.
- Physical/Logical Role: This term serves as a normalization factor. It represents the "strength" or "magnitude" of the part of $z_1$'s feature representation that is aligned with the rest of the training data. Normalizing by this value ensures that $F_\phi(z, z_1)$ measures relative alignment, not absolute magnitude, making it comparable across different samples.
- Why squared L2 norm: The L2 norm is a standard measure of vector length. Squaring it simplifies calculations and is common in many similarity metrics, often resembling a cosine similarity when normalized.
Step-by-Step Flow
Imagine a mechanical assembly line where raw data points are transformed and interact to produce a final measure of influence.
- Raw Input Stage: We begin with two raw data points: a generic sample $z$ (for which we want to assess stability) and a specific training sample $z_1$ (whose influence we're examining).
- Feature Transformation Unit ($\phi$): Both $z$ and $z_1$ enter the feature transformation unit. This unit, powered by the feature map $\phi$, converts the raw inputs from their original $\mathbb{R}^d$ space into richer, higher-dimensional feature vectors, $\phi(z)$ and $\phi(z_1)$, in $\mathbb{R}^p$. This is like converting raw materials into standardized, high-quality components.
- Dual Model Training Bay: In parallel, two versions of the model are "trained" (or their parameters are determined):
- Full Model ($\theta^*$): One model is trained using the entire dataset $Z$ (including $z_1$).
- Reduced Model ($\theta^*_{-1}$): Another model is trained using a slightly smaller dataset $Z_{-1}$ (which is $Z$ minus $z_1$).
This creates two slightly different "machines," one with $z_1$ as a design input, and one without.
- Self-Stability Measurement Station ($S_{z_1}(z_1)$): The specific training sample $z_1$ is fed into both the Full Model and the Reduced Model. Their outputs, $f(z_1, \theta^*)$ and $f(z_1, \theta^*_{-1})$, are compared. The difference, $S_{z_1}(z_1)$, is calculated. This measures how much $z_1$ itself "stresses" or "changes" the model's internal state when it's part of the training data. This value is then stored.
- Contextual Projection Unit ($P_{\Phi_{-1}}$): The feature vector $\phi(z_1)$ (from step 2) enters this unit. Here, it's projected onto the subspace defined by the feature matrix $\Phi_{-1}$ (which represents all other training samples). This operation, $P_{\Phi_{-1}} \phi(z_1)$, extracts the part of $z_1$'s feature that is already "understood" or "aligned" with the existing training data. It's like filtering out the unique noise of $z_1$ to see its common patterns.
- Alignment Calculation Module ($F_\phi(z, z_1)$):
- The output from the Contextual Projection Unit, $P_{\Phi_{-1}} \phi(z_1)$, is sent to a norm calculator, which computes its squared length, $\|P_{\Phi_{-1}} \phi(z_1)\|_2^2$. This provides a normalization factor.
- Simultaneously, $\phi(z)$ (from step 2) and $P_{\Phi_{-1}} \phi(z_1)$ are fed into a dot product engine, yielding $\phi(z)^T P_{\Phi_{-1}} \phi(z_1)$. This measures their raw similarity.
- Finally, the dot product is divided by the squared norm. The result, $F_\phi(z, z_1)$, is the feature alignment score. This score indicates how well the generic sample $z$ "fits" with the contextualized representation of $z_1$.
- Final Stability Assembly: The self-stability value $S_{z_1}(z_1)$ (from step 4) and the feature alignment score $F_\phi(z, z_1)$ (from step 6) are brought together. They are multiplied to produce the final output: $S_{z_1}(z)$. This final value quantifies the total influence of training sample $z_1$ on the model's prediction for $z$, showing how $z_1$'s internal impact is transmitted to $z$ through feature similarity.
Optimization Dynamics
The paper's analysis is grounded in the behavior of generalized linear regression models operating in the over-parameterized regime. This context is crucial for understanding how the mechanism learns, updates, and converges.
- Interpolation and Loss Landscape: In the over-parameterized setting, the model has enough capacity (parameters $p$ are sufficiently large relative to data points $N$) to achieve zero training error. This means the model can perfectly interpolate all training labels, $G$, even if they contain noise or spurious correlations. The empirical risk, defined by a quadratic loss $\min_\theta \|\Phi\theta - G\|_2^2$, is convex. Gradient descent, when run to convergence, finds a solution $\theta^*$ that perfectly fits the training data. This implies that the loss landscape has a "flat" minimum or a subspace of global minima where the training error is zero. The model doesn't just find a good solution; it finds an interpolating solution.
- Gradient Behavior and Convergence: The gradient descent process iteratively updates the model parameters $\theta$. Because the loss function is quadratic and the model is over-parameterized, gradient descent will converge to an interpolating solution. Specifically, it converges to the minimum L2-norm interpolator (as per equation 3). This means that the gradients drive the parameters towards a region where the training loss is zero, and among those zero-loss solutions, it picks the one closest to the initialization in L2 norm. The gradients effectively "pull" the model towards perfect fit, regardless of whether that fit is due to core or spurious features.
- Stability and Generalization: A key insight is the established link between model stability and generalization error. Equation (6) shows that the expected squared stability of a training sample $z_1$ (i.e., $E_{z_1 \sim P_Z} [S_{z_1}(z_1)^2]$) is equivalent to the generalization error $R_{Z_{-1}}$ of the model trained without $z_1$. As the model learns and its generalization error decreases over time (e.g., with more data or better training), its stability with respect to individual samples tends to increase. This means the model becomes less sensitive to the presence or absence of any single training point.
- Role of Feature Alignment ($\gamma_\phi$) in Memorization: The constant $\gamma_\phi$, which is a concentrated version of the feature alignment $F_\phi(z_1^s, z_1)$, is a critical factor in the memorization dynamics. Theorems 5.4 and 6.3 demonstrate that $\gamma_\phi$ is strictly positive and bounded. This implies that there is always some degree of feature alignment between spurious and core features. The value of $\gamma_\phi$ is not learned directly but is an inherent property determined by the chosen model architecture (RF or NTK), the activation function, and the proportion of spurious features ($\alpha$). For instance, activation functions with dominant low-order Hermite coefficients lead to higher $\gamma_\phi$, increasing memorization.
- How Memorization Occurs: The mechanism learns spurious features not through explicit optimization for them, but as an unavoidable consequence of interpolating the training data and the inherent feature alignment. The model's objective is to minimize the overall training error. If spurious features are present and exhibit sufficient alignment with core features (quantified by $\gamma_\phi$), the model will memorize these correlations to achieve zero training error. The optimization process, by driving the model to perfectly fit all labels, inadvertently picks up these spurious correlations. The loss landscape does not penalize memorization of spurious features directly; it only guides the model to fit the data. The final extent of memorization is thus a product of the model's generalization capability (lower $R_{Z_{-1}}$) and the intrinsic feature alignment ($\gamma_\phi$), as shown in equation (11). This highlights that even a well-generalizing model can memorize spurious features if the feature alignment is strong. The dependance on the activation function is a crucial insight here.
Figure 4. Test and spurious accuracies as a function of the number of training samples N, for a fully connected (FC, first two plots), and a small convolutional neural network (CNN, last two plots). In the first plot, we use synthetic (Gaussian) data with d = 1000, and the labeling function is g(x) = sign(u⊤x). As we consider binary classification, the accuracy of random guessing is 0.5. The other plots use subsets of the MNIST and CIFAR-10 datasets, with an external layer of noise added to images, see Figure 1. As we consider 10 classes, the accuracy of random guessing is 0.1. We plot the average over 10 independent trials and the confidence band at 1 standard deviation
Figure 5. Test and spurious accuracies as a function of the number of training samples N, for two ResNet architectures. We use subsets of the CIFAR-10 dataset, with an external layer of noise added to images, see Figure 1. As we consider 10 classes, the accuracy of random guessing is 0.1. We plot the average over 10 independent trials and the confidence band at 1 standard deviation
Results, Limitations & Conclusion
Experimental Design & Baselines
The authors meticulously designed experiments to validate their theoretical claims regarding spurious feature memorization in deep learning models. They focused on two prototypical settings: Random Features (RF) models, as defined by Equation (14), and Neural Tangent Kernel (NTK) models, described by Equation (24).
For data, they utilized both synthetic and standard datasets. The synthetic data consisted of Gaussian distributions with an input dimension $d = 1000$, where the labeling function was a simple binary classifier, $g(x) = \text{sign}(u^T x)$. For real-world validation, they employed subsets of the MNIST dataset (binary classification of digits 1 and 7) and CIFAR-10 (binary classification of cats and ships).
Spurious features were introduced by modeling each sample $z$ as a concatenation of a core feature $x$ and a spurious feature $y$, i.e., $z = [x, y]$. For image datasets, $y$ was implemented as an external layer of noise background added to the original images $x$, as illustrated in Figure 1. Specifically, for MNIST, a noise background of 10 pixels width was used, and for CIFAR-10, 8 pixels.
The primary metric for quantifying memorization was "spurious accuracy." This was obtained by querying the trained model only with the spurious component $z^s = [-, y]$ (where the core feature $x$ was replaced with zeros) and then checking if the output matched the true label $g(x)$. An accuracy of 0.5 for binary classification (or 0.1 for 10-class classification) represents random guessing, serving as a clear baseline for non-memorization. The "victims" in these experiments were essentially the models themselves, as the goal was to understand how and when they memorized, rather than defeating external models.
The experimental architecture was designed to ruthlessly prove their mathematical claims by varying key parameters:
- Number of training samples ($N$): Experiments tracked test accuracy and spurious accuracy as $N$ increased, ranging from $10^1$ to $10^4$ or $10^5$.
- Activation functions: For RF models (Figure 2), they compared activations $\phi_2 = h_1 + h_2$ and $\phi_4 = h_1 + h_4$, which have different dominant Hermite coefficients. For NTK models (Figure 6), they used derivatives of activations $\phi'_2 = h_0 + h_1$ and $\phi'_4 = h_0 + h_3$.
- Fraction of spurious feature ($\alpha$): They investigated the impact of $\alpha = d_y/d$ (the ratio of spurious feature dimension to total input dimension) on both test and spurious accuracies (Figure 7).
- Neural network architechtures: Beyond the theoretical RF and NTK models, they empirically tested fully connected (FC), convolutional neural networks (CNN), and ResNet architectures (ResNet18 and ResNet50) on standard datasets (Figures 4 and 5) to demonstrate the broader applicability of their theory.
All models were trained in the interpolating regime, aiming for 0 training error, a setting where memorization is known to occur. Each experiment was averaged over 10 independent trials, with confidence bands indicating one standard deviation.
What the Evidence Proves
The evidence presented in the paper provides a definitive, undeniable validation of the core mathematical mechanism proposed: that spurious feature memorization is a consequence of the model's generlization error and the novel concept of "feature alignment."
The central theoretical claim, articulated in Equations (11) and (27), posits that the covariance between the model's output on a spurious sample and the true label (a measure of memorization) is bounded by $\gamma_\phi \sqrt{R_{Z_{-1}}} \sqrt{\text{Var}(g_1)}$. Here, $\gamma_\phi$ represents the feature alignment and $R_{Z_{-1}}$ is the generalization error. This equation is the definitive evidence that memorization is directly linked to generalization error and feature alignment.
The numerical experiments robustly confirm this relationship:
- Memorization and Generalization Error: Figures 2, 3, 4, 5, and 6 consistently show that as the number of training samples $N$ increases, the test accuracy improves (implying a decrease in generalization error), and concurrently, the spurious accuracy decreases (indicating reduced memorization). This inverse relationship between generalization performance and spurious memorization is a direct empirical validation of the theoretical bound. The models, in essence, "defeat" random guessing by achieving higher test accuracy, but in doing so, they also exhibit spurious memorization, which then decreases as they generalize better.
Figure 2. Test and spurious accuracies as a function of the number of training samples N, for various binary classification tasks. In the first two plots, we consider the RF model in (14) with k = 105 trained over Gaussian data with d = 1000. The labeling function is g(x) = sign(u⊤x). We repeat the experiments for α = {0.25, 0.5} and for the two activations ϕ2 = h1 + h2 and ϕ4 = h1 + h4, where hi denotes the i-th Hermite polynomial (see Appendix A.1). In the last two plots, we consider the same model with ReLU activation, trained over two MNIST and CIFAR-10 classes. The width of the noise background is 10 pixels for MNIST and 8 pixels for CIFAR-10, see Figure 1. The spurious accuracy is obtained by querying the model only with the noise background from the training set, replacing all the other pixels with 0, and taking the sign of the output. As we consider binary classification, an accuracy of 0.5 is achieved by random guessing. We plot the average over 10 independent trials and the confidence band at 1 standard deviation
- Role of Activation Functions: The paper's theory, particularly Theorems 5.4 and 6.3, predicts that the constant $\gamma_\phi$ (feature alignment) depends on the activation function's Hermite coefficients. Specifically, activations with dominant higher-order Hermite coefficients are predicted to lead to less memorization. This is strikingly confirmed in Figures 2 and 6. For both RF and NTK models on synthetic data, activations like $\phi_4$ (or $\phi'_4$) which have higher-order Hermite coefficients, consistently show lower spurious accuracy compared to $\phi_2$ (or $\phi'_2$) as $N$ increases. This demonstrates that the choice of activation function plays a crucial role in shaping the model's susceptibility to memorizing spurious features.
- Impact of Spurious Feature Fraction ($\alpha$): Figure 7, which plots accuracies as a function of $\alpha = d_y/d$, provides further strong evidence. It shows that while test accuracy remains largely independent of $\alpha$, spurious accuracy monotonically increases with $\alpha$. This aligns perfectly with the theoretical predictions of Theorems 5.4 and 6.3, where $\gamma_{RF}$ and $\gamma_{NTK}$ are shown to depend on $\alpha$. This ruthlessly proves that the extent of spuriousness in the input directly influences memorization.
Figure 7. Test and spurious accuracies as a function of α. We consider RF (first and second plot) and NTK (third and fourth plot) models trained on a synthetic dataset. The settings are the same as in Figures 2 and 6, and we use a ReLU activation function. The number of training samples is fixed to N = 200
- Transferability Across Architectures and Datasets: A significant strength of the evidence is its transferability. Although the core mathematical framework is developed for generalized linear regression with RF and NTK, the empirical results on standard datasets (MNIST, CIFAR-10) and diverse neural network architechtures (fully connected, convolutional, and ResNet models in Figures 3, 4, and 5) show the same trends. This suggests that the underlying mechanisms of spurious feature memorization, as characterized by stability and feature alignment, are fundamental and not limited to simplified theoretical models. The consistent behavior across these varied settings provides undeniable proof that the core mechanism works in reality.
- Strictly Positive Feature Alignment: Theorems 5.4 and 6.3 prove that the feature alignment $F_\phi(z_1^s, z_1)$ concentrates to a strictly positive constant $\gamma > 0$. This "geometric overlap" guarantees that spurious correlations are indeed memorized as long as the generalization error is not vanishing, which is the case in interpolating models.
Limitations & Future Directions
This paper presents a robust theorectical framework and compelling empirical evidence for understanding spurious feature memorization. However, like all scientific endeavors, it operates within certain constraints and opens numerous avenues for future exploration.
One primary limitation of the current work is its focus on spurious features that are uncorrelated with the true learning task, meaning they are independent of the ground-truth label $g$. While this simplified setting allows for precise analytical characterization, many real-world scenarios involve spurious correlations that are subtly or strongly linked to the label. Extending this framework to cases where the spurious feature $y$ is correlated with the ground-truth label $g$ would be a significant and complex undertaking, requiring new mathematical tools to disentangle causal and spurious relationships.
Another intriguing direction involves testing the model's behavior with novel spurious features. What happens if a trained model is queried with a spurious feature $y'$ that was not present in the training set, but is correlated with a feature $y$ that was seen? This could shed light on the model's ability to generalize spurious patterns or its vulnerability to new, unseen biases, which is crucial for out-of-distribution robustness.
The paper also briefly mentions the potential to capture the role of "simplicity bias" within this formalism. Simplicity bias refers to the phenomenon where models tend to learn "easy" patterns first, even if they are spurious. Integrating simplicity bias into the feature alignment and stability framework could provide a more holistic understanding of why certain spurious features are memorized preferentially. This would require a deeper characterization of feature complexity and its interaction with model learning dynamics.
While the mathematical formalism of feature alignment (Lemma 4.1) is general enough to apply to any feature map $\phi$ (including those from complex fully-connected, convolutional, or attention layers), the current paper provides a precise analysis only for Random Features and Neural Tangent Kernel models. A substantial future direction would be to characterize the feature alignment for these more complex, feature-learning architectures. This would enable a direct comparison of different deep learning models and architectures to establish which ones are inherently less prone to memorizing spurious correlations, moving beyond the current generalized linear regression setting.
Finally, the paper's impact statement explicitly notes its theoretical nature and refrains from discussing societal consequences. However, the implications of understanding and mitigating spurious feature memorization are profound, touching upon critical areas such as robustness to distribution shifts, fairness, and data privacy. Future work could bridge this gap by translating the theoretical insights into actionable design principles for machine learning models that are not only accurate but also robust, fair, and private. This would involve developing practical methodologies based on the $\gamma_\phi$ and $R_{Z_{-1}}$ terms to guide model development and deployment in sensitive applications.
Connections to Other Fields
Mathematical Skeleton
The paper's mathematical core involves quantifying the relationship between a model's sensitivty to individual data points (stability) and a novel geometric measure of "feature alignment" in high-dimensional feature spaces. This is achieved through concentration results for kernel-based models, leveraging tools from random matrix theory and functional analysis (Hermite polynomials).
Adjacent Research Areas
Algorithmic Stability and Generalization Bounds
The concept of model stability, particularly the leave-one-out stability $S_{z_1}$ (Definition 3.1, Equation 5), is a cornerstone of statistical learning theory for deriving generalization boundss. This work directly builds upon and extends this classical notion by connecting it to the memorization of spurious features. The paper shows that memorization is a natural consequence of the generalization error, mediated by the feature alignment, which is a novel contribution.
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. The Journal of Machine Learning Research, 2:499–526.
Neural Tangent Kernel (NTK) and Random Feature Models
The paper's analysis is explicitly framed within the contexts of Random Features (RF) and Neural Tangent Kernel (NTK) regression models. The feature maps $\phi_{RF}(z)$ (Equation 14) and $\phi_{NTK}(z)$ (Equation 24) are central to the entire theoreical framework. The work provides a precise characterization of feature alignment $F_\phi(z^s, z_1)$ (Equation 8) and its concentration properties (Theorems 5.4 and 6.3) for these models, thereby deepening the understanding of how model architecture and activation functions influence memorization in these widely studied kernel regimes.
Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS).
Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems (NeurIPS).
High-Dimensional Probability and Random Matrix Theory
The theoretical derivations throughout the paper heavily rely on advanced tools from high-dimensional probability and random matrix theory. For instance, the Hanson-Wright inequality (Adamczak, 2015) is used to prove concentration results for the feature alignment (Lemma D.7). Schur's inequality (Schur, 1911) is applied in Lemma D.1 to bound the smallest eigenvalue of Khatri-Rao products of feature matrices. These techniques are crucial for analyzing the behavior of feature matrices and kernels in the over-parameterized, high-dimensional settings considered.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge university press.