The Initialization Determines Whether In-Context Learning Is Gradient Descent

Abstract

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), but this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD.

We investigate how multi-head LSA approximates GD under more realistic conditions, specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend the multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for the ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

Key Contributions

A limitation of multi-head LSA under non-zero priors: We show that when regression weights have a non-zero prior mean, multi-head linear self-attention cannot in general reproduce one-step gradient descent, even with many attention heads.
Query initialization is the decisive factor: We identify the initial query prediction y_q as the source of the gap. Misaligned initialization creates a persistent error, while correcting the initial guess is sufficient to recover the gradient-descent solution.
A minimal architectural extension: We introduce yq-LSA, which equips LSA with a trainable initial guess. This restores the ICL-GD correspondence in the non-zero prior setting without requiring a larger architectural change.
Theory and experiments align: We provide theoretical analysis and empirical validation showing head-count saturation, the persistent gap for non-zero priors, and improved in-context performance on a semantic similarity task when initial guesses are introduced in pretrained LLMs.

Methodology Overview

Training and evaluation loss curves of LSA with a non-zero prior mean. The dashed red line denotes the baseline loss achieved by one-step GD.

The paper revisits the connection between in-context learning and gradient descent in a linear regression setting. The analysis focuses on how multi-head linear self-attention behaves when the prior over task weights has a non-zero mean, which better reflects realistic pretraining assumptions than the standard zero-mean setup.

The main intervention is to make the query initialization explicit. Instead of fixing the query prediction to zero, the proposed yq-LSA model introduces a trainable initial guess for the query, allowing the attention update to start from a better-aligned point.

Main Findings

Increasing the number of heads helps only up to a saturation point: once the number of heads reaches the feature dimension plus one, adding more heads does not improve the achievable ICL risk.
Multi-head LSA only matches one-step GD in the special case of a zero prior mean. Under non-zero prior means, a systematic performance gap remains.
A correctly chosen query initialization closes this gap. In the linear setting, yq-LSA recovers the one-step gradient-descent behavior.
The same idea transfers beyond the toy setup: adding non-trivial initial guesses improves in-context performance for pretrained language models on a semantic similarity task.