Token Embeddings Violate the Manifold Hypothesis
This is an article that walks through my understanding of the paper Token Embeddings Violate the Manifold Hypothesis. I highly recommend reading.
Manifold
In many machine learning papers, a manifold hypothesis is either tacitly or explicitly assumed—that is, high-dimensional data are not uniformly scattered throughout a high-dimensional space, but rather, concentrate near a lower-dimensional manifold. That is, the token cloud \(e(T)\subset \mathbb{R}^{\ell}\) locally resembles a smooth \(d\)-dimensional manifold, with bounded curvature.
The definition of a manifold is global. If a single point doesn't have a Euclidean neighborhood, then the space cannot be considered a manifold. This kind of point is called a singularity.
Since the token input embedding matrix defines the salient subspace of the latent space for language models, let's take a look. Assume that the token subspace has reach \(\tau\). We would like to estimate the dimension of a ball centered at token \(\psi\) with radius \(r\) < \(\tau\). The idea is to construct two statistical tests:
| Manifold test | Fiber bundle test | |
|---|---|---|
| \(H_0\) | There is a unique dimension at \(\psi\). | The dimension at \(\psi\) within a radius-\(r\) ball does not increase as \(r\) grows. |
| \(H_1\) | There is not a unique dimension at \(\psi\). | The dimension at \(\psi\) increases for some \(r\). |
Low reach \(\tau\) indicates high curvature, and high reach indicates low curvature. Intuitively, if we have low curvature, this means we can roll a massive ball, touching the manifold at many points. If we have high curvature, we can only roll a small ball. Therefore, we can use the reach to estimate the curvature of the token subspace.
Modeling the token subspace as a fiber bundle
Let \(T\) denote the vocabulary (the set of discrete tokens a language model knows how to emit) and \(e: T\to\mathbb{R}^{\ell}\) the embedding map, so \(e(T)\subset\mathbb{R}^{\ell}\) is the resulting token cloud.
Given an embedding dimension \(\ell\), the token cloud lives inside a subspace \(\mathcal{M}\subset\mathbb{R}^{\ell}\) and inherits the ambient Euclidean metric and volume. Because the inclusion map \(e: \mathcal{M}\to\mathbb{R}^{\ell}\) is one-to-one with full-rank Jacobian, tiny neighborhoods behave like \(\mathbb{R}^d\) for some \(d=\dim \mathcal{M}\). Classical geometry therefore gives
\[v(r) = K r^d + (\text{curvature terms})\cdot r^{d+1},\]
As shown in Gray (1974), the leading term grows like \(r^d\) and the correction encodes curvature. Large corrections indicate high curvature. To handle regions where the manifold picture breaks down (boundaries, cusps, or dimension jumps), we use fiber bundles: locally the space looks like \(B\times F\), with \(B\) smooth (the base) and \(F\) capturing boundary-like behavior.
Curvature is not well-defined at singularities, so we switch to the embedded reach. For \(\mathcal{M}\subset\mathbb{R}^{\ell}\), set
\[\tau(\mathcal{M}) = \sup\Big\{\rho \ge 0 : y\in\mathbb{R}^{\ell},\ \operatorname{dist}(y,\mathcal{M}) < \rho \Rightarrow \#\operatorname*{arg\,min}_{x\in\mathcal{M}} \|y-x\|_2 = 1 \Big\},\]
the largest radius for which every point still has a unique closest projection. Convex sets have \(\tau=\infty\); cusps or self-intersections force \(\tau=0\).
With reach under control we can compare volumes. Write \(B_r(\psi)\) for the Euclidean ball of radius \(r\) around \(\psi\). Taking any volume form \(\nu\) on \(T\), we find that for a continuous \(e: T\to\mathbb{R}^{\ell}\) define the pushforward
\[(e_*\nu)(V) = \nu\big(e^{-1}(V)\big),\]
for measurable \(V\subset\mathbb{R}^{\ell}\). If \(e\) is an embedding or fiber-bundle map, \(e_*\nu\) is again a volume form, so the geometry of \(T\) carries over to \(e(T)\).
Assume now that \(e\) is smooth. Then \(e(T)\) is a \(\dim T\) manifold with volume form \(e_*\nu\). Gray's result tells us that for \(\psi\in e(T)\) and \(r\ll\tau\)
\[(e_*\nu)(B_r(\psi)) = \Theta\big(r^{\dim T}\big).\]
Compactness guarantees strict containment radii: there exists \(\rho(\psi)>0\) so that \(B_r(\psi)\subset e(T)\) for all \(r\le\rho(\psi)\), with bounds satisfying \(0< r_1\le \rho(\psi)\le r_2 < \tau\). Hence the \(r^{\dim T}\) scaling holds up to \(r=\rho(\psi)\). Once \(\rho(\psi) < r < \tau\), the ball leaks outside \(e(T)\) and the volume is strictly smaller than the \(\Theta(r^{\dim T})\) envelope.
To bound that leaked regime, push \(\nu\) through the global bundle projection \(p: T\to S\) (here \(S\) is the same “base” as the local \(B\) from \(B\times F\)). We obtain
\[(e_*\nu)(B_r(\psi)) = \nu\big(e^{-1}(B_r(\psi))\big) \le (p_*\nu)\big(p(e^{-1}(B_r(\psi)))\big) = O\big(r^{\dim S}\big).\]
Re-centering at \(r = \rho(\psi)\) gives the two regimes in a single expression:
\[ (e_*\nu)(B_r(\psi)) = \begin{cases} \Theta\big(r^{\dim T}\big), & 0 \le r \le \rho(\psi), \\ (e_*\nu)(B_{\rho(\psi)}(\psi)) + O\big((r-\rho(\psi))^{\dim S}\big), & \rho(\psi) < r < \tau. \end{cases} \]
That matches the two-case statement of Theorem 1 from the paper: up until the ball hits the edge of \(e(T)\) it grows with slope \(\dim T\), and once it “sticks out” the slope drops to the base dimension \(\dim S\). Full technical details (compactness arguments, tubular neighborhoods, and the exact inequalities) live in the appendix of the paper.
Context windows cannot hide singularities
A \(w\)-token context is simply a tuple \((x_1,\ldots,x_w)\) of tokens, i.e. a point of the Cartesian product \(T^w = T\times\cdots\times T\). If \(T\) fails to be a manifold at some token \(x\), then \(T^w\) fails wherever a coordinate equals \(x\). Therefore, stacking tokens cannot eliminate singularities.
Let \(Z\) be a \(d\)-dimensional smooth manifold containing \(T\subset Z\subset\mathbb{R}^{\ell}\). A transformer with context window \(w\) and output length \(m\) induces a map \(E:T^w\to Z^m\). When
\[2wd < m\ell \qquad \text{and} \qquad m\le w,\]
the Jacobian of \(E\) is generically full rank: the right inequality keeps the output dimension \(m\ell\) no larger than the input latent space \(w\ell\), while the left inequality guarantees we collect enough coordinates to avoid collisions. Therefore \(E\) restricts to a topological embedding on \(T^w\). Equivalently, there exists a map \(\Phi\) such that
\[\Phi(e(x_1),\ldots,e(x_w)) = (y_1,\ldots,y_m) \in Z^m, \qquad \Phi \text{ is injective with a continuous inverse on its image.}\]
Embeddings preserve local topology, so any singularity carried by \((x_1,\ldots,x_w)\) survives in the output. Hence, if the token subspace \(T\) has irregular points, the generated tokens inherit them whenever the inequalities above are met.
Intuitively, a longer context window simply copies \(T\) multiple times and feeds the product through the transformer. As soon as \(2wd < m\ell\), the mixing remains injective (there are enough observed coordinates to keep track of every token) so the geometry of \(T^w\) lives on inside the response. Therefore, singularities persist.
There’s a lot more in the paper—the deterministic neighbor-count tests, synthetic sanity checks, and a full GPT2/Llemma7B/Mistral7B/Pythia6.9B sweep showing wide variation in local dimension and many tokens violating both hypotheses—that I highly recommend reading. They also explore auxiliary questions (e.g., approximating softmax as non-parametric regression or via Taylor series) that are perfect rabbit holes if you want to tinker further.
Paper link: https://arxiv.org/abs/2509.19059