From QK-clip to Muon clip
QK-clip is a simple but sharp idea: if attention logits explode, rescale the query and key projections so the largest logit is capped. The same intuition survives in multi-head latent attention, but the shared rotary key projection forces a careful tweak. This post walks through the math and the fix, usually called muon clip.
Why exploding logits are harmful
For a token \(i\), attention weights are
Dividing by the largest entry for that row makes the collapse explicit. Let \(j^* = \operatorname*{argmax}_j S_{ij}\), so \(S_{ij^*} \ge S_{ij}\) for all \(j\). It follows that
If the gaps \(S_{ij^*} - S_{ij} \gg 0\) are large, the distribution collapses, so \(A_{ij}\) is essentially zero for most \(j\). This low-entropy regime starves most keys of gradient. Taking the Jacobian makes it clear:
To see this, write \(A_{ij} = \exp(S_{ij})/Z_i\) with \(Z_i = \sum_t \exp(S_{it})\). Then
When \(A_{ij^*} \approx 1\), we get \(\partial A_{ij^*}/\partial S_{ij^*} \approx 0\) and \(\partial A_{ij}/\partial S_{ik} \approx 0\) for \(j \ne j^*\). This means brittle, overconfident attention and vanishing gradients to most keys. Large \(S_{ij}\) also push \(\exp(S_{ij})\) toward numeric overflow in reduced precision, which destabilizes training.
QK-clip in one equation
Let \(X \in \mathbb{R}^{n \times d}\), \(Q = XW^Q\), \(K = XW^K\), and logits
QK-clip introduces a threshold \(\tau\) and rescales both projections when \(S_{\max} > \tau\):
The dot product is bilinear, so the logits scale by \(\gamma\): \(S \leftarrow \gamma S\), and the maximum is pulled back to \(\tau\).
Per-head QK-clip in standard multi-head attention
In vanilla multi-head attention, each head \(h\) has its own projections:
If a single head blows up, we can clip that head alone. Let \(S_{h,\max}\) be the largest logit in head \(h\), define \(\gamma_h = \tau / S_{h,\max}\) when \(S_{h,\max} > \tau\), and rescale its query/key projections:
We do not need to rescale \(W_h^V\): values do not appear in the logits, so scaling \(V_h\) does not prevent softmax saturation. It only rescales the output and can be absorbed by the output projection.
MLA and decoupled RoPE
Multi-head latent attention (MLA) compresses queries and KV into low-rank latents:
Each head then uses up-projections to recover its own content queries/keys/values:
RoPE does not fit cleanly inside the shared low-rank KV path, because the rotation must act in each head's key space. The standard fix is decoupled RoPE: add rotary projections computed directly from \(X\),
where all heads share \(W^{K_R}\) but each head has its own \(W_h^{Q_R}\). The head vectors are then concatenated:
This is where the new clipping constraint appears: the rotary key projection is shared, so it cannot be rescaled separately for each head.
Muon clip: rescale only the rotary queries
We still clip per head, but we must respect what is shared. For head \(h\), define \(\gamma_h = \tau / S_{h,\max}\) when \(S_{h,\max} > \tau\). Then rescale the head-specific up-projections and the head-specific rotary query:
Because \(K^R\) is fixed, scaling \(Q_h^R\) by \(\gamma_h\) scales the rotary dot product by \(\gamma_h\) directly, matching the content term's \(\gamma_h\) factor from the paired \(\sqrt{\gamma_h}\) scalings.
We do not rescale \(W_h^V\) because values do not affect logits, and we do not rescale the shared \(W^{K_R}\) because that would multiply it multiple times across heads. This is the muon clip rule: per-head scaling for the content up-projections and rotary queries, shared rotary key left alone.