← back to archive

Hadamard and Monarch: Compressing GPT-2 Small

Ethan TS. Liu | June 24, 2026 | code on GitHub

This post walks through how we can use two tools to make GPT-2 small smaller. The first is the Hadamard transform, which clears out the outliers that wreck low-bit quantization. The second is the Monarch matrix, a structured replacement for a dense layer that uses far fewer parameters.

building the Hadamard matrix+1−1H₁H₂H₄H₈H₁₆×2×2×2×2Hₙ = [[Hₙ/₂, Hₙ/₂], [Hₙ/₂, −Hₙ/₂]]

The Hadamard transform is a linear map whose matrix has only entries of ±1. When we divide this by √n its rows become orthonormal, so it can be thought of as a pure rotation. In particular, it changes the basis a vector is written in, rather than its length or the function a downstream matrix computes.

Two properties make this transform useful: 1) It is an isometry, meaning it can take a single large spike and spread it evenly across all coordinates (while remaining exactly invertible) and 2) it is extremely cheap. Similar to the FFT, a butterfly algorithm can apply the Hadamard transform in O(n log n) rather than O(n²), without every forming the matrix. This means we can place it in front of every matmul for almost nothing.

Quantization stores each number in a few bits instead of 16, or, in our case, 32. But typically it is 16. A naive scheme shares one scale across a group of values: take the largest magnitude in the group, divide by it to set a single step size, and round every value to the nearest multiple of that step. With four bits there are only sixteen levels, so that one largest value fixes the whole grid. Take a row of eight values with a single outlier, quantized to signed 4-bit. Signed 4-bit runs from −8 to 7, so the largest level is 7. Therefore the step is the largest magnitude divided by 7:

\[ x = [\,0.1,\ -0.3,\ 0.2,\ 0.0,\ -0.1,\ 0.25,\ -0.15,\ \mathbf{8.0}\,], \qquad \text{step} = \tfrac{8.0}{7} \approx 1.14 \]

Rounding each value to the nearest multiple of 1.14, the 8.0 lands on 7 steps and comes back as 8.0 exactly, while every other value rounds to 0:

\[ \hat{x} = [\,0,\ 0,\ 0,\ 0,\ 0,\ 0,\ 0,\ \mathbf{8.0}\,] \]

The outlier survives and the rest of the row is wiped out. This is not good! In theory, if we drop the 8.0, the largest entry becomes 0.3, the step falls to \(0.3/7 \approx 0.043\), and we'd be fine. However, computing which values to drop is annoying in actual activation quantization, because we would need to find the outliers on the fly for every input (outlier channels in activations shift from token to token). And even once they are found, dropping them is not free. They are often the channels the model relies on most, so we cannot simply zero them. We would have to keep them in higher precision, run them through a separate matmul, and scatter the results back into place. But this data-dependent, irregular bookkeeping is precisely what dense hardware is slow at, which defeats the purpose of dropping to 4-bit in the first place.

Let's use the Hadamard to solve this! In particular, the Fast Walsh–Hadamard Transform (FWHT), which runs in only log₂n passes, is the perfect candidate:

inputpass 1 · stride 1pass 2 · stride 2pass 3 · stride 4output01234567each butterfly replaces a pair (a, b) with (a+b, a−b); O(n log n)

To be precise, the matrix we use is the Walsh–Hadamard matrix (consistent with the first diagram you saw). Hadamard matrices also exist at other sizes (any multiple of four, using Paley's construction), but those have no recursive halving and therefore no fast transform. We use the Walsh–Hadamard form precisely because it admits the FWHT: that is what makes placing a rotation in front of every matmul essentially free.

However, there is another issue. A plain Hadamard is not quite enough on its own. A vector that happens to line up with one of the matrix's own rows (that is, pointing in the same direction) comes through just as "spiky" as it went in. To fix this, let's randomize the transform. Before applying the hadamard, we can flip the sign of each coordinate using a fixed diagonal of random ±1s, so the map becomes HD with D = diag(±1). This is still an orthogonal rotation and still costs O(n log n).

This fix isn't a guarantee. Post-randomization, a vector can, in principle, still align. However, for any fixed input, it's extremely unlikely, and even less likely the larger the block we rotate over (256 or 1024 here). This same randomized Hadamard sits at the core of two established quantization methods: QuIP#, which adds lattice codebooks (rounding weights in batches to a shared, regular set of values instead of one at a time) to reach very low bit-widths, and QuaRot, which fuses the rotations into the network so that weights and activations alike become outlier-free, running the model end-to-end in 4-bit. These are mature, GPU-scale systems with extra machinery such as lattice codebooks, fused end-to-end pipelines, and learned rotations. I highly recommend checking them out if you would like to dive deeper into quantization! This blog contains only a single fixed Hadamard and plain round-to-nearest quantization, from scratch on an M1 Macbook Pro laptop (2020).

So how spiky is a vector, exactly? The metric, from QuIP#'s predecessor, QuIP, is called incoherence. It's computed by taking the vector's largest entry and dividing it by its RMS entry. Roughly, it computes how many times larger is the biggest value compared to a typical one. A lone outlier makes incoherence high; spreading the "energy" across all the coordinates drives it down.

We have one more issue. The FWHT needs a power-of-two length, but most dimensions are not in powers of two; in our case, GPT-2 small's are 768 and 3072. Rather than pad up to the next power of two, which would corrupt a coarse per-tensor scale, let's split each dimension into the largest power-of-two block that divides it (256 for 768, 1024 for 3072) and apply the transform block by block.

Applying the Hadamard transform to GPT-2 small

Let's apply the rotation. Taking a single GPT-2 small activation row from layer 3, we can see the rotation drop the incoherence from 48 to 4.

05100510activation magnitudeGPT-2 small FFN activation (layer 3)after the Hadamard transformincoherence (max/RMS) = 48incoherence (max/RMS) = 4.0

Let's look at GPT-2 small's matrices more closely. Rotating drops a layer's incoherence from about 70 to 6, and lowers 4-bit weight reconstruction error on all 48 of its weight matrices (24 attention, 24 MLP). To naively quantize, we divide every value by a scale, which is just the largest magnitude in whatever group of weights shares it, and that group can be the whole tensor (one coarse scale) or a small block of 128 weights (many fine ones). Naturally, outliers hurt the most under a coarse per-tensor scale, where one big entry sets the scale for everything, whereas a fine group-wise scale already softens this by giving each block its own scale. That is why the rotation helps a lot under a per-tensor scale (0.94 → 0.22) but almost nothing under group-128 (0.14 → 0.12).

1.01.21.41.64-bit error reduction (×)1030100layer incoherence (max / RMS, log scale)each dot is one of GPT-2 small’s 48 attention + MLP weight matricesno benefit 1.000.5004-bit weight errorbaseline+Hadamard0.940.22per-tensor0.240.15per-channel0.140.12group-128

Quantizing the whole model

For the data, we use Project Gutenberg. Quantization needs no training (we round the weights and activations after the fact and just measure how much quality we lose), so here we only need text to test on. Let's use Alice in Wonderland (circa 45k tokens).

Quantizing only the weights in the rotated basis barely changes end-to-end quality (the W4 row below, a factor of 1.4), even though it cuts weight error much more. The gain comes from quantizing the activations in the rotated, low-incoherence basis instead. Methods such as QuaRot and QuIP# do not rotate the weight and then rotate back. Instead, they rotate the shared input dimension that the matmul sums over, the one the activation outliers live on. Since H is orthonormal, rotating the activations along this axis and the weights back along it cancels inside the matmul, so the output is unchanged, but both the activations and the weights are now low-incoherence and quantize well.

1001000perplexity (log)baseline+Hadamard2929W8A84938W4184251W8A4215294W4A4fp 29

Perplexity of GPT-2 small on text, with full precision at 29. Here, WxAy means x-bit quantization for the weights and y-bit for the activations:

regime baseline + Hadamard improvement
W8A829.429.2
W4 (weight-only)48.637.61.3×
W8A418425136×
W4A421529423×

Four-bit activations make the model unusable, with perplexity near 1800. The rotation brings it back to 51, at a cost of O(n log n) and no extra parameters. The rest of the table shows where the trick runs out. At 8-bit there are no outliers to fight, so the rotation is neutral. Weight-only quantization is only mildly helped, because the activations, which are the real problem, are left untouched.

Monarch Matrices

Let's take a look at a different, related idea, the Monarch matrix, which replaces a dense n×n weight of cost n² with two block-diagonal matrices and a fixed permutation, at cost about 2n√n. We decompose M into P L Pᵀ R: here L and R are the two block-diagonal matrices, and P is the permutation that transposes the √n × √n grid. We apply R, then transpose, then apply L, then transpose back (the second transpose just returns the output to its original ordering). This saving comes at some loss of expressivity.

a dense map factors into two block-diagonals (L, R) and two transposes (P = Pᵀ) ··· dense Ptranspose Lblock-diag Pᵀtranspose Rblock-diag n² entries ≈ 2n√n entries (the two transposes are free)

First, why do we even want this to reach globally? The whole point of a dense layer is so that every output can depend on every input, so any cheaper replacement should keep that reach. Let's see how Monarch does it. First, it reshapes the length-n input into a √n × √n grid. A block-diagonal matrix is then just a small dense transform applied to each row on its own. This is cheap, but it can only mix coordinates that share a row, which is far too local. Let's add a permutation P. Sitting between the two block-diagonal matrices, it transposes the grid, so the second block-diagonal now mixes what used to be the columns. Simply two local mixes with a transpose in between allows every input coordinate to reach every output.

If this feels familiar to the reader, it is the same factorization the Fast Fourier Transform uses: mix down the columns, shuffle, then mix across the rows. Monarch just swaps the FFT's fixed butterflies for learnable dense blocks. Each block-diagonal holds about n√n entries instead of n², so a 1024×1024 weight needs roughly 2 × 1024 × 32 ≈ 65k parameters instead of a million! Splitting into √n blocks is just the natural choice. In general we can use any number of blocks b per factor, and the two factors then hold about 2n²⁄b entries, so fewer and larger blocks keep more parameters and fit the dense weight more closely. Naturally, there's a catch. Not every dense matrix can be written this way, so fitting a Monarch to a trained weight is only an approximation. We'll measure exactly how good that approximation is, and what it costs the model.

Let's trace it on a length-4 vector, with arbitrary learned blocks rather than the fixed ±1 of a Hadamard. R mixes each row of the grid by its own 2×2, we transpose, L mixes each row again, and we transpose back. Here R uses [[1, 2], [3, 1]] and [[2, 1], [1, 2]] on the two rows, and L uses [[1, 1], [2, 1]] and [[1, 2], [1, 1]]:

a Monarch matrix mapping x = [1, 2, 3, 4] → [15, 27, 20, 16] R: mix rows transpose L: mix rows transpose 1234 551011 510511 15202716 15272016 reshape x after R after Pᵀ after L after P

Even though we only ever multiplied 2×2 blocks, every output is a different weighted combination of all four inputs, and the weights are not all ±1, so this is a different map from the Hadamard transform. Had we instead used the fixed ±1 block [[1, 1], [1, −1]] everywhere, this would collapse to the Walsh–Hadamard transform. The Hadamard is the fixed special case, and Monarch is the general, learnable version. At n = 4 it is break-even, since L and R are each two 2×2 blocks, four in total, holding 16 numbers. At n = 9, L and R are each three 3×3 blocks, six in total, so they hold 54 numbers compared to 81 in the dense version. The gap only widens from there.

But what does it mean for the blocks to be learned? They are just parameters, exactly like the entries of an ordinary weight matrix. We initialize them, and then gradient descent tunes them during training to whatever fits the task. The only piece we never learn is the permutation P, which stays fixed as the transpose. Running the layer is simple too. We reshape the input into a √n × √n grid, multiply each row by its block (R), transpose, multiply each row by its block again (L), and transpose back. Each of those row-mixes is just a batched matmul, which a GPU is very happy to do, so the whole forward pass is two small matmuls and a couple of reshapes. That is exactly where the O(n√n) cost comes from.

There are then two ways to actually get the blocks. We can train them from scratch, dropping Monarch layers into the model in place of dense ones. Or, as we do later in this post, we can fit them to a weight that is already trained, so that the Monarch layer reproduces what the dense one was doing.

Monarch on a real model

One thought you might have is that rotating the input before fitting a Monarch would help, but fitting one of GPT-2 small's MLP weights with and without a Hadamard pre-rotation gives the same error. The reason is that a fixed rotation cannot make a target any easier for a structured family to fit: fitting Monarch(Hx) to W is the same as fitting Monarch to WHᵀ, which is no more "Monarch-shaped" than W itself.

Monarch, unlike the quantization above, needs training. Let's keep Alice in Wonderland, but we now also need training text, so we fine-tune on six other Project Gutenberg books (circa 1M tokens).

100100010000perplexity (log)Monarch MLP (58%)0200400600fine-tuning stepsfull GPT-2 small 29≈20,00097

However, Monarch is still useful as a compression method. The MLP blocks are about two-thirds of GPT-2 small's non-embedding weights, and replacing them with Monarch matrices brings the model to 58% of its parameters. However, this causes perplexity to balloon to 20,000, since Monarch is a lossy fit to trained weights. But if we do a short fine-tune on some old Apple Silicon using about 700 steps and ten minutes, we can recover perplexity to about 97 (vs 29 at full precision).

A Small New Result

Everything we've gone through thus far reproduces known work. The first half of this post was about rotating a dense weight so its outliers stop wrecking low-bit quantization. It turns out a Monarch layer never has those outliers to begin with, so the rotation barely helps. The current literature still chooses to rotate: CALDERA quantizes generic low-rank factors with a Hadamard, and ButterflyQuant uses a butterfly as the rotation on dense weights. However, none quantize a Monarch layer.

All of this was based on incoherence. Dense layers have it, from about 10 to 160 across layers and worst in the MLP output projections, which is why they need the rotation. However, if the quantized factors of a Monarch layer are also incoherent, Monarch would need the rotation as well. Let's examine this more closely.

04080120160incoherence (max / RMS)dense weights are spiky, but the Monarch factors fit to them are notdense weightMonarch factors

Fitting a Monarch layer to a real GPT-2 small weight gives factors with incoherence around 7, no matter how outlier-heavy the dense matrix was: across the MLP matrices the dense incoherence ranges from 11 to 162 while the factors stay near 7. Intuitively, the block structure absorbs the outliers; a product of two well-behaved block matrices can build a spiky dense map without either factor being spiky. So the rotation that helps dense 4-bit quantization, a median factor of 1.26, does almost nothing for the factors, a median of 0.97. There are no outliers left to spread!

0.91.11.31.54-bit error reduction (×)Hadamard helps the dense weights, not the Monarch factorsno benefitmedian 1.26×median 0.97×dense weightMonarch factors

So, if we can just compress the parameters using Monarch correctly, then we don't need to use Hadamard at all! Our objective is to replace a dense weight W with a Monarch layer whose two factors multiply out to a matrix M, and we would like M to act like W. The obvious target is to minimize ‖M − W‖, the size of their entrywise difference. However, the layer only ever sees a narrow slice of inputs, so what we actually care about is the error on the outputs it produces, ‖(M − W)x‖ averaged over real activations x. So we can fit M against those real activations rather than random inputs, which drives that error down directly. This is the standard move in low-rank compression (DRONE, ASVD, SVD-LLM). The closest is ProcrustesGPT, which compresses with a broader family of block-diagonal matrices sandwiched between permutations (called GS-matrices, in which Monarch can be considered a special case). However, none of these methods quantize the factors themselves.

2.401.200output errornaive fitdata-awaredata-aware 4-bit1.420.510.53block 02.240.790.81block 50.790.300.31block 11

The three bars are the same Monarch fit measured two ways, on three of GPT-2 small's blocks. The naive fit uses the ‖M − W‖ target from above, which is the same as fitting on Gaussian probes (if x ∼ 𝒩(0, I), then 𝔼‖Ax‖² = ‖A‖2F) The data-aware fit instead matches the layer on its real activations. Moving from those random probes to real activations cuts output error by a factor of about 2.8 at the same parameter budget. The data-aware Monarch's factors stay low in incoherence, around 6, so quantizing them to 4-bit (the third bar) adds only about 2% error.

This holds for the whole model, not just a single layer. We replace all MLPs with data-aware Monarch and set b = 8 blocks per factor, which puts the model at 60% of its parameters. We then fine-tune briefly and naively quantize the factors to 4-bit.

52260perplexity29dense124M46Monarch60%48Monarch 4-bit60%

Held-out perplexity on Alice in Wonderland goes from 29 for the dense model to 46 for the data-aware Monarch at 60% of parameters, and to 48 once the factors are 4-bit, with no rotation. The 46 beats the 97 from the earlier section, but two things changed, not one: this run is data-aware and at a looser budget (the factors here use b = 8 blocks each instead of the earlier b = 16, and since each factor holds about 2n²⁄b entries, fewer and larger blocks keep about 1.7 times the factor parameters), so part of the gain is just the extra budget. The equal-budget comparison earlier isolates the fit, where the effect is the factor of about 2.8 above. The 4-bit step adds almost nothing, so the structured and low-bit compressions compose.

The two Monarch fits also start from very different places. Fit naively to the weights, the model began near 20,000 perplexity (the earlier section); fit data-aware on its real activations, it begins around 680. Matching the real activations gets it most of the way there before any fine-tuning, which is why its curve below starts so much lower than the naive one did.

To see how much of that recovery came from in-domain adaptation rather than Monarch fitting, I fine-tuned the full dense model in the same way.

101001000perplexity (log)denseMonarch0200400600fine-tuning steps16.746.4≈680

The dense model drops from 29 to 17 on text, so there is definitely some in-domain adaptation. In comparison, however, the Monarch model drops by a much larger amount, so its recovery is the cost of the structure rather than adaptation alone. Note that I used SGD instead of AdamW, since fine-tuning the full 124M model with AdamW would exceed this laptop's memory (even though the 75M Monarch model fits); actually, SGD gives a higher and therefore conservative floor, so this understates the structural cost if anything.

The data-aware fitting idea is not mine, but borrowed from DRONE, ASVD, and ProcrustesGPT. The new idea is the measurement that Monarch factors intrinsically have low incoherence and so need no rotation, and that data-aware fitting of an exact Monarch composes with factor quantization. Low factor incoherence is partly expected anyway. Outliers in large models are an emergent property of training at scale, and a small factor fit in a few hundred steps has no reason to grow them. What is actually interesting is just that a structured map can reproduce an outlier-heavy dense layer with outlier-free factors.

Thanks for reading, and I hope to see you again next time!

References