A Gentle Introduction to Autoencoders

I finally got around to finishing this piece after finals. Enjoy!

In its simplest form, an autoencoder is a neural network that learns to map an input back to itself.

\[ z = f_\theta(x), \qquad \hat{x} = g_\phi(z), \qquad L = \|x - \hat{x}\|^2 \]

Encoder-decoder network

Many input activations compress into a narrow latent layer \(z\), then expand again through the decoder.

The encoder \(f_\theta\) compresses the input into a latent code \(z\), which resides in the latent space. The decoder \(g_\phi\) expands that code back into a reconstruction \(\hat{x}\). The idea is simple: if the latent space is constrained, the model is forced to keep the important information and discard what does not matter.

What structure is worth preserving?

Original input

Latent code z

Autoencoder reconstruction

Latent dimension4

MSE

0.000

Compression ratio

0x

Interpretation

coarse

This is a tiny pretrained neural autoencoder. The encoder maps the image to latent activations \(z=f_\theta(x)\); the slider keeps the first \(k\) prefix-trained latent units and masks the rest before the decoder forms \(\hat{x}=g_\phi(z)\).

To be clear, the latent space does not necessarily need to be smaller than the input/output space. However, in the case that it is larger, and we have no additional constraints, we run the risk of having the model approximate the identity function!

Let's add constraints. The simplest one is an undercomplete bottleneck:

\[ \mathbb{R}^n \rightarrow \mathbb{R}^k \rightarrow \mathbb{R}^n, \qquad k \ll n \]

This is what you saw in the diagram earlier. Now the model cannot preserve every microscopic detail; it must retain the information that best explains the data. Other useful constraints include sparsity penalties, noisy inputs, regularization, and stochastic latent variables. However, let's first examine the geometry of the vanilla autoencoder further, before we dive into more modifications.

Latent Geometry

The manifold hypothesis states that high-dimensional real-world data often resides on low-dimensional manifolds embedded within that high-dimensional space. Imagine a tasty Swiss roll: the intrinsic geometry is two-dimensional, but the data lives in a three-dimensional space.

Latent spaces are interesting because similar examples often map nearby, interpolation can look smooth, and some directions may line up with meaningful factors of variation. None of this is automatic; it happens only when the data structure, model, and training pressure make those factors useful for reconstruction.

Hover a point to preview interpolation; click it to pin the endpoint.

Conceptual latent coordinates

Hovered decode

Interpolated decode

Interpolation \(t\)0.50

The interpolation follows \(\;z(t) = (1-t)z_1 + tz_2\;\). These are procedural glyphs, not MNIST reconstructions; the point is the geometry of interpolation.

You might have noticed that the concept of compression in autoencoders is closely related to principal component analysis (PCA) naturally comes to mind. A linear autoencoder with one hidden layer and mean squared reconstruction loss learns the same principal subspace as PCA, up to a rotation of the basis.

Method	Strengths	Limitations
PCA	Linear, deterministic, fast, interpretable, closed-form.	Restricted to linear subspaces.
Deep autoencoder	Nonlinear, flexible, can model curved manifolds.	Optimization-based, less directly interpretable.

On the Swiss roll, PCA compresses by projecting along a fixed linear subspace. The autoencoder can learn a curved coordinate chart, which is why it can represent the roll more naturally.

3D Swiss Roll Manifold Demo

Rotate the manifold, then compare a PCA projection with nonlinear manifold coordinates.

Original 3D Swiss roll (drag to rotate)

PCA projection to 2D

Trained autoencoder latent coordinates

Hovered point

none

Intrinsic coordinate

-

Takeaway

nonlinear unrolling

The PCA panel is computed from the synthetic 3D points. The autoencoder panel uses the encoder output of a small 3D → 2D → 3D neural autoencoder trained offline on this Swiss roll; its weights are embedded in this page.

Sparse Autoencoders

Instead of a tiny bottleneck, sparse autoencoders may use an overcomplete latent space with many more features than input dimensions, but only a small number of features are allowed to activate on each example.

\[ L = \|x-\hat{x}\|_2^2 + \lambda \cdot \text{penalty} \]

Common choices for the sparsity term include an \(L_1\) penalty on activations, a KL penalty that targets a low average firing rate, or explicit top-\(k\) sparsity. The representation can be overcomplete because sparsity prevents the trivial dense copy solution.

\[ L_1\text{ sparsity:}\quad L = \|x-\hat{x}\|_2^2 + \lambda \sum_j |z_j| \]

\[ \text{KL sparsity:}\quad L = \|x-\hat{x}\|_2^2 + \lambda \sum_j \mathrm{KL}\big(\rho \,\|\, \hat{\rho}_j\big) \]

Here \(\rho\) is the target average activation rate and \(\hat{\rho}_j\) is the observed average activation of latent unit \(j\). The KL term becomes large when a unit fires much more or much less often than the target.

\[ \mathrm{Top}\text{-}k\text{ sparsity:}\quad z'_j = \begin{cases} z_j, & j \in \mathrm{TopK}(z) \\ 0, & \text{otherwise} \end{cases} \]

This makes sparse autoencoders feel like learned dictionaries. Each feature is a reusable basis element; each example is a sparse combination of those elements.

Note that overcomplete does not mean unconstrained. A sparse code can live in a higher-dimensional latent space than the input as long as only a few coordinates are active per example.

Trained Sparse Autoencoder Viewer

Input pattern

Sparse autoencoder reconstruction

Sparsity control0.38

This demo uses a tiny sparse autoencoder trained offline with an \(L_1\) activation penalty. The slider applies additional top-\(k\) masking to the learned latent activations before decoding.

Learned latent units

Sparse autoencoders are especially important in mechanistic interpretability, where they may be trained on the internal activations of a larger model. The goal is to decompose dense, superposed activations into a sparse feature dictionary that may be easier to inspect.

Denoising Autoencoders

A denoising autoencoder receives a corrupted input \(\tilde{x}\) and learns to reconstruct the original clean \(x\):

\[ \tilde{x} = x + \varepsilon,\qquad \hat{x} = g(f(\tilde{x})),\qquad \mathcal{L} = \|x - \hat{x}\|^2 \]

The model is encouraged to learn the structure of clean data so that corrupted points are pushed back toward the data manifold. This is the conceptual bridge to score matching and diffusion: denoisers implicitly learn directions back toward high-density regions.

\[ r(\tilde{x}) = g(f(\tilde{x})), \qquad r(\tilde{x}) - \tilde{x} \approx \sigma^2 \nabla_{\tilde{x}} \log p_{\text{data}}(\tilde{x}) \]

For small Gaussian corruption, the denoising direction can be interpreted as an estimate of the score, the gradient of log density. This is only an intuition here, but it explains why denoising and diffusion are mathematically connected.

Conceptual Noise Slider

Increase corruption and see how well the denoiser can still recover the clean structure.

Original

Noisy input

Smoothed denoiser output

Noise amplitude0.18

Contractive Autoencoders

Contractive autoencoders penalize the sensitivity of the encoder to small input changes. The encoder Jacobian appears directly in the loss:

\[ L = \|x-\hat{x}\|_2^2 + \lambda \left\|\frac{\partial f(x)}{\partial x}\right\|_F^2 \]

This encourages local invariance: nearby inputs map to nearby codes. Geometrically, it smooths the latent map and discourages wildly unstable coordinates.

A plain autoencoder is not automatically generative. It learns reconstructions, not necessarily a latent space where random samples decode well.

Variational Autoencoders

Compared to a fixed vector in a regular autoencoder, a variational autoencoder encodes input data into a continuous probability distribution in latent space. The idea is to first sample a latent variable from a prior, then decode it into an observation.

\[ z \sim p(z), \qquad x \sim p_\theta(x\mid z), \qquad p(z)=\mathcal{N}(0,I) \]

The decoder is now interpreted as a conditional distribution \(p_\theta(x\mid z)\), not just a deterministic function. To assign probability to an observed input \(x\), we integrate over every latent explanation that could have produced it.

\[ p_\theta(x) = \int p_\theta(x\mid z)p(z)\,dz \]

The reader's first instinct may be to train by maximum likelihood, maximizing \(\log p_\theta(x)\). However, this integral is usually intractable for a neural decoder with no closed form. The exact posterior over latents is also intractable because it depends on that same evidence term.

\[ p_\theta(z\mid x) = \frac{p_\theta(x\mid z)p(z)}{p_\theta(x)} \]

The VAE introduces an encoder distribution \(q_\phi(z\mid x)\) to approximate this true posterior. In principle, this approximate posterior could come from many distribution families. A Gaussian choice would be written generally as

\[ q_\phi(z\mid x) = \mathcal{N}\!\big(\mu_\phi(x), \Sigma_\phi(x)\big) \]

where \(\Sigma_\phi(x)\) is the covariance matrix. In the most common VAE implementation, however, this covariance is restricted to be diagonal. The encoder predicts one mean and one variance per latent coordinate:

\[ q_\phi(z\mid x) = \mathcal{N}\!\big(\mu_\phi(x), \operatorname{diag}(\sigma_\phi^2(x))\big) \]

The diagonal assumption means \(q_\phi(z\mid x)\) treats latent coordinates as conditionally independent given \(x\). This is not the only mathematically valid choice, but it is the standard one in many VAEs because it is cheap, easy to parameterize, easy to sample from, and gives a simple closed-form KL term.

Let's go through a quick derivation. First insert the approximate posterior into the marginal likelihood and rewrite the integral as an expectation:

\[ \begin{aligned} \log p_\theta(x) &= \log \int q_\phi(z\mid x) \frac{p_\theta(x,z)}{q_\phi(z\mid x)}\,dz \\[0.45em] &= \log \mathbb{E}_{q_\phi(z\mid x)} \left[ \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \end{aligned} \]

By Jensen's inequality, \(\log \mathbb{E}[Y] \ge \mathbb{E}[\log Y]\) (since \(\log\) is concave).

\[ \begin{aligned} \log p_\theta(x) &\ge \mathbb{E}_{q_\phi(z\mid x)} \left[ \log \frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \\[0.45em] &= \mathbb{E}_{q_\phi(z\mid x)} \left[ \log \frac{p_\theta(x\mid z)p(z)}{q_\phi(z\mid x)} \right] \\[0.45em] &= \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] + \mathbb{E}_{q_\phi(z\mid x)}[\log p(z)-\log q_\phi(z\mid x)] \\[0.45em] &= \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \mathrm{KL}\big(q_\phi(z\mid x)\,\|\,p(z)\big) \\[0.45em] &= \mathrm{ELBO}(x) \end{aligned} \]

Actually, the gap between the true log-likelihood and the ELBO is itself a KL divergence. This comes from subtracting the ELBO from \(\log p_\theta(x)\) and using Bayes' rule \(p_\theta(z\mid x)=p_\theta(x,z)/p_\theta(x)\):

\[ \begin{aligned} \log p_\theta(x)-\mathrm{ELBO}(x) &= \mathbb{E}_{q_\phi(z\mid x)} \left[ \log p_\theta(x) - \log\frac{p_\theta(x,z)}{q_\phi(z\mid x)} \right] \\[0.45em] &= \mathbb{E}_{q_\phi(z\mid x)} \left[ \log q_\phi(z\mid x) - \log p_\theta(z\mid x) \right] \\[0.45em] &= \mathrm{KL}\big(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\big) \ge 0 \end{aligned} \]

So the exact identity is

\[ \log p_\theta(x) = \mathrm{ELBO}(x) + \mathrm{KL}\big(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\big) \]

Since the exact log-likelihood is usually not available, VAEs maximize this lower bound. Equivalently, as a loss, they minimize the negative ELBO:

\[ L_{\mathrm{VAE}} = \underbrace{-\mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)]}_{\text{reconstruction term}} + \underbrace{\mathrm{KL}\big(q_\phi(z\mid x)\,\|\,p(z)\big)}_{\text{prior regularization}} \]

The first term is the reconstruction term: latent samples from \(q_\phi(z\mid x)\) should decode back to \(x\). The second term is the prior regularization term: the encoder's posterior should stay close to \(p(z)=\mathcal{N}(0,I)\), so random prior samples are more likely to land in regions the decoder understands.

\[ \text{If }p_\theta(x\mid z)=\mathcal{N}(g_\theta(z),\sigma_x^2 I), \quad -\log p_\theta(x\mid z) \propto \|x-g_\theta(z)\|_2^2 \]

That is why mean squared reconstruction loss often appears in simple VAE examples. For binary images, one often chooses a Bernoulli likelihood instead, which makes the reconstruction term binary cross entropy.

To train the encoder with backpropagation, we still need to sample \(z\). Directly sampling from \(q_\phi(z\mid x)\) would hide the dependence on \(\mu_\phi(x)\) and \(\sigma_\phi(x)\) behind a random operation. Let's reparameterize.

\[ \varepsilon \sim \mathcal{N}(0,I), \qquad z = \mu_\phi(x) + \sigma_\phi(x)\odot\varepsilon \]

Now \(z\) is a differentiable function of the encoder outputs and an external noise sample, so gradients can flow through \(\mu_\phi\) and \(\sigma_\phi\)!

\[ \mathrm{KL}\big(\mathcal{N}(\mu,\operatorname{diag}(\sigma^2))\,\|\,\mathcal{N}(0,I)\big) = \frac{1}{2}\sum_i\left(\mu_i^2 + \sigma_i^2 - \log\sigma_i^2 - 1\right) \]

This closed form is one reason diagonal Gaussian VAEs are convenient. It penalizes posterior means far from zero and variances far from one.

We do have a structural tradeoff. Weak KL pressure can leave the latent space irregular, so samples from \(p(z)\) may decode poorly. Strong KL pressure can make the decoder receive too little information about \(x\), so reconstructions become blurrier or less detailed.

A VAE is trained as a generative latent model, but its practical objective is an ELBO surrogate: fit the data through the reconstruction term while keeping the approximate posterior close to the prior through the KL term.

Trained Toy VAE Gaussian Explorer

Gaussian latent cloud

The Gaussian clouds are actual encoder outputs from a tiny VAE trained offline on synthetic \(12\times12\) face-like images. The KL slider is a post-training illustration of stronger prior pressure, not a live retraining run. The epsilon sliders compare a posterior sample \(z=\mu+\sigma\varepsilon\) with a prior sample \(z=\varepsilon\).

Decoded posterior sample \(z\sim q_\phi(z\mid x)\)

Decoded prior sample \(z\sim p(z)\)

KL pressure illustration0.70

\(\varepsilon_1\)0.35

\(\varepsilon_2\)-0.60

Beta-VAE and Disentanglement

Beta-VAE modifies the VAE objective to increase pressure toward a simpler latent representation:

\[ L = \|x-\hat{x}\|_2^2 + \beta \cdot \mathrm{KL}(q(z\mid x)\,\|\,p(z)) \]

Larger \(\beta\) often encourages different latent dimensions to specialize in different factors of variation. In idealized settings, one dimension might control rotation while another controls width or brightness.

Setting the coefficient greater than 1 causes lower information capacity in z and results in often better factor separation but worse reconstruction.

Procedural Disentanglement Sliders

A hand-coded decoder showing the basic intuition behind factorized latent controls.

Procedural object

This is not a trained Beta-VAE. It is a procedural illustration: each slider is deliberately wired to one visual factor so the intended idea of disentanglement is visible without pretending the model discovered it.

Rotation0.50

Width0.50

Height0.50

Brightness0.50

Smile0.50

Vector Quantized VAE

VQ-VAE replaces continuous latent variables with a learned table of vectors. The encoder produces a continuous proposal, but the stored latent is the nearest vector from that table. This makes the representation discrete and token-like.

\[ z_e(x)=f_\phi(x),\qquad k=\arg\min_j \|z_e(x)-e_j\|_2^2,\qquad z_q(x)=e_k \]

Here \(z_e(x)\) is the encoder's continuous output, \(e_j\) are the learned vectors, and \(z_q(x)\) is the quantized latent actually passed to the decoder.

\[ \hat{x}=g_\theta(z_q(x)) \]

The basic VQ-VAE loss combines reconstruction, learning the vector table, and a commitment term that discourages the encoder from jumping between vectors too freely:

\[ L = \|x-\hat{x}\|_2^2 + \|\operatorname{sg}[z_e(x)]-e_k\|_2^2 + \beta\|z_e(x)-\operatorname{sg}[e_k]\|_2^2 \]

The stop-gradient operator \(\operatorname{sg}[\cdot]\) means "treat this value as constant for this term." It lets selected vectors move toward encoder outputs in one term, while the encoder learns to commit to selected vectors in the other.

The reader might note that VQ-VAE looks similar to nearest-neighbors! Geometrically, VQ-VAE borrows the idea: snap the encoder output to the closest learned vector. The difference from plain nearest neighbors or k-means is that these vectors are not stagnant cluster centers. They are trained end-to-end with the encoder and decoder, so the discrete tokens become useful for reconstruction.

This is precisely why VQ-style models became important for image tokenization and later token-based generative systems.

Toy Vector Quantization

Drag the encoder output point and watch nearest-vector regions determine the discrete latent token.

Toy latent plane and fixed vector set

Encoder output

(0.00, 0.00)

Nearest vector

v0

Interpretation

discrete token

Short version: VQ-VAE uses nearest-neighbor geometry, but the vectors are trained end-to-end as part of an autoencoder rather than fitted as a separate k-means step. This demo freezes the vectors only to make the snapping step easy to see.

Convolutional Autoencoders

Images have local structure, so fully connected layers are not ideal. Convolutional autoencoders preserve spatial organization through feature maps, downsampling, and upsampling.

Stage	Shape	Role
Input	28 × 28 × 1	Raw image
Conv 1	14 × 14 × 32	Local edge features
Conv 2	7 × 7 × 64	Higher-level spatial features
Latent	128	Compressed code
Decoder	Reverse path	Reconstruct image

Transpose convolutions can cause checkerboard artifacts. A common fix is resize plus convolution, which usually yields smoother upsampling.

Masked Autoencoders

Masked autoencoders hide random patches and ask the model to reconstruct the missing content. In vision, this often means patch embeddings, a transformer encoder on visible patches, and a lightweight decoder for reconstruction.

The reason this teaches useful representations is that the model has to infer global structure rather than leaning on local pixel copying.

Conceptual Mask Ratio Slider

As more patches disappear, the reconstruction problem gets harder.

Original

Visible patches

Neighbor fill-in

Mask ratio50%

The reconstruction panel uses only visible neighboring patches for a crude fill-in. A real MAE learns a much stronger image prior from data; the visual here only shows why higher mask ratios make the task harder.

Diffusion Models

Denoising autoencoders and diffusion models are closely related conceptually: both learn how to reverse corruption. Diffusion models take this idea to a multi-step stochastic process, repeatedly denoising from pure noise back toward data.

In latent diffusion systems, an autoencoder first learns a compressed image space. During generation, diffusion usually starts from noise in that latent space and the decoder maps the denoised latent back to pixels.

Latent Diffusion Compression Pipeline

Autoencoder compression path

image
256 × 256 × 3

→

encoder
compress

→

latent space
32 × 32 × 4

→

decoder
reconstruct

→

reconstruction
256 × 256 × 3

Latent diffusion generation path

sampling starts here
noisy latent

→

diffusion denoiser
runs in latent space

→

decoder
expand

→

generated image
256 × 256 × 3

Pixel cells

196608

Latent cells

4096

State-size reduction

48x

The exact factor depends on architecture details, but the core idea is stable: latent diffusion dramatically reduces the state size the denoiser must process, making it computationally attractive.

Representation Learning

Autoencoders are just one family of self-supervised learning methods. The supervision comes from the input itself rather than an external label.

Method	Training signal	Learns by	Common use
Autoencoder	Reconstruction	Compressing and decoding	Dimensionality reduction
Denoising AE	Clean target from noisy input	Removing corruption	Robust features
VAE	Reconstruction + KL	Probabilistic latent modeling	Generation
Sparse AE	Reconstruction + sparsity	Feature decomposition	Interpretability
MAE	Masked reconstruction	Filling missing patches	Vision pretraining
Contrastive	Positive/negative pairs	Similarity structure	Representation learning

An Information-Theoretic View

There is also a clean information-theoretic way to say what we have been saying all along. The latent code \(z\) should preserve enough information to reconstruct \(x\), but not so much information that the network just stores the input. This is exactly the tension studied in rate-distortion theory.

\[ \min_{p(z\mid x),\,g} \mathbb{E}\big[d(x,g(z))\big] \quad\text{subject to}\quad I(X;Z) \le R \]

The distortion term is just reconstruction error. The rate constraint says the latent code is only allowed to carry \(R\) units of information about the input. If \(R\) is tiny, reconstructions collapse. If \(R\) is huge, the model can keep too much detail and the representation becomes less interesting.

The quantity \(I(X;Z)\) is mutual information: how much does knowing the latent code \(Z\) tell us about the original input \(X\)?

\[ I(X;Z) = \mathbb{E}_{p(x,z)} \left[ \log\frac{p(z\mid x)}{p(z)} \right] = H(Z)-H(Z\mid X) \]

This is the issue with a large deterministic autoencoder. Given \(x\), the code \(z\) may be almost completely determined, and if \(z\) has enough capacity, it can carry a huge amount of information. The diagram still says "encoder" and "decoder," but the model may effectively be using \(z\) as a memory buffer.

\[ \mathcal{L} = \mathbb{E}\big[d(x,\hat{x})\big] + \beta I(X;Z) \]

This is the same tradeoff written as a loss. We pay for bad reconstructions, but we also pay for letting \(z\) carry too much information. Increasing \(\beta\) makes information more expensive, so the model has to be more selective about what it keeps.

This is one reason the VAE objective is so natural. Its KL term is not literally \(I(X;Z)\) for a single example, but after averaging over the dataset it decomposes into an information term plus a prior-matching term:

\[ \mathbb{E}_{p_{\text{data}}(x)} \left[ \mathrm{KL}\big(q_\phi(z\mid x)\,\|\,p(z)\big) \right] = I_q(X;Z) + \mathrm{KL}\big(q_\phi(z)\,\|\,p(z)\big) \]

Here \(q_\phi(z)=\int q_\phi(z\mid x)p_{\text{data}}(x)\,dx\) is the overall distribution of latent codes produced by the encoder. So the VAE KL is doing two things at once: it limits how much the latent can say about each individual input, and it pushes the full cloud of latent codes toward the prior.

The goal is not simply to make \(z\) as small as possible. The goal is to make \(z\) carry the right information: enough to reconstruct stable structure, but not so much that it memorizes every pixel-level detail.

Common Failure Modes

Identity mapping: the bottleneck or regularization is too weak.
Posterior collapse in VAEs: the decoder ignores the latent variables.
Blurry VAE samples: average-seeking likelihoods and strong regularization can wash out detail.
Dead sparse features: too much sparsity pressure can shut down neurons permanently.
Checkerboard artifacts: transpose convolutions can upsample unevenly.
Poor latent interpolation: the latent space may have holes or badly regularized regions.
Train/test mismatch: low training reconstruction does not imply robust representation quality.

Adjust bottleneck size, tune \(\beta\) or \(\lambda\), anneal KL in VAEs, normalize inputs, prefer resize-plus-convolution when artifacts appear, and use skip connections carefully so they do not bypass the intended compression.

Summary

With that, you've finished this article. Congratulations!

Autoencoders are more than compression tools. They are a lens into representation learning. By forcing a model to preserve what matters and discard what does not, autoencoders reveal the geometry of data.

Modern AI increasingly depends on learned representations. Autoencoders show one of the clearest paths from raw data to latent structure: compress, reconstruct, regularize, and discover meaning.

I hope the reader enjoyed this article; I certainly had fun writing it. More posts will come in the future!