← back to projects

Representative Sampling of High‑Dimensional Point Sets

Ethan Liu | May 2025 · Updated October 2025

Background

High-dimensional point sets arise across numerous domains, presenting fundamental computational and statistical challenges:

  • Machine Learning: Feature vectors in modern ML systems can have thousands of dimensions (e.g., text embeddings, image descriptors, molecular fingerprints)
  • Computational Biology: Gene expression profiles containing tens of thousands of genes and single-cell RNA-seq data
  • Scientific Computing: Particle simulations and climate models with high-dimensional state spaces
  • Computer Graphics: 3D point clouds, mesh vertices, and spatial coordinates with various attributes

As dimensionality increases, data becomes exponentially sparse, distance metrics lose discriminative power, and computational costs grow rapidly. This motivates the need for representative sampling—selecting small subsets that preserve the essential characteristics of large datasets.

The challenge is twofold. Computationally, many downstream algorithms scale superlinearly in \(N\) (e.g., nearest neighbors \(O(Nd)\), kernel methods \(O(N^2)\), optimal transport up to \(O(N^3\log N)\)), so reducing \(N\) to a compact \(m\) yields dramatic speedups. Statistically, we still want estimators computed on \(X\) to be faithful to those on \(Y\):

\[ \hat{\theta}_X \xrightarrow{p} \theta \quad \text{as} \quad m \to \infty \]

where \(\hat{\theta}_X\) is computed on the subset and \(\theta\) is the true parameter. The rate at which \(\hat{\theta}_X\) approaches \(\theta\) depends on how representative \(X\) is of \(Y\). Our goal is to find subsets that balance computational savings with statistical fidelity.

Formal Problem Statement

Given a dataset \(Y = \{y_i\}_{i=1}^N \subset \mathbb{R}^d\), we seek to select a subset \(X = \{x_j\}_{j=1}^m \subset Y\) with \(m \ll N\) that preserves key properties of \(Y\). Formally:

\[ X^* = \arg\min_{X \subset Y, |X| = m} \mathcal{L}(P_Y, P_X) + \lambda \cdot R(X) \]

where:

  • \(P_Y\) and \(P_X\) are probability distributions induced by \(Y\) and \(X\)
  • \(\mathcal{L}\) is a loss function measuring discrepancy (e.g., KL divergence, Wasserstein distance)
  • \(R(X)\) is a regularization term encouraging desirable properties (e.g., diversity, coverage)
  • \(\lambda\) balances fidelity against regularization

Method Categories

Distributional Methods

These methods match probability distributions by minimizing a divergence \(d(P_Y, P_X)\), which is ideal for inference, generative modeling, and risk analysis where tails and modality matter. The benefit is principled fidelity to the data‑generating process; the drawback is computational load—full KL or OT can be prohibitive in high‑D—so we rely on tractable proxies like sliced variants and kernel metrics that retain signal while scaling linearly or nearly so.

Geometric Methods

These emphasize spatial structure—extents, boundaries, and relative positions. They excel for visualization and downstream geometry (e.g., nearest neighbors) and are extremely fast, but they may under‑represent dense regions or subtle distributional features compared to distributional objectives.

Sampling‑Based Methods

Stochastic optimization (e.g., MCMC) explores the subset space via probabilistic moves that can escape local optima and quantify uncertainty. Modern variants use delayed‑acceptance and streaming surrogates to keep proposal evaluation cheap while preserving global search behavior.

Diversity‑Based Methods

These explicitly discourage redundancy to promote spread. k‑DPPs achieve this via kernel‑based repulsion; practical deployments rely on Nyström landmarking to approximate kernels and keep factorization and sampling within budget.

Let \(Y = \{y_i\}_{i=1}^N \subset \mathbb{R}^d\) be the full dataset and \(X = \{x_j\}_{j=1}^m \subset Y\) the selected subset with \(m \ll N\). We denote by \(P_Y\) and \(P_X\) the probability distributions induced by these sets. In practice, we work with either empirical distributions (\(P_Y = \frac{1}{N}\sum_{i=1}^N \delta_{y_i}\) and \(P_X = \frac{1}{m}\sum_{j=1}^m \delta_{x_j}\)) or smoothed distributions using kernel density estimates (KDE) to avoid atomic mass points.

Kullback-Leibler Divergence

The KL divergence is a type of information-theoretic divergence that measures the information loss when using \(Q\) to approximate \(P\). For continuous distributions with densities \(p\) and \(q\):

\[ D_{\mathrm{KL}}(P \Vert Q) = \int_{\mathbb{R}^d} p(x) \log\frac{p(x)}{q(x)} \, dx \]

For discrete distributions with probability masses \(\{p_i\}\) and \(\{q_i\}\):

\[ D_{\mathrm{KL}}(P \Vert Q) = \sum_i p_i \log\frac{p_i}{q_i} \]

\(D_{\mathrm{KL}}(P \Vert Q)\) penalizes regions where \(Y\) has mass but \(X\) does not. In our context, \(P\) can be thought of as the distribution of the full set \(Y\) and \(Q\) as that of the subsample \(X\). Intuitively, \(D_{\mathrm{KL}}(P \Vert Q)\) measures the information lost when \(Q\) (the subset's distribution) is used to approximate \(P\) (the full set's distribution). A smaller KL divergence means the subset \(X\) is a better stand-in for \(Y\). A smaller KL divergence means the subset \(X\) is a better stand-in for \(Y\).

However, direct estimation in high dimensions requires density estimation, which suffers from the curse of dimensionality. As dimension \(d\) increases, the volume of space grows exponentially (\(O(r^d)\)) while sample size grows only linearly, making data points exponentially more isolated and density estimates highly unreliable. Therefore, we need to approximate the KL divergence using tractable metrics that avoid explicit density estimation, which we will discuss in the following sections.

Optimal transport. Rather than comparing densities, optimal transport measures the minimal cost to move mass from one distribution to another, respecting geometry. The p‑Wasserstein distance is defined as:

\[ W_p(P,Q) = \left(\inf_{\gamma \in \Pi(P,Q)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\|^p \, d\gamma(x,y)\right)^{1/p} \]

where \(\Pi(P,Q)\) is the set of all couplings (joint distributions) with marginals \(P\) and \(Q\).

For \(p=1\) (Wasserstein‑1), this is the minimum work to move mass from \(P\) to \(Q\). In 1D it reduces to:

\[ W_1(P,Q) = \int_0^1 |F_P^{-1}(u) - F_Q^{-1}(u)| \, du \]

where \(F_P^{-1}\) and \(F_Q^{-1}\) are quantile functions. In practice, we use sliced variants to make OT scalable.

Kernel metrics. Maximum Mean Discrepancy (MMD) compares distributions via their embeddings in an RKHS (Gretton et al., 2012):

\[ \mathrm{MMD}^2_k(P,Q) = \|\mu_P - \mu_Q\|_{\mathcal{H}}^2 \]

where \(\mu_P = \mathbb{E}_{X \sim P}[\phi(X)]\) is the kernel mean embedding and \(\phi\) is the feature map associated with kernel \(k\).

An unbiased estimator using finite samples:

\[ \widehat{\mathrm{MMD}}^2_k = \frac{1}{m(m-1)}\sum_{i \neq j}^m k(x_i, x_j) + \frac{1}{N(N-1)}\sum_{i \neq j}^N k(y_i, y_j) - \frac{2}{mN}\sum_{i=1}^m\sum_{j=1}^N k(x_i, y_j) \]

Common kernels:

  • RBF (Gaussian) kernel:
    \( k(x,y) = \exp\!\left(-\frac{\|x - y\|^2}{2\ell^2}\right) \)
    where \(\ell > 0\) is the bandwidth parameter
  • Linear kernel: \(k(x,y) = x^\top y\)
  • Laplace kernel: \(k(x,y) = \exp(-\|x - y\|/\ell)\)

Sliced Approximations

The Slicing Framework

Direct computation of high-dimensional divergences is often intractable. The slicing approach (Bonnotte, 2013) projects distributions onto random 1D subspaces and aggregates the results:

For a unit vector \(v \in \mathbb{S}^{d-1}\), define the projection operator \(\pi_v(x) = \langle v, x \rangle\). The pushforward measure \((\pi_v)_* P\) is the distribution of projected values when \(X \sim P\).

Sliced Wasserstein Distance

\[ \mathrm{SW}_1^p(P,Q) = \frac{1}{p}\sum_{j=1}^p W_1\left((\pi_{v_j})_* P, (\pi_{v_j})_* Q\right) \]

where \(\{v_j\}_{j=1}^p\) are random unit directions sampled uniformly from the sphere.

Sliced KL Divergence

\[ \mathrm{SKL}^p(P,Q) = \frac{1}{p}\sum_{j=1}^p D_{\mathrm{KL}}\left((\pi_{v_j})_* P \,\Big\|\, (\pi_{v_j})_* Q\right) \]

Computational advantage: Each 1D computation is \(O(N \log N)\) for sorting (Wasserstein) or \(O(N)\) for binning (KL), compared to intractable high-dimensional operations.

Methods

Kullback-Leibler Divergence

One principled way to measure how well a subset \(X\) of points represents a full set \(Y\) is to treat each set as defining a probability distribution over space (for example, an empirical distribution or a density estimate) and then compute the Kullback–Leibler (KL) divergence between these distributions. The KL divergence \(D_{\mathrm{KL}}(P\Vert Q)\) from distribution \(P\) to distribution \(Q\) is given by:

\[ D_{\mathrm{KL}}(P\Vert Q) = \int P(x)\, \log \frac{P(x)}{Q(x)}\, dx \]

or equivalently, for discrete distributions:

\[ D_{\mathrm{KL}}(P\Vert Q) = \sum_i p_i\,\log\frac{p_i}{q_i} \]

In our context, \(P\) can be thought of as the distribution of the full set \(Y\) and \(Q\) as that of the subsample \(X\). Intuitively, \(D_{\mathrm{KL}}(P\Vert Q)\) measures the information lost when \(Q\) (the subset's distribution) is used to approximate \(P\) (the full set's distribution). A smaller KL divergence means the subset \(X\) is a better stand-in for \(Y\).

This method starts with an empty subset and iteratively adds points from \(Y\) to \(X\) such that each addition yields the greatest reduction in \(D_{\mathrm{KL}}(P_Y\Vert P_X)\). Initially, \(P_X\) (with \(X\) empty or very small) is a poor approximation of \(P_Y\); as more points are added, \(P_X\) should become a closer approximation and the KL divergence decreases. To implement this, we need a way to estimate the distributions \(P_Y\) and \(P_X\) at each step. One approach is to use a discretization of the space into bins (a histogram) or a kernel density estimate. At each iteration, we evaluate which candidate point, if added to \(X\), would most decrease the divergence \(D_{\mathrm{KL}}(P_Y\Vert P_{X\cup\{p\}})\). That point is then selected into \(X\).

Here \(P_Y\) and \(P_X\) denote the (empirical or smoothed) distributions of \(Y\) and \(X\); \(D_{\mathrm{KL}}\) is Kullback–Leibler divergence.

\[\begin{aligned} &X_0 = \varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; D_{\mathrm{KL}}\!\big(P_Y\,\Vert\, P_{X_{t-1}\cup\{y\}}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

At each step, we score a candidate \(y\) by the KL divergence between the full set and the subset if \(y\) were added, then pick the minimizer \(x_t^{\ast}\) and update \(X_t\). This directly targets distributional fidelity in the full space.

That's expected given how we generated and selected points. Data distribution: The synthetic 3D generator uses a 3‑cluster GMM with unequal weights (≈ 0.45, 0.35, 0.20). With m=30, the "fair" split by mass is roughly 14, 11, and 6 reps. So the two larger clusters will naturally get more red points. Method behavior: Distributional methods (KL/Sliced‑KL/Sliced‑W1/MMD) allocate more representatives where there's more probability mass, so bigger/denser clusters receive more points. Mean–Variance/Hyperplane can also bias toward clusters with larger spread or radius, which again favors the larger two if the smaller one is tighter. Greedy selection + randomness/seed will introduce small deviations from the exact mass ratio.

This greedy procedure is computationally intensive for large \(Y\) (each iteration scans all remaining points to evaluate the KL divergence). However, it often yields a subset that very closely approximates the distribution of \(Y\). In practice, one can limit the resolution of the distribution estimate or use approximate updates to make this feasible. We implemented this method for both 2D and 3D point sets, using a fixed grid to estimate \(P_Y\) and \(P_X\) at each step. In the 3D case, the concepts are identical, but estimating \(P(Y)\) and \(P(X)\) requires a 3D grid or kernel, which increases computational cost. The result is a set \(X\) that minimizes the information loss (KL divergence) relative to \(Y\) by greedily covering regions of space in proportion to how much probability mass \(Y\) has in those regions.

Walkthrough. Estimate \(P_Y\) and the subset \(P_X\) (via KDE or grids). For each candidate \(y\), compute the drop in \(D_{\mathrm{KL}}(P_Y\Vert P_{X\cup\{y\}})\); add the minimizer. The 3D view shows how KL greedily concentrates reps in high‑density regions while still spanning the support.

Sliced KL

Sliced KL provides a scalable approximation to full KL divergence by exploiting the fact that 1D distributions are much easier to estimate and compare than high-dimensional ones. Instead of working directly with the full d-dimensional distributions, we project the data onto random one-dimensional subspaces and aggregate the resulting 1D KL divergences. This approach transforms an intractable high-dimensional density estimation problem into a collection of tractable 1D problems.

The mathematical foundation rests on the pushforward measure: for a unit vector \(v\), the projection \(\pi_v(x) = \langle v, x \rangle\) creates a 1D distribution \((\pi_v)_* P\) that captures the behavior of \(P\) along direction \(v\). By sampling multiple directions \(\{v_j\}_{j=1}^p\), we obtain different "slices" or views of the original distribution. The sliced KL objective averages the 1D KL divergences across these directions:

\[ \mathrm{SKL}(P_Y,P_X)=\frac{1}{p}\sum_{j=1}^p D_{\mathrm{KL}}\!\left(\langle v_j,\cdot\rangle_{∗} P_Y\,\Big\Vert\,\langle v_j,\cdot\rangle_{∗} P_X\right). \]

Statistical interpretation: While full KL divergence captures the complete information loss between distributions, sliced KL focuses on preserving the distributional characteristics along diverse directions. This directional preservation often suffices for practical applications where the relative positions and densities along various axes matter more than the full joint distribution. The method is particularly effective when the underlying distribution has meaningful lower-dimensional structure or when correlations between dimensions are moderate.

Computational advantages: In 1D, we can use simple histogram estimates or kernel density estimation with much smaller bandwidth requirements. Each 1D KL computation is \(O(N)\) for histogram binning, compared to \(O(N^2)\) or worse for full multivariate KDE. The total complexity becomes \(O(p \cdot N)\) per evaluation, where typically \(p \ll N\) and \(p\) can be as small as 50-200 for good results.

Practical considerations: The number of projection directions \(p\) controls the approximation quality. Too few directions may miss important distributional features, while too many increase computation without proportional benefits. The directions are typically sampled uniformly from the unit sphere, though adaptive schemes that focus on informative directions are possible. For 1D density estimation, we use histograms with \(O(\sqrt{N})\) bins or simple KDE with adaptive bandwidth selection.

Behavioral characteristics: Sliced KL tends to produce subsets that preserve the relative densities and cluster structures visible along various projections. It's more robust to the curse of dimensionality than full KL and can handle moderate-dimensional problems (d ≤ 100) effectively. The method naturally adapts to anisotropic distributions since different directions capture different aspects of the data structure.

Here \(v_j\) are random unit directions and \(\langle v_j,\cdot\rangle_{∗}P\) denotes the 1D pushforward (projection law) of a distribution \(P\).

\[\begin{aligned} &X_0 = \varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; \frac{1}{p}\sum_{j=1}^p D_{\mathrm{KL}}\!\Big(\langle v_j,\cdot\rangle_{∗} P_Y\,\Big\Vert\, \langle v_j,\cdot\rangle_{∗} P_{X_{t-1}\cup\{y\}}\Big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

We draw random directions \(v_j\), compare projected 1D laws, average the 1D KLs over \(p\) slices, and choose \(x_t^{\ast}\) that minimizes this sliced score. This is a scalable proxy for full KL in high‑d.

Walkthrough. Draw \(p\) random directions \(v_j\); in each 1D slice compare \(\langle v_j,\cdot\rangle_{∗}P_Y\) vs \(\langle v_j,\cdot\rangle_{∗}P_X\) by KL; add the point that minimizes the sliced average. This scales KL‑like selection to high‑d.

Wasserstein‑1

True \(W_1\) in 3D solves an optimal transport problem between \(P_Y\) and \(P_X\). It's accurate but expensive to generate at large \(N\). For the interactive below we may either use a small \(N\) true \(W_1\), or a high‑\(p\) sliced proxy that is visually indistinguishable at \(N=6000\).

Sliced Wasserstein-1

Sliced Wasserstein-1 offers a transport-aware perspective on distributional matching that complements information-theoretic approaches like KL divergence. Instead of measuring information loss, Wasserstein distance quantifies the minimal "work" required to transform one distribution into another, where work is defined as the amount of probability mass moved multiplied by the distance it travels. This geometric interpretation makes W1 particularly suitable for applications where the spatial arrangement and support of the distribution matter.

Transport perspective: Imagine each point in the full set \(Y\) as a pile of sand of equal weight, and each point in the subset \(X\) as an empty hole. The Wasserstein distance finds the optimal way to move sand from the piles to fill the holes while minimizing the total transport cost. This optimal transport formulation captures both the differences in location (where mass needs to move) and the differences in quantity (how much mass needs to move).

1D simplification: In one dimension, the optimal transport plan has a beautiful property: it's simply given by matching sorted quantiles. If we sort the projected values of both distributions, the optimal matching pairs the smallest with smallest, second smallest with second smallest, and so on. The Wasserstein-1 distance then reduces to the average absolute difference between these sorted values:

\[ W_1(P,Q) = \frac{1}{n}\sum_{i=1}^n |F_P^{-1}(i/n) - F_Q^{-1}(i/n)| = \int_0^1 |F_P^{-1}(u) - F_Q^{-1}(u)|\,du, \]
where \(F_P^{-1}\) and \(F_Q^{-1}\) are the quantile functions.

Sliced approximation: Extending this to high dimensions, we sample random directions and compute the 1D Wasserstein distances along each projection. The sliced W1 objective averages these directional distances:

\[ \mathrm{SW}_1(P_Y,P_X)=\frac{1}{p}\sum_{j=1}^p W_1\big(\langle v_j,\cdot\rangle_{∗} P_Y,\langle v_j,\cdot\rangle_{∗} P_X\big). \]

Statistical properties: Unlike KL divergence, Wasserstein distance is symmetric and respects the underlying metric of the space. It provides gradients even when distributions have disjoint support, making it numerically stable and well-behaved. The W1 metric is particularly robust to outliers and can handle distributions with different support without becoming infinite.

Computational efficiency: Each 1D Wasserstein computation requires only sorting \(O(N \log N)\) followed by a linear pass \(O(N)\). With \(p\) projection directions, the total complexity is \(O(p \cdot N \log N)\) per evaluation, which is comparable to sliced KL but with better constants since sorting is highly optimized in practice.

Practical behavior: Sliced W1 tends to preserve both the overall shape and the relative positions of probability mass across different regions. It's particularly effective at capturing differences in location, scale, and shape that might be missed by purely density-based methods. The method is less sensitive to precise density estimation and more focused on matching the cumulative distribution characteristics.

Parameter sensitivity: The number of projections \(p\) typically ranges from 100 to 500 for good results. The method is relatively insensitive to the exact choice of \(p\) - more projections reduce variance in the estimate but with diminishing returns. The random directions can be either fixed across iterations (for consistency) or resampled (for stochasticity).

\(v_j\) are random unit directions; \(\langle v_j,\cdot\rangle_{∗}P\) is the pushforward; \(W_1\) is the 1D Earth‑Mover distance between projected laws.

\[\begin{aligned} &X_0 = \varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; \frac{1}{p}\sum_{j=1}^p W_1\!\Big(\langle v_j,\cdot\rangle_{∗} P_Y,\; \langle v_j,\cdot\rangle_{∗} P_{X_{t-1}\cup\{y\}}\Big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

We average 1D Earth‑Mover distances across random projections and greedily add the point \(x_t^{\ast}\) that yields the smallest sliced‑\(W_1\). This emphasizes transport‑aware alignment of mass.

Walkthrough. Draw \(p\) directions, form 1D empirical laws of \(Y\) and \(X\), and greedily add the point that minimizes the average 1D \(W_1\). Sliced‑\(W_1\) emphasizes transport‑aware shape alignment.

Kernel Herding (Maximum Mean Discrepancy, MMD)

Kernel herding provides an elegant framework for representative selection by operating in a high-dimensional feature space where the structure of the data becomes more apparent. The key insight is that through an appropriate kernel function, we can map points to a reproducing kernel Hilbert space (RKHS) where matching moments becomes equivalent to matching the full distribution. This approach avoids explicit density estimation while still capturing complex distributional characteristics.

RKHS and mean embeddings: The kernel function \(k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\) implicitly defines a feature map \(\phi: \mathbb{R}^d \rightarrow \mathcal{H}\) into a Hilbert space \(\mathcal{H}\) where \(k(x,y) = \langle \phi(x), \phi(y) \rangle_\mathcal{H}\). The kernel mean embedding \(\mu_P = \mathbb{E}[\phi(X)]\) provides a complete representation of the distribution \(P\) (for characteristic kernels like the RBF). By matching mean embeddings between \(P_Y\) and \(P_X\), we implicitly match the full distributions.

Maximum Mean Discrepancy: The MMD squared measures the distance between mean embeddings in the RKHS:

\[ \mathrm{MMD}^2(P_Y, P_X) = \|\mu_{P_Y} - \mu_{P_X}\|_\mathcal{H}^2 = \mathbb{E}[k(Y,Y')] + \mathbb{E}[k(X,X')] - 2\mathbb{E}[k(Y,X)], \]
where the expectations are taken over independent samples from each distribution.

Herding algorithm: Instead of directly minimizing MMD (which would require recomputing all pairwise similarities), kernel herding uses a greedy strategy that sequentially selects points to best approximate the kernel mean. At step \(t\), we choose the point that maximizes the discrepancy between its kernel mean score and the average similarity to previously selected points:

\[ x_t \in \arg\max_{y\in Y} \; \mu(y) - \frac{1}{t-1}\sum_{j=1}^{t-1} k(y,x_j),\quad \mu(y)=\frac{1}{|Y|}\sum_{i} k(y,y_i). \]

Statistical interpretation: The term \(\mu(y)\) measures how typical the candidate point \(y\) is with respect to the full dataset - it's high in dense regions and low in sparse ones. The subtraction term \(\frac{1}{t-1}\sum_{j=1}^{t-1} k(y,x_j)\) penalizes points that are similar to already selected ones, promoting diversity. This balance between representativeness (high density) and diversity (low redundancy) is the hallmark of kernel herding.

Kernel choice and hyperparameters: The Radial Basis Function (RBF) kernel \(k(x,y) = \exp(-\|x-y\|^2/(2\ell^2))\) is commonly used due to its universal approximation properties. The lengthscale parameter \(\ell\) controls the smoothness: small \(\ell\) captures fine-grained local structure, while large \(\ell\) focuses on global patterns. In practice, \(\ell\) is often chosen as the median pairwise distance or via cross-validation.

Random Fourier Features acceleration: For large datasets, computing all \(N^2\) kernel evaluations becomes prohibitive. Random Fourier Features (RFF) provide an explicit approximation \(\hat{\phi}(x)\) such that \(k(x,y) \approx \hat{\phi}(x)^\top \hat{\phi}(y)\). This allows us to compute the mean embedding directly: \(\hat{\mu}_P = \frac{1}{|P|}\sum_{x \in P} \hat{\phi}(x)\), reducing complexity from \(O(N^2)\) to \(O(NR)\) where \(R\) is the number of random features (typically 512-2048).

Computational complexity: Without approximation, the naive approach is \(O(m \cdot N \cdot t)\) where \(m\) is the subset size and \(t\) is the current step. With RFF, this reduces to \(O(m \cdot N \cdot R)\) for precomputing features plus \(O(m \cdot R)\) for selection, making it scalable to millions of points.

Behavioral characteristics: Kernel herding tends to select points that are both representative (appearing in high-density regions) and diverse (covering different modes of the distribution). The method naturally adapts to multi-modal distributions and can capture complex non-linear structure. The RBF kernel emphasizes local neighborhoods, leading to good coverage of dense regions while avoiding oversampling of sparse areas.

Here \(k\) is a positive‑definite kernel (RBF in our runs) and \(\mu(y)=|Y|^{-1}\sum_i k(y,y_i)\) is the kernel mean score.

\[\begin{aligned} &X_0 = \varnothing,\quad \mu(y)=\frac{1}{|Y|}\sum_{i} k(y,y_i).\\ &\text{For } t=1,\ldots,m:\\ &\quad g_t(y) \;:=\; \mu(y) - \frac{1}{t-1}\sum_{z\in X_{t-1}} k(y,z)\;\; (t>1),\;\; g_1(y):=\mu(y).\\ &\quad x_t^{\ast} \,\in\, \arg\max_{y\in Y\setminus X_{t-1}}\; g_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

Kernel herding picks \(x_t^{\ast}\) that best matches the kernel mean embedding: start from \(\mu(y)\) and, for \(t>1\), subtract similarity to previously chosen points. This greedily reduces MMD.

Walkthrough. Compute the kernel mean score \(\mu(y)=\tfrac{1}{|Y|}\sum_i k(y,y_i)\). At step \(t\) pick \(x_t\) maximizing \(\mu(y) - \tfrac{1}{t-1}\sum_{z\in X} k(y,z)\) (for \(t{>}1\)). With RFF this is fast and memory‑bounded.

Mean-Variance

The Mean-Variance method provides a principled approach to representative selection by focusing on preserving the first two statistical moments along multiple random projections. This approach stems from the observation that many distributions can be reasonably characterized by their location (mean) and spread (variance) along various directions. By ensuring these moments match between the full set and the subset, we obtain a representative sample that captures both central tendency and dispersion.

Moment matching principle: In classical statistics, the method of moments seeks to match sample moments to population moments. Here we adapt this idea: instead of matching all moments in high dimensions (which would be computationally expensive), we match only the first two moments along randomly chosen 1D projections. This provides a good balance between computational efficiency and statistical fidelity.

Projection mechanics: For each random direction \(v_j\), we project all points onto that direction: \(p_i^{(j)} = \langle v_j, y_i \rangle\). The projected values form a 1D distribution characterized by its mean \(\mu_Y^{(j)} = \frac{1}{N}\sum_i p_i^{(j)}\) and variance \(\sigma_Y^{(j)2} = \frac{1}{N}\sum_i (p_i^{(j)} - \mu_Y^{(j)})^2\). Our goal is to select a subset \(X\) such that its projected statistics \(\mu_X^{(j)}\) and \(\sigma_X^{(j)}\) closely match those of the full set across all directions.

The tradeoff dilemma: There's an inherent tension between matching means and matching variances. Points near the center of the distribution help match the mean but contribute little to variance. Points in the tails increase variance but may shift the mean away from the target. The parameter \(\lambda\) in the error function controls this tradeoff:

\[ E(X) = \frac{1}{p}\sum_{j=1}^p \Big[(\mu_Y^{(j)}-\mu_X^{(j)})^2 + \lambda(\sigma_Y^{(j)}-\sigma_X^{(j)})^2\Big]. \]

Statistical interpretation: When \(\lambda = 1\), we give equal importance to mean and variance matching. Setting \(\lambda > 1\) emphasizes variance (spread) and leads to selecting more extreme points, while \(\lambda < 1\) prioritizes mean matching and results in more centrally concentrated subsets. The choice of \(\lambda\) should reflect the application: for tasks where capturing variability is crucial, use larger \(\lambda\); for applications focused on central tendency, use smaller \(\lambda\).

Computational advantages: Computing means and variances along projections is highly efficient - each projection requires \(O(N)\) operations for the dot products and \(O(N)\) for the moment calculations. With \(p\) directions, the total complexity is \(O(p \cdot N)\) per candidate evaluation, making it much faster than density-based methods.

Direction selection strategy: The random directions are typically sampled uniformly from the unit sphere in \(\mathbb{R}^d\). The number of directions \(p\) controls the approximation quality - more directions provide better coverage of the angular space but increase computation. In practice, \(p = 50\) to \(200\) directions often suffice for good results. The directions can be fixed for reproducibility or resampled for stochastic variation.

Behavioral characteristics: The Mean-Variance method tends to select a balanced mix of central and peripheral points. It naturally allocates points to maintain the shape of the distribution across multiple orientations. This method is particularly effective for unimodal or approximately elliptical distributions where first and second moments capture most of the distributional information.

Limitations and extensions: For highly multi-modal distributions, matching only means and variances may not capture all relevant structure. The method can be extended to include higher moments (skewness, kurtosis) or to use adaptive projections that focus on directions with maximum variance (principal components) rather than random ones.

Concretely, suppose we draw \(m\) random unit vectors \(\{u_1, u_2, \ldots, u_m\}\) in the plane (for 2D, these could be random angles) or in higher dimensions (random directions on the unit sphere). For each direction \(u_j\), we can compute the mean \(\mu_Y^{(j)}\) and standard deviation \(\sigma_Y^{(j)}\) of the projections of all points in \(Y\) onto \(u_j\). We would like our subset \(X\) to have similar statistics: i.e., for each \(j\), the mean \(\mu_X^{(j)}\) and standard deviation \(\sigma_X^{(j)}\) (computed over points in \(X\) projected onto \(u_j\)) should be as close as possible to those of the full set.

There is often a trade-off between matching the mean and matching the variance (spread) along any given direction, especially if the distribution of \(Y\) is not uniform. For example, picking too many extreme outlier points can quickly match the range or variance of \(Y\), but may shift the subset's mean away from the true center of \(Y\); conversely, picking points near the center will align the means but reduce the variance of \(X\) relative to \(Y\). The greedy algorithm for the mean-variance method balances these by explicitly considering both in its selection criterion. We define an error metric that quantifies the discrepancy in mean and variance along the set of projection directions \(U = \{u_1, \ldots, u_m\}\). One such error function can be written as:

\[ E(X) = \frac{1}{m}\sum_{j=1}^m \Big[(\mu_Y^{(j)}-\mu_X^{(j)})^2 + \lambda(\sigma_Y^{(j)}-\sigma_X^{(j)})^2\Big], \]

where \(\lambda\) is a weighting factor that can be tuned to give more or less emphasis to matching the variance relative to the mean. In our implementation, we took \(\lambda = 1\) for simplicity, weighing mean and variance equally.

\(\mu_X^{(j)}\) and \(\sigma_X^{(j)}\) are the mean and standard deviation of \(v_j^{\top}x\) over \(x\in X\); \(\lambda\) controls the mean/variance trade‑off.

\[\begin{aligned} &E(X)=\frac{1}{p}\sum_{j=1}^p\Big[(\mu_Y^{(j)}-\mu_X^{(j)})^2+\lambda\, (\sigma_Y^{(j)}-\sigma_X^{(j)})^2\Big].\\ &X_0=\varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; E\big(X_{t-1}\cup\{y\}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t=X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

We define a projection‑based error that balances mean and spread along random directions (controlled by \(\lambda\)), then add the point \(x_t^{\ast}\) that most reduces this error.

The greedy selection procedure starts with \(X\) empty and adds one point at a time from \(Y\) that most reduces the error \(E(X)\). After each point is added, the means and variances \(\mu_X^{(j)}, \sigma_X^{(j)}\) are recomputed for the new subset before the next selection. In this approach, the random projection directions serve as a proxy for overall shape. By using multiple random projections, we capture different aspects of the distribution of \(Y\) without needing to consider the full high-dimensional geometry directly. This makes the method scalable: one can choose \(m\) (the number of projections) to balance accuracy and computational cost. In practice, even a modest number of random directions (e.g. \(m = 10\) or 20) can yield a good approximation of the data's spread in various orientations. The chosen subset \(X\) will tend to include both central points (to keep the projected means aligned with \(Y\)) and some peripheral points (to account for the variance along those projections). Thus, it inherently balances bias (mean alignment) and variance (coverage of spread) in multiple directions.

Walkthrough. Sample random directions and keep \(X\) close to \(Y\) in projection means while preserving spread (variance). Each step adds the point that most reduces the directional moment error.

Hyperplane Extent

The Hyperplane method provides a geometrically-motivated baseline for representative selection that focuses on preserving the spatial extent and boundary characteristics of the data distribution. Unlike distribution-aware methods that seek to match probability densities or moments, this approach emphasizes coverage of the geometric envelope and ensures that the selected subset spans all major directions in space.

Geometric intuition: Imagine the point cloud \(Y\) as a collection of points in space. For any direction \(v\), there exists a point in \(Y\) that is furthest along that direction (the supporting point). By selecting these extreme points across multiple directions, we approximate the convex hull of the distribution. This ensures that our subset captures the shape, extent, and boundaries of the original data.

Hyperplane formulation: In \(d\) dimensions, a hyperplane through the origin with normal vector \(n\) divides space into two half-spaces. The signed distance of a point \(x\) to this hyperplane is given by \(d(x,n) = \langle n, x \rangle\). Points with large absolute values of this distance are far from the hyperplane in either direction. By sampling many random normal vectors \(\{n_j\}_{j=1}^R\), we can assess the extent of the data along different orientations.

Extent matching objective: We define the average absolute projection for a set \(S\) as:

\[ A(S) = \frac{1}{|S|R}\sum_{x \in S}\sum_{j=1}^R |n_j^\top x|, \]
which measures how well the set \(S\) covers the space along the sampled directions. Our goal is to minimize the deviation between the full set and subset extents:

\[ J(S) = |A(S) - A(Y)|. \]

Selection strategy: The greedy algorithm starts with an empty set and iteratively adds points that most reduce the extent deviation \(J(S)\). At each step, we evaluate every candidate point and compute the new extent that would result from adding it to the current subset. This naturally favors points that lie in directions that are currently under-represented.

Statistical interpretation: While not explicitly statistical, this method can be viewed as matching the expected absolute value of projections under a uniform distribution over directions. This is related to the concept of mean width in convex geometry - the average distance between parallel supporting hyperplanes. By preserving this quantity, we maintain the overall size and scale of the point cloud.

Computational efficiency: The hyperplane method is one of the fastest approaches available. Computing projections requires \(O(d)\) operations per point per direction, and we need \(O(R \cdot N)\) operations to compute the initial extent of the full set. Each greedy step requires \(O(R \cdot (|X_{t-1}| + N))\) operations, making the total complexity \(O(m \cdot R \cdot N)\). In practice, \(R\) can be much smaller than for sliced methods (20-50 directions often suffice).

Geometric properties: The method naturally selects points from the convex hull or its vicinity, ensuring good coverage of the boundary. It's particularly effective for distributions with well-defined shapes or when the application requires preserving the support of the distribution. The selected points tend to be well-distributed across different regions of space.

Robustness characteristics: Since the method focuses on geometric extent rather than density, it's robust to outliers in terms of selection frequency (outliers are naturally selected as extreme points). However, it may over-represent sparse regions if they contribute significantly to the geometric extent. This makes it less suitable for applications where density preservation is important.

Relationship to other methods: The hyperplane approach can be seen as a geometric counterpart to moment-matching methods. While Mean-Variance preserves statistical moments along projections, Hyperplane preserves geometric extent. In practice, these methods complement each other - one captures internal structure, the other captures external boundaries.

Practical applications: This baseline is particularly useful for tasks involving spatial coverage, boundary detection, convex approximation, or when computational efficiency is paramount. It serves as a reference point for evaluating more sophisticated distribution-aware methods.

One implementation is as follows: sample a set of \(k\) random unit direction vectors \(\{v_1, v_2, \ldots, v_k\}\) in 2D. For each direction \(v_i\), find the point \(p_i \in Y\) that maximizes the signed distance \(v_i \cdot p\) (i.e. the projection of \(p\) onto \(v_i\)). Include each such \(p_i\) in the subset \(X\). (If the same point ends up selected by multiple directions, it need only be included once.) This yields up to \(k\) points that cover the extremes of \(Y\) in those sampled directions. We used a variant of this idea. To make sure both sides of each hyperplane are considered, one can also sample \(v_i\) and \(-v_i\) pairs or sample more directions and then take the top \(k\) extreme points overall.

This baseline does not attempt to match the overall distribution of points; rather, it focuses on the geometric boundary. It will ensure that the subset spans a similar bounding box or convex shape as \(Y\), but it may under-sample dense interior regions (since no explicit effort is made to pick points from high-density areas). The baseline is computationally cheap (linear in \(|Y|\) for a given set of directions) and provides a point of comparison for the more sophisticated strategies above.

\(n_r\) are random unit normals. We define \(A(S)=\tfrac{1}{|S|R}\sum_{x\in S}\sum_{r=1}^R |n_r^\top x|\) and match \(A(X)\) to \(A(Y)\) via \(J(X)=\lvert A(X)-A(Y)\rvert\).

\[\begin{aligned} &\text{Draw normals } \{n_r\}_{r=1}^R.\; A(S)=\frac{1}{|S|R}\sum_{x\in S}\sum_{r=1}^R |n_r^\top x|,\; J(S)=\big|A(S)-A(Y)\big|.\\ &X_0=\varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y)\;:=\;J\big(X_{t-1}\cup\{y\}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t = X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

We match the subset's average absolute projection to random normals to that of the full set by minimizing \(J(X)=\lvert A(X)-A(Y)\rvert\). Each step adds \(x_t^{\ast}\) that best reduces this deviation.

Walkthrough. Draw random normals and match the average absolute projection \(A(X)\) to \(A(Y)\). Greedily add the point that reduces \(|A(X)-A(Y)|\); this captures extent/extremes.

MCMC Subset Selection

Markov Chain Monte Carlo methods provide a fundamentally different approach to representative selection by exploring the space of possible subsets through stochastic transitions rather than deterministic greedy optimization. Instead of always picking the "best" point at each step, MCMC allows for probabilistic moves that can escape local optima and explore diverse solutions, potentially finding better global optima for complex distributions.

Sampling perspective: We view the selection problem as sampling from a probability distribution over subsets, where "good" subsets (those with low objective value) have higher probability. The target distribution is typically defined as \(\pi(X) \propto \exp(-\beta \cdot f(X))\), where \(f(X)\) is our objective (e.g., sliced KL divergence) and \(\beta\) controls the concentration around good solutions.

Delayed Acceptance optimization: For high-dimensional problems, computing the full objective for every proposed move becomes expensive. Delayed Acceptance MCMC addresses this by using a two-stage evaluation process. First, we compute a cheap approximation using a subset of directions or features. Only if this approximation looks promising do we compute the full expensive objective.

Sliced‑DA specific implementation: Our optimized MCMC uses sliced KL divergence as the target objective. The delayed acceptance mechanism works as follows:

  • Stage 1 (Coarse evaluation): Compute sliced KL using a small number of random directions (e.g., 64 instead of 384). This provides a quick estimate of whether the proposed move is promising.
  • Stage 2 (Fine evaluation): Only if the coarse evaluation passes a threshold, compute the full sliced KL with all directions. This two-stage process dramatically reduces computation while maintaining solution quality.

Proposal mechanisms: Several proposal strategies can be employed:

  • Add/Remove moves: Propose adding a new point to the subset or removing an existing one.
  • Swap moves: Exchange one point in the current subset with one from the candidate pool.
  • Adaptive proposals: Learn which points are more likely to be good candidates and bias proposals toward them.

Computational advantages: While greedy methods are \(O(m \cdot N \cdot \text{cost_per_eval})\), MCMC can find good solutions in fewer evaluations because it doesn't exhaustively search all candidates at each step. The delayed acceptance mechanism can reduce computation by 5-10x while maintaining 90%+ of the solution quality.

Convergence and mixing: The key challenge in MCMC is ensuring the chain mixes well and converges to the target distribution. We monitor effective sample size, autocorrelation, and acceptance rates. The temperature parameter \(\beta\) can be gradually increased (simulated annealing) to focus the search on increasingly better solutions.

Quality vs. runtime tradeoffs: MCMC offers a tunable tradeoff between solution quality and computation time. Short runs (few thousand iterations) can find good-enough solutions quickly, while longer runs can explore more thoroughly and potentially find better optima than greedy methods.

Advantages over greedy: MCMC can escape local optima that greedy methods get trapped in, explore multiple diverse solutions rather than committing to early decisions, and provide uncertainty estimates by examining the distribution of visited states.

Practical considerations: The method requires careful tuning of proposal distributions, step sizes, and acceptance thresholds. Parallel tempering (running multiple chains at different temperatures) can improve exploration for complex distributions.

MCMC (Exact)

The exact MCMC approach evaluates the full objective function for every proposed subset modification. While computationally intensive, this provides the most accurate exploration of the solution space. The method follows the standard Metropolis-Hastings algorithm:

Given current subset \(X_t\), propose a new subset \(X'\) by randomly adding, removing, or swapping a point. Accept the proposal with probability:

\[ \alpha = \min\left(1, \frac{\pi(X')}{\pi(X_t)}\right) = \min\left(1, \exp\left(-\beta(f(X') - f(X_t))\right)\right) \]

Implementation details:

  • Uses sliced KL divergence with 384 projection directions as the objective
  • Employs symmetric proposals (equal probability of forward and reverse moves)
  • Implements annealing schedule: \(\beta_t = \beta_0 \cdot (1 + \gamma t)\) to gradually focus on better solutions
  • Runs 10,000 iterations with burn-in of 2,000 for convergence

Advantages: Guarantees convergence to the target distribution given infinite time; provides uncertainty estimates through the chain's stationary distribution.

Limitations: Each iteration requires \(O(p N)\) operations for sliced KL evaluation, making it impractical for very large datasets.

MCMC (Budgeted / Delayed‑Acceptance)

The optimized MCMC implementation shown here uses the delayed acceptance framework with sliced KL divergence. It achieves comparable or better quality than greedy methods with significantly less computation by avoiding exhaustive search at each step. The two-stage evaluation ensures that only promising moves receive the full computational budget.

k‑DPP Diversity-Based Selection

Determinantal Point Processes (DPPs) provide an elegant probabilistic framework for selecting diverse subsets by modeling the repulsive interactions between points. Unlike methods that explicitly match distributions or moments, k-DPPs focus on diversity as the primary selection criterion, ensuring that selected points are spread out and mutually informative.

Mathematical foundation: A DPP defines a probability distribution over subsets where the probability of selecting a subset \(S\) is proportional to the determinant of its kernel matrix:

\[ \mathbb{P}(S) \propto \det(K_S), \]
where \(K_S\) is the submatrix of the kernel \(K\) indexed by points in \(S\). The determinant measures the volume spanned by the selected points in feature space - larger determinants correspond to more diverse sets.

Kernel diversity interpretation: The kernel function \(k(x_i, x_j)\) measures similarity between points. The DPP probability \(\det(K_S)\) can be expanded as a sum over all matchings, where each term penalizes selecting points that are highly similar. This creates a natural repulsion effect - the process discourages selecting redundant points.

Fixed-size k-DPP: Standard DPPs allow subsets of variable size, which can be problematic when we need exactly \(m\) points. The k-DPP modification conditions on the subset size, giving:

\[ \mathbb{P}(S) = \frac{\det(K_S)}{\sum_{|T|=m} \det(K_T)}, \quad |S| = m. \]
This removes any bias toward larger or smaller subsets and focuses entirely on the diversity of the selected content.

Kernel choice and diversity semantics: The choice of kernel determines what "diversity" means:

  • RBF kernel: \(k(x,y) = \exp(-\|x-y\|^2/(2\ell^2))\) - promotes geometric diversity, selecting points that are spatially separated.
  • Linear kernel: \(k(x,y) = x^\top y\) - promotes linear independence, selecting points that span different directions.
  • Feature kernels: Any kernel defined on relevant features can encode domain-specific diversity.

Computational challenges: Computing exact k-DPP samples requires evaluating determinants of many submatrices, which is \(O(N^3)\) in naive form. Even with efficient algorithms using eigendecompositions, exact sampling remains expensive for large \(N\).

Nyström approximation: To scale k-DPPs to large datasets, we use the Nyström method to approximate the kernel matrix. We select \(r\) landmark points (where \(r \ll N\)) and approximate:

\[ K \approx C W^{-1} C^\top, \]
where \(W = K_{LL}\) is the kernel matrix on landmarks and \(C = K_{XL}\) contains cross-kernel values between all points and landmarks.

Greedy volume maximization: In the Nyström-approximated feature space \(B = C W^{-1/2}\), we select points by greedily maximizing the determinant of the selected submatrix. This is equivalent to greedily maximizing the volume spanned by the selected points in feature space.

Scalability and complexity: The Nyström approximation reduces complexity from \(O(N^3)\) to \(O(Nr^2)\) for the initial approximation plus \(O(m^2 r)\) for the greedy selection. With \(r\) typically set to 64-256 landmarks, this enables scaling to millions of points.

Diversity vs. representativeness: Pure k-DPP optimization may select diverse points that are not representative of the data distribution (e.g., selecting outliers because they're diverse). In practice, k-DPPs work best when combined with representativeness constraints or when the application specifically values diversity over density matching.

Behavioral characteristics: k-DPP methods tend to select points that are well-spread across the data space, with good coverage of different modes or clusters. The selected points often form a natural "skeleton" of the data distribution. The method is particularly effective when the data has clear structure or when we want to avoid redundancy in our selection.

Practical applications: k-DPPs excel in scenarios like experimental design (selecting diverse conditions), sensor placement (maximizing coverage), summarization (selecting diverse exemplars), and active learning (choosing informative samples).

k‑DPP (Exact)

The exact k-DPP implementation provides the gold standard for diversity-based selection by directly sampling from the determinantal distribution. It uses efficient algorithms based on eigendecomposition of the kernel matrix and the chain rule for DPP sampling. While computationally intensive, it provides the optimal diversity according to the chosen kernel.

The exact method involves computing the eigendecomposition of the kernel matrix \(K = U\Lambda U^\top\) and then sampling sequentially using the eigenvectors. This ensures that each selected point maximizes the incremental gain in determinant volume, leading to provably optimal diversity under the kernel metric.

The Nyström Approximation

The Nyström approximation makes k-DPP selection practical for large-scale problems. By working in a low-rank approximation of the kernel feature space, we can achieve near-exact diversity selection with dramatically reduced computation. The landmark count \(r\) serves as the computational budget parameter - smaller values are faster but less accurate.

This approach first selects \(r\) landmark points (typically using k-means or random sampling) to approximate the kernel structure. Then it performs greedy selection in the transformed space, selecting points that maximize the incremental determinant. The quality of approximation depends on how well the landmarks capture the structure of the full dataset.

Compute Budget

We present two profiles to balance fairness and realism:

  • Batch‑small (exactish, apples‑to‑apples): d=64, N=12,800 (=200·d), m∈{100,200,300}, sliced p=384, MMD with exact RBF kernel. Full KL is omitted at 64D; we use Sliced‑KL as the scalable KL proxy. (Include full KL for 2D/3D.)
  • Streaming‑large (scalable): the blog profile (scripts/run_blog_all.py) using streaming/sketch implementations and large N; finishes in minutes to tens of minutes per method and matches the figures here.

Streaming vs Batch

Batch: processes the full dataset \(Y\) in memory for every scoring step (exact 1D sorts/histograms for sliced metrics; full kernel or full RFF features for MMD; KDE for KL). Pros: exact(er) objectives; cons: time/memory scale with \(N\) and full‑KL becomes infeasible in high‑d.

Streaming: reads \(Y\) in passes and optimizes against small sketches (per‑direction 1D histograms/quantiles for sliced metrics; RFF mean embedding for MMD; aggregated projection moments for geometric methods). Pros: scales to millions with bounded memory; cons: small approximation bias governed by sketch size (\(p\), bins, R).

Sketch Parameters

Unless stated otherwise, sliced methods use \(p=384\) random directions; MMD uses RBF with lengthscale \(\ell=0.5\); Mean–Variance uses \(p=384\) projections with \(\alpha=0.2\); Hyperplane uses \(k=40\,d\) normals. Increase \(p\)/R for more stability (lower variance) at higher compute.

The careful reader may notice that full KL was not calculated in the experiments. This is because KL divergence is not tractable in high dimensions. The greedy step at iteration \(t\) evaluates candidates \(p\in Y\setminus X\) by scoring \(\Delta(p)=D_{\mathrm{KL}}\!\big(P_Y\,\Vert\,P_{X\cup\{p\}}\big).\) With KDE, \(P_{X}\) has density \(p_X(x)=\tfrac{1}{t}\sum_{j=1}^{t} K(x,x_j).\) A naive evaluation uses all \(N\) points:

\[\text{score}(p)\;=\;\sum_{i=1}^{N} p_Y(y_i)\,\log\frac{p_Y(y_i)}{p_{X\cup\{p\}}(y_i)}\quad\Rightarrow\quad O\big((t+1)\,N\big)\;\text{per candidate.}\]

Scanning all candidates costs \(O\big((N{-}t)\,(t{+}1)\,N\big)\) at step \(t\). Summing over \(t=1...m\) yields \(\sum_t O\big(N^2 t\big)=O\!\big(N^2 m^2\big)\) operations (up to constants from KDE fits). For \(d{=}64, N\approx 1.5\times10^5, m=300\), this is far beyond a single‑machine budget—weeks on a laptop—and so we adopt Sliced‑KL as a scalable proxy.

Experiments

Experimental Setup

Data Generation

We evaluate on synthetic data tailored for representative sampling: a 3D GMM for visualization and a 64D benchmark (\(N=12{,}800\)) derived from a higher‑dimensional Gaussian projection. Evaluation uses held‑out data with Sliced‑W1, Sliced‑KL, and MMD as primary metrics.

Method-Specific Analysis

KL (grid/KDE). In principle we’d minimize the true KL, but estimating densities on grids or with KDEs explodes with dimension (and grid resolution), making full KL impractical beyond very low‑dimensional settings. This motivates a proxy that preserves the spirit of KL while scaling to modern datasets.

Sliced KL. Instead of fitting a high‑dimensional density, we project to many 1D lines and sum KLs of 1D histograms—capturing KL’s behavior without its curse‑of‑dimensionality. Accuracy improves with the number of projections, and a histogram/streaming implementation processes points in batches, keeping memory and time linear in \(N\).

Sliced Wasserstein‑1. Full optimal transport scales poorly (e.g., cubic or worse in generic solvers), so we transport along many 1D projections where sorting gives \(O(p\,N\log N)\). This preserves Wasserstein’s geometry and outlier robustness while enabling batch/stream execution with per‑projection buffers.

Kernel Herding (MMD, RFF). Exact MMD requires an \(N\times N\) kernel (quadratic memory/time). Random Fourier Features compress the kernel map so herding works with linear passes in the feature space. With streaming accumulators, we update the mean on the fly and avoid building the full kernel.

Mean‑Variance. When speed dominates, we align first/second moments on random projections. It’s a single‑pass, streaming‑friendly heuristic that summarizes center and spread quickly—but, by design, it can miss fine multi‑modal structure.

Hyperplane. To capture geometric extent without heavy computation, we pick extremes along sampled directions. This one‑pass, memory‑light baseline excels at boundary coverage and integrates naturally with batched or streaming ingestion.

MCMC (Delayed Acceptance). Exact MCMC proposals are costly when every step evaluates an expensive objective. Delayed acceptance first screens proposals with a cheap surrogate (e.g., few projections), reserving the full evaluation for promising moves—retaining global search while cutting wall‑clock time.

k‑DPP (Nystrom). Vanilla k‑DPP sampling needs spectral decompositions that scale cubically with \(N\), which is untenable for large point sets. Nyström approximates the kernel with \(L\) landmarks, reducing factorization and sampling costs; with \(L\in[64,256]\) we keep diversity while staying within a practical runtime budget.

Computational Results

64D Unified Benchmark

Held-out metrics and runtime from the unified 64D benchmark (lower is better for metrics; runtime in seconds).

m = 100

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)

m = 200

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)

m = 300

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)
Source CSV: sample-reduction_results/unified/unified_stream_d64.csv.

Quality Trade-offs

Across held-out evaluations, distributional methods (Sliced‑KL, Sliced‑W1, MMD) consistently achieve lower discrepancy than purely geometric approaches. Sliced variants capture most of the benefit of their full counterparts while remaining tractable in high dimensions.

There are diminishing returns with increasing subset size: most gains appear by \(m\!\approx\!200\), after which improvements taper. Rankings remain stable across seeds and data splits, indicating the trends are robust rather than artifacts of randomness.

Conclusion

Key Takeaways

Method Performance Summary

Method Strength Best Use Case
Mean–Variance Computational efficiency Quick summaries, moments
Hyperplane Boundary coverage Geometric baselines, extent
Sliced‑W1 Transport-aware, robust Geometric structure, outliers
Sliced‑KL Scalable KL approximation High‑dim, distributional fidelity
Kernel Herding (MMD, RFF) Fast kernel matching Large‑scale, kernel similarity
MCMC‑DA Global optimization Complex, multi‑modal distributions
k‑DPP (Nystrom) Diversity maximization Experimental design, sensor placement

Computational-Statistical Trade-offs

Our experiments reveal clear trade-offs between computational efficiency and statistical fidelity:

  • Linear-time methods (Mean-Variance, Hyperplane) offer 10-100x speedups but sacrifice distributional accuracy
  • Sliced approximations retain 85-95% of performance with 10x reduction in cost
  • Kernel methods with RFF enable million-scale datasets with bounded memory
  • Dimensional scaling: Most methods transition from infeasible to tractable around \(d = 50\) when using appropriate approximations

Future Directions

Adaptive projection selection. Learning optimal projection directions rather than random sampling could significantly improve the quality of sliced approximations. Current methods rely on random projections, but data-driven approaches that learn the most informative directions could yield better approximations with fewer projections.

Hybrid objectives. Combining distributional and geometric criteria with learned weights offers a promising direction for balancing different aspects of representative sampling. Rather than choosing between distributional fidelity and geometric preservation, learned combinations could adapt to specific application requirements.

Dynamic subset selection. Adapting the subset as the data distribution evolves is crucial for streaming and online scenarios. Current methods assume static data, but real-world applications often involve data that changes over time, requiring adaptive algorithms that can update selections efficiently.

Deep learning integration. End-to-end differentiable subset selection for downstream tasks could enable joint optimization of sampling and model performance. This would allow the sampling strategy to be tailored specifically for the intended use case, potentially achieving better results than generic representative sampling approaches.

Final Thoughts

Representative sampling bridges fundamental concepts from statistics, optimization, and computer science. The methods presented here offer a spectrum of trade-offs, from theoretically optimal but computationally intensive approaches to fast approximations suitable for massive datasets. The key is to match the method to the application's requirements, considering not just accuracy but also computational constraints, interpretability, and scalability needs.

As data continues to grow in size and complexity, the importance of principled data reduction will only increase. The framework and methods described here provide a solid foundation for tackling these challenges across diverse domains.

View:

Transcript

Figures

All code is available here.

Acknowledgements

This project benefited from conversations with Luke Triplett and support from the Mathematics Directed Reading Program at UC Berkeley.

References