← back to archive

Representative Sampling of High‑Dimensional Point Sets

Ethan Liu | May 2025 · Updated October 2025

Background

High-dimensional point sets arise in machine learning, genomics, simulation, and many other fields. A frequent goal is to select a compact subset \(X\subset Y\) of size \(|X|=m\ll|Y|\) that preserves properties of \(Y\): distributional similarity, geometry, and the like. This enables faster downstream computation while controlling approximation error.

As dimensionality increases, data becomes exponentially sparse, distance metrics lose discriminative power, and computational costs grow rapidly. This motivates the need for representative sampling—selecting small subsets that preserve the essential characteristics of large datasets.

The challenge is twofold. Computationally, many downstream algorithms scale superlinearly in \(N\) (e.g., nearest neighbors \(O(Nd)\), kernel methods \(O(N^2)\), optimal transport up to \(O(N^3\log N)\)), so reducing \(N\) to a compact \(m\) yields dramatic speedups. Statistically, we still want estimators computed on \(X\) to be faithful to those on \(Y\):

\[ \hat{\theta}_X \xrightarrow{p} \theta \quad \text{as} \quad m \to \infty \]

where \(\hat{\theta}_X\) is computed on the subset and \(\theta\) is the true parameter. The rate at which \(\hat{\theta}_X\) approaches \(\theta\) depends on how representative \(X\) is of \(Y\). Our goal is to find subsets that balance computational savings with statistical fidelity.

Formal Problem Statement

Given a dataset \(Y = \{y_i\}_{i=1}^N \subset \mathbb{R}^d\), we seek to select a subset \(X = \{x_j\}_{j=1}^m \subset Y\) with \(m \ll N\) that preserves key properties of \(Y\). Formally:

\[ X^* = \arg\min_{X \subset Y, |X| = m} \mathcal{L}(P_Y, P_X) + \lambda \cdot R(X) \]

where:

  • \(P_Y\) and \(P_X\) are probability distributions induced by \(Y\) and \(X\)
  • \(\mathcal{L}\) is a loss function measuring discrepancy (e.g., KL divergence, Wasserstein distance)
  • \(R(X)\) is a regularization term encouraging desirable properties (e.g., diversity, coverage)
  • \(\lambda\) balances fidelity against regularization

Method Categories

We explore four categories of methods for selecting representative subsets of high-dimensional point sets: geometric, distributional, diversity-based, and sampling-based.

Geometric Methods

Perhaps the first thought that comes to mind when thinking about representative sampling are geometric methods. With geometric methods, we can emphasize spatial structure, such as extents, boundaries, and relative positions. These methods excel for visualization and downstream geometry (e.g., nearest neighbors) and are extremely fast, which means we save on computation! However, they may under‑represent dense regions or subtle features compared to objectives that explicitly match the underlying probability law.

Distributional Methods

Then, the next choice, naturally, is to look at probability distributions. We can match distributions by minimizing a divergence \(d(P_Y, P_X)\), solving the issue from the geometric methods! However, this is not always the best choice. Since we are working with high-dimensional data, it is often the case that the full distribution is not known, and we need to estimate it from the data. Additionally, greedily calculating the divergence is intractable in high dimensions. Consequently, we must use proxies that retain signal while scaling near-linearly.

Diversity‑Based Methods

Diversity is another metric that we can use to select representative subsets. An ideal subset contains points that are both representative and diverse. Intuitively, we want to include points that appear in high-density regions, while avoiding redundant selections. Determinantal point processes (DPPs) turn this into sets that span more volume are preferred. In practice we approximate the kernel so we obtain a well‑spread “skeleton” of the data with reasonable compute. A kernel \(k(x,y)\) is simply a positive semi-definite similarity function whose choice (e.g., RBF with length‑scale \(\ell\) vs linear) sets what “similar” means—and thus what “diverse” means. See k‑DPP for details.

Sampling‑Based Methods

Sometimes the landscape of possible subsets is too rugged for a greedy rule to navigate cleanly. Sampling‑based methods treat subset selection as a stochastic search problem: we posit a probability distribution over candidate subsets and use Markov Chain Monte Carlo (MCMC) or related schemes to explore it. Rather than committing to the single best-looking point at each step, the chain proposes swaps, additions, or removals and accepts them in proportion to how much they improve—or occasionally worsen—the objective. This "probabilistic wandering" helps the algorithm escape local optima, provides a natural way to trade compute for quality, and even yields uncertainty estimates about which points truly matter!

The practical hurdle is that evaluating the objective for every proposal can be expensive. Modern variants therefore combine coarse, low-cost surrogates (fewer projection directions, sketchy summaries, streaming buffers) with occasional full evaluations to keep the exploration light-weight while still gravitating toward high-quality subsets.

Let \(Y = \{y_i\}_{i=1}^N \subset \mathbb{R}^d\) be the full dataset and \(X = \{x_j\}_{j=1}^m \subset Y\) the selected subset with \(m \ll N\). We denote by \(P_Y\) and \(P_X\) the probability distributions induced by these sets. In practice, we work with either empirical distributions (\(P_Y = \frac{1}{N}\sum_{i=1}^N \delta_{y_i}\) and \(P_X = \frac{1}{m}\sum_{j=1}^m \delta_{x_j}\)) or smoothed distributions using kernel density estimates (KDEs) to avoid atomic mass points. These conventions anchor the method discussions that follow!

Methods

Geometric Methods

Hyperplane Extent

This method is most useful when we care about geometry-driven tasks (nearest neighbors, collision checks, or boundary visualization). We care more about the shape of the cloud than the exact density inside it. Motivated by this, hyperplane extent focuses on which points mark the furthest reach of the cloud along different directions. Intuitively, we spin random hyperplanes through the origin and note which points sit at the front and back of each orientation; those extremes outline the convex hull. By sampling a number of random normals \(\{n_j\}_{j=1}^R\) and measuring how far each point projects, we collect the landmarks that preserve the outer envelope.

For any set \(S\) we summarize coverage with the average absolute projection:

\[ A(S) = \frac{1}{|S|R}\sum_{x \in S}\sum_{j=1}^R |n_j^\top x|, \]
and we keep the subset close to the full-set extent by minimizing \(J(S) = |A(S) - A(Y)|\). Starting from the empty set, the greedy procedure repeatedly chooses the point that most reduces this deviation, so directions that are currently underserved attract the next extreme point.

This routine is extremely light-weight. Computing the initial projections costs \(O(RNd)\), where \(R\) is the number of sampled directions and \(d\) is the ambient dimension; each additional point needs only updated dot products, so the overall complexity is \(O(mRN)\) with small constants because a few dozen directions already give smooth coverage. The selected subset hugs the convex hull, naturally surfaces outliers, and guarantees good boundary coverage even in high dimensions. The trade-off is that interior density can be under-represented.

In practice we sample direction pairs \(v\) and \(-v\), take the farthest projections in each, and deduplicate the chosen indices. If the pass yields fewer than \(m\) points, we backfill by adding the farthest points from the centroid so that the envelope remains inflated. The result is a subset that maintains the geometric silhouette of \(Y\) with linear-time compute and minimal tuning.

\(n_r\) are random unit normals (length-1 vectors pointing in uniformly sampled directions). We define \(A(S)=\tfrac{1}{|S|R}\sum_{x\in S}\sum_{r=1}^R |n_r^\top x|\) and match \(A(X)\) to \(A(Y)\) via \(J(X)=\lvert A(X)-A(Y)\rvert\). So, in summary, the full algorithm:

\[\begin{aligned} &\text{Draw normals } \{n_r\}_{r=1}^R.\; A(S)=\frac{1}{|S|R}\sum_{x\in S}\sum_{r=1}^R |n_r^\top x|,\; J(S)=\big|A(S)-A(Y)\big|.\\ &X_0=\varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y)\;:=\;J\big(X_{t-1}\cup\{y\}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t = X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

We match the subset's average absolute projection to random normals to that of the full set by minimizing \(J(X)=\lvert A(X)-A(Y)\rvert\). Each step adds \(x_t^{\ast}\) that best reduces this deviation.

The interactive below shoud be useful for gaining some geometric intuition. Using your mouse, rotate the 3D Gaussian mixture. The generator places three anisotropic clusters at (-1.5, 0, 0), (1.6, 0.9, 0.4), and (0, -1.2, 1.1), tuned so one blob is compact, another elongated, and the third mid-spread, with mixture weights about 0.45, 0.35, and 0.20.

Mean-Variance

Hyperplane Extent keeps the boundary honest, but it can starve the interior: dense clusters lose weight, the global centroid drifts, and variances shrink when we only chase extremes. We want a subset that preserves those first two moments, so we turn to Mean-Variance.

Intuitively, we reuse the random-direction machinery—project the cloud along many normals, compare the subset’s center and spread with the full set, and whenever a slice looks off we add the point that repairs it. Because the variance term still rewards scatter, the greedy steps keep a few of those edge points in play, which is exactly what you see near the periphery of the 3D plot.

Formally, we minimize the discrepancy

\[ E(X) = \frac{1}{p}\sum_{j=1}^p \Big[(\mu_Y^{(j)}-\mu_X^{(j)})^2 + \lambda(\sigma_Y^{(j)}-\sigma_X^{(j)})^2\Big], \]
where \(\mu^{(j)}\) and \(\sigma^{(j)}\) are the mean and standard deviation of the projected coordinates \(v_j^{\top}y\). The factor \(1/p\) simply averages over the \(p\) sampled directions; here we take \(p = 128\) probes, which is enough for stable moment estimates without blowing up runtime.

The weighting \(\lambda\) controls how loudly variance argues relative to the mean; for the 3D visualization we set \(\lambda = 0.2\), so means take the lead while variance still has a vote. At each step we add the point that most reduces \(E(X)\), automatically steering new selections toward whichever directions currently look under-served.

The mechanics stay lightweight. Projections are just dot products, directional moments update incrementally, and the whole loop runs in roughly \(O(pN)\) per step. The resulting subset blends central anchors with a few deliberate outliers—it mirrors the ellipsoidal core of the data while keeping the statistics that matter to downstream inference intact.

\(\mu_X^{(j)}\) and \(\sigma_X^{(j)}\) are the mean and standard deviation of \(v_j^{\top}x\) over \(x\in X\); we use \(p = 128\) directions and \(\lambda = 0.2\) for the rendered example.

\[\begin{aligned} &E(X)=\frac{1}{p}\sum_{j=1}^p\Big[(\mu_Y^{(j)}-\mu_X^{(j)})^2+\lambda\, (\sigma_Y^{(j)}-\sigma_X^{(j)})^2\Big].\\ &X_0=\varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; E\big(X_{t-1}\cup\{y\}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t=X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

Oh no! We actually ended up selecting all edge points, which is worse than the hyperplane extent method!

To see why, look at the variance term inside \(E(X)\). For each direction \(v_j\) the sample variance of the subset is \(\sigma_X^{(j)2} = \tfrac{1}{m}\sum_{x\in X} \big(v_j^{\top}x - \mu_X^{(j)}\big)^2\). Matching \(\sigma_Y^{(j)}\) therefore demands large values of \(|v_j^{\top}x|\); with only \(m\) candidates, the greedy step gets the biggest drop in the \(\lambda(\sigma_Y^{(j)}-\sigma_X^{(j)})^2\) penalty by pulling in whichever point has the largest squared projection along that direction. Different slices claim different extremes, so even with \(\lambda = 0.2\) we eventually hoard the entire set of boundary points.

Could we fix this? The reader's first thought might be to shrink \(\lambda\). Doing this causes the variance term to stop dragging points to the perimeter—but then we are back to pure mean matching, and the interior collapses again. Alternatively we could regularize each projected variance by subtracting the contribution of any one point, e.g. replace \(\sigma_X^{(j)2}\) with a leave-one-out estimate:

\[ \tilde{\sigma}_{X}^{(j)2} = \frac{1}{m-1} \sum_{z \in X} \big(v_j^{\top}z - \mu_{X\setminus\{z\}}^{(j)}\big)^2,\qquad \Delta_j(x) = \tilde{\sigma}_{X \cup \{x\}}^{(j)2} - \tilde{\sigma}_{X}^{(j)2} = \frac{1}{m-1}(v_j^{\top}x)^2 + O\!\left(\frac{1}{m}\right). \]

The cross terms damp out at \(O(1/m)\), but the new squared projection stays \(O(1)\), so the greedy objective still rewards large \(|v_j^{\top}x|\) no matter how we rescale.

As soon as we try to match higher moments or the full histogram along each direction, we have crossed into distributional territory. In other words, tweaking Mean-Variance in the ways that would stop the edge fetish slowly morphs it into the sliced divergences we study next.

Clearly, the only way out is to constrain entire projected distribution—by matching quantiles, bins, or transport plans.

Distributional Methods

Kullback-Leibler Divergence

A more principled way to measure how well a subset \(X\) of points represents a full set \(Y\) is to treat each set as defining a probability distribution over space (for example, an empirical distribution or a density estimate) and then compute the Kullback–Leibler (KL) divergence between these distributions. This information-theoretic divergence \(D_{\mathrm{KL}}(P\Vert Q)\) from distribution \(P\) to distribution \(Q\) is given by:

\[ D_{\mathrm{KL}}(P\Vert Q) = \int P(x)\, \log \frac{P(x)}{Q(x)}\, dx \]

or equivalently, for discrete distributions:

\[ D_{\mathrm{KL}}(P\Vert Q) = \sum_i p_i\,\log\frac{p_i}{q_i} \]

In our context, \(P\) can be thought of as the distribution of the full set \(Y\) and \(Q\) as that of the subsample \(X\). We instantiate \(P_Y\) and \(P_X\) either as histograms or as kernel density estimates (KDEs). A histogram partitions space into axis-aligned boxes of side length \(\Delta\) and assigns each cell probability mass \(\widehat{P}(\text{cell})=\tfrac{1}{|S|}\#\{y_i\in \text{cell}\}\); the KDE alternative smooths every point into a Gaussian bump \(K_h(x-y)=\exp(-\|x-y\|^2/2h^2)\) and normalizes the sum.

High-dimensional density estimation suffers from the curse of dimensionality: the volume of each bin increases like \(\Delta^d\) while the number of samples per bin shrinks, so raw empirical estimates become zero almost everywhere. To keep the KL inputs well-defined we deliberately enlarge \(\Delta\) and raise the bandwidth \(h\), ensuring every region retains non-zero mass instead of triggering the infinite penalty \(\log\tfrac{p_i}{q_i}\). Intuitively, \(D_{\mathrm{KL}}(P\Vert Q)\) measures the information lost when \(Q\) (the subset's distribution) is used to approximate \(P\) (the full set's distribution). A smaller KL divergence means the subset \(X\) is a better stand-in for \(Y\).

We start with an empty subset and iteratively add points from \(Y\) to \(X\) such that each addition yields the greatest reduction in \(D_{\mathrm{KL}}(P_Y\Vert P_X)\). Naturally \(P_X\) (with \(X\) empty or very small) is initially a very poor approximation of \(P_Y\); however, as more points are added, the KL divergence decreases, and \(P_X\) becomes a closer approximation. To implement this, we need a way to estimate the distributions \(P_Y\) and \(P_X\) at each step. One approach is to use a discretization of the space into bins (a histogram) or a kernel density estimate. At each iteration, we evaluate which candidate point, if added to \(X\), would most decrease the divergence \(D_{\mathrm{KL}}(P_Y\Vert P_{X\cup\{p\}})\). That point is then selected into \(X\).

Here \(P_Y\) and \(P_X\) denote those empirical or smoothed distributions of \(Y\) and \(X\); \(D_{\mathrm{KL}}\) is the resulting information-theoretic divergence.

\[\begin{aligned} &X_0 = \varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; D_{\mathrm{KL}}\!\big(P_Y\,\Vert\, P_{X_{t-1}\cup\{y\}}\big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

At each step, we score a candidate \(y\) by the KL divergence between the full set and the subset if \(y\) were added, then pick the minimizer \(x_t^{\ast}\) and update \(X_t\). This directly targets distributional fidelity in the full space.

This greedy procedure is computationally intensive for large \(Y\) (as each iteration scans all remaining points to evaluate the KL divergence). However, it often yields a subset that very closely approximates the distribution of \(Y\). In practice, one can limit the resolution of the distribution estimate or use approximate updates to make this feasible. We implemented this method for both 2D and 3D point sets, using a fixed grid to estimate \(P_Y\) and \(P_X\) at each step. In the 3D case, the concepts are identical, but estimating \(P(Y)\) and \(P(X)\) requires a 3D grid or kernel, which increases computational cost. The result is a set \(X\) that minimizes the information loss (KL divergence) relative to \(Y\) by greedily covering regions of space in proportion to how much probability mass \(Y\) has in those regions.

This time, we got a much better subset! As you may notice, distribution-focused methods like KL naturally pour more picks into the heavier clusters, while Mean-Variance and Hyperplane Extent also lean toward wide clusters because they chase spread. The greedy choice and random seed add a little noise around those proportions, but the broad pattern stays the same.

High-dimensional divergences are punishing: estimating a full density or solving a transport plan explodes in cost long before the datasets we care about are finished loading. Instead, let's try slicing! By peering at the data from one direction at a time, we can solve the one-dimensional problem exactly, then average the answers.

Formally, for a unit vector \(v \in \mathbb{S}^{d-1}\) the projection \(\pi_v(x) = \langle v, x \rangle\) induces the pushforward measure \((\pi_v)_*P\). Averaging over directions \(\{v_j\}_{j=1}^p\) produces sliced analogues of familiar metrics:

\[ \mathrm{SW}_1^p(P,Q) = \frac{1}{p}\sum_{j=1}^p W_1\left((\pi_{v_j})_* P, (\pi_{v_j})_* Q\right), \qquad \mathrm{SKL}^p(P,Q) = \frac{1}{p}\sum_{j=1}^p D_{\mathrm{KL}}\left((\pi_{v_j})_* P \,\Big\|\, (\pi_{v_j})_* Q\right). \]

Sliced KL

For each random direction \(v\) we form the pushforward distribution \((\pi_v)_*P\) and compare it to its counterpart from the subset. Averaging over \(p\) directions gives the objective

\[ \mathrm{SKL}(P_Y,P_X)=\frac{1}{p}\sum_{j=1}^p D_{\mathrm{KL}}\!\left(\langle v_j,\cdot\rangle_{∗} P_Y\,\Big\Vert\,\langle v_j,\cdot\rangle_{∗} P_X\right). \]

This directional view preserves how mass is arranged along many axes rather than insisting on a perfect match of the full joint distribution. When the data has meaningful low-dimensional structure, those slices capture the salient features and make the approximation surprisingly faithful. However, the gains in tractability are substantial: one-dimensional KL estimates need only histograms or light kernel density estimates, so each slice costs \(O(N)\) time instead of the \(O(N^2)\) or worse demands of multivariate density estimation.

Choosing \(p\) between 50 and 200 usually suffices, and the directions come from uniform draws on the unit sphere, though they can also be steered toward informative axes. With enough slices the subset inherits the relative densities seen along most projections, which in turn preserves cluster structure while staying robust to dimensionality.

Hey, this looks pretty similar to the KL method! In fact, it's exactly the same if we use histograms for the density estimates. The only difference is that we use kernel density estimates instead of histograms.

Wasserstein‑1

Optimal transport measures the minimal cost required to morph or "transport" one distribution into another while respecting geometry. The \(p\)-Wasserstein distance between probability measures \(P\) and \(Q\) is

\[ W_p(P,Q) = \left(\inf_{\gamma \in \Pi(P,Q)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\|^p \, d\gamma(x,y)\right)^{1/p}, \]

where \(\Pi(P,Q)\) denotes the set of couplings with marginals \(P\) and \(Q\). For \(p=1\) (known as the Earth-Mover distance), the objective captures the minimum work to move mass from \(P\) to \(Q\). In one dimension this simplifies to a quantile comparison:

\[ W_1(P,Q) = \int_0^1 \big|F_P^{-1}(u) - F_Q^{-1}(u)\big| \, du, \]

with \(F_P^{-1}\) and \(F_Q^{-1}\) the inverse CDFs.

True \(W_1\) in 3D solves an optimal transport problem between \(P_Y\) and \(P_X\). It's accurate but expensive to generate at large \(N\). For the interactive below we may either use a small \(N\) true \(W_1\), or a high‑\(p\) sliced proxy that is visually indistinguishable at large \(N\).

This looks pretty similar to the KL method! Actually, if you look closely, you'll note that it subtly has less "center weighting" than the KL method, which is most apparent in the rightmost cluster. This is because Wasserstein is more sensitive to the shape of the distribution than KL, which can be said to be more sensitive to the location of the distribution.

Sliced Wasserstein-1

Sliced Wasserstein-1 projects the dataset onto random directions \(v_j\), reuses the 1D Earth-Mover distance for each slice, and averages the costs. It keeps the transport intuition of Wasserstein while sidestepping the heavy high-dimensional solve.

Sorting per projection costs \(O(N \log N)\), so a few hundred directions already stabilize the average in practice. The resulting subset tracks cluster densities and relative geometry without computing a full transport plan.

Draw random unit directions, compute 1D \(W_1\) slices, and greedily add the point that minimizes the mean cost.

\[\begin{aligned} &X_0 = \varnothing.\\ &\text{For } t=1,\ldots,m:\\ &\quad s_t(y) \;:=\; \frac{1}{p}\sum_{j=1}^p W_1\!\Big(\langle v_j,\cdot\rangle_{∗} P_Y,\; \langle v_j,\cdot\rangle_{∗} P_{X_{t-1}\cup\{y\}}\Big).\\ &\quad x_t^{\ast} \,\in\, \arg\min_{y\in Y\setminus X_{t-1}}\; s_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

Again, this looks pretty similar to the original Wasserstein method! Not bad!

Kernel Herding (Maximum Mean Discrepancy, MMD)

Let's try one more distributional method. Kernel herding treats the dataset through its kernel mean embedding. For a feature map \(\phi\) induced by \(k(x,y)=\langle\phi(x),\phi(y)\rangle_\mathcal{H}\), the empirical mean \(\mu_{P_Y}=\tfrac{1}{|Y|}\sum_{y\in Y}\phi(y)\) summarizes the distribution. Matching this mean is sufficient to match all moments encoded by the kernel.

Different kernels emphasize different aspects: RBF kernels are sensitive to local geometry, linear kernels to first-order moments, and others capture higher-order structure. The squared distance between embeddings is the Maximum Mean Discrepancy (MMD):

\[ \mathrm{MMD}^2(P_Y, P_X) = \|\mu_{P_Y} - \mu_{P_X}\|_\mathcal{H}^2 = \mathbb{E}[k(Y,Y')] + \mathbb{E}[k(X,X')] - 2\mathbb{E}[k(Y,X)]. \]

Kernel herding builds the subset greedily. At iteration \(t\) each candidate is scored by how much it pulls the current embedding toward \(\mu_{P_Y}\) once we discount similarity to points already selected:

\[ x_t \in \arg\max_{y\in Y} \; \mu(y) - \frac{1}{t-1}\sum_{j=1}^{t-1} k(y,x_j),\quad \mu(y)=\frac{1}{|Y|}\sum_{i} k(y,y_i). \]

The first term drives the subset toward regions with large kernel mass, while the correction term enforces diversity by penalizing redundancy with previously chosen points.

Evaluating exact kernels scales quadratically in \(|Y|\). Random Fourier Features approximate \(k\) with a low-dimensional explicit feature map \(\hat\phi\), so embeddings reduce to simple averages of \(\hat\phi(y)\). This keeps the greedy sweep linear in both the subset size and the number of sampled features.

Here \(k\) is a positive-definite kernel (RBF in our runs) and \(\mu(y)=|Y|^{-1}\sum_i k(y,y_i)\) is the kernel mean score.

\[\begin{aligned} &X_0 = \varnothing,\quad \mu(y)=\frac{1}{|Y|}\sum_{i} k(y,y_i).\\ &\text{For } t=1,\ldots,m:\\ &\quad g_t(y) \;:=\; \mu(y) - \frac{1}{t-1}\sum_{z\in X_{t-1}} k(y,z)\;\; (t>1),\;\; g_1(y):=\mu(y).\\ &\quad x_t^{\ast} \,\in\, \arg\max_{y\in Y\setminus X_{t-1}}\; g_t(y).\\ &\quad X_t \;=\; X_{t-1}\cup\{x_t^{\ast}\}. \end{aligned}\]

Kernel herding picks \(x_t^{\ast}\) that best matches the kernel mean embedding: start from \(\mu(y)\) and, for \(t>1\), subtract similarity to previously chosen points. This greedily reduces MMD.

This looks pretty similar to Wasserstein!

Sampling-Based Methods

All right, let's move on from distributional methods. The KL, sliced-Wasserstein, and kernel-herding families all build subsets greedily—each step commits to the best-looking point under the current metric. Sampling-based methods instead explore many trajectories through subset space. They matter when the objective is rugged, when multiple near-optimal summaries are valuable, or when we want uncertainty estimates rather than a single deterministic outcome.

MCMC Subset Selection

Define a target distribution over size-\(m\) subsets, \(\pi(X) \propto \exp(-\beta f(X))\), with \(f\) the quality metric (sliced KL here). MCMC constructs a Markov chain \(X_{t}\) whose stationary law is \(\pi\) by proposing local edits (add/remove/swap) and accepting them via the Metropolis-Hastings ratio

\[ \alpha(X \to X') = \min\Bigl(1, \frac{\pi(X') \; q(X' \to X)}{\pi(X) \; q(X \to X')}\Bigr). \]

The chain samples the subset space rather than committing to a single greedy trajectory, so it can report multiple high-scoring subsets or posterior summaries.

MCMC (Exact)

Here each proposed edit evaluates the full objective, so \(\pi(X)\) is computed exactly and the acceptance step reduces to

\[ \alpha = \min\bigl(1, \exp(-\beta(f(X') - f(X)))\bigr). \]

This gives unbiased samples from \(\pi\) but costs \(O(pN)\) per step when the score uses all 384 sliced directions.

MCMC (Budgeted / Delayed‑Acceptance)

Delayed acceptance factors \(\pi(X)\) into a cheap surrogate and an exact correction. A proposal is first accepted using a 64-direction surrogate; only surviving moves trigger the full 384-direction evaluation. This preserves the exact stationary distribution while cutting wall-clock cost.

Diversity-Based Methods

k‑DPP Diversity-Based Selection

The size-\(m\) k-DPP draws subsets with probability \(\mathbb{P}(S) \propto \det(K_S)\), so the selected points maximize volume in the kernel-induced feature space and are automatically diverse.

k‑DPP (Exact)

Exact sampling eigendecomposes \(K = U \Lambda U^\top\), samples eigenvectors via the standard inclusion probabilities \(\lambda_i/(\lambda_i+1)\), then performs the sequential volume sampler. This yields unbiased draws but costs \(O(N^3)\) in preprocessing.

k‑DPP (Nyström Approximation)

Nyström approximates \(K\) with \(\tilde K = C W^{-1} C^\top\) using \(r\) landmarks \(L\). Factoring \(\tilde K\) is \(O(N r^2)\) and sampling reduces to \(O(m r)\), so with \(r\approx 64{-}256\) we retain high-quality diversity at a fraction of the cost.

Compute Budget

We present two profiles to balance fairness and realism:

  • Batch‑small (exactish, apples‑to‑apples): d=64, N=12,800 (=200·d), m∈{100,200,300}, sliced p=384, MMD with exact RBF kernel. Full KL is omitted at 64D; we use Sliced‑KL as the scalable KL proxy. (Include full KL for 2D/3D.)

Streaming vs Batch

Batch: processes the full dataset \(Y\) in memory for every scoring step (exact 1D sorts/histograms for sliced metrics; full kernel or full RFF features for MMD; KDE for KL). Pros: exact(er) objectives; cons: time/memory scale with \(N\) and full‑KL becomes infeasible in high‑d.

Streaming: reads \(Y\) in passes and optimizes against small sketches (per‑direction 1D histograms/quantiles for sliced metrics; RFF mean embedding for MMD; aggregated projection moments for geometric methods). Pros: scales to millions with bounded memory; cons: small approximation bias governed by sketch size (\(p\), bins, R).

Sketch Parameters

Unless stated otherwise, sliced methods use \(p=384\) random directions; MMD uses RBF with lengthscale \(\ell=0.5\); Mean-Variance uses \(p=384\) projections with \(\alpha=0.2\); Hyperplane Extent uses \(k=40\,d\) normals. Increase \(p\)/R for more stability (lower variance) at higher compute.

The careful reader may notice that full KL was not calculated in the experiments. This is because KL divergence is not tractable in high dimensions. The greedy step at iteration \(t\) evaluates candidates \(p\in Y\setminus X\) by scoring \(\Delta(p)=D_{\mathrm{KL}}\!\big(P_Y\,\Vert\,P_{X\cup\{p\}}\big).\) With KDE, \(P_{X}\) has density \(p_X(x)=\tfrac{1}{t}\sum_{j=1}^{t} K(x,x_j).\) A naive evaluation uses all \(N\) points:

\[\text{score}(p)\;=\;\sum_{i=1}^{N} p_Y(y_i)\,\log\frac{p_Y(y_i)}{p_{X\cup\{p\}}(y_i)}\quad\Rightarrow\quad O\big((t+1)\,N\big)\;\text{per candidate.}\]

Scanning all candidates costs \(O\big((N{-}t)\,(t{+}1)\,N\big)\) at step \(t\). Summing over \(t=1...m\) yields \(\sum_t O\big(N^2 t\big)=O\!\big(N^2 m^2\big)\) operations (up to constants from KDE fits). For \(d{=}64, N\approx 1.5\times10^5, m=300\), this is far beyond a single‑machine budget—weeks on a laptop—and so we adopt Sliced‑KL as a scalable proxy.

Experiments

Experimental Setup

Data Generation

We evaluate on synthetic data tailored for representative sampling: a 64D benchmark (\(N=12{,}800\)) derived from a higher‑dimensional Gaussian projection. Evaluation uses held‑out data with Sliced‑W1, Sliced‑KL, and MMD as primary metrics.

Method-Specific Analysis

KL (grid/KDE). In principle we’d minimize the true KL, but estimating densities on grids or with KDEs explodes with dimension (and grid resolution), making full KL impractical beyond very low‑dimensional settings. This motivates a proxy that preserves the spirit of KL while scaling to modern datasets.

Sliced KL. Instead of fitting a high‑dimensional density, we project to many 1D lines and sum KLs of 1D histograms—capturing KL’s behavior without its curse‑of‑dimensionality. Accuracy improves with the number of projections, and a histogram/streaming implementation processes points in batches, keeping memory and time linear in \(N\).

Sliced Wasserstein‑1. Full optimal transport scales poorly (e.g., cubic or worse in generic solvers), so we transport along many 1D projections where sorting gives \(O(p\,N\log N)\). This preserves Wasserstein’s geometry and outlier robustness while enabling batch/stream execution with per‑projection buffers.

Kernel Herding (MMD, RFF). Exact MMD requires an \(N\times N\) kernel (quadratic memory/time). Random Fourier Features compress the kernel map so herding works with linear passes in the feature space. With streaming accumulators, we update the mean on the fly and avoid building the full kernel.

Mean‑Variance. When speed dominates, we align first/second moments on random projections. It’s a single‑pass, streaming‑friendly heuristic that summarizes center and spread quickly—but, by design, it can miss fine multi‑modal structure.

Hyperplane Extent. To capture geometric extent without heavy computation, we pick extremes along sampled directions. This one‑pass, memory‑light baseline excels at boundary coverage and integrates naturally with batched or streaming ingestion.

MCMC (Delayed Acceptance). Exact MCMC proposals are costly when every step evaluates an expensive objective. Delayed acceptance first screens proposals with a cheap surrogate (e.g., few projections), reserving the full evaluation for promising moves—retaining global search while cutting wall‑clock time.

k‑DPP (Nystrom). Vanilla k‑DPP sampling needs spectral decompositions that scale cubically with \(N\), which is untenable for large point sets. Nyström approximates the kernel with \(L\) landmarks, reducing factorization and sampling costs; with \(L\in[64,256]\) we keep diversity while staying within a practical runtime budget.

Computational Results

64D Unified Benchmark

Held-out metrics and runtime from the unified 64D benchmark (lower is better for metrics; runtime in seconds).

m = 100

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)

m = 200

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)

m = 300

Method Sliced‑W1 (held‑out) Sliced‑KL (held‑out) MMD² (held‑out) Runtime (s)

Quality Trade-offs

Across held-out evaluations, distributional methods (Sliced‑KL, Sliced‑W1, MMD) consistently achieve lower discrepancy than purely geometric approaches. Sliced variants capture most of the benefit of their full counterparts while remaining tractable in high dimensions.

There are diminishing returns with increasing subset size: most gains appear by \(m\!\approx\!200\), after which improvements taper. Rankings remain stable across seeds and data splits, indicating the trends are robust rather than artifacts of randomness.

Conclusion

Key Takeaways

Method Performance Summary

Method Strength Best Use Case
Mean-Variance Computational efficiency Quick summaries, moments
Hyperplane Extent Boundary coverage Geometric baselines, extent
Sliced‑W1 Transport-aware, robust Geometric structure, outliers
Sliced‑KL Scalable KL approximation High‑dim, distributional fidelity
Kernel Herding (MMD, RFF) Fast kernel matching Large‑scale, kernel similarity
MCMC‑DA Global optimization Complex, multi‑modal distributions
k‑DPP (Nystrom) Diversity maximization Experimental design, sensor placement

Computational-Statistical Trade-offs

Our experiments reveal clear trade-offs between computational efficiency and statistical fidelity:

  • Linear-time methods (Mean-Variance, Hyperplane Extent) offer 10-100x speedups but sacrifice distributional accuracy
  • Sliced approximations retain 85-95% of performance with 10x reduction in cost
  • Kernel methods with RFF enable million-scale datasets with bounded memory
  • Dimensional scaling: Most methods transition from infeasible to tractable around \(d = 50\) when using appropriate approximations

Future Directions

Adaptive projection selection. Learning optimal projection directions rather than random sampling could significantly improve the quality of sliced approximations. Current methods rely on random projections, but data-driven approaches that learn the most informative directions could yield better approximations with fewer projections.

Hybrid objectives. Combining distributional and geometric criteria with learned weights offers a promising direction for balancing different aspects of representative sampling. Rather than choosing between distributional fidelity and geometric preservation, learned combinations could adapt to specific application requirements.

Dynamic subset selection. Adapting the subset as the data distribution evolves is crucial for streaming and online scenarios. Current methods assume static data, but real-world applications often involve data that changes over time, requiring adaptive algorithms that can update selections efficiently.

Deep learning integration. End-to-end differentiable subset selection for downstream tasks could enable joint optimization of sampling and model performance. This would allow the sampling strategy to be tailored specifically for the intended use case, potentially achieving better results than generic representative sampling approaches.

Final Thoughts

Representative sampling bridges fundamental concepts from statistics, optimization, and computer science. The methods presented here offer a spectrum of trade-offs, from theoretically optimal but computationally intensive approaches to fast approximations suitable for massive datasets. The key is to match the method to the application's requirements, considering not just accuracy but also computational constraints, interpretability, and scalability needs.

As data continues to grow in size and complexity, the importance of principled data reduction will only increase. The framework and methods described here provide a solid foundation for tackling these challenges across diverse domains.

View:

Transcript

Figures

All code is available here.

Acknowledgements

This project benefited from conversations with Luke Triplett and support from the Mathematics Directed Reading Program at UC Berkeley.

References