Random features
Random embeddings
2016-12-05 — 2025-10-26
Wherein random, fixed cosine features are introduced, Gaussian frequencies are sampled to approximate the RBF kernel, and data is projected into a higher-dimensional space so linear methods are applied.
The random-features method (or family of methods) uses the idea that sometimes the “right” features are hard to find, but we can get “pretty good” features by random chance.
Classic random-feature methods use a non-linear transformation of the data, \(\mathbf{z}(\mathbf{x})\), whose parameters are not trained but drawn from a random distribution and then held fixed. A typical example is \(\mathbf{z}(\mathbf{x}) = \cos(W\mathbf{x} + \mathbf{b})\), where the weights \(W\) and biases \(\mathbf{b}\) are sampled once and never updated.
This simple technique is a remarkably powerful tool for making intractable non-linear problems tractable. By projecting data into a new (often higher-dimensional) space with these fixed random features, we can often simplify a complex problem into a much easier one—typically, one that is just linear.
1 In kernels
Kernel methods, like Support Vector Machines (SVMs) or Gaussian process regression, are powerful tools in machine learning. These are all based on the trick of computing the inner product of data points in a high-dimensional feature space, \(\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle\), using only the kernel function \(k(\mathbf{x}, \mathbf{y})\) in the original space.
On the plus side, this avoids an explicit (and potentially infinite-dimensional) mapping \(\phi(\mathbf{x})\). On the downside, to make predictions these methods typically require solving a linear system involving the Gram matrix \(K\), an \(n \times n\) matrix where \(n\) is the number of data points and \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\). [TODO clarify]
- Storage: Storing this matrix requires \(O(n^2)\) memory.
- Computation: Calculating it takes \(O(n^2 d)\) time (where \(d\) is the data dimension), and solving the system (e.g., by inverting \(K\)) takes \(O(n^3)\) time.
These costs are prohibitive for large datasets. Many tricks have been proposed to fix this; one is using random features to approximate the inversion.
Rahimi and Recht (2007) bypassed the \(n \times n\) matrix. Instead of using the implicit map \(\phi\), they find an explicit, low-dimensional feature map \(\mathbf{z}(\mathbf{x}): \mathbb{R}^d \to \mathbb{R}^D\) such that:
\[ \langle \mathbf{z}(\mathbf{x}), \mathbf{z}(\mathbf{y}) \rangle \approx k(\mathbf{x}, \mathbf{y}) \]
Here, \(D\) is the new target dimension, which we control. The key is choosing \(D \ll n\).
If we can find such a \(\mathbf{z}\), we can transform our \(n \times d\) dataset \(X\) into a new \(n \times D\) dataset \(Z\). Then we can apply a fast linear model (for example, a linear SVM or standard ridge regression) to the transformed data \(Z\). This linear model, in the \(D\)-dimensional feature space, will approximate the powerful non-linear kernel machine in the original space.
The training cost drops from \(O(n^3)\) to something like \(O(nD^2)\) (to solve \(Z^\top Z \mathbf{w} = Z^\top \mathbf{y}\)), which is a massive win when \(D \ll n\). [TODO clarify]
So how do we find this magic map \(\mathbf{z}\)?
For shift-invariant kernels (where \(k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})\), such as the very common Gaussian RBF kernel), the answer lies in Bochner’s Theorem.
Conceptually, Bochner’s Theorem states that any such kernel is the Fourier transform of a non-negative measure \(p(\mathbf{w})\). This means we can write the kernel as an expectation:
\[ k(\mathbf{x} - \mathbf{y}) = \int_{\mathbb{R}^d} p(\mathbf{w}) e^{i \mathbf{w}^\top (\mathbf{x} - \mathbf{y})} d\mathbf{w} = \mathbb{E}_{\mathbf{w} \sim p} \left[ e^{i \mathbf{w}^\top \mathbf{x}} e^{-i \mathbf{w}^\top \mathbf{y}} \right] \]
We approximate this integral using Monte Carlo sampling.
If we sample \(D\) random vectors (or “features”) \(\mathbf {w}_1, \dots, \mathbf {w}_D\) from the distribution \(p(\mathbf {w})\), we can form our feature map. A common construction that uses a random phase \(b\) to avoid complex numbers is:
\[ \mathbf{z}(\mathbf{x}) = \sqrt{\frac{2}{D}} \left[ \cos(\mathbf{w}_1^\top \mathbf{x} + b_1), \dots, \cos(\mathbf{w}_D^\top \mathbf{x} + b_D) \right]^\top \]
Here, each \(\mathbf{w}_j\) is drawn from \(p(\mathbf{w})\), and each \(b_j\) is drawn uniformly from \([0, 2\pi]\).
By the law of large numbers, as \(D \to \infty\), the inner product \(\langle \mathbf{z}(\mathbf{x}), \mathbf{z}(\mathbf{y}) \rangle\) converges to the true kernel value \(k(\mathbf{x}, \mathbf{y})\).
Example: The Gaussian (RBF) Kernel For the RBF kernel \(k(\mathbf{x}, \mathbf{y}) = \exp(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\sigma^2})\), its Fourier transform \(p(\mathbf{w})\) is also a Gaussian distribution: \(\mathbf{w} \sim \mathcal{N}(0, \frac{1}{\sigma^2} I_d)\).
So the algorithm is simple:
- Choose the number of features \(D\).
- Sample \(D\) vectors \(\mathbf{w}_j \sim \mathcal{N}(0, \frac{1}{\sigma^2} I_d)\).
- Sample \(D\) phases \(b_j \sim \text{Uniform}(0, 2\pi)\).
- Create the new \(n \times D\) data matrix \(Z\) where \(Z_{ij} = \sqrt{\frac{2}{D}} \cos(\mathbf{w}_j^\top \mathbf{x}_i + b_j)\).
- Train a linear model on \((Z, \mathbf{y})\).
This technique turns a non-linear problem into a linear one, making large-scale kernel methods practical.
2 Nonlinear Embeddings Theory
The R&R feature map is a non-linear map, \(\mathbf{z}(\mathbf{x}) = \cos(W\mathbf{x} + \mathbf{b})\), that often increases the dimensionality of our data (e.g., from \(d=100\) to \(D=1000\)) to make a non-linear regression approximately linear.
This links back to a classic idea from Cover’s Theorem (Cover 1965).
It was shown that, for a random set of linear inequalities in \(d\) unknowns, the expected number of extreme inequalities, which are necessary and sufficient to imply the entire set, tends to \(2d\) as the number of consistent inequalities tends to infinity, thus bounding the expected necessary storage capacity for linear decision algorithms in separable problems. The results, even those dealing with randomly positioned points, have been combinatorial in nature, and have been essentially independent of the configuration of the set of points in the space.
Informally, a dataset that isn’t linearly separable in a low-dimensional space is likely to become linearly separable when non-linearly mapped to a sufficiently high-dimensional space.
While kernel approximation is a classic example, this principle of using fixed, non-linear maps extends to other modern problems in machine learning.
3 Wide Neural Networks
The Random Features (RF) model also helps us understand modern deep learning. An RF model can be viewed as a simplified framework corresponding to neural networks with random representations. Think of a neural network where all the hidden layers have fixed, random weights and only the final linear layer is trained.
For example, Akhtiamov, Ghane, and Hassibi (2025) use this framework to prove a surprising result: that for sufficiently wide models, one-bit quantization (compressing hidden weights to just +1 or -1) causes no loss in generalization error compared with the full-precision model.
3.1 Parameter-Finding in Complex Simulations
Another neat provocation is to use random features as nearly sufficient statistics in simulation-based inference (SBI)
Scientists in many fields (e.g., climatology, economics or ecology) build complex generative models where the likelihood function \(p(\mathbf{x} | \theta)\) is intractable: we can simulate data \(\tilde{\mathbf{X}}(\theta)\) given parameters \(\theta\), but can’t calculate the probability of our observed data \(\mathbf{x}_{\text{data}}\). This rules out standard statistical techniques.
The traditional approach is to hand-craft summary statistics (like the mean or variance) and tune \(\theta\) until the simulation summaries match the data summaries. This is “a lot of work” and ad hoc.
The random features approach, as proposed by Shalizi (2021), is to “replace the procedure of carefully selecting a very small number of summary statistics, instead using about twice as many random functions of the data”. We define a vector of \(k\) random features, \(\mathbf{z}(\mathbf{x}) = [f_1(\mathbf{x}), \dots, f_k(\mathbf{x})]^\top\), and find the parameters \(\theta\) that minimize the distance between the features of our real data and the expected features from the simulation:
\[ \hat{\theta} = \arg\min_{\theta} || \mathbf{z}(\mathbf{x}_{\text{data}}) - \mathbb{E}[\mathbf{z}(\tilde{\mathbf{X}}(\theta))] ||^2 \]
Shalizi gives a really neat justification, in terms of theory from nonlinear dynamics (“embedology”), that for a \(d\)-dimensional parameter space we typically only need \(k = 2d+1\) “typical” random functions to create an “embedding” that uniquely identifies the parameters. Here, the random features act as an automatically generated set of summary statistics for inference.
Note that this looks a lot like the method of Koopman operators in randomized form; is that a useful connection?
3.2 Low-dimensional embeddings
Linear projections are a different beast. Instead of approximating a nonlinear kernel, the goal is often to reduce dimensionality while preserving some structure, typically pairwise distances.
Over at compressed sensing we mention some useful tools for these problems. These include the Johnson-Lindenstrauss lemma. This lemma states that you can project \(n\) points in a high-dimensional space \(\mathbb{R}^d\) down to a much lower dimension \(D = O(\log n / \epsilon^2)\) using a random matrix, and all pairwise distances will be preserved up to a \((1 \pm \epsilon)\) factor.
See low-d projections.
4 Locality-sensitive hashing
TBD
