Random features

Random embeddings

2016-12-05 — 2025-10-26

Wherein random, fixed cosine features are introduced, Gaussian frequencies are sampled to approximate the RBF kernel, and data is projected into a higher-dimensional space so linear methods are applied.

feature construction
functional analysis
high d
linear algebra
kernel methods
measure
metrics
Monte Carlo
probabilistic algorithms
probability
signal processing
sparser than thou
statistics
Figure 1: Separation of inputs by random projection

The random-features method (or family of methods) uses the idea that sometimes the “right” features are hard to find, but we can get “pretty good” features by random chance.

Classic random-feature methods use a non-linear transformation of the data, \(\mathbf{z}(\mathbf{x})\), whose parameters are not trained but drawn from a random distribution and then held fixed. A typical example is \(\mathbf{z}(\mathbf{x}) = \cos(W\mathbf{x} + \mathbf{b})\), where the weights \(W\) and biases \(\mathbf{b}\) are sampled once and never updated.

This simple technique is a remarkably powerful tool for making intractable non-linear problems tractable. By projecting data into a new (often higher-dimensional) space with these fixed random features, we can often simplify a complex problem into a much easier one—typically, one that is just linear.

1 In kernels

Kernel methods, like Support Vector Machines (SVMs) or Gaussian process regression, are powerful tools in machine learning. These are all based on the trick of computing the inner product of data points in a high-dimensional feature space, \(\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle\), using only the kernel function \(k(\mathbf{x}, \mathbf{y})\) in the original space.

On the plus side, this avoids an explicit (and potentially infinite-dimensional) mapping \(\phi(\mathbf{x})\). On the downside, to make predictions these methods typically require solving a linear system involving the Gram matrix \(K\), an \(n \times n\) matrix where \(n\) is the number of data points and \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\). [TODO clarify]

  • Storage: Storing this matrix requires \(O(n^2)\) memory.
  • Computation: Calculating it takes \(O(n^2 d)\) time (where \(d\) is the data dimension), and solving the system (e.g., by inverting \(K\)) takes \(O(n^3)\) time.

These costs are prohibitive for large datasets. Many tricks have been proposed to fix this; one is using random features to approximate the inversion.

Rahimi and Recht (2007) bypassed the \(n \times n\) matrix. Instead of using the implicit map \(\phi\), they find an explicit, low-dimensional feature map \(\mathbf{z}(\mathbf{x}): \mathbb{R}^d \to \mathbb{R}^D\) such that:

\[ \langle \mathbf{z}(\mathbf{x}), \mathbf{z}(\mathbf{y}) \rangle \approx k(\mathbf{x}, \mathbf{y}) \]

Here, \(D\) is the new target dimension, which we control. The key is choosing \(D \ll n\).

If we can find such a \(\mathbf{z}\), we can transform our \(n \times d\) dataset \(X\) into a new \(n \times D\) dataset \(Z\). Then we can apply a fast linear model (for example, a linear SVM or standard ridge regression) to the transformed data \(Z\). This linear model, in the \(D\)-dimensional feature space, will approximate the powerful non-linear kernel machine in the original space.

The training cost drops from \(O(n^3)\) to something like \(O(nD^2)\) (to solve \(Z^\top Z \mathbf{w} = Z^\top \mathbf{y}\)), which is a massive win when \(D \ll n\). [TODO clarify]

So how do we find this magic map \(\mathbf{z}\)?

For shift-invariant kernels (where \(k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})\), such as the very common Gaussian RBF kernel), the answer lies in Bochner’s Theorem.

Conceptually, Bochner’s Theorem states that any such kernel is the Fourier transform of a non-negative measure \(p(\mathbf{w})\). This means we can write the kernel as an expectation:

\[ k(\mathbf{x} - \mathbf{y}) = \int_{\mathbb{R}^d} p(\mathbf{w}) e^{i \mathbf{w}^\top (\mathbf{x} - \mathbf{y})} d\mathbf{w} = \mathbb{E}_{\mathbf{w} \sim p} \left[ e^{i \mathbf{w}^\top \mathbf{x}} e^{-i \mathbf{w}^\top \mathbf{y}} \right] \]

We approximate this integral using Monte Carlo sampling.

If we sample \(D\) random vectors (or “features”) \(\mathbf {w}_1, \dots, \mathbf {w}_D\) from the distribution \(p(\mathbf {w})\), we can form our feature map. A common construction that uses a random phase \(b\) to avoid complex numbers is:

\[ \mathbf{z}(\mathbf{x}) = \sqrt{\frac{2}{D}} \left[ \cos(\mathbf{w}_1^\top \mathbf{x} + b_1), \dots, \cos(\mathbf{w}_D^\top \mathbf{x} + b_D) \right]^\top \]

Here, each \(\mathbf{w}_j\) is drawn from \(p(\mathbf{w})\), and each \(b_j\) is drawn uniformly from \([0, 2\pi]\).

By the law of large numbers, as \(D \to \infty\), the inner product \(\langle \mathbf{z}(\mathbf{x}), \mathbf{z}(\mathbf{y}) \rangle\) converges to the true kernel value \(k(\mathbf{x}, \mathbf{y})\).

Example: The Gaussian (RBF) Kernel For the RBF kernel \(k(\mathbf{x}, \mathbf{y}) = \exp(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\sigma^2})\), its Fourier transform \(p(\mathbf{w})\) is also a Gaussian distribution: \(\mathbf{w} \sim \mathcal{N}(0, \frac{1}{\sigma^2} I_d)\).

So the algorithm is simple:

  1. Choose the number of features \(D\).
  2. Sample \(D\) vectors \(\mathbf{w}_j \sim \mathcal{N}(0, \frac{1}{\sigma^2} I_d)\).
  3. Sample \(D\) phases \(b_j \sim \text{Uniform}(0, 2\pi)\).
  4. Create the new \(n \times D\) data matrix \(Z\) where \(Z_{ij} = \sqrt{\frac{2}{D}} \cos(\mathbf{w}_j^\top \mathbf{x}_i + b_j)\).
  5. Train a linear model on \((Z, \mathbf{y})\).

This technique turns a non-linear problem into a linear one, making large-scale kernel methods practical.

2 Nonlinear Embeddings Theory

The R&R feature map is a non-linear map, \(\mathbf{z}(\mathbf{x}) = \cos(W\mathbf{x} + \mathbf{b})\), that often increases the dimensionality of our data (e.g., from \(d=100\) to \(D=1000\)) to make a non-linear regression approximately linear.

This links back to a classic idea from Cover’s Theorem (Cover 1965).

It was shown that, for a random set of linear inequalities in \(d\) unknowns, the expected number of extreme inequalities, which are necessary and sufficient to imply the entire set, tends to \(2d\) as the number of consistent inequalities tends to infinity, thus bounding the expected necessary storage capacity for linear decision algorithms in separable problems. The results, even those dealing with randomly positioned points, have been combinatorial in nature, and have been essentially independent of the configuration of the set of points in the space.

Informally, a dataset that isn’t linearly separable in a low-dimensional space is likely to become linearly separable when non-linearly mapped to a sufficiently high-dimensional space.

While kernel approximation is a classic example, this principle of using fixed, non-linear maps extends to other modern problems in machine learning.

3 Wide Neural Networks

The Random Features (RF) model also helps us understand modern deep learning. An RF model can be viewed as a simplified framework corresponding to neural networks with random representations. Think of a neural network where all the hidden layers have fixed, random weights and only the final linear layer is trained.

For example, Akhtiamov, Ghane, and Hassibi (2025) use this framework to prove a surprising result: that for sufficiently wide models, one-bit quantization (compressing hidden weights to just +1 or -1) causes no loss in generalization error compared with the full-precision model.

3.1 Parameter-Finding in Complex Simulations

Another neat provocation is to use random features as nearly sufficient statistics in simulation-based inference (SBI)

Scientists in many fields (e.g., climatology, economics or ecology) build complex generative models where the likelihood function \(p(\mathbf{x} | \theta)\) is intractable: we can simulate data \(\tilde{\mathbf{X}}(\theta)\) given parameters \(\theta\), but can’t calculate the probability of our observed data \(\mathbf{x}_{\text{data}}\). This rules out standard statistical techniques.

The traditional approach is to hand-craft summary statistics (like the mean or variance) and tune \(\theta\) until the simulation summaries match the data summaries. This is “a lot of work” and ad hoc.

The random features approach, as proposed by Shalizi (2021), is to “replace the procedure of carefully selecting a very small number of summary statistics, instead using about twice as many random functions of the data”. We define a vector of \(k\) random features, \(\mathbf{z}(\mathbf{x}) = [f_1(\mathbf{x}), \dots, f_k(\mathbf{x})]^\top\), and find the parameters \(\theta\) that minimize the distance between the features of our real data and the expected features from the simulation:

\[ \hat{\theta} = \arg\min_{\theta} || \mathbf{z}(\mathbf{x}_{\text{data}}) - \mathbb{E}[\mathbf{z}(\tilde{\mathbf{X}}(\theta))] ||^2 \]

Shalizi gives a really neat justification, in terms of theory from nonlinear dynamics (“embedology”), that for a \(d\)-dimensional parameter space we typically only need \(k = 2d+1\) “typical” random functions to create an “embedding” that uniquely identifies the parameters. Here, the random features act as an automatically generated set of summary statistics for inference.

Note that this looks a lot like the method of Koopman operators in randomized form; is that a useful connection?

3.2 Low-dimensional embeddings

Linear projections are a different beast. Instead of approximating a nonlinear kernel, the goal is often to reduce dimensionality while preserving some structure, typically pairwise distances.

Over at compressed sensing we mention some useful tools for these problems. These include the Johnson-Lindenstrauss lemma. This lemma states that you can project \(n\) points in a high-dimensional space \(\mathbb{R}^d\) down to a much lower dimension \(D = O(\log n / \epsilon^2)\) using a random matrix, and all pairwise distances will be preserved up to a \((1 \pm \epsilon)\) factor.

See low-d projections.

4 Locality-sensitive hashing

TBD

5 References

Achlioptas. 2003. Database-Friendly Random Projections: Johnson-Lindenstrauss with Binary Coins.” Journal of Computer and System Sciences, Special Issue on PODS 2001,.
Ailon, and Chazelle. 2009. The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors.” SIAM Journal on Computing.
Akhtiamov, Ghane, and Hassibi. 2025. One-Bit Quantization for Random Features Models.”
Alaoui, and Mahoney. 2014. Fast Randomized Kernel Methods With Statistical Guarantees.” arXiv:1411.0306 [Cs, Stat].
Andoni, A., and Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.” In 47th Annual IEEE Symposium on Foundations of Computer Science, 2006. FOCS ’06.
Andoni, Alexandr, Indyk, Nguyen, et al. 2013. Beyond Locality-Sensitive Hashing.” arXiv:1306.1547 [Cs].
Andoni, Alexandr, and Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors.” arXiv:1501.01062 [Cs].
Auvolat, and Vincent. 2015. Clustering Is Efficient for Approximate Maximum Inner Product Search.” arXiv:1507.05910 [Cs, Stat].
Bach. 2015. On the Equivalence Between Kernel Quadrature Rules and Random Feature Expansions.”
Baraniuk, Davenport, DeVore, et al. 2008. A Simple Proof of the Restricted Isometry Property for Random Matrices.” Constructive Approximation.
Bingham, and Mannila. 2001. Random Projection in Dimensionality Reduction: Applications to Image and Text Data.” In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’01.
Brault, d’Alché-Buc, and Heinonen. 2016. Random Fourier Features for Operator-Valued Kernels.” In Proceedings of The 8th Asian Conference on Machine Learning.
Candès, and Tao. 2006. Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? IEEE Transactions on Information Theory.
Casey, Rhodes, and Slaney. 2008. Analysis of Minimum Distances in High-Dimensional Musical Spaces.” IEEE Transactions on Audio, Speech, and Language Processing.
Celentano, Misiakiewicz, and Montanari. 2021. Minimum Complexity Interpolation in Random Features Models.”
Choromanski, Rowland, and Weller. 2017. The Unreasonable Effectiveness of Random Orthogonal Embeddings.” arXiv:1703.00864 [Stat].
Coleman, Baraniuk, and Shrivastava. 2020. Sub-Linear Memory Sketches for Near Neighbor Search on Streaming Data.” arXiv:1902.06687 [Cs, Eess, Stat].
Cover. 1965. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition.” IEEE Transactions on Electronic Computers.
Dasgupta. 2000. Experiments with Random Projection.” In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. UAI’00.
Dasgupta, and Gupta. 2003. An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures & Algorithms.
Datar, Immorlica, Indyk, et al. 2004. Locality-Sensitive Hashing Scheme Based on P-Stable Distributions.” In Proceedings of the Twentieth Annual Symposium on Computational Geometry. SCG ’04.
Dezfouli, and Bonilla. 2015. Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28. NIPS’15.
Duarte, and Baraniuk. 2013. Spectral Compressive Sensing.” Applied and Computational Harmonic Analysis.
Eftekhari, Yap, Wakin, et al. 2016. Stabilizing Embedology: Geometry-Preserving Delay-Coordinate Maps.” arXiv:1609.06347 [Nlin, Stat].
Fodor. 2002. A Survey of Dimension Reduction Techniques.”
Freund, Dasgupta, Kabra, et al. 2007. Learning the Structure of Manifolds Using Random Projections.” In Advances in Neural Information Processing Systems.
Geurts, Ernst, and Wehenkel. 2006. Extremely Randomized Trees.” Machine Learning.
Ghojogh, Ghodsi, Karray, et al. 2021. Johnson-Lindenstrauss Lemma, Linear and Nonlinear Random Projections, Random Fourier Features, and Random Kitchen Sinks: Tutorial and Survey.” arXiv:2108.04172 [Cs, Math, Stat].
Gionis, Indyky, and Motwaniz. 1999. Similarity Search in High Dimensions via Hashing.” In.
Giryes, Sapiro, and Bronstein. 2016. Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? IEEE Transactions on Signal Processing.
Gorban, Tyukin, and Romanenko. 2016. The Blessing of Dimensionality: Separation Theorems in the Thermodynamic Limit.” arXiv:1610.00494 [Cs, Stat].
Gottwald, and Reich. 2020. Supervised Learning from Noisy Observations: Combining Machine-Learning Techniques with Data Assimilation.” arXiv:2007.07383 [Physics, Stat].
Hall, and Li. 1993. On Almost Linearity of Low Dimensional Projections from High Dimensional Data.” The Annals of Statistics.
Heusser, Ziman, Owen, et al. 2017. HyperTools: A Python Toolbox for Visualizing and Manipulating High-Dimensional Data.” arXiv:1701.08290 [Stat].
Kammonen, Kiessling, Plecháč, et al. 2020. Adaptive Random Fourier Features with Metropolis Sampling.” arXiv:2007.10683 [Cs, Math].
Kane, and Nelson. 2014. Sparser Johnson-Lindenstrauss Transforms.” Journal of the ACM.
Kar, and Karnick. 2012. Random Feature Maps for Dot Product Kernels.” In Artificial Intelligence and Statistics.
Koltchinskii, and Giné. 2000. Random Matrix Approximation of Spectra of Integral Operators.” Bernoulli.
Koppel, Warnell, Stump, et al. 2016. Parsimonious Online Learning with Kernels via Sparse Projections in Function Space.” arXiv:1612.04111 [Cs, Stat].
Krummenacher, McWilliams, Kilcher, et al. 2016. Scalable Adaptive Stochastic Optimization Using Random Projections.” In Advances in Neural Information Processing Systems 29.
Kulis, and Grauman. 2012. Kernelized Locality-Sensitive Hashing.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Landweber, Lazar, and Patel. 2016. On Fiber Diameters of Continuous Maps.” American Mathematical Monthly.
Li, Ping, Hastie, and Church. 2006. Very Sparse Random Projections.” In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’06.
Li, Zhu, Ton, Oglic, et al. 2019. “Towards a Unified Analysis of Random Fourier Features.” In.
McWilliams, Balduzzi, and Buhmann. 2013. Correlated Random Features for Fast Semi-Supervised Learning.” In Advances in Neural Information Processing Systems 26.
Moosmann, Triggs, and Jurie. 2006. Fast Discriminative Visual Codebooks Using Randomized Clustering Forests.” In Advances in Neural Information Processing Systems.
Oveneke, Aliosha-Perez, Zhao, et al. 2016. Efficient Convolutional Auto-Encoding via Random Convexification and Frequency-Domain Minimization.” In Advances in Neural Information Processing Systems 29.
Oymak, and Tropp. 2015. Universality Laws for Randomized Dimension Reduction, with Applications.” arXiv:1511.09433 [Cs, Math, Stat].
Rahimi, and Recht. 2007. Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems.
———. 2008. Uniform Approximation of Functions with Random Bases.” In 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
Saul. 2023. A Geometrical Connection Between Sparse and Low-Rank Matrices and Its Application to Manifold Learning.” Transactions on Machine Learning Research.
Scardapane, and Wang. 2017. Randomness in Neural Networks: An Overview.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Shalizi. 2021. A Note on Simulation-Based Inference by Matching Random Features.”
Shi, Sun, and Zhu. 2018. A Spectral Approach to Gradient Estimation for Implicit Distributions.” In.
Sinha, and Duchi. 2016. Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29.
Sterge, and Sriperumbudur. 2021. Statistical Optimality and Computational Efficiency of Nyström Kernel PCA.” arXiv:2105.08875 [Cs, Math, Stat].
Tang, Athreya, Sussman, et al. 2014. A Nonparametric Two-Sample Hypothesis Testing Problem for Random Dot Product Graphs.” arXiv:1409.2344 [Math, Stat].
Weinberger, Dasgupta, Langford, et al. 2009. Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09.
Zhang, Wang, Cai, et al. 2010. Self-Taught Hashing for Fast Similarity Search.” In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’10.