feature_construction on Dan MacKinlayhttps://danmackinlay.name/tags/feature_construction.htmlRecent content in feature_construction on Dan MacKinlayHugo -- gohugo.ioen-usThu, 13 Jan 2022 15:07:27 +1100Laplace approximations in inferencehttps://danmackinlay.name/notebook/laplace_approx.htmlThu, 13 Jan 2022 15:07:27 +1100https://danmackinlay.name/notebook/laplace_approx.htmlLearnable Laplace approximations By stochastic weight averaging For model selection In function spaces INLA Generalized Gauss-Newton and linearization Laplace in inverse problems References Second mode? I see no second mode.
Approximating probability distributions by a Gaussian with the same mode. Thanks to limit theorems this is not always a terrible idea, especially since Neural networks seem pretty keen to converge to Gaussians in various senses.Overparameterizationhttps://danmackinlay.name/notebook/overparameterization.htmlWed, 08 Dec 2021 09:41:04 +1100https://danmackinlay.name/notebook/overparameterization.htmlFor smoothness In the wide-network limit For making optimisation nice Double descent Lottery ticket hypothesis References Notes on the general technique of increasing the number of slack parameters you have, especially in machine learning, especially especially in neural nets.
For smoothness This insight is fresh. Bubeck and Sellke (2021) argue
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied.Random neural networkshttps://danmackinlay.name/notebook/nn_random.htmlTue, 12 Oct 2021 07:08:39 +1100https://danmackinlay.name/notebook/nn_random.htmlRecurrent: Echo State Machines / Random reservoir networks Random convolutions References If you do not bother to train your neural net, what happens? In the infinite-width limit you get a Gaussian process. There are a number of net architectures which do not make use of that argument and which are still random though.
Recurrent: Echo State Machines / Random reservoir networks This sounds deliciously lazy; At a glance it sounds like the process is to construct a random recurrent network, i.Approximate Bayesian Computationhttps://danmackinlay.name/notebook/approximate_bayesian_computation.htmlMon, 20 Sep 2021 08:08:00 +1000https://danmackinlay.name/notebook/approximate_bayesian_computation.htmlSMC for ABC Bayesian Synthetic Likelihood Neural methods SBC References Approximate Bayesian Computation is a terribly underspecified description. There are many ways that inference can be based upon simulations, many types of freedom from likelihood and many ways to approximate Bayesian computation. This page is about the dominant use of that term, which is the use of Simulation-based inference to do Bayes updates where the likelihood is not available but where we can simulate from the generative model.Learning summary statisticshttps://danmackinlay.name/notebook/learning_summary_statistics.htmlThu, 15 Jul 2021 10:38:42 +1000https://danmackinlay.name/notebook/learning_summary_statistics.htmlReferences A dimensionality reduction/feature engineering trick specific to the needs of likelihood-free inference methods such as indirect inference or approximate Bayes computation. In these context it is not just the summary statistic in isolation tbe considered but its relationship to a distance measure between this summary statistic for the observations and the model simulation. We would like both of these to be tractable in combination.Randomized low dimensional projectionshttps://danmackinlay.name/notebook/low_d_projections.htmlMon, 24 May 2021 14:16:37 +1000https://danmackinlay.name/notebook/low_d_projections.htmlTutorials Inner products Random projections are kinda Gaussian Random projections are distance preserving Projection statistics Concentration theorems for projections References One way I can get at the confusing behaviours of high dimensional distributions is to instead look at low dimensional projections of them. If I have a (possibly fixed) data matrix and a random dimensional projection, what distribution does the projection have?
This idea pertains to many others: matrix factorisations, restricted isometry properties, Riesz bases, randomised regression, compressed sensing.Infinite width limits of neural networkshttps://danmackinlay.name/notebook/nn_wide.htmlTue, 11 May 2021 11:36:54 +1000https://danmackinlay.name/notebook/nn_wide.htmlNeural Network Gaussian Process Neural Network Tangent Kernel Implicit regularization Dropout As stochastic DEs To files References Large-width limits of neural nets. An interesting way of considering overparameterization.
Neural Network Gaussian Process For now: See Neural network Gaussian process on Wikipedia.
The field that sprang from the insight (Neal 1996a) that in the infinite limit, random neural nets with Gaussian weights and appropriate scaling asymptotically approach Gaussian processes, and there are useful conclusions we can draw from that.Random embeddings and hashinghttps://danmackinlay.name/notebook/random_embedding.htmlTue, 01 Dec 2020 14:01:36 +1100https://danmackinlay.name/notebook/random_embedding.htmlReferences Separation of inputs by random projection
See also matrix factorisations, for some extra ideas on why random projections have a role in motivating compressed sensing, arndomised regressions etc.
Occasionally we might use non-linear projections to increase the dimensionality of our data in the hope of making a non-linear regression approximately linear, which dates back to (Cover 1965).
Cover’s Theorem (Cover 1965):
It was shown that, for a random set of linear inequalities in \(d\) unknowns, the expected number of extreme inequalities, which are necessary and sufficient to imply the entire set, tends to \(2d\) as the number of consistent inequalities tends to infinity, thus bounding the expected necessary storage capacity for linear decision algorithms in separable problems.Randomised regressionhttps://danmackinlay.name/notebook/randomised_regression.htmlTue, 01 Dec 2020 14:00:10 +1100https://danmackinlay.name/notebook/randomised_regression.htmlReferences Tackling your regression, by using random embeddings of the predictors (and/or predictions?). Usually this means using low-dimensional projections, to reduce the dimensionality of a high dimensional regression. In this case it is not far from compressed sensing, except in how we handle noise. In this linear model case, this is of course random linear algebra, and may be a randomised matrix factorisation. You can do it the other way and prject something into a higher dimensional space, which is a popular trick for kernel approximation.Dimensionality reductionhttps://danmackinlay.name/notebook/dimensionality_reduction.htmlFri, 11 Sep 2020 08:20:03 +1000https://danmackinlay.name/notebook/dimensionality_reduction.htmlBayes Learning a summary statistic Feature selection PCA and cousins Learning a distance metric UMAP For indexing my database Locality Preserving projections Diffusion maps As manifold learning Multidimensional scaling Random projection Stochastic neighbour embedding and other visualisation-oriented methods Autoencoder and word2vec Misc References 🏗🏗🏗🏗🏗
I will restructure learning on manifolds and dimensionality reduction into a more useful distinction.
You have lots of predictors in your regression model!(Approximate) matrix factorisationhttps://danmackinlay.name/notebook/matrix_factorisation.htmlFri, 03 Jul 2020 19:51:38 +1000https://danmackinlay.name/notebook/matrix_factorisation.htmlWhy does it ever work Overviews Non-negative matrix factorisations As regression Sketching \([\mathcal{H}]\)-matrix methods Randomized methods Connections to kernel learning Implementations References Forget QR and LU decompositions, there are now so many ways of factorising matrices that there are not enough acronyms in the alphabet to hold them, especially if you suspect your matrix is sparse, or could be made sparse because of some underlying constraint, or probably could, if squinted at in the right fashion, be such as a graph transition matrix, or Laplacian, or noisy transform of some smooth object, or at least would be close to sparse if you chose the right metric, or…Learning of manifoldshttps://danmackinlay.name/notebook/learning_of_manifolds.htmlTue, 23 Jun 2020 09:34:49 +1000https://danmackinlay.name/notebook/learning_of_manifolds.htmlImplementations TTK scikit-learn tapkee References 🏗🏗🏗🏗🏗
I will restructure learning on manifolds and dimensionality reduction into a more useful distinction.
Berger, Daniels and Yu on manifolds in Genome search
As in — handling your high-dimensional, or graphical, data by trying to discover a low(er)-dimensional manifold that contains it. That is, inferring a hidden constraint that happens to have the form of a smooth surface of some low-ish dimension.Likelihood free inferencehttps://danmackinlay.name/notebook/likelihood_free_inference.htmlWed, 22 Apr 2020 17:36:41 +1000https://danmackinlay.name/notebook/likelihood_free_inference.htmlReferences Finding the target without directly inspecting the likelihood of the current guess
A term which seems to have a couple of distinct uses.
Here, I mostly mean this in the sense of trying to approximate intractable likelihoods, possibly by Monte Carlo simulations from a generative model. There seems also to be a school which would like to use this term for methods which make no reference to probability densities whatever.Learnable indexes and hasheshttps://danmackinlay.name/notebook/learnable_indexes.htmlTue, 18 Feb 2020 12:20:29 +1100https://danmackinlay.name/notebook/learnable_indexes.htmlLearnable hashes for similarity search Learnable indexes for arbitrary search References Dr. Wu-Jun LI’s excellent Lit review and practicalities supporting their own papers. Kevin Zakka’s kNN classification using Neighbourhood Components Analysis is an illustrated guide to a type of dimensionality reduction I had not heard of before that looks handy for nearest-neighbour search, which I suppose is the entry-level use here. (Dwibedi et al. 2019)Non-negative matrix factorisationhttps://danmackinlay.name/notebook/nnmf.htmlMon, 14 Oct 2019 15:56:01 +1100https://danmackinlay.name/notebook/nnmf.htmlReferences A cute hack in the world of sparse matrix factorisation, where the goal is to decode an element-wise non-negative matrix into a product of two smaller matrices, which looks a lot like sparse coding if you squint at it.
David Austin gives a simple introduction, to the classic Non-negative matrix factorization for the American Mathematical Society.
This method is famous for decomposing things into parts in a sparse way using \(l_2\) loss.Fourier interpolationhttps://danmackinlay.name/notebook/fourier_interpolation.htmlWed, 19 Jun 2019 13:11:40 +0200https://danmackinlay.name/notebook/fourier_interpolation.htmlMinimum curvature interpolant Derivatives References \[\renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\mm}[1]{\mathrm{#1}} \renewcommand{\mmm}[1]{\mathrm{#1}} \renewcommand{\cc}[1]{\mathcal{#1}} \renewcommand{\ff}[1]{\mathfrak{#1}} \renewcommand{\oo}[1]{\operatorname{#1}} \renewcommand{\cc}[1]{\mathcal{#1}}\]
Video
Jezzamon’s Fourier hand
a.k.a. spectral resampling/differentiation/integration.
Rick Lyons, How to Interpolate in the Time-Domain by Zero-Padding in the Frequency Domain. Also more classic Rick Lyons: FFT Interpolation Based on FFT Samples: A Detective Story With a Surprise Ending.
Steven Johnson’s Notes on FFT-based differentiation) is all I need; it points out a couple of subtleties about DTFT-based differentiation of functions.Entity embeddingshttps://danmackinlay.name/notebook/entity_embeddings.htmlSat, 01 Apr 2017 09:56:50 +0800https://danmackinlay.name/notebook/entity_embeddings.htmlFeature construction for inconvenient data; made famous by word embeddings such as word2vec being surprisingly semantic. Note that word2vec has a complex relationship to its documentation.
Entity embeddings of categorical variables (code)
We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables.Clusteringhttps://danmackinlay.name/notebook/clustering.htmlTue, 07 Jun 2016 10:05:38 +1000https://danmackinlay.name/notebook/clustering.htmlClustering as matrix factorisation References Getting a bunch of data points and approximating them (in some sense) by their membership (possibly fuzzy) in some groups, or regions of feature space.
For certain definitions this can be the same thing as non-negative and/or low rank matrix factorisations if you use mixture models, and is only really different in emphasis from dimensionality reduction. If you start with a list of features then think about “distances” between observations you have just implicitly intuced a weighted graph from your hitherto non-graphy data and are now looking at a networks problem.Indirect inferencehttps://danmackinlay.name/notebook/simulation_based_inference.htmlTue, 15 Dec 2015 14:12:55 +0800https://danmackinlay.name/notebook/simulation_based_inference.htmlReferences A.k.a the auxiliary method. At the moment I am mostly using the sub-flavour of this called Approximate Bayesian Computation, so that notebook is rather more developed.
In the (older?) frequentist framing you can get through an undergraduate program in statistics without simulation based inference arising. However, I am pretty sure it is required for be economists and ecologists.
Quoting Cosma:
[…] your model is too complicated for you to appeal to any of the usual estimation methods of statistics.