Approximately-reversible, representations of complicated spaces by vectors which preserve semantic information.
Warning: this is not my current area, but it is a rapidly moving one.
Treat notes here with caution; many are outdated.
Modernised for transformer era
TBD
Not just words now!
Misc
Feature construction for inconvenient data;
made famous by word embeddings such as word2vec
being surprisingly semantic.
Note that word2vec has a complex relationship to its documentation.
Entity embeddings of categorical variables (code)
We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.
Rutger Ruizendaal has a tutorial on learning embedding layers
Embedding vector databases
- Vector Database Primer
- What is a Vector Database? | Pinecone
- LangChain docs maintains a confusing but current list of backing sores: VectorStores
- Vector Databases as Memory for your AI Agents | by Ivan Campos
related: learnable indices.
Software
This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.”
fastText is a library for efficient learning of word representations and sentence classification.
No comments yet. Why not leave one?