Neural vector embeddings

Hyperdimensional Computing, Vector Symbolic Architectures, Holographic Reduced Representations



Representations of complicated spaces by vectors which preserve semantic information.

Warning: this is not my current area, but it is a rapidly moving one.

Treat notes here with caution; many are outdated.

Modernised for transformer era

TBD

Technical survey: Kleyko et al. (2022) cites back to the year 2000.

Misc

Feature construction for inconvenient data; made famous by word embeddings such as word2vec being surprisingly semantic. Note that word2vec has a complex relationship to its documentation.

Entity embeddings of categorical variables (code)

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.

Rutger Ruizendaal has a tutorial on learning embedding layers

Embedding vector databases

Related: learnable indices.

Built on top of popular vector search libraries including Faiss, Annoy, HNSW, and more, Milvus was designed for similarity search on dense vector datasets containing millions, billions, or even trillions of vectors. Before proceeding, familiarize yourself with the basic principles of embedding retrieval.

Milvus also supports data sharding, data persistence, streaming data ingestion, hybrid search between vector and scalar data, time travel, and many other advanced functions. The platform offers performance on demand and can be optimized to suit any embedding retrieval scenario. We recommend deploying Milvus using Kubernetes for optimal availability and elasticity.

Milvus adopts a shared-storage architecture featuring storage and computing disaggregation and horizontal scalability for its computing nodes. Following the principle of data plane and control plane disaggregation, Milvus comprises four layers: access layer, coordinator service, worker node, and storage. These layers are mutually independent when it comes to scaling or disaster recovery.

Milvus Lite is a simplified alternative to Milvus that offers so many advantages and benefits.

  • You can integrate it into your Python application without adding extra weight.
  • It is self-contained and does not require any other dependencies, thanks to the standalone Milvus' ability to work with embedded Etcd and local storage.
  • You can import it as a Python library and use it as a command-line interface (CLI)-based standalone server.
  • It works smoothly with Google Colab and Jupyter Notebook.
  • You can safely migrate your work and write code to other Milvus instances (standalone, clustered, and fully-managed versions) without any risk of losing data.

Other Software

  • word2vec

    This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.”

  • fastText

    fastText is a library for efficient learning of word representations and sentence classification.

References

Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3 (Feb): 1137–55.
Boykis, Vicki. 2023. What Are Embeddings?
Cancho, Ramon Ferrer i, and Ricard V. Solé. 2003. Least Effort and the Origins of Scaling in Human Language.” Proceedings of the National Academy of Sciences 100 (3): 788–91.
Cao, Hui, George Hripcsak, and Marianthi Markatou. 2007. A statistical methodology for analyzing co-occurrence data from a large sample.” Journal of Biomedical Informatics 40 (3): 343–52.
Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by Latent Semantic Analysis.”
Gayler, Ross W. 2004. Vector Symbolic Architectures Answer Jackendoff’s Challenges for Cognitive Neuroscience.” arXiv.
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. A Closer Look at Skip-Gram Modelling.” In.
Herremans, Dorien, and Ching-Hua Chuan. 2017. Modeling Musical Context with Word2vec.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017.
Kanerva, Pentti. 2009. Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.” Cognitive Computation 1 (2): 139–59.
Kiros, Ryan, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-Thought Vectors.” arXiv:1506.06726 [Cs], June.
Kleyko, Denis, Dmitri A. Rachkovskij, Evgeny Osipov, and Abbas Rahimi. 2022. A Survey on Hyperdimensional Computing Aka Vector Symbolic Architectures, Part I: Models and Data Transformations.” ACM Computing Surveys 55 (6): 130:1–40.
Lazaridou, Angeliki, Dat Tien Nguyen, Raffaella Bernardi, and Marco Baroni. 2015. Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation.” arXiv:1506.03500 [Cs], June.
Le, Quoc V., and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents.” In Proceedings of The 31st International Conference on Machine Learning, 1188–96.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781 [Cs], January.
Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting Similarities Among Languages for Machine Translation.” arXiv:1309.4168 [Cs], September.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality.” In arXiv:1310.4546 [Cs, Stat], 3111–19. Curran Associates, Inc.
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL, 746–51. Citeseer.
Mitra, Bhaskar, and Nick Craswell. 2017. Neural Models for Information Retrieval.” arXiv:1705.01509 [Cs], May.
Narayanan, Annamalai, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. Graph2vec: Learning Distributed Representations of Graphs.” arXiv:1707.05005 [Cs], July.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12.
Plate, Tony A. 2000. Analogy Retrieval and Processing with Distributed Vector Representations.” Expert Systems 17 (1): 29–40.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.