Text processing

Information retrieval via string metrics. Speech tagging. Vector spaces induced by document structures, such as cosine similarit and word2vec style embeddings.

Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.

If I were to actually write this entry, it would be a big research project.


  • Luke

    “Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

    Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways…”

  • whoosh

    “Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.”

  • xapian

  • sphinx

  • lemur


Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, et al. 2012. Large Scale Distributed Deep Networks.” In Advances in Neural Information Processing Systems, 1223–31.
Le, Quoc V., and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents.” In Proceedings of The 31st International Conference on Machine Learning, 1188–96.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781 [Cs], January.
Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting Similarities Among Languages for Machine Translation.” arXiv:1309.4168 [Cs], September.
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL, 746–51. Citeseer.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12.
Rousseau, François, and Michalis Vazirgiannis. 2013. Graph-of-Word and TW-IDF: New Approach to Ad Hoc IR.” In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, 59–68. ACM.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.