Text processing

April 26, 2015 — July 13, 2016


Information retrieval via string metrics. Speech tagging. Vector spaces induced by document structures, such as cosine similarit and word2vec style embeddings.

Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.

Figure 1

If I were to actually write this entry, it would be a big research project.

1 Software

  • Luke

    “Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

    Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways…”

  • whoosh

    “Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.”

  • xapian

  • sphinx

  • lemur

2 References

Dean, Corrado, Monga, et al. 2012. Large Scale Distributed Deep Networks.” In Advances in Neural Information Processing Systems.
Le, and Mikolov. 2014. Distributed Representations of Sentences and Documents.” In Proceedings of The 31st International Conference on Machine Learning.
Mikolov, Chen, Corrado, et al. 2013. Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781 [Cs].
Mikolov, Le, and Sutskever. 2013. Exploiting Similarities Among Languages for Machine Translation.” arXiv:1309.4168 [Cs].
Mikolov, Yih, and Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL.
Pennington, Socher, and Manning. 2014. GloVe: Global Vectors for Word Representation.” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).
Rousseau, and Vazirgiannis. 2013. Graph-of-Word and TW-IDF: New Approach to Ad Hoc IR.” In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management.