stringology on The Dan MacKinlay family of variably-well-considered enterprises
https://danmackinlay.name/tags/stringology.html
Recent content in stringology on The Dan MacKinlay family of variably-well-considered enterprisesHugo -- gohugo.ioen-usWed, 10 Jun 2020 16:19:39 +1000Text data processing
https://danmackinlay.name/notebook/text_data_processing.html
Wed, 10 Jun 2020 16:19:39 +1000https://danmackinlay.name/notebook/text_data_processing.htmlGeneral Munging jq yq PowerShell Nushell pxi d2d fx awk tab Searching Getting data in a text-like format gets you a whole world of weird tools to manage and process it.
General Data Cleaner’s cookbook explicates dataframe processing by laundering through CSV/TSV and using command-line fu. Fz mentions various tools including CSV munger xsv.
Munging Here are some popular tools.
jq jq allows one to parse json instead of TSV.Natural language processing
https://danmackinlay.name/notebook/nlp.html
Sun, 07 Jun 2020 14:40:08 +1000https://danmackinlay.name/notebook/nlp.htmlWhat is NLP? Software Stanza Blingfire SpaCy pytorch.text NLTK NLP4J Misc other Computation language translation, parsing, search, generation and understanding.
A mare’s nest of intersecting computational philosophical and mathematical challenges (e.g. semantics, grammatical inference, learning theory) that humans seem to be able to handle subconsciously and which we therefore hope to train machines on. Moreover it is a problem of great commercial benefit so it is likely we can muster the resources to tackle it.Computational symbolic mathematics
https://danmackinlay.name/notebook/computational_symbolic_maths.html
Fri, 05 Jun 2020 12:50:54 +1000https://danmackinlay.name/notebook/computational_symbolic_maths.htmlHow it works Tools Maxima Sympy PARI/GP Javascript I could write about how it works, but for now I mostly care about implementations that are available to me.
How it works Long story of which I understand only tiny fragments.
However, let us consider how it might be solved with neural nets (Lample and Charton 2019). The linked method looks like a heinous hack at first glance, but maybe it is suggestive of the potential differentiable search in the future?Knowledge geometry
https://danmackinlay.name/notebook/knowledge_topology.html
Fri, 22 May 2020 10:46:01 +1000https://danmackinlay.name/notebook/knowledge_topology.htmlWhat is the shape of collected human knowledge? To Investigate, possibly related Topic modelling in text databases Artificial chemistry Related links See also:
Innovation Is a material basis for technology plus a knowledge topology equal to a model of technology? I suspect not - surely there are emergent effects. But there must be a relationship. Spaces of strings String dynamics Related question: What is the shape of the vocabulary of communicating people?MAPLE
https://danmackinlay.name/notebook/maple.html
Tue, 19 May 2020 08:00:51 +1000https://danmackinlay.name/notebook/maple.htmlThe other major computer symbolic algebra system (apart from Mathematica) which seems to have not quite as much traction because of… not having a messianic CEO? Having awful branding? It seems to be OK now that I look at it. In particular, it does what I expect regarding transforms of random variables.
Since everyone seems to know Mathematica, I guess I should describe it in terms of that? It is imperative-emphasis rather than functional emphasis; the upshot seems to be that if you want functional behaviour it has to be defined using the “inert” form?Voice transcriptions
https://danmackinlay.name/notebook/speech_transcription.html
Sat, 16 May 2020 06:50:51 +1000https://danmackinlay.name/notebook/speech_transcription.htmlDictation Transcribing recordings The converse to voice fakes: generating text from speech. a.k. speech-to-text.
This is an older practice than I thought. Check out Volume 89 of Popular Science monthly,, Lloyd Darling, The Marvelous Voice Typewriter for the state-of-the-art dictation machine of 1916 (PDF version).
Dictation Speaking as a realtime interactive textual input method. See following roundups of dictation apps to start:
Zapier dictation roundup the rather grimmer Linux-specific roundup.Statistical relational learning
https://danmackinlay.name/notebook/statistical_relational_learning.html
Mon, 27 Apr 2020 21:45:09 +1000https://danmackinlay.name/notebook/statistical_relational_learning.htmlPlaceholder.
I cannot help but notice that the discussions of changing probabilistic domain, and unusual assumptions about exchangability are reminiscent of inference on social graphs. Connections?
See the big book.
Braz, Rodrigo de Salvo, Eyal Amir, and Dan Roth. 2008. “A Survey of First-Order Probabilistic Models.” In Innovations in Bayesian Networks, edited by Dawn E. Holmes and Lakhmi C. Jain, 156:289–317. Studies in Computational Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg.Diff/merge tools
https://danmackinlay.name/notebook/diffing.html
Mon, 09 Mar 2020 14:18:27 +1100https://danmackinlay.name/notebook/diffing.htmlDiff/merge GUIs Recursive diffs Tools to compare and harmonise folders/files.
Diff/merge GUIs Handy as a complement to, e.g. git.
Meld is an open source GUI merge tool. Free, cross-platform for Linux/Windows. a Mac fork exists.
Diffmerge is a classic cross-platform nagware merge. USD19 for a licence. Can we set up as a git merge tool.
kdiff3 is a long-well-regarded GUI, but it is somewhat hard to find in the nature of esoteric coder tools.Mathematica
https://danmackinlay.name/notebook/mathematica.html
Wed, 29 Jan 2020 15:16:25 +1100https://danmackinlay.name/notebook/mathematica.htmlBasics Pros Cons Tips Links A computer symbolic algebra system.
Basics I’m all about open-source tools, as a rule. Mathematica is not that. But the fact remains that the best table of integrals that exists is Mathematica, that emergent computation of the cellular automaton that implements Stephen Wolfram’s mind. I should probably work out what else it does, while I have a their seductively cheap student-license edition chugging away.Bandit problems
https://danmackinlay.name/notebook/bandit_problems.html
Tue, 28 Jan 2020 13:07:59 +1100https://danmackinlay.name/notebook/bandit_problems.htmlPseudopolitical diversion Intros Theory Practice Bandits-meet-optimisation Bandits-meet-evolution Details Delayed/sparse reward Multi-world testing Extensions Deep reinforcement learning Markov decision problems POMDP Practicalities Sequential surrogate interactive model optimisation Bandit problems, Markov decision processes, a smattering of dynamic programming, game theory, optimal control, and online learning of the solutions to such problems, esp. reinforcement learning.
Learning, where you must learn an optimal action in response to your stimulus, possibly an optimal “policy” of trying different actions over time, not just an MMSE-minimal prediction from complete data.Applied string mangling
https://danmackinlay.name/notebook/string_mangling.html
Thu, 23 Jan 2020 13:20:11 +1100https://danmackinlay.name/notebook/string_mangling.htmlRegexp A.k.a. Un-natural language processing.
Comby is parsing/search replace thing designed for code.
Regexp (Image used under CC licence from Martin Haverbeke’s Eloquent Javascript.)
A.k.a. regexes. A.k.a. “regular expressions”, from a principled origin they presumably had in the theory of syntax. However, regexes as commonly encountered encode a particular way of specifying a language, rather than some arbitrary class of regular languages.
The default flavour of string matching, available in a variety of flavours, all equally boring.*-omics
https://danmackinlay.name/notebook/star_omics.html
Wed, 22 Jan 2020 19:00:29 +1100https://danmackinlay.name/notebook/star_omics.htmlI do not truly understand the Roche biochemical pathways poster.
Preoteomics, genomics, phenomics, connectomics. On the understanding and inference of networks of control in living systems using statistics. Generates lots of interesting problems at the nexus of various other statistical problems, like model selection, false discovery rates, causal graphs and so on.
Of course, there is a deep learning angle.
Is nilearn any good?
Gao, Chuan, Ian C. McDowell, Shiwen Zhao, Christopher D.Esoteric language zoo
https://danmackinlay.name/notebook/esolang.html
Fri, 27 Dec 2019 22:48:10 +1100https://danmackinlay.name/notebook/esolang.htmlIverson and Whitney Pure Brainfuck INTERCAL If you want to find more about the weird ends of this hobby, see retrocomputing or the esolang wiki.
I’m nostly documentning a couple of weird but not completely abstruse ones here.
Iverson and Whitney Arthur Whitney is i think the creator of the odd ill-explained k and b languages which make impressive claims about performance and unimpressive claims about community, support and longevity.Algorithmic statistics
https://danmackinlay.name/notebook/algorithmic_statistics.html
Tue, 15 Jan 2019 10:58:55 +1100https://danmackinlay.name/notebook/algorithmic_statistics.htmlInformation-based complexity theory The intersection between probability, ignorance, and algorithms, butting up against computational complexity, coding theory, dynamical systems, ergodic theory, and probability. When is the relation between things sufficiently unstructured that we may treat them as random? Stochastic approximations to deterministic algorithms. Kolmogorov complexity. Compressibility, Shannon information. Sideswipe at deterministic chaos. Chaotic systems treated as if stochastic. (Are “real” systems not precisely that?) Statistical mechanics and ergodicity.Models of computation
https://danmackinlay.name/notebook/models_of_computation.html
Sun, 18 Jun 2017 08:51:45 +0800https://danmackinlay.name/notebook/models_of_computation.htmlEverything is Turing-complete Weird stuff Rewriting Systems Everything is Turing-complete Surprsingly Turing Complete
Many configuration or special-purpose languages or tools or complicated games turn out to violate the Rule of least power & be “accidentally Turing-complete”, like MediaWiki templates, sed or repeated regexp/find-replace commands in an editor (any form of string substitution or templating or compile-time computation is highly likely to be TC on its own or when iterated since they often turn out to support a lambda calculus or a term-rewriting language or tag system eg esolangs “///” or Thue ), XSLT, Infinite Minesweeper, Dwarf Fortress3, Starcraft, Minecraft, Ant, Transport Tycoon, C++ templates & Java generics, DNA computing etc are TC but these are not surprising … On the other hand, the vein of computer security research called “weird machines” is a fertile ground of “that’s TC?Granger causation/Transfer Entropy
https://danmackinlay.name/notebook/transfer_entropy.html
Thu, 04 May 2017 00:19:48 +1000https://danmackinlay.name/notebook/transfer_entropy.htmlWhy do we care about this model of causation? Estimating from data Transfer entropy is one way of learning the arrow of time.
tl;dr I’m not currently using Transfer Entropy so should not be taken as an expert. But I have dumped some notes here from an email I was writing to a physicist, explaining why I don’t think it is, in general, a meaningful thing to estimate from data “non-parametrically”.State space reconstruction
https://danmackinlay.name/notebook/state_space_reconstruction.html
Tue, 02 Aug 2016 11:51:55 +1000https://danmackinlay.name/notebook/state_space_reconstruction.htmlSome stuff I saw that’s maybe related Stuff that I might actually use Disclaimer: I know next to nothing about this.
But I think it’s something like: Looking at the data from a, possibly stochastic, dynamical system. and hoping to infer cool things about the kinds of hidden states it has, in some general sense, such as some measure of statistical of computational complexity, or how complicated or “large” the underlying state space, in some convenient representation, is.Rummaging in string bags
https://danmackinlay.name/notebook/string_bags.html
Wed, 13 Jul 2016 13:50:58 +1000https://danmackinlay.name/notebook/string_bags.htmlBags of words, edit distance (as see in bioinformatics, hamming distances, cunning kernels and vector spaces over documents. Vector spaces induced by document structures. Metrics based on generation by finite state machines, *-omics Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.
TBC.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022. http://www.Text processing
https://danmackinlay.name/notebook/information_retrieval.html
Wed, 13 Jul 2016 13:50:58 +1000https://danmackinlay.name/notebook/information_retrieval.htmlSoftware Information retrieval via string metrics. Speech tagging. Vector spaces induced by document structures, such as cosine similarit and word2vec style embeddings.
Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.
If I were to actually write this entry, it would be a big research project.
Software Luke
“Lucene is an Open Source, mature and high-performance Java search engine.Syntax
https://danmackinlay.name/notebook/syntax.html
Wed, 22 Jun 2016 09:43:15 +1000https://danmackinlay.name/notebook/syntax.htmlWhat’s so special about speech anyway?
Sam Kriss calls the spamularity the language of god. See also Feral, Thomas Urquhart, natural language processing.
“They're using phrase-structure grammar, long-distance dependencies. FLN recursion, at least four levels deep and I see no reason why it won’t go deeper with continued contact. […] It doesn’t have a clue what I’m saying.”
“What?”
“It doesn’t even have a clue what it’s saying back,” she added.Stream processing and reactive programming
https://danmackinlay.name/notebook/stream_processing.html
Wed, 01 Jul 2015 13:42:25 +0200https://danmackinlay.name/notebook/stream_processing.htmlCSP/ FRP/ reactive programming Javascript Python Streaming data analysis To read Lazy bookmark for practical details to processing and transforming possibly-infinite streams of data, from signals to parse trees. Disambiguating “transducers”.
Used in parallel/offline processing of very large data sets that do not fit in core, or processing things that happen in realtime such as UI.
I am imagining more general objects than singly-indexed real-valued signals; Tokens, maybe.Artificial chemistry
https://danmackinlay.name/notebook/artificial_chemistry.html
Sun, 31 May 2015 15:31:30 +0200https://danmackinlay.name/notebook/artificial_chemistry.htmlP-systems and membrane computing the Broadcast language Systems which allow interacting particles with string representations, that interact. Distributed or agent-based models for stringology. This might remind us of evolution, or chemistry, or computational learning agents or whatever. Is there a name for this family of systems?
These are popular as models for… Understanding what kind of computing nature might be doing? Or as a source of biomimetic algorithms.Grammatical inference
https://danmackinlay.name/notebook/grammatical_inference.html
Thu, 19 Feb 2015 16:51:01 +0100https://danmackinlay.name/notebook/grammatical_inference.htmlThings to read Mathematically speaking, inferring the “formal language” which can describe a set of expressions. In the slightly looser sense used by linguists studying natural human language, discovering the syntactic rules of a given language, which is kinda the same thing but with every term sloppier, and the subject matter itself messier.
This is already a crazy complex area, and being naturally perverse, I am interested in an especially esoteric corner of it, to whit, grammars of things that aren’t speech; inferring design grammars, say, could allow you to produce more things off the same “basic plan” from some examples of the thing; look at enough trees and you know how to build the rest of the forest, that kind of thing.Computational mechanics
https://danmackinlay.name/notebook/computational_mechanics.html
Fri, 02 Jan 2015 20:33:58 +0100https://danmackinlay.name/notebook/computational_mechanics.htmlTo read To understand See also:
informations algorithmic statistics computational complexity stochastic automata grammatical inference To read Decisional states
“This article introduces both a new algorithm for reconstructing epsilon-machines from data, as well as the decisional states. These are defined as the internal states of a system that lead to the same decision, based on a user-provided utility or pay-off function.”
CRS’s CSSRAlgebra I would like to learn
https://danmackinlay.name/notebook/group_theory.html
Sat, 22 Nov 2014 10:26:38 +0100https://danmackinlay.name/notebook/group_theory.htmlStringology Probabilistic Cycles in a random permuation Large prime factors of a random number Connection to symmetries Stringology Long story. Group theory for languages and automata.
Properties of the free group. (because of the stringology thing) Cayley Graphs. (because of the stringology thing) Probabilistic Probabilistic methods in algebra (as opposed to algebraic methods in probability).
Cycles in a random permuation The number of cycles in a random permutationDesign grammars
https://danmackinlay.name/notebook/design_grammars.html
Mon, 30 May 2011 04:36:58 +0000https://danmackinlay.name/notebook/design_grammars.htmlSee also grammatical inference, syntax.
In computer graphics these are also called “procedural design” (that being slightly more general), or L-systems.
Prusinkiewicz and Lindenmayer (the L in “L-systems” ) had success describing plants and seashells and other CGI-friendly lifeforms as grammars. Lerdahl and Jackendoff applied these ideas to music. Look around for applications to primatology, genetic programming, gene expression, dynamical systems, Barnsley et al and their fractal image compression…