Research proposal: Grammars of music

2012-11-11 — 2012-11-11

Wherein the question of musical grammars is posed, and probabilistic context-free grammars are proposed to be inferred from symbolic corpora such as MIDI and the Million Song Dataset to compare against Markov models.

grammar

language

music

nlp

project

proposal

Historical note

This is a research proposal that never got off the ground. I shelved this project and moved on. It is included here as a kind of memorial to the months I spent studying the area ineffectually, and I would not start this project in remotely the same way these days.

1 Introduction

This is a step towards understanding the relationship between music and language, using tools from machine learning and linguistics. The specific question that I ask to gain this understanding is: Can we model music with an automatically-inferred context-free grammar?

Such a technical question requires background. It is necessary to unpick several big ideas in the scientific studies of language and music to define the terms in it. By trying to “infer a context-free grammar,” I am proposing to import some specific tools from modern linguistics to the understanding of music.

To understand the merits of this approach, consider the similarities between these domains. On the face of it, it seems hard to find universal rules that span all musics. Each musical tradition has its own set of rules and conventions; The Javanese slendro tuning might sound dissonant to a Western composer, but not to a Javanese gamelan player. Or Top 40 pop might sound simplistic and limited to a bebop musician, and yet sell millions of copies to the public. Music is structured into different traditions with different audiences that look not so different to the “languages” and “dialects” into which verbal communication is organised. Yet, that is precisely the situation with the modern study of language.

This is just one of the many parallels between music and language.

Taking this analogy between language and music further, we discover approaches that might work to cast light on this question: We might, for example, hope to discover a Universal Grammar to model musical traditions in the same way that linguists seek a Universal Grammar to model natural language.

In grammatical research programs, we infer a grammar, a set of syntactic rules that describe all the permitted sentences, for many different languages, then attempt to infer the universal rules that constrain the similarities and differences between the various grammars. Music has often been subjected to attempts to explain its construction in terms of language-like grammatical rules (e.g. 🚧TODO🚧 clarify).

Such earlier projects seem to offer the potential to discover whether the relationship between musical styles runs deeper than just these superficial similarities, by quantifying how similar the underlying organisational principles of each are. Comparing the structure of rule sets used in music and in language gives us evidence of the nature of cognitive capacities that it engages, and allows us to question whether it is the same cognitive constraints used in each, which is of interest in understanding the affordances of human cognition (🚧TODO🚧 clarify).

However, attempts thus far have been limited in several ways that I propose to address in this thesis to allow a true Universal Musical Grammar research program. Previously, the labour intensity of manually creating grammars has limited our ability to answer this question in the large, and we have been restricted to provocative but isolated hand-crafted efforts which out of necessity cannot hope to answer universal questions. (e.g. 🚧TODO🚧 clarify)

Moreover, these particular earlier attempts restricted themselves to non-probabilistic grammars, and yet the latest development in linguistics has benefited from modern probabilistic inference, which is often markedly easier than non-probabilistic inference, to the point that large-scale automated inference of language structure is possible from text.

There is another strand of research in the same realm, but with a slightly different choice of tool set: inferring probabilistic models of music, but restricting the set of models to simple Markovian models. (e.g. 🚧TODO🚧 clarify and 🚧TODO🚧 clarify).

Taken together, these two limitations of previous musical grammar work suggest a natural way to extend the research program: Can an automatically inferred context-free model do better, in the sense of giving us more compact and more general descriptions? Specifically, this thesis proposes to extend the tools of musical grammar research to include the latest developments in modern linguistics: Automated inference of stochastic grammars from musical data.

By creating automated processes to examine this question we can use large data sets and use fewer presuppositions, and hope thereby to create answers that are both more general and more precise. Answering the basic research question has many concrete benefits, both in terms of other questions that it will enable us to answer in the future, and in immediate practical use.

By inferring the “structure” of a given style of music in the form of a grammar we might hope to compare the styles of music in the same way that linguists can compare dialect differences. By training inference on different musical styles much as automated grammatical inference programs are trained on a given language corpus, we can identify rules that apply across a genre and quantify differences between them. We can measure how consistent those grammars are across composers, genres, and time.

Practically, if we can find compact, descriptive rule sets that characterise given musical styles, then we have a new tool to measure musical similarity, and improve the results of processes that depend upon them, such as musical recommendation systems. Creating such rule sets would enable novel methods for generating music in a given style, simply executing the rules with our desired input parameters.

From there it is a small step to generating new hybrid styles from empirical data by combining rule sets, as practised with manually-crafted design grammars in the visual (🚧TODO🚧 clarify) and musical (🚧TODO🚧 clarify) domains. Uncovering regular rules could improve the efficiency of musical machine-learning tasks, for example by informing machine listening algorithms with priors based on rule sets, thereby addressing the “semantic gap” in modern Music Information Retrieval applications between signal processing and cognitive understanding of music. In short, the question is of interest, for theoretical understanding of the human experience, and for practical use in the realms of musical information retrieval and digital creativity.

2 Literature review of generative grammar and music

2.1 Abstract formal languages and automata

2.1.1 Classic formal languages

The studies of languages as objects of mathematical curiosity have roots in formal logic (e.g. 🚧TODO🚧 clarify). These theorists created various approaches to understanding how strings of symbols, analogous to words or to letters, could be operated upon by various transformations, as a way of understanding how mathematical or logical propositions could interact in the process of logic itself. Among these systems, the most fully-featured ones, such as Turing’s universal machine, are posited to be capable of representing any symbolic calculation in mathematics.

The various abstract systems were combined in the modern setting by the framework of the linguist Noam Chomsky, who created a hierarchy of such systems in terms of their expressive power, ranging from simple deterministic robots up to full-powered general-purpose computing systems. The Chomsky hierarchy was developed in the hope of quantifying the complexity of “natural,” which is to say, typical human languages, on this scale of expressive power but has found far wider application.

In Chomsky’s formalism, a grammar is a tuple of:

a finite set of terminal symbols, which is also called an alphabet
a finite set of preterminal symbols
a finite set of production rules which instruct how a given combination of symbols can be rewritten as a different set.
a start symbol

A language is a list of all permitted strings of terminal symbols.

A given grammar is said to generate all the sentences in its corresponding language. By this we mean, a sentence belongs to a given language if, and only if, it may be constructed by applying the rules in that language’s grammar. In a given human language, such as English, the elements of the grammar correspond to

A dictionary of words or, more properly, morphemes
A set of phrase types, such as “Noun Phrase,” “Verb Phrase” and “Sentence”
The rules of English grammar. So, we might have the rule \(Sentence \rightarrow Noun Phrase + Verb Phrase\)
The root symbol in human languages is taken to be a “Sentence”

The language, English, in this formalism, is the list of all permissible English sentences.

By putting different restrictions on the type of rules in a given language’s grammar, we may constrain the complexity of that language, and it is this tactic that induces Chomsky’s hierarchy. In his system, for example, the most highly constrained class of languages, the regular languages, may only possess production rules of the form \(A\rightarrow a\) or \(A\rightarrow aB\). (Here preterminal symbols are written in capitals, and terminal symbols in lower case, and arbitrary lists of each are written as Greek letters).

As a concrete example, we can define a simple regular language, with just two rules:

\(A \rightarrow aA\)
\(A \rightarrow b\)

From the start symbol \(A\), this grammar can generate all the strings \(b, ab, aab, aaab, aaaab…\) by repeated application of the rules.

Some thought reveals, however, that no regular grammar can recognise language that looks like this: \(ab, aabb, aabb, aaabbb, aaaabbbb…\). i.e. we cannot find a finite number of rules that include all (and only) the strings of so-many a followed by the same number of b. To generate such a language, we need a more powerful class of grammar, the “context-sensitive” grammar, which allows any rule of the form \(A\rightarrow \gamma\). Then we may choose the following rules, where \(\lambda\) is, by convention, an empty string:

\(A \rightarrow aAb\)
\(A \rightarrow \lambda\)

More generally, Chomsky’s hierarchy contains 4 Types, with the regular languages, and the context-free languages as Type 4 and Type 3 respectively.

Formal language theory is closely linked to the complementary field automata theory, the study of abstract mathematical machines that transform one string of symbols into another.

For every language type in the hierarchy there is a corresponding automaton, an abstract mathematical machine, which can recognise the language — that is, which can answer the question of whether a given sentence belongs to the language of a given grammar, or not. Equivalently, such an automaton may generate permissible sentences from that language.

The most famous class of automata is likely the Turing machine, although there are many more. Generalisations of these basic automata to discrete systems other than strings exist, most famously cellular automata.

In terms of automata theory, the question: “How powerful a grammar is needed to represent a given language?” may be transformed into an equivalent question: “How powerful an automaton is needed to correctly recognise whether or not a sentence belongs to a given language?” or, “How powerful an automaton is needed to generate all the sentences in a given language?”

There exist many theorems linking particular automata constructions to particular grammatical constructions, and many questions that may be answered within either formalism.

Throughout this document, the equivalences of these systems will be taken as given except where noted, and the most convenient terminology for the topic in hand will be used.

The classic Chomsky hierarchy is summarised in the following table:

In fact, modern automata theory admits of many categories of automata between the classic types listed here, which can recognise, in turn, languages intermediate in complexity between these. For the current purposes the classic Chomsky hierarchy gives us enough landmarks to locate the complexity of a given production system of interest, and we will not attempt to exhaustively document every class of automaton considered by modern theory, except to note that various sub-types will turn out to be important in later applications, and that these will be introduced as necessary. (But see 🚧TODO🚧 clarify and 🚧TODO🚧 clarify for overviews.)

We are concerned not only with the use of automata to recognise languages, but also automata to produce examples of a language — that is, with constructing machines whose output is a valid “sentence” in our language of choice. General automata capable of such general processes as not simply accepting or rejecting a sentence but outputting arbitrary output such as whole new sentences are referred to as “transducers” in the literature.

I will not make much of this distinction here, however, as there exist reasonably straightforward translations between recognising and transducing automata.

The further one descends the hierarchy, the more powerful the class of automaton that is required to recognise it. Specifically, the lower one is on the hierarchy, the higher the upper bound on the memory required by that recognising automaton relative to the size of the input.

A regular language may be recognised by an automaton with no memory other than its current state.

Recognising a context-free language requires a “stack,” an internal list of states. Recognising a context-sensitive language requires storage capacity proportional to the length of the input sentence. And to recognise a general recursively enumerable language can require the unbounded storage capacity of a Turing machine.

Thus, in the study of human language, to recognise a sentence as, say, a valid English utterance, one has to know not only that the words within it are valid English words, but that the sentence may be “parsed” into a legal application of the grammatical rules of English.

Thus, if we posit that the following are grammatical rules in English:

\(S \rightarrow NP VP\)
\(N \rightarrow NAME\)
\(N \rightarrow DET N\)
\(VP \rightarrow V NP\)

If we further stipulate that “John” is of lexical type NAME, “fed” is of type V, “the” is of type DET and “numbat” is of type N, then the sentence “John fed the numbat” is in the English language, since it may be parsed according to those rules as follows:

By the same token, “the numbat fed John” is also permissible, but not, on the other hand, “numbat John the fed.”

The terminal symbols need not be words or classes of words, but all grammars possess these basic structures, and musical grammars are typically similar in appearance. Here, for example, is part of the musical grammar of de Haas’s model used on harmonic similarity of jazz tunes:

\(Piece \rightarrow P+\)
\(P \rightarrow PCP\)
\(P \rightarrow HCP\)
\(PCP \rightarrow d t+\)
\(PCP \rightarrow t d t\)
\(HCP \rightarrow t+ d\)
…

In this grammar, d and t are taken to indicate, approximately, dominant and tonic chords, rather than words, but the linear structure of the piece is posited to be understandable in terms of a hierarchical parse much like an English sentence.

For brevity, throughout this document I will use “parse” to mean “inferring a permissible application of grammatical rules for a given production using a given grammar,” without presupposing a particular degree of complexity for the grammar in question. That is, although “parse trees” are often used to imply that the grammar in question is at least context-free in power, I will not assume so here.

I will also use variously “production,” “sentence” or “piece” synonymously to indicate entities contained within a language. “Production” is the most general, but when I wish to emphasise grammars for natural language, or for music, the latter two terms are useful as well.

2.1.2 Stochastic Formal Languages

Generalising all these grammars are probabilistic, or stochastic, versions where production rules in the system are decorated with weights which give the probability that any matching rule will be chosen. These generalised systems are ascendant in modern machine processing of language for many reasons such as ease of inference from data and graceful degradation under poor input. Moreover, they are the foundational tool upon which this thesis hopes to improve over established work in the field, and so they will be expanded at length here.

We can assign explicit probabilities to each transition within a grammar (or, alternatively, to each possible generated string).

The field was formalised in modern terms for regular languages by Rabin and for general grammars by Booth and Salomaa. but practical uses of such probabilistic automata predate those. Notably, Shannon’s mathematical theory of communication

used finite-alphabet Markov processes, corresponding to stochastic finite-state automata, in his approximating models for human language (following Markov’s 1913 lead applying Markov process to analysing Pushkin).

Later, an influential paper by von Neumann

considered the problem of constructing deterministic machines from probabilistic ones.

The use of such objects has only proliferated since then, as has terminology to describe them. Vidal et al. identify a huge number of formally equivalent terms used to describe these objects, depending on the discipline and whether one wishes to emphasise, for example, recognition, generation, or inference.

For Type-3 probabilistic automata, we can refer to probabilistic finite-state automata, stochastic regular grammars, Markov chains, n-grams, probabilistic suffix trees, deterministic stochastic or probabilistic automata, or weighted automata.

Such grammars not only distinguish whether a given sentence is a possible production of a given grammar, but they also supply us with some notion of the probability that such a sentence could have been produced by one or another grammar. Or, upon our observation of a particular sentence with multiple possible parses, we can calculate the relative probability the sentence resulted from each of the possible candidate parses.

Regular Languages and Context-Free Languages, and their associated automata, Markov machines, and Stochastic Stack Automata, occur regularly in the study of languages. Stochastic generalisations can equally apply to more general automata, and even Turing machines have fully probabilistic formulations. More powerful machines are capable of general stochastic calculation just as their non-stochastic peers are capable of non-stochastic calculation, and one may even emulate the other.

These general versions are rarely used in a grammatical inference context because of the difficulties in general inference and will accordingly not be discussed further here.

Recently, stochastic automata and languages have received attention from outside linguistics, in the study of stochastic dynamical systems, in a field frequently called “computational mechanics”.

Dynamical systems include such objects as classic chaotic attractors like the Lorenz “butterfly”, the value of an American option in the Black-Scholes model, or the stochastic predator-prey model.

Classical, real-valued dynamical systems may be quantised into “finite alphabet” processes and modelled as outputs of stochastic automata. Such a quantisation can in certain cases be statistically equivalent to even a continuous measured process. For our purposes, a key point is to note this field has produced additional automated procedures for inferring such automata from the data,

and that such methods are frequently more parsimonious than those of the natural language literature, avoiding such complexities as phonetics and lexicons, while still being informed by the same underlying theoretical principles.

2.2 Formal languages supplemented by symbol interpretations

In natural language and dynamical systems, the list of symbols produced by a grammar has a straightforward interpretation, given as part of the definition of the system. In natural language, for example, the terminal nodes are morphemes — roughly, words. In dynamical systems, a symbol might correspond to a certain range of an observable output variable. There are important generalisations of these systems, however, where interpretations of the output symbol are not so straightforward, adding complexities to the inference process. In musical grammar studies, there are no universally accepted interpretations of terminals, and they may be, for example, complex timing and consonance rules, simple semi-tonal intervals, or whole chords with respect to some root note.

In the absence of such fixed conventions for musical grammar systems, it is worthwhile considering the broader class of grammatical systems which includes both grammatical rules and domain-specific interpretations of the terminals of those rules. There are a number of examples of these systems of note in the literature, usually in the visual domain.

2.2.1 L-systems

Named in honour of their originator, botanist Aristid Lindenmayer, L-systems,

have been influential as a compact representation of the growth and morphology of plants. Recently, they have been commercially successful in computer graphics as design systems for trees, and for, e.g. architectural design processes.

These systems form a parallel hierarchy of formal languages with close relations to the Chomsky hierarchy. They differ in details; L-systems make no distinction between terminal and preterminal symbols, and their rules are applied in parallel, simultaneously transforming all symbols by the matching rules rather than sequentially applying them as in Chomsky grammars.

There are also subtle terminological differences; a “deterministic” L-system (or “D0L-system”) is a non-stochastic L-system, rather than a deterministic automaton in the computer science sense. There are several formulations of the L-systems at different levels of grammatical complexity.

The original “D0L” formulation is closely related to, though slightly more general than, a Context-Free Grammar, while the “D1L, D2L, D3L…” versions support stronger degrees of context-sensitivity. They are associated with two other features of relevance to our purposes here which are less often seen in classic grammars:

Firstly, L-systems typically concern themselves not only with rewriting operations upon a string of symbols, but, in addition, the interpretation of such symbols. Typically, terminal symbols are expressed as instructions to a turtle.

Turtles are a convention in computer graphics, originating in the work of Seymour Papert on the Logo teaching programming language. Some symbols draw particular shapes, such as a leaf or petal, and others are, for example, instructions to rotate, scale, or alter the colour of subsequent drawing instructions.

The result is that while the symbol string might indicate the structure of a synthesised plant, it is this interpretation which determines the shape and appearance of the image representing it. In the study of musical grammars, although the interpretation of symbols must necessarily be different, the issue of choosing meaningful interpretations remains pertinent.

Secondly, L-systems possess a variant with no equivalent in automata-based languages, the bracketed form. Bracketed L-Systems bless a particular pair of symbols, typically the square brackets, as instructions to push or pop a graphics context stack, enabling the turtle cursor to draw one component of the L-system before returning to a prior location to draw another. These constructions are usually used to draw branches in branching systems, such as models of trees.

In classic sequential automata operating on linear strings, which have no notion of concurrent expression of rules, such constructions are unnecessary. In grammars intended to, potentially, express multiple concurrent streams of symbols, say, a melody and percussion line, in which the order in which the string is interpreted as graphical instructions changes the output, such constructions are important.

2.2.2 Shape Grammars

Related to the parallel literature in L-system synthesis, there exists a further parallel literature on grammars having elements other than abstract symbols. The so-called “shape grammars” operate upon geometric shapes and were conceived as an aid to generating novel shapes in design. Exemplars of this tradition include originators, Gips and Steiny,

who applied them to two-dimensional geometry,

who interpreted terminals as subdivisions of n-dimensional volumes, and Leyton

to deformation curves of plastic objects. This field has now, however, been largely absorbed on one side into the study of L-systems and on the other into Iterated Function systems, and will not be treated further here.

2.2.3 Iterated Function Systems and Superfractals

Equivalent in the limit to deterministic L-systems,

but constructed using formalism with some surface differences, the fractal forms of Iterated Function Systems or IFS

are also design grammars. The class of IFS includes the classic Barnsley Fern and a broad class of systems using iterative application of transformations on the basis of a design grammar. More recently, a generalised IFS called the “V-Variable fractal”, related to the random fractal,

has been introduced which has the same formal relation to stochastic L-systems as does its non-probabilistic peer to non-probabilistic L-systems. While much work has gone into the rapid production of L-systems, it is IFSs that have dominated the efforts of research into inference, and thus they are both deserving of coverage in this proposal.

2.3 Learning Grammars

Distinct from the generation or the recognition of a given language is the problem of learning, or inferring the language — working out from examples what the rules of the language might be. This topic has been the subject of intensive research since the origin of the field and is a central component of the project proposed here.

One of the earliest and most famous results is Gold’s theorem, that places bounds upon how accurately a given language may be learned, where learning means, roughly, to know the precise form of a language’s grammar in a finite time. Gold famously showed that if a language learner is able to both receive examples of valid sentences, and to ask and receive negative feedback, that is to ask if particular sentences are valid or invalid, then they may learn any grammar in the Chomsky hierarchy down to Type-1, the Context-Sensitive Languages. In the absence of negative feedback, though, not even the Type 3 regular languages are learnable. Although considered a significant problem at the time, Gold’s definition of learning is nowadays considered too strict, and it was only shortly afterward that the addition of probabilities to language rendered the results a curiosity. Horning, in his dissertation,

showed learnability of probabilistic context-free grammars with arbitrary precision, even in the absence of negative evidence. This result is considered (e.g. by

) as a retrospective turning point, in that it demonstrated the utility of probabilistic formulations of language, which formulations now dominate industrial natural language processing.

Modern probabilistic language learning does not just learn a simple binary classification of productions into those inside and outside of a language. In the framework of probabilistic parsing, one infers a likelihood for a given production, based on learned weightings for particular rules, and where possible, does not depend on supervised learning, but makes do with positive examples.

This said, non-probabilistic grammars have also seen major developments. Based on the observation that grammars were supposed to capture not just expressiveness but learnability, theorists have attempted to rank languages based on their learnability rather than just their complexity. Prominently, new categories such as Distributional Lattice Grammars and Substitutable Languages, which included even some context-sensitive languages,

can be learned with efficient polynomial time algorithms. Other tractably learnable classifications, such as reversible languages,

are also known, and overviews of these classes and others are available in

and

At lower levels of the Chomsky hierarchy, in the context-sensitive, and recursively enumerable languages, just as the recognising automaton grows more complex, so too does the process of “learning” the rules of the grammar also become more demanding.

Without additional constraints upon the form of the grammar, learning them can become fraught. Chaitin’s incompleteness theorem, for example,

proves that there exists a certain grammar size, such that no string of symbols can be shown to require a grammar of greater size to express it. This follows from the fact that higher-order grammars can in fact implement arbitrary Turing machines and thus run aground on the various shoals of computational complexity theory.

Because of these pathologies, no attempt will be made to formulate such high-complexity automata in this project.

For more tractable grammars, there exists a large literature on techniques of inference for the case of natural languages, including

and

For some grammatical constraints, convergence proofs are known, and for others heuristics are employed, with in-practice good performance despite their lack of theoretical guarantees.

Those results are from natural language processing; but in fact grammatical inference is involved in many fields now, and algorithms from, e.g. general time-series analysis are also prominent. Notably, techniques inferring minimal automata on infinite symbolic signals have been developed, such as Crutchfield and Young’s \(\epsilon\)-machine formalism, Nevill-Manning and Witten’s SEQUITUR algorithm, and Shalizi and Crutchfield’s Causal State Splitting Reconstruction. Each of these offers related, though different, guarantees about the statistical properties of the recovered grammar, and the formal properties of the inferred automata.

2.3.1 Learning grammars from interpreted productions

All the aforementioned grammatical inference techniques operate on simple symbol strings. There is a more general inference problem, which is to infer the grammar of the system after the symbol string representing it has been interpreted as some kind of, e.g. graphical instruction set, as in the case of L-systems, or, as in the case of this project, as a musical score. In a complex case, such as inferring a grammar for a 3-dimensional object from a two-dimensional projection, even basic parsing can become challenging. This particular problem is an active area of research in computational architecture (that is, in designing built structures using computers) where design schema are inferred from photos.

The inverse operation to interpretation is by convention encoding, and while there are a limited set of conventions for this in L-systems, in music, there exist many different conventions for doing so, e.g.

Given a sufficiently well-behaved interpretation, such as an isomorphic map, encoding and interpretation may both be well defined and equally easy. In the case that we are not given the encoding, but have to infer it as well, the situation is more difficult.

The domain of problems where one must not only infer the grammar, but also select the interpretation/encoding schema used is large and poorly understood. In L-system-based modelling of plants this problem is usually solved manually. A researcher hard-codes interpretations by assigning a conventional set of graphic operations to symbols output by the grammatical rules, such as mapping the symbol + to a turtle instruction such as “turn right 30 degrees”. (See, however,

for an example of a limited attempt to automate this.) Such sets of transformations are generally chosen by hand to be those most intuitively “natural” seeming for the system under consideration; for example, if branches frequently leave the trunk in a given tree at an angle of 30 degrees, it seems reasonable to have a symbol interpreted as a 30-degree turn.

Early work in the inference area is dominated by the problem of encoding images as Iterated Function Systems, typically for use in image compression.

For many limited classes of inference, efficient algorithms are known; so Davis

showed that a restricted version of Jacquin’s fractal encoder

using only affine transformations of image kernels, was equivalent to a Haar wavelet sub-tree quantisation, and could be solved by the well-known, and efficient, techniques of wavelet analysis. Or, to turn that idea on its head, for some classes of IFS we may infer the image grammar and terminal interpretation simultaneously as the output of an appropriate wavelet transform.

Alternatively, on the problem of inferring generating grammars of three-dimensional forms where we do not have guarantees of strictly affine transforms, solutions can be found by applying other strong constraints to the search space, for example by limiting terminal shapes to line segments and by disallowing recursion. Alternatively, by limiting production rules to rectangular surfaces, grammar induction can be handled by reinforcement learning.

Or, by assuming a near-Markovian form for a potentially context-free grammar, Bayesian inference using Markov-chain Monte Carlo methods can be made to converge rapidly with more general geometric transformations. Many other such constrained schemes are known and are used in object classification

and in the subcategory of machine vision known as syntactic pattern recognition.

For problems without convenient restrictions, less rapid search techniques such as genetic programming of the interpretations and rules are required, often concurrent evolution of both. In genetically programmed systems, pre-processing of the source data can still reduce the complexity of the search. There have been experiments in using alternate representations of the source data, such as wavelet transforms and Fourier-domain representation,

or images preprocessed for useful features.

More recently, work has been done in speeding search simply by harnessing the easy parallelisation of both L-systems and of evolutionary search on GPU.

None of these image-based techniques generalize immediately to the musical domain, but they each provide suggestive hints toward classes of techniques that might be applied; thus, for example, while the wavelet transform of an image is rather unlike the wavelet transform of any musical event signal, a simplified version of the former might be a useful step toward the latter.

2.3.2 Learning music grammar from music events

A related set of problems is to infer an underlying grammar when the production to be interpreted is a musical piece. There is a large body of work on applying to the analysis of music, but rather less on the inference of those rules.

Of those that do attempt to infer rules in an unsupervised fashion, most assume both Markovian grammars and essentially monophonic input. (e.g.

and

). Recently,

and

have extended analyses to restricted types of polyphony but have kept the Markovian restriction. On the other hand,

has successfully trained probabilistic CFGs on the Essen Folk-song collection but remained restricted to strict monophony.

This process is complicated to an extent by the necessity of choosing not just a symbol sequence, but an interpretation of that symbol sequence, as in general L-system inference problems. As with L-systems, the expression of a musical grammar must include not only a hierarchical tree parsed from the symbol string but a list of interpretations of symbols. Convention also supplies a stock of such interpretations in musical grammar studies, such as note transpositions and chord tension.

For the newer area of music, there is the choice of either examining the problem of inferring such interpretations by hand or using one of the default interpretations.

On one hand, the default interpretations are easy to infer, being natural, simple, and easy. On the other, they have been chosen largely on the basis of convenience and are heavily imbued with the presuppositions of Western music theory, and can only encode a limited subset of music data, being essentially monophonic and sequential.

A more general approach to inference might consider interpretations of symbols analogous to those inferred in IFS work, where affine, projective, or even general non-linear contracting maps are used.

In the spirit of such analogies, we would consider interpreting symbols not just as a pitch or time interval, or a relative degree of tension in a chord, but as a transformation upon a musical motif. Rules could encode, for example:

a tempo-doubling
a transposition with respect to a particular scale
transposition some number of semitone steps
a chord inversion
a phrase truncation

Given the complexity of inferring arbitrary encodings and interpretations of grammatical symbols, I will experiment with a hand-selected set of such encodings but do not plan to infer them without presupposing a limited set of forms for the rules at this stage.

2.3.3 Simplicity in grammatical inference

A crucial problem in grammatical inference is deciding what is meant by inferring a “better” grammar; there are many answers to this. Intuitively, if a smaller number of rules can describe the same corpus, or if the same number of rules can describe more of the corpus, then it might seem that we have found a “better” grammar. For example, one may easily infer a language that has a single rule for every example, and thus, trivially, every sentence in the language may be expressed within it. But this grammar would not be ideal, since it has a huge number of complex rules, one per input datum, and yet cannot generalize to new data at all. On the other hand, a grammar that accepted every piece would have massively overgeneralised and yet told us little. We wish to avoid over-fitting by degenerate rules such as \(S\rightarrow\) Beethoven’s Moonlight Sonata. The question is, rather, what order of language succeeds in both generalising, and in more compactly expressing, a given musical structure, by some metric. Concretely, if we can relax the common presumption that musical grammars must be Markovian, and infer, rather, a context-free grammar over some corpus, and then describe that corpus more effectively, then we might feel that this challenge has been met; it turns out that this can be formalised as the Minimum Description Length (MDL) principle,

This is a formalisation of the earlier idea of Kolmogorov-Chaitin complexity,

which measures the complexity of a phenomenon by the length of the description of that phenomenon. Historically, this idea has been associated with Solomonoff’s formal theory of inference. This idea is, loosely, the idea that we can formalize scientific inference as a process of finding and quantifying the patterns in phenomena.

Practically, Kolmogorov-Chaitin complexity in its original form is not often used; it is in general uncomputable, and has other deficiencies such as being dominated by the random component of stochastic processes. The MDL idea formalises a calculable measure of the complexity of the system; roughly the “coding” of the system, which quantifies how complex an automaton can describe the data, where both the description of the automaton and its parameters are included in this measure of “complexity”. It also provides us with the normative principle that it is this description length that we should aim to minimise.

This is closely related to the stochastic complexity of Crutchfield’s \(\epsilon\)-machines.

MDL techniques are fully developed, and have such attractive features as being able to gracefully handle sparse data; practically, if we have a complex model, but not enough data to learn it, then we are impelled to choose a simpler model which may fit the data better.

In practice, MDL procedures are already used in grammatical inference and have even been employed in the musical domain for Hidden Markov Models. Using them should therefore be straightforward.

There are other techniques available to select model classes; standard cross-validation techniques and standard model selection tasks, for example, can be used to similar ends.

The attraction of this case is description length is a natural fit to the framing of this research in terms of automata. It captures in an intuitive fashion the idea that the project will be successful if a small increase in the complexity of grammar, from Markov to probabilistic context-free, enables a more compact representation of the complexity of the patterns in music.

2.3.4 Classification using inferred grammars

One of the most foundational purposes of stochastic grammars is the classification of productions, and techniques exist to, for example, automatically infer which language a given sentence belongs to using such systems, e.g. Such systems are well-developed and in frequent commercial use in, e.g. Google Translate.

Less well-developed, there are a number of attempts to use generating automata to classify musical pieces using generating automata.

The simplest of these simply classify given pieces based on the probability that they were generated by a particular grammar. For grammars inferred from several different corpora, a given piece is simply supposed to belong to the grammar that was most likely to have produced it. More generally, the form of a grammar can be used as a source of additional features with which to train any other classifier system, such as a support vector machine, and providing better structural features is likely to produce better results in music classification tasks.

Such elaborations will not be considered further here, as the basic probability estimates are sufficient for a simple verification step.

2.4 Grammars and automata in art

For all the limited success for grammar inference in art and music, there is an extensive literature in the use of hand-made grammars in these areas. Although the advent of modern linguistics generated increased interest in this study formally, the interest in a syntax of music predates the modern study of syntax. Hugo Riemann’s work on the syntax of music, originally published in the late 9th century, initiated the field with his metaphors regarding musical syntax. A more current, but still venerable, approach is Schenkerian Analysis, which breaks music up into grammar-like nested hierarchies. These analyses are frequently effectively similar to parse trees for musical grammars. For our current purposes, these informal, historical notions of syntax have been superseded by the formal language approaches of contemporary analysis, and the discussion will focus on modern, formalised uses of these ideas.

2.4.1 Markov processes and regular languages

Starting with the least powerful languages, we note early work attempted to place the theory of harmony on an information-theoretic basis in the fifties and, soon after, Markov models were applied to composer style identification.

Taking things further, composers such as Xenakis actively employed Markov models in their compositions. Xenakis’s Analogique A and B, as described in his book ‘Formalized Music’, were based around such processes. Ames provides a summary of various works, including several algorithmic scores created using Markov models.

More recently, Pachet has created a real-time Markovian jazz improvisation system based on Markov models.

Recently there has been interest in using such systems as models of the state transition structure of the Markov graph to quantify song structure complexity in birds.

2.4.2 Partially Observable and Hidden Markov Models

A modest generalisation of the Markov model, but still more tractable than context-free systems in full generality, is the partially observable Markov model. This has received attention for its use in modelling birdsong, and the suppositions about the cognitive basis for such song.

has showed that one such language can explain the most complex birdsong, such as that of the Benegalese Finch, and workers such as Fee et al and Jin have tied this to the underlying neural and physiological mechanisms within the birds.

Authors such as Mavromatis and Noland have used Markov models in musical inference and generation, for key estimation.

Cruz-Alcázar and Vidal have identified melodic style and musical key using such systems.

2.4.3 Context-Free Grammars

Although Markovian models have been successful in some aspects of musical analysis by virtue of their simplicity and ease of use, it is likely that they fail to capture much of the structure of music, and much effort has been spent in applying higher-complexity systems. Rohrmeier argues that super-Markovian complexity rules are required to handle harmonic progressions in Western tonal music.

Accordingly, many schemes have been proposed.

In general, the grammars used as models of musical structure have been hand-designed. For example, Steedman designed a context-free grammar specifically for the delimited domain of Blues chord progressions, later extended by Chemillier for real-time use.

More exotic schemes, such as using context-free grammars to partition time and pitches, have also been employed.

There have been many models besides these — notably even Leonard Bernstein has weighed in. However, the most prominent at this date seems to be Lerdahl and Jackendoff’s Generative Theory of Tonal Music (GTTM). Their approach parses music sequential data as in classic grammars, and thus is essentially monophonic. However, it does offer a model of a parse tree with chords rather than single nodes as the leaves, when augmented with a theory of consonance. Lerdahl’s later Tonal Pitch Space is the preferred model for this.

Recently, parsing under this system has been partially automated. Hamanaka has developed software capable of parsing a given piece of music using Lerdahl’s GTTM grammar.

However, as the grammar in this case is supplied a priori, there is no attempt to solve the grammatical inference problem in generality. Notably, also, the grammar is not a probabilistic one, but a set of grammatical rules supplemented with a preference ordering over alternative parses; different parse trees cannot be assigned, for example, a likelihood in the system as envisaged by its creators, though such an extension would presumably be straightforward.

There have been many attempts to solve specific sub-problems in this grammatical inference domain with a more general grammatical model. The work of Bod et al on folk song melody databases uses a full probabilistic inference system recognisable from probabilistic NLP.

Though the model used can handle only melodies rather than chords, it has the most in common with the approaches discussed in this proposal. Bod et al argue that in inferring such grammars, similar regularisation parameters are necessary for probabilistic parsers both for natural language and for musical inference; such a coincidence, they argue, is suggestive of parallels between the mechanisms involved in each activity.

2.4.4 Constraint programming

At the highest level of generality and the lowest level of the Chomsky hierarchy, the arbitrarily powerful systems of context-sensitive grammar and constraint programming reside. Despite the challenges of such systems, their use in music is an expanding research area, since they make it easy to encode the rules of an a priori theory of music whose constraints must be satisfied, such as constraints on the consonance or dissonance of concurrently played notes, or weightings on bar boundaries.

For the above-mentioned reasons of difficulty of inference, however, there appear to be no attempts to infer rule sets for such systems empirically. In examples from this field, rules are typically constructed from hand-crafted theories of musical consonance or constraints. Despite the intuitive attractions of being able to directly express an explicit theory of music, it is an open question whether such complex systems are necessary to encode music in general. The trade-off for their generality and directness is extreme difficulty in inference, and these systems are riven with undecidability issues. This proposal attempts to make do with more parsimonious, and thus more falsifiable, means to model music. Whilst not ruling out the possibility of requiring general-purpose computing to be ascribed to musical sequences, it seems desirable to exhaust the possibilities of simpler systems first; moreover, the success in capturing so much of natural language with simple models leads us to suspect that such models might be similarly successful in music. As such, if we ultimately discover that no context-free model can adequately describe music, then we are left with a remarkable result inferring substantially different cognitive mechanisms might be at play in each.

2.4.5 L-systems

To return to the related L-system program, we find that they are in use in creating and analysing music in a diversity of fashions. On the microscopic scale, L-systems have been used in synthesis algorithms to create self-similar wave-forms. More commonly they are used to devise more macroscopic features, such as notes and phrases.

That said, the most active art form as regards the prominence of such models seems to be in architecture, where rule-based and procedural systems based on these are actively used both for individual buildings and for large-scale urban planning. In this context, they are prized for computational and storage efficiency with which they can generate architectural features.

3 Detailed proposal

Can we identify universals in music in the same spirit with which syntacticians attempt to identify the universals in human language in Chomsky’s program of Universal Grammar?

There have been various attempts to answer this question (e.g.

), but all have provided at best incomplete answers, for one of two kinds of reasons, best summarised as

the dependence upon manual analysis, and
excessively simple automata.

Exemplifying the first kind, Lerdahl and Jackendoff’s “Generative Theory of Tonal Music” (Hereafter GTTM), is the most thoroughly developed and prominent system of musical grammar. But even this one can only handle a small subset (monophonic, score-based) of music, from a particular (Western) tradition. It is not clear, for example, how to generalize GTTM’s hand-crafted phrasal rules to non-western music, nor to polyphonic or polyrhythmic source material, nor how to place their system in the data-driven, statistical frame that dominates modern natural language processing.

As an example from the second category, take Cruz-Alcázar and Vidal’s work. Harnessing recent computational advances, they have circumvented the need for hand-parsing, attempting to analyse polyphonic musical material with data-driven probabilistic approaches. However, they have restricted their models to simple stochastic finite state automata which, though convenient for purposes such as style identification and easy to estimate, are known to discard important musical structure, just as such models discard important structure in natural language.

I argue that both kinds of deficiencies can be addressed by importing more recent advances from computational linguistics. Since these various analyses were first proposed, there have been changes in the disciplinary landscape which have not yet been integrated into the state-of-the-art musical application of grammars.

Firstly, the ascendancy of statistical, unsupervised techniques for inference, and of probabilistic models, has revolutionised grammatical inference and shifted the focus of research to probabilistic models from the earlier discrete models. Rather than hand-created discrete rules to explain production structure, we can infer probabilistically weighted rules at scale without manual intervention.

Secondly, improvements in machine listening and musical information retrieval techniques have made available huge quantities of musical data pre-analysed into musical events, vastly reducing the barrier of entry to corpus-based learning approaches. At the same time, the diversity of music has expanded beyond the confines of Western classical music to contemporary folk and pop forms from diverse cultures. There is no in-principle reason that we cannot examine the consistency of a particular style, such as Irish folk music, 80s synth-pop, Javanese Gamelan, or even the evolution of any such genre over time.

Finally, developments in the theory of formal languages, both in natural language processing and in dynamical systems theory, have shown that rapid unsupervised inference techniques exist for approximating subsets of general grammatical systems, making automated analyses far more tractable, at least for the pertinent subclasses.

In short, in the changing contemporary landscape of tools and techniques, it is now plausible that we could infer more complete model structures over a wider variety of musical types, thanks to improvements in theory, techniques, and data. It remains, simply, to do the labour to knit these tools together and test them. I propose to do precisely this, to harness these recent developments to create an improved understanding of the generating rules of music.

Instrumental in this process will be attempting to create techniques to automatically infer rules for a given corpus of music data without manual parsing of the data and without presuming a particular (e.g. Western) musical tradition, and by relaxing the limitations on grammatical models. Ultimately, this process will provide evidence as to the complexity (or simplicity) of the generating rules of music across styles and cultures.

Without presuming that this latest attempt would necessarily find a single set of constraints that define the grammar of each musical style — indeed, there have been no definitive answers to this question even in the source field of natural language after more than half a century of research — even incremental progress toward such a goal would have valuable outcomes, shining light on various topics of note. I will be evaluating the research according to the primary criteria that I have already mentioned, to wit, how well context-free probabilistic grammars perform at this task compared to Markovian ones. Nonetheless, any of these wider issues could provide a useful way to validate the primary outputs of the research:

For example:

How sophisticated a grammar will current state-of-the-art allow us to infer from the available musical data?
What is the richest class of grammars that we can extract from music?
Based on the systems that we can infer, what can we say about the complexity of their governing principles?
How does the grammatical complexity of musical structure compare with complexity in natural human language?
Does the structural complexity of a song constrain its popularity, that is, are there some parse trees that are too simple or complex to be successful?
Are particular types of grammatical rules especially common in particular genres of music, or in the oeuvre of a particular artist?
What measures of similarity or difference most naturally fit this grammatical representation?
Can these measures identify commonalities or differences that other types of similarity detection cannot?
Can we recover the hand-designed rules of prior musical systems (e.g. GTTM) using automatic inference, and thereby discover how well these earlier attempts fit within a more universal framework?
Can a given grammar structure explain a particular genre, or the output of a particular artist?
Can we infer musical genre, or composer, more efficiently from a grammatical representation of the music than we can from, say, a Markovian representation?

I do not propose to answer all of these questions in this thesis. Rather I state them to illustrate the broad variety of questions that can be asked of the models of the kind that I aim to produce here. Any model I produce must be placed within the wider context of such unsolved questions as these. Any of these questions could serve as a touchstone with which to test the models, in addition to the standard statistical measures of goodness-of-fit and of generalisation error.

For the current purpose though, I note that, in addition to the primary question of inferring the simplest possible grammar, I intend to only answer the last of these questions as supporting evidence for the usefulness of my technique: Can these newer, simpler grammars not only produce more elegant, simpler models, but can they also identify the genre of music more efficiently than rougher, more approximate tools?

3.1 Datasets

This project will not attempt to do machine listening in the sense of inferring audio content from raw signal data. Digital-signal processing issues will be avoided by using musical structures which have already been symbolically encoded as musical events. There are several sources of these. Small, high-quality data sets are available in the form of MIDI files of various pop songs, which are available freely online, and may be freely used for research purposes. There are repositories of hand-encoded annotated scores, such as the Essen Folk-song Database. Most prominently, there is a large, although lower-quality, data set known as the Million Song Dataset, based on advanced machine listening techniques, made freely available for research purposes by Columbia University’s Laboratory for the Recognition and Organisation of Speech and Audio.

Thus, I hope to sidestep issues of machine listening and concentrate on structural inference, by starting with corpora encoding, for example, note pitches and onset times relative to a base key, and to a base tempo, avoiding the need to estimate pitches and tempos.

As grammatical inference is the process of inferring a generating structure from symbols, the particular choice of dataset is secondary, as long as I can convert it into an appropriate stream of symbols. Should the aforementioned datasets prove unsuitable for whatever reason, there are many alternative standard datasets of music, e.g. Real World Computing Music Database, the “CAL10K” dataset, and so on. There are also freely available tools to parse audio data into musical events, such as The Echo-nest Analyzer, Berkeley’s SPRACHcore, and the Omras2 project, should I wish to proceed from a corpus of musical event data of my own creation. The larger datasets are more attractive from the perspective of large sample estimates, but as they are the product of machine listening techniques, are more prone to containing classification and estimation errors themselves. Polyphonic pitch tracking and rhythm inference are each imperfect technologies and errors are to be expected. The trade-off between data noise and dataset size will be considered in the course of the various steps of the project. All datasets mentioned so far possess genre and composer metadata to a greater or lesser degree.

3.2 Methods

I will create tools to automatically infer probabilistic grammars for a corpus of music, then verify and validate their outputs when applied to different corpora with regard to standard statistical tests.

To this end, I will need to be able to

Express a grammar as music
Pre-process musical event data into symbols
Parse a musical piece using a musical grammar
Infer a grammar from a corpus of pieces
Compare similarities of grammars induced by different corpora
See if the similarities between grammars of different corpora are effective classifiers for song metadata, e.g. composer or genre.
Quantify range and variation of inferred parses and see what generalisations may be made about the features of grammars which persist across cultures

Each of these steps will depend to an extent upon the output of the previous and it is, of course, unknown at this stage how much of an improvement can be made in each of these over the state of the art.

3.2.1 Expressing a grammar as music

Expressing a musical grammar as music is reasonably straightforward; there exist many tools capable of handling reasonably complex stochastic grammatical expressions as music, such as the Smalltalk dialect SuperCollider, the Oz dialect Strasheela, or even non-musical systems such as Lisp-derivative Church.

I do not propose to presume any particular complexity class of grammar at this stage; as discussed earlier the increase in effort for expressing Turing-complete automata is only marginally greater than that of more basic systems. An ideal solution would be one that could be played audibly, to allow intuitive exploration, but which could also produce machine-readable output as input to other steps in the process and ideally, scores. All these goals appear straightforward with the tools to hand.

3.2.2 Pre-process musical event data into symbols

As discussed earlier, musical event data require an interpretation to map between sets of musical events and the strings of symbols that are the basis of grammatical inference theory.

Although there is much potential to explore inference techniques to deduce interpretations for grammars, at the initial stage I will focus on simple rules from the literature, such as the lists in @citation rather than inferring new rules, with an eye to exploring automatic interpretation inference at a later stage if the data merit it.

3.2.3 Parsing a musical piece using a musical grammar

Parsing is identifying the states that an automaton assumed in emitting a given production; that is, identifying which grammatical rules were responsible for each symbol in a production. In the case of a probabilistic grammar, this means identifying all likely possible states that could have produced a given piece, and their relative probabilities. Unlike expressing a grammar, parsing productions generated by more complex automata can be more computationally complex. It is this later step, of parsing, that will constrain the complexity class of grammars used in later stages, and which provides a motivation to avoid excessively complex models such as context-sensitive or unrestricted grammars.

For the classes of models here, the complexity of this step will be modest. To parse a production in full generality can be difficult, in the worst case. Inferring parse trees with, for example, unrestricted grammars, on general symbol strings can be a computationally, or even undecidable problem.

Fortunately, for more restricted systems, the problem is typically easy. For this project, where we hope to infer simple sequences of symbols from music, and context-free grammars for music, the problem is well-explored and in general computationally affordable.

For the moment, therefore, I proceed with the assumption that such simple parsers will be sufficient, and that using one of the many available parser-generators will be sufficient.

Creating and estimating likelihoods for probabilistic parses is a necessary pre-condition for inferring grammars; otherwise we cannot compare our predictions with our intuitions. For example, Bod was able to confirm the utility of his parser by comparing the emitted parse trees for folk songs in the Essen Dataset by comparing the parse-tree-inferred phrase structure with the manually-supplied annotations in the data set.

For other datasets under consideration, such as the Million Song Dataset, it might be necessary to supply a manually-annotated test-set, or to test the parse trees implicitly, by resynthesising new pieces using them, or comparing the effectiveness of the inferred rules as an input to grammatical inference. At this stage, no more complexity will be assumed for this step, though I note that the potential to revisit this assumption should the data necessitate it.

3.2.4 Inferring a grammar from a corpus of pieces

Grammatical inference is the most computationally expensive component of this process. It is distinct from the problem of parsing; while parsing attempts to recover the states a given automaton assumed in producing a piece, inferring attempts to recover the automaton itself, given many pieces. In plain language, inferring a good grammar requires that we are able to produce a probabilistic grammar that is optimal in some sense — e.g. that uses the most parsimonious possible rule set or that provides the highest-likelihood parses of a given corpus.

Some inference strategies, such as candidate grammars to a learnable subset of all possible grammars, such as “substitutable” grammars, can ease the computational burden, as can accepting probabilistic error. Fortunately, a wide variety of off-the-shelf solutions for probabilistic grammatical inference, such as the Stanford Lexicalized Parser and the Bikel Parser, and many others besides. This step therefore reduces to the question of trying and evaluating these until one adequate for my purposes is identified, and choosing appropriate parameters for its learning algorithms based on the corpora that we have available.

3.2.5 Inferring symbol interpretations

These discussions of difficulty of parses assume that we are already given a symbol string to parse, which, in the case of monophonic input is a reasonable assumption. If, however, I attempt to generalize the algorithm to handle polyphony or polyrhythm, where two concurrent symbols might reasonably be expected to be expressions of two different symbolic rules, the parsing procedure can become substantially harder. With concurrent symbols, this problem becomes equivalent to the general problem of branched L-system parsing.

Unlike the general problem of parsing, say, a shape grammar from raw image input, musical event representations are already low dimensional, and, e.g. Cruz-Alcázar and Vidal argue that appropriate pre-processing of data can reduce the problem to one only marginally more complex than basic inference, contingent on some mild restrictions of the complexity of the grammar to parse. The investment of time in this component of the project will be contingent upon the success achieved with the simplest possible rule interpretations; if inference results are not sensitive to choices of rule set, there would be little purpose investing extra time in this step.

3.2.6 Comparing effectiveness of grammars induced by different corpora

Finally, the project requires using inferred grammars to produce measurable improvements to existing schemes of classification and inference about music. To this end, it will be necessary to choose metrics between parses, and/or grammars. The primary point of comparison will be comparing restricted (Markov) inferred grammars to less restricted (context-free) grammars.

There are subtle difficulties here; for one, even in grammars as simple as context-free, the general equivalence problem is undecidable — that is, we cannot know in general that two alternative inferred grammars have identified the same language.

Such problems are, in fact, worst-case problems. Whilst this research proposes expanding the power of grammars inferred, there is no reason to suppose that a well-regularised selection process will produce such pathological cases. If it did indeed do so, then I would further hope that the pathological grammars would cast light on ambiguity in the music itself or are in some way reflective of human cognitive constraints.

Moreover, we can avoid this to a degree by comparing in terms of “simplicity”; that is, in terms of the minimum description length. All this requires is a consistent definition of our coding system, and a consistent measure of how successfully our grammar has explained the data. Standard techniques exist for estimating success in the statistical natural language processing literature by, e.g. assigning a likelihood of the generation of a given piece by a given grammar.

This process gives us P-values, that is, standard significance tests, for given pieces to estimate their suitability to a given grammar, giving us a mean likelihood when averaged across the test data.

3.2.7 Validating results by use of inferred grammars in classification

Rather than staking all the verification of the model discussed here in this rather abstract procedure of measuring description lengths and so on, it is desirable to measure performance of the algorithm on at least one practical, real-world application if there is available time. Classification seems the most straightforward task here. As discussed earlier, there are many attempts to classify pieces using probabilistic automata, most directly by rating the “probability” of a given piece given a given grammar. For my purposes, a simple validation of the model, in addition to standard significance tests, would be training grammars on different corpora. For most of the datasets, for example, we might divide the dataset into different genre corpora using the dataset metadata and train genre-specific grammars on these. On these outputs, I can measure performance at classification both with simple Markov automata and my more complex grammar systems to evaluate how much, if any, improved accuracy has been gained. Then I may compare the efficacy of different degrees of grammatical complexity at this task in terms of test-data error. It is straightforward to compare the efficacy of such classification with other published research in the field that uses related techniques — such as de la Higuera’s automated inference learning of Markovian musical grammars for classification.

3.3 Costs and benefits of my alternative

Prior research has not suffered from a lack of diligence. Earlier researchers have left only the more treacherous ground unploughed, and my plans rely on some degree of innovation on my part to circumvent well-known problems in the field.

As outlined in the methods section, there are potential difficulties in the parsing and inference steps, and many opportunities to founder upon poor choices of technique or upon computational constraints.

However, these difficulties are merited by the benefits. Compared to earlier approaches, the tools proposed here — low-assumption unsupervised probabilistic grammatical inference over large datasets — have many attractions.

Firstly, the ability to compactly express families of musical structure as grammatical structures opens the door to commercial applications in creating and composing audio, much as L-systems have found commercial application in creating computer graphics. Much earlier research has discussed this idea and some workers have created implementations.

This seems to be the first such scheme, however, that does not tie the grammatical rules used to an a priori set such as GTTM.

By eschewing culturally-specific rule sets, we additionally free the subject matter from a specific genre of music, to music generally, free from a specific tuning and able to be calibrated to alternate tuning conventions.

Indeed, there is no reason, contrary to the bulk of earlier attempts, that theories of music devoid of any tuning at all cannot be inferred, creating grammars, say, of untuned percussion, or of instrument-specific parameters such as cuts of samples of digital music, which other approaches to musical parameter learning have attempted, e.g.

All the techniques described thus far are blind to particular musical theories or specific event types and can support inference from any events available in the dataset. That is, there is no difference, from the perspective of the algorithm, between inferring structure in the sequence “kick-snare-kick-kick-snare” and in “Do-Re-Mi-Fa-So-La-Ti-Do”.

By devising probabilistic rules, rather than simple binary acceptance, we acquire an easy method to generate novel musical pieces whose frequency properties approximate their training data. As a natural consequence of parameterising the grammar with real number coefficients, we also acquire an easy way to interpolate between different rule sets to create hybrid genres. Such techniques are common in design grammars for visual systems (e.g.

but I have found no example of combining such interpolation with unsupervised style inference in the musical domain for anything except for low-complexity Markov models.

On the other hand, tools to do interpolation with manually-created grammars do exist,

, So we might hope for this process to be comparatively painless.

Automated rule inference across different music traditions provides us with a potential source of universal novel features and constraints not contingent upon any one musical theory. In turn, this provides new opportunities for musical information retrieval and novel means of dimensional reduction to feed into future learning algorithms.

Finally, representing musical structures as grammatical systems opens up the possibility of measuring the “complexity” of musical structure using the techniques by which grammatical complexity is measured in other domains e.g. with Crutchfield and Young’s \(\epsilon\)-machines , which represent the length of the shortest description of a system that can explain the non-random component of its behaviour.

While the ultimate goal stated here is ambitious, and there is of course no guarantee that it is attainable in whole, the many intermediate steps along the way are novel in themselves, and achieving partial success still provides novel outcomes and publishable research.