Bach to the Future (or, Humanising Music With Neural Nets)

10 Apr 2019|Technology

A tale of a harrowing failure and the worst bug the author thinks he ever wrote, followed by a glorious success – well, sort of, anyway.

"The Music Room" by Mihály Munkácsy, licensed under CC CC0 1.0.

Many companies have different perks and benefits for their employees, but when I tell my friends and acquaintances about Chilicorn Records they never fail to be impressed.

Chilicorn Records is Futurice’s in-house record label. More than that, it is an occasion for all Futuriceans who are interested in music to once a year create a song for the annual Chilicorn Records compilation, which we then release and open source. We make a song, and Futurice pays us to do it, because Chilicorn Record is a part of our Spice program, wherein we, the employees, are paid for any open source work we do on our spare time.¹

A colleague and I had already collaborated on a track for the first edition of the compilation in 2017, but for the second edition, which we all released with great fanfare during our annual summer party here in Berlin last June, I wanted to do something a little different. But before I describe the composing, performing and recording of that song, let us go back for a moment to the distant past of 2017.

Composing Music

When my future biographers write my life (hmm, hmm), they may well choose to divide it into a series of obsessions. Since spring of 2017, my chief obsession has been the study of music theory and composition, driven by the desire to learn how the masters of old created the music I love. And one type of composition attracted me more than any of the others – the fugue.

Point against point

The word texture comes from a Latin root related to 'weaving', and it probably draws to your mind the tactile qualities of physical materials, like fabrics, papers or woods. Similarly, in music, it is used to describe how the melodies we hear in it appear to us. (As with so much in music, this is subjective, but not so subjective that we cannot make generalisations.) – When listening, do we hear only a single melody? Or do we hear one melody at the surface, as it were, and several other melodies moving under the surface (but in harmony with that first melody)? Or do we hear several melodies none of which is more prominent than any of the others?

What we have just described are the three categories of musical texture:

monophony (from mono-‘one’ + Greek phonē ‘sound’), wherein we hear only a single melodic line.
Listen to this Kyrieleison by Hildegard von Bingen (1098—1179) and notice how, even after the second voice enters at the :15 mark, we still only perceive a single melodic line (because the two voices sing the same notes²). Embedded content: https://www.youtube.com/watch?v=98S9spYHtJc
homophony (from Greek homos ‘same’ + phonē ‘sound’), wherein we hear one main melodic line together with a (less prominent) accompaniment.
Now listen to the opening section (:00–:45) of Locus Iste, a motet by Anton Bruckner (1824—1896), and as beautiful an example of homophony as I could ever wish to hear. Embedded content: https://www.youtube.com/watch?v=vT2_jVvTvH4
polyphony (Greek poluphōnos, from polu- ‘many’ + phōnē ‘voice, sound’), wherein we hear multiple individual but related melodic lines.
Finally, listen to the madrigal Solo e pensoso, composed by Luca Marenzio (1553—1599), a lovely example of polyphonic singing, and you’ll notice how all the voices have a life of their own. Embedded content: https://www.youtube.com/watch?v=v0PPZSqvC9E

The means of achieving polyphony in practice is called counterpoint (from medieval Latin contra- ‘against’ + punctum, from pungere ‘to prick’), a word you may have heard used figuratively in other contexts. In counterpoint, the different parts form harmonies together, but have different contour and rhythm. For instance, one part may rise while another descends, or one part may consist of crotchets and another of semibreves (in English, that means one part is played or sung “twice as fast” as the other), and so on. John Rahn puts it more strikingly:

It is hard to write a beautiful song. It is harder to write several individually beautiful songs that, when sung simultaneously, sound as a more beautiful polyphonic whole. … The way that is accomplished in detail is ... ‘counterpoint’. (John Rahn, Music Inside Out: Going Too Far in Musical Essays, Taylor & Francis, 2001)

The fugue

A fugue is a special kind of complex contrapuntal composition, wherein two or more subjects (distinct and recognisable melodies or themes, usually quite short) are introduced in imitation in each part, and then reappear throughout the composition in a variety of combinations and variations. The word fugue comes from Latin fuga (flight), and is related to fugere (flee) and fugare (chase), illustrating, maybe, how the subjects weave in and out of the song, now played higher, now lower, now in the foreground, now in the background.

Let’s listen to one. Take, for instance, this splendid Ricercar á 6

Embedded content: https://www.youtube.com/watch?v=RUYDkAvs-Ko

from Johann Sebastian Bach’s Musikalisches Opfer, dedicated to Frederick the Great³, and try to pay attention to how the subject (the very first notes that you hear) is used to introduce each instrument in turn, and then how it reappears throughout the piece, sometimes in a new costume, but always recognisable.

The great attraction of the fugue, for me at least, is that it is not just a type of composition, but also a method for composing. Having written a subject, you use the rules of contrapuntal composition to write a countersubject that can be played alongside it, and then your composition will consist simply of combining these two subjects in different ways, more or less following a common scheme.⁴ So, in a sense, you write a subject, and then everything else follows. Or, as 16th-century composer and music theorist Nicola Vicentino put it, you “will always have material with which to compose without having to stop and reflect”.

Subliminal messaging

Because of its strictness and complexity, the fugue is often seen as a profound and cerebral style of music, as Bach too was a profound and cerebral composer. Apart from the purely musical complexities of his music, he would often integrate puzzles, ciphers and oblique references to extramusical topics (usually devotional) into his music.

For instance, if you look at his scores, you will every now and then come across the following four-note sequence.

The notes are, in the German system, B, A, C and H, in other words spelling out the name BACH. This, the B-A-C-H motif, is the most famous of many musical cryptograms.

For the second edition of the Chilicorn Records compilation, I wanted to try my hand at writing a fugue. Specifically, I composed a fugue in three voices on the theme F-C-E, for Futurice (Bach was lucky that his name could be written completely with musical notes), as spelled out by the first three notes of the subject.

I wrote the fugue, but of course I don’t know how to play any instrument. So as any reasonable person might be expected to do, I decided to try my hand at writing a neural network that would perform it for me.

Machine Learning, Attempt One

The software I used to compose the fugue, MuseScore, allows you to export a piece to MIDI. The MIDI format describes music as a collection of 'note on' and 'note off' events (corresponding to pressing and releasing a key on a piano) distributed through time.

The problem with this is that all notes are played with equal emphasis, i.e. equally loudly, and with “perfect” timing – in other words, it lacks all of those qualities that distinguish different human interpretations of a composition. All the right notes are there, but they don’t sound right; they have loudness but no dynamics. Compare the following image, from a real performance of the first prelude, in C major, from the first volume of The Well-Tempered Clavier, again by Johann Sebastian Bach,

with the following, where the dynamic is totally flat:

And what is performing, anyway? Let’s ignore for a moment the acoustic qualities of the piano and the room in which it’s located, the calibre of the studio equipment and the skill of the studio engineer – what really differentiates a performance of, say, Franz Liszt’s Piano Sonata in B minor by, say, Alfred Brendl from one by, say, Martha Argerich? After all, the same notes are being sounded in the same order.

There is an art-critical answer to that question, which I will ignore, and a technical answer, which can be boiled down to two aspects:

the precise position of notes in time, and
the dynamics, that is to say the loudness of each note.

Of those, the second seemed to me the lower-hanging fruit. And besides, I had found a master’s thesis (by Iman Malik of the University of Bristol, who also wrote a very illuminating blog post about it) describing a machine learning solution for exactly that problem. (Most of what follows in this section is based on her method, and some of it is also based on her code.)

The dataset

You might think that this image shows an ordinary piano, but you’d be mistaken. It actually shows a Yamaha Disklavier e-piano, which is able to record performances and save them to be played back later. It’s used in the annual e-Piano Junior Competition for young pianists, and many of the performances from the final rounds are later uploaded as MIDI files on the competition’s website, happily providing an excellent data set for us to use.

Modelling the problem

I let each song consist of an arbitrary number of moments in time, starting at 0 and proceeding towards infinity. At each moment, any of the 88 keys on the piano (from the lowest A to the highest C) may be either played or not played. Actually, we’ll be a little bit more specific here and say that it can either be pressed down, held down (sustained) since before or not played at all. If we represent those 3 states with 2 integers – {(1, 0), (1, 1), (0, 0)} –, that gives us 88 time 2 equals 176 input features at each time step.

For the output, we have the same number of moments in time as for the input – the length of the song does not change. We also have the same number of keys on the piano, and for each key at each point in time we want a velocity (meaning “loudness”, or the speed with which a piano key goes from resting position to being fully depressed), ranging from 0 (silent) to 1 (fortississimo). That gives us 88 output features at each time step.

The keen reader will have noticed that I lied just then. I wrote that there were 176 features, but the image of the inputs shows a number of additional rows (features) at the top of the graph. That’s because, in addition to the 176 features representing the notes being played, I added a number of additional features based on my (limited, I admit) understanding of music theory.

Feature engineering

By feature engineering, we try to use our domain knowledge (in this case, of music and music theory) to give the neural networks additional hints when predicting the velocities. (For an in-depth article on feature engineering, see Will Koehrsen's Feature Engineering: What Powers Machine Learning.) Here are some of the ones I used:

Marking the stress of each beat. In other words, representing whether we are on a strong (1), weak (0) or medium-strong (0.5) beat. For instance, music in 2/2 time such as marches generally have an OOM-pah rhythm (encoded as 1 0 1 0 ...) whereas music in 3/4 time, like waltzes or minuets, usually have an OOM-pah-pah rhythm (1 0 0 1 0 0 ...).
Summing the number of notes being played or sustained at a given time step.
Averaging the pitch value of all notes being played or sustained at a given time step.
Calculating the nearness to the end of the song, going from 0 for the very first time step of the song to 1 at the very last one.
Determining the quality of the chord being played, in other words, answering the questions is the chord currently being played a major, minor, diminished, augmented or suspended chord? is it a dyad, triad or a seventh, etc.?

Looking again at the visual representation of the features

you may see that the values at the top represent these engineered features, derived from the 88 notes being played or not.

Implementing it in code

The final model, implemented in Keras, is very similar to that proposed in the aforementioned paper.

I use recurrent layers (long short-term memory layers, to be precise) because how loud a note should be sounded will surely depend on which notes came before it. I make those recurrent layers bidirectional because, following the same logic, the loudness of a note will surely also depend on which other notes come after it.

Here it is in code:

dropout = 0.2
model = Sequential()
model.add(Bidirectional(LSTM(output_size, activation='relu', return_sequences=True, dropout=dropout),
                        merge_mode='sum',
                        input_shape=(None, input_size),
                        batch_input_shape=(batch_size, None, input_size)))
model.add(Bidirectional(LSTM(output_size, activation='relu', return_sequences=True, dropout=dropout),
                        merge_mode='sum'))
model.add(Bidirectional(LSTM(output_size, activation='relu', return_sequences=True, dropout=dropout),
                        merge_mode='sum'))
model.compile(loss='mse', optimizer=Adam(lr=1e-3, clipnorm=1), metrics=['mse'])

Training the model

So I had my data and my model. I trained it for some 130 epochs

and got a fine loss, although it should be noted that the baseline loss will be low, too, because the output matrix is so sparse. (One 'epoch' entails training the model with each sample in the data set exactly once.) I should note that the loss function being used here is mean squared error.

I used this model to make predictions for the fugue, recorded it with velocity-sensitive synthesisers and sent the track off to be included in the compilation, which it was.

Time passed. About half a year later, we had a data science training in the Futurice Berlin office, which inspired me to see if I could use my newfound knowledge to improve my implementation, which I had gone on using regularly during the previous months.

Data augmentation

My main idea began with the following observation. If you transpose a song chromatically (or, in plain English, if you play all the notes of that song a certain number of keys further to the left or to the right on the piano), the dynamics (relative loudnesses) of the song should very nearly stay the same. But if you do such a transposition, the input into the first layer will look completely different from what it was before the transposition.

As an example, listen to the openings of these three versions of On Hearing the First Cuckoo in Spring by Frederick Delius. The first is the original

the second example is transposed down by a major second –

and the third example is transposed up by a major second:

They sound very much alike, don’t they? And that’s in spite of the fact that the actual notes being played are very different. In other words, we have two very different inputs which should give very nearly the same output. And that’s a ripe opportunity for data augmentation. So I took each of the tracks in the data set and created from it twelve variations, the original plus eleven different transpositions, and thus at a stroke grew my data set by a factor of twelve.

But there was a fly in my sweet-scented ointment. As I implemented this, I came across something hideous, something grotesque even. I came across what may have been the worst bug that I ever wrote.

Yes – for half a year I had only ever trained my neural network on a single batch of four songs, entirely ignoring the rest of my data set. During those six months, I presented my work to my colleagues, told friends about it and used it in my music … It was a somewhat embarrassing discovery, one feels urged to admit. But now, I just find it sort of miraculous that the neural network was able to produce passable results in the first place.

Having fixed the bug and trained the new model, with data augmentation, for 50 epochs, I realised that the best it could do was to predict a sort of average velocity (reminder: the velocity is the force with which a piano key is depressed) for all notes – no crescendi or diminuendi here. (In a way, the model trained on a single batch produced more interesting results, as, being trained solely on a very specific subset of the data, it was more wild and opinionated than this second model, which regressed to a sort of bland mean.)

In other words, after half a year and countless hours spent implementing my model, I had to face up to the facts, which was that the model was no better at predicting velocities than simply assigning a constant value to every note. And using a complex neural network to do that, one would have to consider to be a case of over-engineering.

Machine Learning, Attempt Two

Per aspera ad astra! While my original idea and implementation ended in catastrophic failure, I had the feeling that, in the end, it would serve only to sweeten my ultimate and well-deserved success. Rethinking the problem, and figuring that simpler is better, I decided to model it as a tabular problem.

Modelling the problem, pt. II

Seen in this way, every note event is a single row in a large table. Each note event has an associated time (specifically, the time passed since the beginning of the song) and pitch feature – that’s the basic data gleaned from the raw MIDI file, and they make up the first two feature columns in the table.

In addition to that pair, I engineered a large number of additional features, some of which the reader may recognise from the previous iteration. They include the average pitch value of all currently pressed keys, the nearness to the end and to the midpoint of the song, meta information describing the song itself (e.g. duration, total number of notes sounded and so on) plus various values describing how the note event is related to notes appearing simultaneously with, before and after it. I also calculated running means, sums, standard deviations and so on for many of these features.

Golly! the reader may exclaim. That makes for a darned lot of features! Surely your model will overfit! As it turns out, no – using weight decay we give the model an incentive to have weights closer to zero, meaning that, in the end, if all goes well, it will end up using only a limited subset of all these features.

Implementing it in code, pt. II

Every personal software project is an occasion to learn a thing. At this time I’d just heard about the fastai library, which promised ease of use and excellent performance (as it turned out, both of those promises were founded in reality – my only complaint would be that it’s so far poorly documented). Here’s how my model looked – basically a neural network with two hidden layers:

$ rachel.learn.model
TabularModel(
  (embeds): ModuleList(…)
  (emb_drop): Dropout(p=0.1)
  (bn_cont): BatchNorm1d(142, eps=1e-05, momentum=0.1, affine=True)
  (layers): Sequential(
    (0): Linear(in_features=330, out_features=1000, bias=True)
    (1): ReLU(inplace)
    (2): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True)
    (3): Dropout(p=0.2)
    (4): Linear(in_features=1000, out_features=500, bias=True)
    (5): ReLU(inplace)
    (6): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True)
    (7): Dropout(p=0.5)
    (8): Linear(in_features=500, out_features=1, bias=True)
  )
)

The final training set consists of 1.77 million notes and the validation set of 0.22 million notes. The table has 213 columns, of which 1 is the output label – the velocity.

$ rachel.train_df.shape
(1765557, 213)

$ rachel.validate_df.shape
(215514, 213)

Here’s how the model is created in code:

data = (TabularList.from_df(self.midi_df,
                            path=self.data_folder,
                            cat_names=category_names,
                            cont_names=continuous_names,
                            procs=[Categorify, Normalize])
                .split_by_idx(valid_idx)
                .label_from_df(cols=‘velocity’, label_cls=FloatList)
                .databunch())

learn = tabular_learner(data,
                        layers=[1000, 500],
                        ps=[0.2, 0.5],
                        emb_drop=0.1,
                        y_range=torch.tensor([0, 1.2], device=defaults.device),
                        metrics=exp_rmspe)

The loss function here is the exponential root mean squared error, although I suspect any variant of mean squared error would do the job. As you can imagine, fastai does quite a lot for us!

Intermezzo: a little look inside the data set

Having set it up as a tabular problem, it was easy to dig inside the data. Here is the feature that was most obviously related to the velocity (loudness), namely the octave (or, in English, how far to the left or right on the keyboard a note appears, where 0 is on leftmost side and 10 on the very right).

There was another, minor relation in the follows_pause feature, which is 1 if no other note was pressed when the note event in question occurred, and 0 otherwise.

Or, if numbers speak to you and boxes don’t, here are the highest and lowest Pearson correlation coefficients:

$ rachel.train_df.corr().velocity.sort_values(ascending=False)
velocity                                     1.000000
interval_from_released_fwd_roll_std_50       0.232278
interval_from_released_roll_std_50           0.229318
pitch_fwd_roll_std_50                        0.220742
pitch                                        0.206995
                                               ...   
time_since_last_pressed_fwd_roll_mean_50    -0.246373
time_since_last_pressed_roll_mean_10        -0.246656
sustain_roll_min_50                         -0.272489
sustain_fwd_roll_min_50                     -0.273559
time_since_last_pressed_roll_mean_50        -0.287749
Name: velocity, Length: 168, dtype: float64

The best correlation here is the rolling average time passed since the last pressed note event – the more time has passed since the previous note event, the quieter the following note will be played. We also see: 1) that the longer the surrounding notes are sustained (held pressed), the lower the velocity of the note event, and 2) that the higher the note event's pitch in comparison to the surrounding notes, the higher the velocity (an 'interval' is a difference in pitch between two notes).

In general, there are two trends here:

The higher the pitch, the louder the note is played.
The slower the tempo, the quieter the note is played.

Results

Training the model for 4 epochs gave a training loss of 0.0498 and a validation loss of 0.0547 (both mean squared error). After that, the model began to overfit (the validation loss increased). This seems a lot higher than for the previous implementation, but remember that we are measuring different things – there the sparse output matrix made for low losses, but here we are making velocity predictions only for note events, and not for silent parts.

The following diagrams are a visual representation of the model’s performance when run on a selection of samples from the validation set. Each graph is a song, and each dot is a note being played. On the y-axis we have the velocity predicted by the model, and on the x-axis the actual velocity played by the human pianist. A perfect model, in other words, would produce a straight diagonal line from the bottom left to the top right.

As you can see in these examples, the model does better than a baseline – there is a noticeable positive correlation between the predicted velocities and the actual velocities.

For some songs, the relationship seems stronger, whereas for others there doesn’t seem to be much of one at all.

As one colleague pointed out when I gave a talk about this at our office, the dots appear rather expanded horizontally and rather compressed vertically. That is to say, the model is reserved, playing it safe by not predicting many extreme values, whereas the human performances have a much higher dynamic range.

Spotting the difference

But in propria causa nemo debet esse iudex – that’s no one should be their own esteemed and honourable justice for us non-Latin-speakers. I will let the reader judge its efficacy for themselves. For each of the two piano pieces below you will find three sound files – one of the original performance by a human pianist, one of the version with velocities set by the neural network, and one of a baseline version, where the velocity is always set to a constant value. The order has been determined via a random number generator, so you’ll have no way of psychologing your way to a correct answer. The correct answers, by the way, are provided at the end of this article.

J. S. Bach: Prelude and Fugue in C major, BWV 846

A

B

C

Joseph Haydn: Keyboard Sonata No. 47 in B Minor: I. Allegro Moderato

A

B

C

Vision of the future

The opposition to mechanised music performance is as old as the invention itself. In 1737, Jacques de Vaucanson constructed a life-size shepherd statue that, through a sophisticated system of pipes, weights and bellows, was able to play the flute. When it was brought to the attention of the court of Frederick the Great in Potsdam⁵, the Prussian king’s flute teacher, Johann Joachim Quantz, wrote the following:

With skill, a musical machine could be constructed that would play certain pieces with a quickness and exactitude so remarkable that no human being could equal it either with his fingers or with his tongue. Indeed it would excite astonishment, but it would never move you. (James R. Gaines, Evening in the Palace of Reason, Harper Perennial, 2006)

I do not imagine an AI will replace any Yuja Wang, Khatia Buniatishvili or Marc-André Hamelin any time soon. But I do think that once such an AI

will be able to mimic the fluctuating tempi (via adjusting the time values of note events) of concert pianists,
will be able to take into account all of the composer’s directives in the sheet music (including dynamics, rubato, expressive markings, etc.), and
does all of this adequately,

then it, the AI, will be immensely useful for composers in trying out written passages without the need of a real musician or real instruments.

The solution for the quizzes is as follows. Bach: A – original; B – machine-learned; C – baseline. Haydn: A – baseline; B – original; C – machine-learned.

Links and Attributions

The GitHub repository: github.com/erwald/rachel

Author

Erich Grunewald
Software Developer