Generating Music Notation Videos for Practice or Stage


When transferring some music to my mobile audio player for practising in 2006, I wondered whether it would be possible to have the musical notation available at the same time. Especially when using a video player, one could show some score extract and have the video player play the remaining voices.

Very often you can get some sort of MIDI sequencers for those devices, but this means an extra expensive program with special handling and often not too reliable on those small systems. But since most audio players (and of course mobile phones and tablets) support videos, I had the idea to produce a video displaying the musical score while playing the music. This can be done fully automatically from a score notation done in an open-source program using open-source tools.

One disadvantage compared to a MIDI sequencer is that you cannot select the voices to be shown or played in the final video. But at least there is some workaround: the video may contain several video and audio tracks and you can select the ones relevant for your performance situation. For example, for my two-person-band I have the vocal leadsheet with chords as the video track and "all voices", "all voices minus vocals" and "all voices minus guitar and vocals" as audio tracks.

The whole method is based on the notation program lilypond which can produce single score pages as bitmaps and a MIDI file at the same time.

As mentioned above all this video production for any MPEG4-capable device can be completely done with open-source software. The resulting videos are only slightly larger than an AAC audio file without score (at least with an MPEG-4 codec). But a word of caution: when doing this you may not be afraid of command-line programs, because we shall use some of them. I shall show the commands needed one at a time to illustrate the process, but it is highly recommended to put them into a command file (a helpful command file written in python for Unix and Windows is in the download archive).

As a little motivation for reading on figure 1 shows one frame of Bach's chorale BWV639 video (it is scaled down by a factor of 2, clicking reveals the original picture...).

Fig. 1: video frame showing measures 11 and 12 from BWV639

The full video can be seen here.


For the description of the process I assume that we are starting from scratch, that means, no music exists so far and no score representation whatsoever. The process may be varied accordingly, for example, when you wanted to generate a score for a music file already existing as MIDI you have to convert it somehow into the notation software and go on from there.

In the assumed context the process for getting a video in MP4 format is as follows:

  1. write a score text file in lilypond format with your favourite text editor (for example, notepad, emacs or vi are fine),
  2. have the music notation program lilypond produce single score pages as bitmaps and a MIDI file,
  3. convert the MIDI file into a WAV file with the player timidity and to an AAC audio file via ffmpeg,
  4. generate the MP4 video from fragments via ffmpeg,

So for the plain generation of the music notation video you need:

Once you have this all installed and have put the binary directories in your system's search path, we are ready to go. In this tutorial I can only show a toy example for space reasons, but of course the idea works for bigger and more complex files, too. In the download archive I provide a script written in python that does most of the described manual stuff automatically.

Manual Run-Through as a Proof of Concept

First we go through the process manually such that you can understand the principle. Later on the process is automated via a script; but you may need to adapt it, hence you need to know the details...

STEP 1: Preparing the Lilypond Score File

The first step is to write a lilypond text file, which contains the notes of the score. You can also generate that file from a MIDI file via tools included in the lilypond distribution, but let's assume we have to write it from scratch.

Lilypond is a very technical text file format for scores and I shall not go into details, but the main idea is that notes are described by their names (like "c") together with durations forming sequences. Those sequences can be given symbolic names and combined into systems within a score.

The score for a simple c-major scale up and down in quarter notes and a trailing half note will look like this:

    myMajorScale = \relative c' {
        c4 d e f |
        g a b c  |
        b a g f  |
        e d c2   |

The digits behind the note pitches are the durations: "4" means a quarter, "2" means a half note and the notes without duration have the same duration as the note before.

To make the file complete we add a reference to an include file setting up the output size, some command to define the font-size and the score section which results in:

    \version "2.18.0"
    \include ""

    mySong = \relative c' { c4 d e f | g a b c | b a g f | e d c2 | }

            \new Staff { \clef "treble"  \key c \major
                         \tempo 4 = 75 \mySong }

As you can see there is an \include-directive for a file setting the output page properties which are specific for the desired output device. Make sure that such a file is called and also is in the same directory as the lilypond file above.

It looks as follows:

    % === settings of target device ===
    % -- set resolution to 132ppi, page size to 1024x768
    resolution = 132
    largeSizeInPixels = 1024
    smallSizeInPixels = 768

    % -- set page margins in millimeters
    topBottomMargin = 5
    leftRightMargin = 10

    % -- define size of musical system (standard is 20pt)
    #(set-global-staff-size 40)

    % === derived settings ===
    #(ly:set-option 'resolution resolution)
    largeDimension = #(/ (* largeSizeInPixels 25.4) resolution)
    smallDimension = #(/ (* smallSizeInPixels 25.4) resolution)

    \paper {
        % -- remove all markup --
        print-page-number = ##f
        print-first-page-number = ##f

        % set paper dimensions
        top-margin    = \topBottomMargin
        bottom-margin = \topBottomMargin
        paper-width   = \largeDimension
        paper-height  = \smallDimension
        line-width    = #(- paper-width (* 2 leftRightMargin))

In our case it defines the page size and the target resolution for an old iPad 1 tablet (132ppi, 197mm by 148mm), adjusts some space dimensions on the page and suppresses all headers. Also for demonstration the font size is extremely enlarged.

You can download both in an archive here.

STEP 2: Typesetting the Lilypond File

Now - assuming that lilypond is in the command search path of your system - we can let lilypond typeset the file into bitmap files and generate a midi file. The command is

    lilypond --png

This should produce four new files:, demo-page1.png, demo-page2.png and demo.midi. The png-files are the key frames of our video, the midi file will later on be converted to some form suitable for a video (for example, an AAC-file). The postscript file is the basis for generating the PNG files.

Let's have a look at the frames generated (figure 2). They are scaled down by a factor of 4 (clicking reveals the original picture...).

Fig. 2: Demo song PNG files

STEP 3: Generating the Audio File

The next step is to generate some WAV file with the audio of the demo song from the MIDI file generated in the step before. We will use the timidity player for that:

    timidity demo.mid -Ow -o demo.wav

The result is the WAV-file demo.wav, which is compressed into an AAC file demo.aac with ffmpeg (the vbr flag sets variable bitrate and a high quality encoding):

    ffmpeg -i demo.wav -vbr 5 demo.aac

STEP 4: Generating the Raw MP4 Video From Fragments

Now comes a complicated part in the notation video generation: We generate the page fragment videos manually via ffmpeg; later we discuss how to automate the approach.

It is simple to make a video from single pictures. For the generation we need the frame rate, the names of the PNG-files and the durations.

But how do we find out the times each picture must be shown? Fortunately it is simple arithmetics, because we have defined the tempo of the track in the lilypond file.

Assume for example that we have three pictures showing 2 measures, 1 measure and 3 measures, a tempo of 90 quarters per minute and a 3/4 time signature. Picture 1 has to be shown for

2 measure × 3 quarter/measure × 1/90 min/quarter × 60 s/min = 4s

By the same logic the times for the other picture pages would be 2s and 6s.

In our demo example the first picture shows eight measures, the second one shows four measures. With a tempo of 75 quarters per minute and a 4/4 time signature this leads to a time list of "(25.6s, 6.4s)".

Musicians often prefer notes to be visible a little bit before they have to be played, so we have to shorten the first picture duration by some amount and add this to the duration of the last. Let's assume 0.4s is fine; this leads to a time list of "(25.2s, 6.0s)".

Unfortunately timidity also adds a long note decay to the end of the audio file depending on the last notes and instruments. In our case it is about two seconds; so we either have to add another 2 seconds to the last entry in the time list or accept that no score is shown for that period. For simplicity we'll do the latter.

Now let's generate the mp4 video fragments (each showing a single picture for the desired duration) with ffmpeg. We assume a frame rate of 25 fps and cut off each video after the calculated duration. The strange input framerate ensures that the picture is replicated often enough (100.000 times) for cutting it off:

    ffmpeg -framerate 1/100000 -i demo-page1.png -r 25 -t 25.2 demo-part1.mp4 
    ffmpeg -framerate 1/100000 -i demo-page2.png -r 25 -t  6.0 demo-part2.mp4 

Checking the videos shows that they have the correct length and show just a single score picture all the time. Great.

Now we have to concatenate the fragments. Unfortunately ffmpeg has no easy mechanism for that: we have to write the names into a text file and feed that file to ffmpeg. So here is the file called demo-concat.txt:

    FILE: demo-concat.txt
    file 'demo-part1.mp4'
    file 'demo-part2.mp4'

The ffmpeg-concatenation is done via the following command:

    ffmpeg -f concat -i demo-concat.txt -c copy demo-noaudio.mp4

A quick check reveals that the video indeed shows the first page for 25 seconds and the second one for 6. But we still lack the audio track...

STEP 5: Combining Video with Audio

The integration of the audio file is simple:

    ffmpeg -i demo-noaudio.mp4 -i demo.aac demo.mp4

Hooray!! When looking at the video, it looks like the result we intended to have! Unfortunately this is just a toy example. So we're not done yet: we have to find a way to do this automagically for an arbitrary lilypond source file (using a standard tool chain)...

Automating the process

Of course, typing in the above commands is nonsense. Most of the steps are fairly straightforward; unfortunately the calculation of the page transition times is tricky. So how can we handle that automatically or at least semi-automatically?

Lilypond does not log the page breaks anywhere (as far as I know). But as a hack we can parse the postscript file generated when the PNG files are rendered. The trick here is to find the first measure numbers by pattern matching. This approach is fragile, of course: it relies on the fact that the generated postscript file does not change too much during the evolution of lilypond...

What is nearly impossible is to get at the tempo list. Even when the tempo changes are shown in the postscript output, they are difficult to parse.

Well, when the tempo does not change too often, the tempo track can be put into a separate configuration text file. The syntax is simple: each tempo change is on a single line with the measure number followed by the number of quarters per measure and the tempo indication. For example, assume a song with 120bpm starting in 3/4 and changing to 60bpm with 4/4 at measure 74. The configuration file looks as follows:

    FILE: song-tempo.txt
     1  -> 3|120
    74  -> 4|60

You may also use fractional measure lengths and tempo indications. For example, a 7/8 measure has a length of 3.5 quarters.

That naive approach does not handle tempo changes within a measure or things like rallentando or accellerando. Well, it's at least an 80% approach...

So we have everything set up for the generation. The automation is done by a python script found here or as part of the download archive from the download section.

Some additional feature is included: the script generates on demand a switchable subtitle with the measure numbers. This is helpful, because it aids in finding the current location in the piece without being too intrusive.

The parameters of the python program are as follows
        --tempo tempoTrackFileName
        -r frameRate
        -o targetMp4FileName
        [--scalefactor scaleFactor]
        [--countin countInMeasures]
        [--subtitle [targetSubtitleFileName]]

with the following meaning

lilypondFileNamename of input lilypond file
--tempo tempoTrackFileName name of file with tempo track data
-r frameRateframe rate for output file
-o targetMp4FileNamename of target MP4 file
--scalefactor scaleFactor integer factor to scale down page images (default: 1)
--countin countInMeasures number of count-in measures before first (default: 0.0); you might also use a negative offset here for advancing the video by some beats for easier continuation after page turning...
--subtitle [targetSubtitleFileName] name of target subtitle file with measure numbers in SRT format containing measure numbers; if given without filename, a temporary SRT file is generated and added as subtitle to the final video
--noaudiodo not generate audio file, just produce a plain notation video
-v, --verboseset verbose mode
--debug set debugging mode giving additional information about program flow
-h, --helpshow this help message and exit

Let us take the Bach piece BWV639 as the example. The tempo is 60 quavers or 30 quarters per minute, so the tempo track file is

    FILE: bach_639-tempo.txt
    1 -> 4|30

We now collect the parameters for the command:

So the command is

    python --tempo bach_639-tempo.txt \
        -o bach_639.mp4 -r 5 --subtitle --scalefactor 2 --countin 0.25

And the output file bach_639.mp4 is here.


By modifying the tool chain several variations are possible: One can

A Case Study: Instrument MMO Video

Let me elaborate on the second item above, the instrument MMO file. For my band I need notation videos of some piece with several audio tracks and rendered both in landscape and portrait format (depending on how I use the tablet on stage).

Maintaining and configuring the lilypond files is tedious unless you use a modular approach: put the music data in some lilypond include file and have the root lilypond file generated from configuration data. The music data file simply contains the voices and some macro definitions like "keyAndTime" or "countIn" without any surrounding lilypond framework as follows:

  keyAndTime = { \key ... \time ... }
  initialTempo = { \tempo 4 = ... }
  tempoTrack = { \initialTempo }
  drumsCountIn = \drummode { ss4 ss ss ss | }
  countIn = { r1 | }
  vocalsIntro = { ... }
  vocalsVerseA = { ... }
  vocals = { \vocalsIntro \vocalsVerseA... }
  vocLyricsMidi = \lyricmode { ... }
  bassIntro =  { ... }
  bassVerseA = { ... }
  bass = { \bassIntro \bassVerseA... }

A simple generator (for example, written in a scripting language) now gets the voice names and whether to generate a score extract or some midi file plus additional parameters (like for example, the video size). It knows some conventions about the involved clefs and whether voices have to be combined into a system (for example, for a piano or an organ) and produces the framework code.

For example, when having bass, keyboard and drums in the midi track and vocals and guitar in the score extract the generated file might look as follows:

  \version "2.18.0"
  \include ""
  \include ""

  % include note stuff
  \include ""

  \score {
      \new Staff = "vocals" { 
        \new Voice = "songA"  { \clef "treble" \keyAndTime \vocals } }
      \new Lyrics \lyricsto "songA" { \vocLyricsMidi }
      \new Staff = "guitar" { \clef "G_8" \keyAndTime \guitar }
      \layout {}

  \score {
          \initialTempo \countIn \tempoTrack
          \new Staff = "bass"
            \with { midiInstrument = "electric bass (pick)" } { 
              \unfoldRepeats { \keyAndTime \countIn \bass }
          \new PianoStaff = "keyboard"
            \with { midiInstrument = "rock organ" } {
                  \new Staff { \unfoldRepeats { \keyAndTime \countIn
                                                \keyboardTop } }
                  \new Staff { \unfoldRepeats { \keyAndTime \countIn
                                                \keyboardBottom } }
          \new DrumStaff = "drums" \with { midiInstrument = "power kit" } { 
              \unfoldRepeats { \keyAndTime \drumsCountIn \myDrums }
      \midi {}

I have not included the generator here, because it typically varies with the requirements. Nevertheless that generation of the boilerplate stuff is part of my video generation tool chain and it simplifies the process enormously.

Another possible improvement is the postprocessing of the audio tracks. timidity produces acceptable audio from soundfonts, but not really stunning audio. It is possible however to make timidity produce WAV files per track and postprocess those with some command-line audio processors (like for example, sox). This improves sound quality dramatically (especially for backing tracks), but for a fully automatic tool-chain you have to make those settings configurable, too. This is really tedious...

Once you have individual track audio files, it is also possible to do several submixes (for example, one with lead vocals and one without). Those submixes can be provided as audio tracks of a single video. Then in typical video players you can select on-the-fly the required track via some identification. Simple players only allow language codes as identifications ("the french audio track"), advanced players will show description texts with a detailed specification of the audio track.


The article has demonstrated how to make music notation videos for portable video players with only a little effort. The necessary tools are all open-source, nevertheless the important step to master is the textual music notation language lilypond.

All other processing steps are easy to be done and the resulting notation video shows high quality score pages only confined by the dimensions and the resolution of the target device and the quality of the MIDI sound rendering.


You can download an archive with this page (as the manual), the demo and Bach639 lilypond and configuration files, the lilypond include, and the python file for your own use.

The files and the method described are put into the public domain, but both are unsupported. If you want to comment, you can contact me by electronic mail (see below), but I cannot promise an immediate answer or even any answer at all.


This text is an update of my original article on video generation from music notation, which dates back to 2006. The approach then was to use avisynth as the video frame generator, but it has been replaced by ffmeg in the current version.

Thanks to Rüdiger Murach for posing that challenge to me in a coffee-break discussion at work and to my partner Ulrike Gröttrup for her patience when always showing her another new boring notation video...

   HOME    UP