lilypondToBandVideoConverter - Generating Music Notation Videos for Practice or Stage


When transferring some music to my mobile audio player for practising in 2006, I wondered whether it would be possible to have the musical notation available at the same time. Especially when using a video player, one could show some score extract and have the video player play the remaining voices.

Very often you can get some sort of MIDI sequencers for those devices, but this means an extra expensive program with special handling and often not too reliable on those small systems. But since most audio players (and of course mobile phones and tablets) support videos, I had the idea to produce a video displaying the musical score while playing the music. This can be done fully automatically from a score notation done in an open-source program using open-source tools.

One disadvantage compared to a MIDI sequencer is that you cannot select the voices to be shown or played in the final video. But at least there is some workaround: the video may contain several video and audio tracks and you can select the ones relevant for your performance situation. For example, for my two-person-band I have the vocal leadsheet with chords as the video track and "all voices", "all voices minus vocals" and "all voices minus guitar and vocals" as audio tracks.

The whole method is based on the notation program lilypond which can produce single score pages as bitmaps and a MIDI file at the same time.

As mentioned above all this video production for any video-capable device can be completely done with open-source software. The resulting videos are only slightly larger than AAC audio files without score - at least with an good video codec -, because the presented frames do not change much and hence can be heavily compressed.

In this document I shall show you how to do this automatically by a Python tool called lilypondToBandVideoConverter that relies on other tools. If you want to know, what is behind all of this, the same process is repeated manually, but simplified.

As a little motivation for reading on figure 1 shows one frame of Bach's chorale BWV639 video (it is scaled down by a factor of 2, clicking reveals the original picture...).

video frame showing measures 11 and 12 from BWV639

Fig. 1: video frame showing measures 11 and 12 from BWV639

The full video can be seen here.

The full documentation for lilypondToBandVideoConverter can be found here, the automation example is taken from there.


The lilypondToBandVideoConverter program assumes you have a Python language interpreter on your system (either 2.7 or 3.3 and above). Most systems have one. Additionally you have to install the following open-source software to get going:

Additionally you have to install lilypondToBandVideoConverter from the PyPi repository via

  pip install lilypondToBandVideoConverter

When installation is done (and the Python script directory is on your executable path), you are ready to go...

Automatic Generation with the LilypondToBandVideoConverter


So how does this work? lilypondToBandVideoConverter uses an input lilypond file and some configuration file controlling the complete generation process. The several processing phases of lilypondToBandVideoConverter produce the several outputs incrementally. Those phases have the following meanings:

Of course, those phases are not independent. Several phases rely on results produced by other phases. Figure 2 shows how the phases depend on each other. The files (in yellow) are generated by the phases (in magenta), the configuration file (in green) and the lilypond fragment file (in blue) are the only manual inputs into the processing chain.

For example, the phase rawaudio needs a midi file as input containing all voices to be rendered as audio files.
Data Flow of the LilypondToBandVideoConverter

Fig. 2: Data Flow of the LilypondToBandVideoConverter


As an example we take a twelve-bar blues in E with two verses and some intro and outro. Note that this song is just an example, its musical merit is limited.

In the following we shall work with three files:

In principle one only needs a single configuration file and a single lilypond fragment file, but by this approach we can keep global and song-specific stuff separate.

In the following we explain the lilypond fragment file and configuration file in pieces; the complete versions are in the distribution of the tool.

Example Lilypond Fragment File

To understand this section you have to have at least a minimum knowledge of the music typesetting language lilypond or others like MusicTeX. Lilypond is a very technical text file format for scores and I shall not go into details, but the main idea is that notes are described by their english names (like "c") together with durations forming sequences. Those sequences can be assigned to variables using an equal sign and may be combined into systems within a score. Lilypond commands and variable references are introduced by backslashes.

The lilypond fragment file starts with the inclusion of the note name language file:

  \include ""

The first musical definition is the key and time designation of the song: it is in e major and uses common time.

  keyAndTime = { \key e \major  \time 4/4 }

The chords are those of a plain blues with a very simple intro and outro. Note that the chords differ for extract and other notation renderings: for the extract and score we use a volta repeat for the verses, hence in that case all verse lyrics are stacked vertically and we only have one pass of the verse.

All chords are generic: there is no distinction by instrument.

  chordsIntro = \chordmode { b1*2 | }
  chordsOutro = \chordmode { e1*2 | b2 a2 | e1 }
  chordsVerse = \chordmode { e1*4 | a1*2 e1*2 | b1 a1 e1*2 }

  allChords = {
    \chordsIntro  \repeat unfold 2 { \chordsVerse }

  chordsExtract = { \chordsIntro  \chordsVerse  \chordsOutro }
  chordsScore   = { \chordsExtract }

The vocals are simple with a pickup measure. Because we want to keep the notation consistent across the voices we have to use two alternate endings for the \vocalsExtract and \vocalsScore.

  vocTransition = \relative c' { r4 b'8 as a g e d | }
  vocVersePrefix = \relative c' {
    e2 r | r8 e e d e d b a |
    b2 r | r4 e8 d e g a g | a8 g4. r2 | r4 a8 g a e e d |
    e2 r | r1 | b'4. a2 g8 | a4. g4 d8 d e~ | e2 r |

  vocIntro = { r1 \vocTransition }
  vocVerse = { \vocVersePrefix \vocTransition }

  vocals = { \vocIntro \vocVerse \vocVersePrefix R1*5 }
  vocalsExtract = {
    \repeat volta 2 { \vocVersePrefix }
    \alternative {
        { \vocTransition }{ R1 }
  vocalsScore = { \vocalsExtract }

The lyrics of the demo song are really bad. Nevertheless note the lilypond separation for the syllables and the stanza marks. For the video notation the lyrics are serialized. Because of the pickup measure, the lyrics have to be juggled around.

  vocalsLyricsBPrefix = \lyricmode {
    \set stanza = #"2. " Don't you know I'll go for }

  vocalsLyricsBSuffix = \lyricmode {
    good, be- cause you've ne- ver un- der- stood,
    that I'm bound to leave this quar- ter,
    walk a- long to no- ones home:
    go down to no- where in the end. }

  vocalsLyricsA = \lyricmode {
    \set stanza = #"1. "
    Fee- ling lone- ly now I'm gone,
    it seems so hard I'll stay a- lone,
    but that way I have to go now,
    down the road to no- where town:
    go down to no- where in the end.
    \vocalsLyricsBPrefix }

  vocalsLyricsB = \lyricmode {
    _ _ _ _ _ _ \vocalsLyricsBSuffix }
  vocalsLyrics = { \vocalsLyricsA \vocalsLyricsBSuffix }
  vocalsLyricsVideo = { \vocalsLyrics }

The bass simply hammers out eighth notes. As before there is an extract and a score version with volta repeats and an unfolded version for the rest.

  bsTonPhrase  = \relative c, { \repeat unfold 7 { e8  } fs8 }
  bsSubDPhrase = \relative c, { \repeat unfold 7 { a'8 } gs8 }
  bsDomPhrase  = \relative c, { \repeat unfold 7 { b'8 } cs8 }
  bsDoubleTonPhrase = { \repeat percent 2 { \bsTonPhrase } }
  bsOutroPhrase = \relative c, { b8 b b b gs a b a | e1 | }

  bsIntro = { \repeat percent 2 { \bsDomPhrase } }
  bsOutro = { \bsDoubleTonPhrase  \bsOutroPhrase }
  bsVersePrefix = {
    \repeat percent 4 { \bsTonPhrase }
    \bsSubDPhrase \bsSubDPhrase \bsDoubleTonPhrase
    \bsDomPhrase \bsSubDPhrase \bsTonPhrase
  bsVerse = { \bsVersePrefix \bsTonPhrase }

  bass = { \bsIntro  \bsVerse \bsVerse  \bsOutro }
  bassExtract = {
    \repeat volta 2 { \bsVersePrefix }
    \alternative {
      {\bsTonPhrase} {\bsTonPhrase}
  bassScore = { \bassExtract }

The guitar plays arpeggios. As can be seen here, very often the lilypond macro structure is similar for different voices.

  gtrTonPhrase  = \relative c { e,8 b' fs' b, b' fs b, fs }
  gtrSubDPhrase = \relative c { a8 e' b' e, e' b e, b }
  gtrDomPhrase  = \relative c { b8 fs' cs' fs, fs' cs fs, cs }
  gtrDoubleTonPhrase = { \repeat percent 2 { \gtrTonPhrase } }
  gtrOutroPhrase = \relative c { b4 fs' a, e | <e b'>1 | }

  gtrIntro = { \repeat percent 2 { \gtrDomPhrase } }
  gtrOutro = { \gtrDoubleTonPhrase | \gtrOutroPhrase }
  gtrVersePrefix = {
    \repeat percent 4 { \gtrTonPhrase }
    \gtrSubDPhrase  \gtrSubDPhrase  \gtrDoubleTonPhrase
    \gtrDomPhrase  \gtrSubDPhrase  \gtrTonPhrase
  gtrVerse = { \gtrVersePrefix \gtrTonPhrase }

  guitar = { \gtrIntro  \gtrVerse  \gtrVerse  \gtrOutro }
  guitarExtract = {
    \repeat volta 2 { \gtrVersePrefix }
    \alternative {
      {\gtrTonPhrase} {\gtrTonPhrase}
  guitarScore = { \guitarExtract }

Finally the drums do some monotonic blues accompaniment. We have to use the \myDrums name here, because \drums is a predefined name in lilypond. There is no preprocessing of the lilypond fragment file: it is just included into some boilerplate code.

  drmPhrase = \drummode { <bd hhc>8 hhc <sn hhc> hhc }
  drmOstinato = { \repeat unfold 2 { \drmPhrase } }
  drmFill = \drummode { \drmPhrase tomh8 tommh toml tomfl }
  drmIntro = { \drmOstinato  \drmFill }
  drmOutro = \drummode {
    \repeat percent 6 { \drmPhrase } | <sn cymc>1 | }
  drmVersePrefix = {
    \repeat percent 3 { \drmOstinato }  \drmFill
    \repeat percent 2 { \drmOstinato  \drmFill }
    \repeat percent 3 { \drmOstinato }
  drmVerse = { \drmVersePrefix \drmFill }

  myDrums = { \drmIntro  \drmVerse \drmVerse  \drmOutro }
  myDrumsExtract = { \drmIntro
    \repeat volta 2 {\drmVersePrefix}
    \alternative {
     {\drmFill} {\drmFill}
    \drmOutro }
  myDrumsScore = { \myDrumsExtract }

So we are done with the lilypond fragment file. What we have defined are

All those definitions take care that the notations shall differ in our case for extracts/score and other notation renderings.

Example Configuration Files

As mentioned above the configuration is split up into a file with global settings and one with the song settings.

As a convention we prefix auxiliary variable with an underscore to distinguish them from the real configuration variables.

Example Global Configuration

The first setup steps define the program locations. We assume that programs are located together in some directory, but this depends on the environment. All definitions assume a Unix context, but you may also use slashes as path separators for Windows.

  _programDirectory = "/usr/local"
  _soundFonts = "/usr/lib/soundfonts/FluidR3_GM.SF2"
  aacCommandLine = _programDirectory "/qaac -V100 -i ${infile} -o ${outfile}"
  ffmpegCommand = _programDirectory "/ffmpeg"
  lilypondCommand = _programDirectory "/lilypond"
  lilypondVersion = "2.18.2"
  midiToWavRenderingCommandLine = \
    _programDirectory "/fluidsynth.exe -n -i -g 1 -R 0" \
    " -F ${outfile} " _soundFonts " ${infile}"
  _sox = _programDirectory "/sox"
  audioProcessor = "{" \
    "mixingCommandLine: '" _sox \
        " -m [-v ${factor} ${infile} ] ${outfile}'," \
    "amplificationEffect: 'gain ${amplificationLevel}'," \
    "paddingCommandLine: '" _sox \
        " ${infile} ${outfile} pad ${duration}'," \
    "refinementCommandLine: '" _sox \
        " ${infile} ${outfile} ${effects}'" \

We have not provided a definition for the mp4boxcommand because - as a default - ffmpeg can also do the MP4 container packaging. Note also that aac and sox must have more extensive definitions.

Other global settings define paths for files or directories. The generated PDF and MIDI files go to subdirectory "generated" of the current directory, audio into "/tmp/audiofiles".

  loggingFilePath = "/tmp/logs/ltbvc.log"
  targetDirectoryPath = "generated"
  tempAudioDirectoryPath = "/tmp/audiofiles"

For the notation we ensure that drums use the drum staff and that the clefs for bass and guitar are transposed by an octave and that drums have no clef at all. Chords shall be shown for all extracts of melodic instruments and on the top voice ``vocals'' in the score and video.

  _voiceNameToStaffListMap = "{ drums : DrumStaff }"
  _voiceNameToClefMap = "{" \
    "bass : bass_8, drums : '', guitar : G_8" \

  phaseAndVoiceNameToStaffListMap = "{"      \
    "extract :" _voiceNameToStaffListMap ","  \
    "midi    :" _voiceNameToStaffListMap ","  \
    "score   :" _voiceNameToStaffListMap ","  \
    "video   :" _voiceNameToStaffListMap "}"

  phaseAndVoiceNameToClefMap = "{"      \
    "extract :" _voiceNameToClefMap ","  \
    "midi    :" _voiceNameToClefMap ","  \
    "score   :" _voiceNameToClefMap ","  \
    "video   :" _voiceNameToClefMap "}"

  voiceNameToChordsMap = "{" \
    "vocals : s/v, bass : e, guitar : e" \

The humanization for the MIDI and audio files is quite simple: we use a rock groove with tight hits on two and four and slight variations for other measure positions. The timing variations are very subtle as the variation is at most 0.2 1/32nd notes.

As the velocity variation there is a hard accent on two and a slighter accent on four while the other positions are much weaker.

We have not defined individual variation factors per instrument; hence all humanized instruments have similar variations in timing and velocity.

  countInMeasureCount = 2

  humanizationStyleRockHard  = \
    "{ 0.00: 1.0/0.1, 0.25: 1.15/0," \
    "  0.50: 0.95/0.1, 0.75: 1.1/0," \
    "  OTHER: 0.85/B0.2, SLACK:0.1," \
    "  RASTER: 0.03125 }"

The video generation is just done for a single video target called "tablet" with a portrait orientation and a classical 4:3 aspect ratio. The strange integer below for the subtitle color is a hexadecimal 8800FFFF, that is a yellow with about 45% transparency. And the videos show both vocals and guitar and are characterized as "Music Videos" in their media type.

  videoTargetMap = "{" \
      "tablet: { resolution: 132," \
               " height: 1024," \
               " width: 768," \
               " topBottomMargin: 5," \
               " leftRightMargin: 10," \
               " scalingFactor: 4," \
               " frameRate: 10.0," \
               " mediaType: 'Music Video'," \
               " systemSize: 25," \
               " subtitleColor: 2281766911," \
               " subtitleFontSize: 20," \
               " subtitlesAreHardcoded: true } }"

  videoFileKindMap = "{" \
      "tabletVocGtr: { target:         tablet,"      \
                     " fileNameSuffix: '-tblt-vg',"     \
                     " directoryPath:  './mediaFiles' ," \
                     " voiceNameList:  'vocals, guitar' } }"

For the transformation from midi tracks to audio files there are only two sound style definitions: an extreme bass and a crunchy guitar. Both use overdrive and some sound shaping, the guitar style also applies a bit of compression. We use "sox" for that, a commandline program for audio processing. Details of its parameters can be found in the sox documentation.

In principle it is possible to use another command-line refinement program by using the variable "audioRefinementCommandLine".

For all the other voices we shall specify later that they just use the raw audio files with some reverb added.

  soundStyleBassExtreme = \
      " norm -12 highpass -2 40 lowpass -2 2k" \
      " norm -10 overdrive 30 0" \
      " norm -24 equalizer  150 4o +10 lowpass -2 600 1.2o"

  soundStyleGuitarCrunch = \
      " highpass -1 100 norm -6" \
      " compand 0.04,0.5 6:-25,-20,-5 -6 -90 0.02" \
      " overdrive 10 40"

For the final audio files we have two variants: one with all voices, the other one with missing vocals and background vocals (the "karaoke version"). The song and album names have the appropriate info in brackets.

All songs and videos will go to the "mediaFiles" subdirectory of HOME and have a jpeg-file as their embedded album art. Audio and video files have "test-" as their prefix before the song name. So, for example, the audio file for "Wonderful Song" with all voices has path "./mediaFiles/test-wonderful_song.m4a".

  targetFileNamePrefix = "test-"
  audioTargetDirectoryPath = "./mediaFiles"
  albumArtFilePath = "./mediaFiles/demo.jpg"

  audioGroupToVoicesMap = "{" \
      " base : bass/keyboard/strings/drums/percussion," \
      " voc  : vocals/bgVocals," \
      " gtr  : guitar" \
Because the "_voiceNameToAudioLevelMap" definition (used in the following) is done in the song configuration, the following part must come after the song configuration file (e.g. in a separately included file).
  audioTrackList = "{" \
      "all :      { audioGroupList : base/voc/gtr," \
      "  audioFileTemplate : '$'," \
      "  songNameTemplate  : '$ [ALL]'," \
      "  albumName         : '$'," \
      "  description       : 'all voices'," \
      "  languageCode      : deu," \
      "  voiceNameToAudioLevelMap : "_voiceNameToAudioLevelMap" }," \
      "novocals : { audioGroupList : base/gtr," \
      "  audioFileTemplate : '$-v'," \
      "  songNameTemplate  : '$ [-V]'," \
      "  albumName         : '$ [-V]'," \
      "  description       : 'no vocals'," \
      "  languageCode      : eng," \
      "  voiceNameToAudioLevelMap : "_voiceNameToAudioLevelMap" }" \
Example Song Configuration

There is not much left to define the song. First come the overall properties:

  title = "Wonderful Song"
  fileNamePrefix = "wonderful_song"
  year = 2017
  composerText = "arranged by Fred, 2017"
  trackNumber = 99
  artistName = "Fred"
  albumName = "Best of Fred"

The main information about a song is given in the table of voices with the voice names, midi data, audio and reverb levels and the sound variants. As mentioned before only bass and guitar have an audio postprocessing.

  voiceNameList      =  "vocals,    bass,  guitar,   drums"
  midiChannelList    =  "     2,       3,       4,      10"
  midiInstrumentList =  "    54,      35,      29,      18"
  midiVolumeList     =  "   100,     120,      70,     110"
  panPositionList    =  "     C,    0.3R,    0.8R,    0.1L"
  reverbLevelList    =  "   0.3,     0.2,     0.0,     0.4"
  soundVariantList   =  "  COPY, EXTREME,  CRUNCH,    COPY"

The audio levels are given in a mapping from voice name to volume factors in decibels, which is used in the audio track list. We use a single mapping for all targets, that means the relative levels are identical in all mixes.

  _voiceNameToAudioLevelMap = \
    "{ vocals : 0, bass : -1.6, guitar : -9.7, drums : 3.5 }"
Note that the above definition must come before the audioTrackList definition.

We also have lyrics: two lines of lyrics in vocals extract and score, one (serialized) line in the video.

  voiceNameToLyricsMap = "{ vocals : e2/s2/v }"

Humanization relies on the humanization style defined in the global configuration. It applies to all voices except vocals and starts in measure 1.

  styleHumanizationKind = "humanizationStyleRockHard"
  humanizedVoiceNameSet = "bass, guitar, drums"
  measureToHumanizationStyleNameMap = \
      "{ 1 : humanizationStyleRockHard }"

The overall tempo is 85bpm throughout the song.

  measureToTempoMap = "{ 1 : 85 }"

Putting it All Together

Now we are set to start the tool chain. Assuming that the configuration is in file "wonderful_song-config.txt" and the lilypond stuff is in "", the command to produce everything is

  lilypondToBVC --phases all wonderful_song-config.txt

and it produces the following target files

Figure 3 shows an extract page (a), one image of the target video (b) and the first score page (c) as an illustration. The generated (non-impressive!) humanized midi file is here and the final video with two audio tracks is here.

(a) example of notation page for voice extract (c) example of notation page for score
(b) example of notation page within video

Fig. 3: Examples for Target File Images

Detailed Manual Walk-Through

If you want to understand how this all works in principle, the following section manually walks through the different steps of the lilypondToBandVideoConverter. This is grossly simplified: e.g. there is no MIDI humanization, the result audio file is rendered in total without fine-tuning the voices and the final video only has two tracks. But hopefully you get the idea...

For the walk-through of the process I assume that we are starting from scratch, that means, no music exists so far and no score representation whatsoever. The example lilypond file is reduced to the minimum, because it just should help to illustrate the idea.

In the context assumed the process for getting a video in MP4 format is as follows:

  1. write a score text file in lilypond format with your favourite text editor (for example, notepad, emacs or vi are fine),
  2. have the music notation program lilypond produce single score pages as bitmaps and a MIDI file,
  3. convert the MIDI file into a WAV file with the player fluidsynth and to an AAC audio file via ffmpeg,
  4. generate the MP4 video from fragments via ffmpeg,

A word of caution in advance: when doing this you may not be afraid of command-line programs, because they provide the necessary services (standing on the shoulders of giants...). I shall show the commands needed one at a time to illustrate the process.

STEP 1: Preparing the Lilypond Score File

The first step is to write a lilypond text file, which contains the notes of the score. You can also generate that file from a MIDI file via tools included in the lilypond distribution, but let's assume we have to write it from scratch.

The lilypond score for a simple c-major scale up and down in quarter notes and a trailing half note will look like this:

    myMajorScale = \relative c' {
        c4 d e f |
        g a b c  |
        b a g f  |
        e d c2   |

The digits behind the note pitches are the durations: "4" means a quarter, "2" means a half note and the notes without duration have the same duration as the note before.

To make the file complete we add a reference to an include file setting up the output size, some command to define the font-size and the score section which results in:

    \version "2.18.2"
    \include ""

    mySong = \relative c' { c4 d e f | g a b c | b a g f | e d c2 | }

            \new Staff { \clef "treble"  \key c \major
                         \tempo 4 = 75 \mySong }

As you can see there is an \include-directive for a file setting the output page properties which are specific for the desired output device. This file is called and also is in the same directory as the lilypond file above.

It looks as follows:

    % === settings of target device ===
    % -- set resolution to 132ppi, page size to 1024x768
    resolution = 132
    largeSizeInPixels = 1024
    smallSizeInPixels = 768

    % -- set page margins in millimeters
    topBottomMargin = 5
    leftRightMargin = 10

    % -- define size of musical system (standard is 20pt)
    #(set-global-staff-size 40)

    % === derived settings ===
    #(ly:set-option 'resolution resolution)
    largeDimension = #(/ (* largeSizeInPixels 25.4) resolution)
    smallDimension = #(/ (* smallSizeInPixels 25.4) resolution)

    \paper {
        % -- remove all markup --
        print-page-number = ##f
        print-first-page-number = ##f

        % set paper dimensions
        top-margin    = \topBottomMargin
        bottom-margin = \topBottomMargin
        paper-width   = \largeDimension
        paper-height  = \smallDimension
        line-width    = #(- paper-width (* 2 leftRightMargin))

In our case it defines the page size and the target resolution for an old iPad 1 tablet (132ppi, 197mm by 148mm), adjusts some space dimensions on the page and suppresses all headers. Also for demonstration the font size is extremely enlarged.

You can download both in an archive here.

STEP 2: Typesetting the Lilypond File

Now - assuming that lilypond is in the command search path of your system - we can let lilypond typeset the file into bitmap files and generate a midi file. The command is

    lilypond --png

This should produce four new files:, demo-page1.png, demo-page2.png and demo.midi. The png-files are the key frames of our video, the midi file will later on be converted to some form suitable for a video (for example, an AAC-file). The postscript file is the basis for generating the PNG files.

Let's have a look at the frames generated (figure 4). They are scaled down by a factor of 4 (clicking reveals the original picture...).

first page within generated video       second page within generated video
Fig. 4: Demo song PNG files

STEP 3: Generating the Audio File

The next step is to generate some WAV file with the audio of the demo song from the MIDI file generated in the step before. We will use the fluidsynth player for that and some soundfont file covering the complete General MIDI instruments:

    fluidsynth FluidR3_GM.SF2 demo.mid -F demo.wav

The result is the WAV-file demo.wav, which is compressed into an AAC file demo.aac with ffmpeg (the vbr flag sets variable bitrate and a high quality encoding):

    ffmpeg -i demo.wav -vbr 5 demo.aac

STEP 4: Generating the Raw MP4 Video From Fragments

Now comes a complicated part in the notation video generation: We generate the page fragment videos manually via ffmpeg; later we discuss how to automate the approach.

It is simple to make a video from single pictures. For the generation we need the frame rate, the names of the PNG-files and the durations.

But how do we find out the times each picture must be shown? Fortunately it is simple arithmetics, because we have defined the tempo of the track in the lilypond file.

Assume for example that we have three pictures showing 2 measures, 1 measure and 3 measures, a tempo of 90 quarters per minute and a 3/4 time signature. Picture 1 has to be shown for

2 measure × 3 quarter/measure × 1/90 min/quarter × 60 s/min = 4s

By the same logic the times for the other picture pages would be 2s and 6s.

In our demo example the first picture shows eight measures, the second one shows four measures. With a tempo of 75 quarters per minute and a 4/4 time signature this leads to a time list of "(25.6s, 6.4s)".

Musicians often prefer notes to be visible a little bit before they have to be played, so we have to shorten the first picture duration by some amount and add this to the duration of the last. Let's assume 0.4s is fine; this leads to a time list of "(25.2s, 6.0s)".

Now let's generate the mp4 video fragments (each showing a single picture for the desired duration) with ffmpeg. We assume a frame rate of 25 fps and cut off each video after the calculated duration. The strange input framerate ensures that the picture is replicated often enough (100.000 times) for cutting it off:

    ffmpeg -framerate 1/100000 -i demo-page1.png -r 25 -t 25.2 demo-part1.mp4 
    ffmpeg -framerate 1/100000 -i demo-page2.png -r 25 -t  6.0 demo-part2.mp4 

Checking the videos shows that they have the correct length and show just a single score picture all the time. Great.

Now we have to concatenate the fragments. Unfortunately ffmpeg has no easy mechanism for that: we have to write the names into a text file and feed that file to ffmpeg. So here is the file called demo-concat.txt:

    FILE: demo-concat.txt
    file 'demo-part1.mp4'
    file 'demo-part2.mp4'

The ffmpeg-concatenation is done via the following command:

    ffmpeg -f concat -i demo-concat.txt -c copy demo-noaudio.mp4

A quick check reveals that the video indeed shows the first page for 25 seconds and the second one for 6. But we still lack the audio track...

STEP 5: Combining Video with Audio

The integration of the audio file is simple:

    ffmpeg -i demo-noaudio.mp4 -i demo.aac demo.mp4

Hooray!! When looking at the video, it looks like the result we intended to have!


The article has demonstrated how to make music notation videos for portable video players with only a little effort. The necessary tools are all open-source, nevertheless the important step to master is the textual music notation language lilypond.

The processing steps are easy to be automated via lilypondToBandVideoConverter with some configuration in a simple language and the resulting notation video shows high quality score pages only confined by the dimensions and the resolution of the target device and the quality of the MIDI sound rendering and postprocessing.

Where To Download It

When you're interested in the manual walk-through you can download the demo archive and Bach639 lilypond and configuration files and the lilypond include.

The much better way is to go with the automated approach. The project is available as a PyPi-project or at GitHub.


This text is an update of my original article on video generation from music notation, which dates back to 2006. The approach then was to use avisynth as the video frame generator, but it has been replaced by ffmeg in the current version and has been highly automated since then.

Thanks to Rüdiger Murach for posing that challenge to me in 2006 in a coffee-break discussion at work and to my partner Ulrike Gröttrup for her patience when always showing her another new boring notation video...

letter icon    HOME    UP