New publication: Leveraging repetition for improved automatic lyric transcription in popular music

Screen shot 2014-05-23 at 13.33.11A bit outdated (a running theme of this blog), but you can now find my 2014 ICASSP paper on my publications page. In this paper, we (myself, Dan Ellis and Masataka Goto) took a novel approach to transcribing the lyrics of pop songs.

The main idea is that although transcribing lyrics is orders of magnitude harder than transcribing speech, one aspect yet to be exploited is that singing typically features repetition throughout a song (most notably in the choruses). We introduce 1 pre-processing technique and two post-processing techniques to leverage this information, and show that both post-processing techniques offer significant improvements over a baseline Hidden Markov Model.

Paper abstract:

Transcribing lyrics from musical audio is a challenging research problem which has not benefited from many advances made in the related field of automatic speech recognition, owing to the prevalent musical accompaniment and differences between the spoken and sung voice. However, one aspect of this problem which has yet to be exploited by researchers is that significant portions of the lyrics will be repeated throughout the song. In this paper we investigate how this information can be leveraged to form a consensus transcription with improved consistency and accuracy. Our results show that improvements can be gained using a variety of techniques, and that relative gains are largest under the most challenging and realistic experimental conditions.

Chords are important!

Very quick blog post: I spotted this a few weeks ago on Annie Mac‘s show, but forgot to flag it up. Doesn’t it show so clearly how changing the chords of a song affect it’s mood? Also great to hear a remix that isn’t simply a speed-up to 120bpm with a kick drum slapped on top…

Finding gems in Acapella recordings

downloadI’m doing some work on acapella music (I know this isn’t technically correct way of writing this, but it’s certainly the most common, especially on the web) and thought I’d share some of the cool things I found whilst sifting through my data.

First up, here’s just a plain cool recording of Michael Jackson and Vincent Price goofing around when recording the Thriller intro:


Notice they decided to cut the end: “It’s great fun! Great take, cut…all right”. Probably would’ve killed the mood of the track I guess :D

Next, here’s another cool intro, this time from David Bowie. You’ll need to crank up your speakers, but whilst he’s waiting for the verse to start, you can hear him say “a little mouse fart? You went: eh!” and then start slapping his cheeks. You can keep listening until 0:38 to verify that it’s indeed the take that made it to the recording.


Gorillaz added some vocal count-ins to help them out when recording “Feel Good Inc.“: take a listen for “change, change, change, change” (x2) starting around 0:28. I guess they were hungover on that day?! Sounds like Damon Albarn to me. You can actually hear this on the original too, but it’s much clearer here:


To end, here’s a bit of Marvin Gaye. Nothing particularly novel I noticed here, except that with all the instrumentation stripped away you can really hear how great his voice is! Audio starts around 0:18:


PhD thesis: A Machine Learning Approach to Automatic Chord Extraction

BristolTo celebrate the 1-year anniversary of defending my Thesis, I thought it was time to make it public! Writing and defending this weighty tome took the best part of 4 years and were simultaneously some of the most enjoyable and stressful years of my life so far.

A full list of thanks and acknowledgements can be found in the text itself, but let me state that this would not have been possible without the support of my family, my advisor Tijl De Bie, the Bristol Centre for Complexity Sciences, and the Engineering and Physical Sciences Research Council. The salient details of the thesis are:

  • Author: Matt McVicar, University of Bristol
  • Title: A Machine Learning Approach to Automatic Chord Extraction
  • Supervisor: Dr Tijl De Bie, Intelligent Systems Laboratory, University of Bristol
  • Internal examiner: Dr Peter Flach, Intelligent Systems Laboratory, University of Bristol
  • External examiner: Dr Simon Dixon, Centre for Digital Music, Queen Mary University of London
  • Defence date: February 12th 2013

The main contributions of this work, and the associated publications, are:

  1. A thorough review of automatic chord extraction, in particular machine learning and expert systems [ 1 ]
  2. A new machine learning-based model for the automatic extraction of chords, chord inversions and musical keys [ 2 ]
  3. The application of this model to unseen data, in particular the use of large numbers of freely-available chord sequences from the internet [ 3, 4 ]

[ 1 ] M. McVicar, R. Santos-Rodríguez, Y. Ni and T. De Bie. Automatic Chord Estimation: A Review of the State of the Art. IEEE Transactions on Audio, Speech and Language Processing, Overview Article

[ 2 ] Y. Ni, M. McVicar, R. Santos-Rodríguez. and T. De Bie. An end-to-end machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech and Language Processing

[ 3 ] M. McVicar, Y. Ni, R. Santos-Rodríguez. and T. De Bie. Using Online Chord Databases to Enhance Chord Recognition. Journal of New Music Research, Special Issue on Music and Machine Learning

[ 4 ] M. McVicar, Y. Ni, R. Santos-Rodríguez and T. De Bie. Curriculum Learning on Large Chord Databases. In Proceedings of the 12th International Society for Music Information Retreival (ISMIR), 2011

The full text (after corrections) and publications can be accessed at my Publications page.

I forgot about a publication!

Screen shot 2014-02-01 at 22.55.33I just realised I forgot to blog a paper! I’m second author, so I suppose it’s OK…promise you won’t be mad? A preprint of the paper, entitled “Understanding Effects of Subjectivity in measuring Chord Estimation Accuracy”, can be downloaded from my Publications page.

The main idea behind this paper is to investigate subjectivity in chord transcriptions and estimations. In the former, we were interested in finding out how consistent a set of musical experts annotated the chords to a given song. Crucially, if this agreement is less than 100%, then we cannot hope to ever design a ‘perfect’ chord estimation algorithm. Furthermore, if the maximum agreement between experts is, say, 95%, then any algorithm which scores higher than this must be modelling the nuances of a particular/set of particular annotators, which we define as the annotators’ subjectivityWe see no scientific gain for doing this, and so 95% (in this case) upper-bounds automatic chord estimation performance.

To study this, we asked 5 experts (including myself!) to annotate a set of 20 songs. We then measured each annotator’s estimation against the consensus annotation, which we assume to be a  less subjective truth, and to also converge to the ‘true’ annotations as the number of experts tends to infinity. The results, interestingly, show that the most skilled annotator (not me, unfortunately; I’m annotator A.3) was able to score 90% against the consensus, indicating an upper bound of 90% on automatic chord estimation algorithms.

Screen shot 2014-02-01 at 23.05.00

Next, we moved on to study subjectivity in automatic methods. Here, we find that the best systems are already able to achieve accuracies close to that of trained humans. Also, we derive a Sequence Crowd Learning algorithm which is able to obtain an accurate consensus annotation from a set of examples. This type of post-processing/bootstrapping has been explored in speech recognition (see ROVER, Recognizer Output Voting Error Reduction) but also in recent work in beat tracking. I also have something in the works for this…stay tuned!

Paper abstract:

“To assess the performance of an automatic chord estimation system, reference annotations are indispensable. However, owing to the complexity of music and the sometimes ambiguous harmonic structure of polyphonic music, chord annotations are inherently subjective, and as a result any derived accuracy estimates will be subjective as well. In this paper we investigate the extent of the confounding effect of subjectivity in reference annotations. Our results show that this effect is important, and they affect different types of automatic chord estimation systems in different ways. Our results have implications for research on automatic chord estimation, but also on other fields that evaluate performance by comparing against human provided annotations that are confounded by subjectivity.”

Automatic Chord Estimation from Audio: A Review of the State of the Art

The final journal paper from my PhD thesis is now available! It’s called “Automatic Chord Estimation from Audio: A Review of the State of the Art”, is currently available on IEEE Explore and will probably be published in the February issue.

Within, we discuss feature extraction, modelling techniques, training and datasets, evaluation strategies (including MIREX, the annual benchmarking evaluation in why our system outperformed all other systems for two years), and software packages for chord estimation.

I’m particularly pleased with two figures in this paper. The first shows the annual performance of algorithms in the MIREX evaluations, clearly showing a performance plateau and overfitting on the Beatles dataset: as well as the challenges and benefits of an unseen test set (the much-appreciated McGill SALAMI dataset) in 2012.


The second is a ‘visual literature review’, showing breakthroughs in various aspects of the automatic chord estimation research problem chronologically:


It feels particularly good to get this this paper published, as it ties up my thesis work and is a quite in-depth study, comprising an IEEE Overview Article, which are ‘solid technical depth and lasting value and should provide advanced readers with a thorough overview of various fields of interest’ and published at most four times a year in the journal. It also gave us twice the usual page limit! A preprint pdf and bibtex link are available on my publications page. Abstract:

In this overview article, we review research on the task of Automatic Chord Estimation (ACE). The major contribu- tions from the last 14 years of research are summarized, with de- tailed discussions of the following topics: feature extraction, mod- eling strategies, model training and datasets, and evaluation strate- gies. Results from the annual benchmarking evaluation Music In- formation Retrieval Evaluation eXchange (MIREX) are also dis- cussed as well as developments in software implementations and the impact of ACE within MIR. We conclude with possible directions for future research.


“With or Without The City of Angels” – similarities in pop music

Screen shot 2014-01-11 at 14.10.53I was listening to Radio 1 last week and heard ‘City of Angels’ by Thirty Seconds to Mars. I thought it sounded familiar, but couldn’t put my finger on why. After some noodling on the piano on my desk (MIR job perks!) I realised it’s almost identical to ‘With or Without You’ by U2. Here are the videos side-by-side, in the awesome Turntubelist:

Click each track’s ‘play’ icon to load, then the large play button underneath each video to play. Mixing is controlled by the slider. These songs share the same key (D major) the same chord sequence throughout (D, A, Bm, G) have almost identical tempi (105 vs 110 bpm), and the refrains even share the same melodies! (compare the sections starting 3:24 in ‘City of Angels’ vs 3:04 in ‘With or without you’).

This got me thinking that it would be cool to run through my buddy and former colleague at AIST Matthew Davies‘ (and collaborators) AutoMashUpper algorithm, which you can read about in his ISMIR paper from 2013.


The gist of the algorithm is that for a given input song, it searches for sections of similar pitch profiles throughout a target song (‘pitch-shifting’ to account for different keys, although this isn’t needed here), and then speeds them up/slows them down (again, not needed much in this instance) and layers them onto the original song, accounting for structure and changes in loudness etc. You can listen to the output of this mashup here:



It fits remarkably well! The algorithm lines up the coda sections so well (3:30 in the mix above) that it’s hard to make out Bono at all.

This got me thinking about There have been various instances of artists being accused of ‘copying’ other songs, from Led Zeppelin and Spirit to Coldplay and Joe Satriani. I was never clean what the grounds for suing on these cases was, since finding one song similar to another doesn’t detract from my aural pleasure; it only enhances it. Can’t we (as music consumers) enjoy both equally, if not more, because of this relationship? Doesn’t it actually provide users an enriched listening experience?



Japanese Tea Ceremony

photo_1Today at work I went to a traditional Japanese Tea Ceremony lesson. Something I’ve wanted to do for a while, so it’s great that I managed to do it on the AIST campus so early on in my stay. And for just 200 Yen!

The ceremony was held in the ‘Japanese Room’ in the welfare centre, tucked away between the hairdressers and a restaurant (obviously). After the obligatory shoe-removing we were shown to cushions to kneel on, which I have to say I find extremely uncomfortable. My Western knees aren’t up to it! More training required.

First off we first had a display, where one main guest performed a kneeling bow (lots more of this to follow) at the entrance, then shuffled over next to us. Shuffling, it turns out, is preferred as it causes one to slow down and minimise disturbance. Her friend then joined her in the same fashion. After this, the tea-sensei came in and began the ceremony. This basically involves:

  1. Confectionary offering. Wagashi are offered to the main guest in a large bowl by the sensei’s assistant after a series of bows. Bow to the rest of your party and say “o-saki-ni” (“Excuse me while I go first”). Wooden chopsticks are used to pick up the sweets, which must then by cleaned by folding the corner of the kai-shi (paper napkin) over the chopsticks and wiping. After returning the bowl to the host, the confectionary are eaten in small bites with a ku-ro-mo-n-ji (small wooden pick).
  2. Washing the bowls. Hot water is taken from an iron pot with a bamboo ladle and poured into the (presumably already clean) bowls, known as cha-wa-n. A kai-shi is then used to wipe the cha-wa-n dry, first inside, then out.
  3. Preparing the tea. A wooden spatula (cha-sa-ku) is used to measure out 1.5 measures of powdered green tea which are added to each bowl. Hot water is added to the tea from an iron pot, and then mixed with a tea whisk (cha-se-n). This kind of looks like a shaving brush, but is made from a single piece of bamboo. The host whisks the tea counter-clockwise, slowly at first and then increasing in speed.
  4. Presenting the tea. After a series of bows and phrases I didn’t quite hear, the tea is offered to the first guest. The guest takes the tea across the mat in their right hand and then apologises to their neighbour for going first. The bowl is raised and placed in the left hand, before being rotated 180 degrees clockwise.
  5. Drinking the tea and admiring the cha-wa-n. The tea is drunk in sips, with both hands, with minimal noise except for a final ‘slurp’ to finish. Cleaning the bowl is then done with your fingers, which are then wiped on your kai-shi. After this, the bowl is placed in front of you to admire. You should first place it to your left side and kneel to the right to get a good view of the right side. The same procedure is then repeated for the left side of the bowl.  Finally, left the bowl to admire it from below, taking note of the personal seal of the maker. Bow to your host.

As you can see, it’s quite involved! No wonder that people spend their whole lifetimes studying ta-do (the way of the tea). After the display, we had a feeble attempt at being served ourselves, and then making our own tea.

In our brief lesson I got an idea about the basics I listed above, but on further research it seems that there is much more to it than this, including proper etiquette for opening of sliding doors, walking technique, how to enter and exit the room, who to bow to and when….overall I really enjoyed learning about the tea ceremony, and think I’ll be picking up a set of basic equipment to study at home, or as souvenirs for friends/family. Don’t think I’ll ever look at builders’ tea the same way again!



Blog is alive!

Research_media_ltdThe blog is alive and kicking! More updates to come soon, including some recently-available publications, and maybe even my PhD thesis. I’d quite like to do a post on my adventures in America at LabROSA with Dan Ellis too, and the the work I’m currently conducting at AIST with Masataka Goto.

A lot has happened since my last post, but I thought I’d fire this up again with a very quick post highlighting an interview I did with Research Media Limited a while back. The (brief) transcript can be found here: