I’m doing some work on acapella music (I know this isn’t technically correct way of writing this, but it’s certainly the most common, especially on the web) and thought I’d share some of the cool things I found whilst sifting through my data.
Notice they decided to cut the end: “It’s great fun! Great take, cut…all right”. Probably would’ve killed the mood of the track I guess :D
Next, here’s another cool intro, this time from David Bowie. You’ll need to crank up your speakers, but whilst he’s waiting for the verse to start, you can hear him say “a little mouse fart? You went: eh!” and then start slapping his cheeks. You can keep listening until 0:38 to verify that it’s indeed the take that made it to the recording.
Gorillaz added some vocal count-ins to help them out when recording “Feel Good Inc.“: take a listen for “change, change, change, change” (x2) starting around 0:28. I guess they were hungover on that day?! Sounds like Damon Albarn to me. You can actually hear this on the original too, but it’s much clearer here:
To end, here’s a bit of Marvin Gaye. Nothing particularly novel I noticed here, except that with all the instrumentation stripped away you can really hear how great his voice is! Audio starts around 0:18:
To celebrate the 1-year anniversary of defending my Thesis, I thought it was time to make it public! Writing and defending this weighty tome took the best part of 4 years and were simultaneously some of the most enjoyable and stressful years of my life so far.
Title: A Machine Learning Approach to Automatic Chord Extraction
Supervisor: Dr Tijl De Bie, Intelligent Systems Laboratory, University of Bristol
Internal examiner: Dr Peter Flach, Intelligent Systems Laboratory, University of Bristol
External examiner: Dr Simon Dixon, Centre for Digital Music, Queen Mary University of London
Defence date: February 12th 2013
The main contributions of this work, and the associated publications, are:
A thorough review of automatic chord extraction, in particular machine learning and expert systems [ 1 ]
A new machine learning-based model for the automatic extraction of chords, chord inversions and musical keys [ 2 ]
The application of this model to unseen data, in particular the use of large numbers of freely-available chord sequences from the internet [ 3, 4 ]
[ 1 ] M. McVicar, R. Santos-Rodríguez, Y. Ni and T. De Bie. Automatic Chord Estimation: A Review of the State of the Art. IEEE Transactions on Audio, Speech and Language Processing, Overview Article
[ 2 ] Y. Ni, M. McVicar, R. Santos-Rodríguez. and T. De Bie. An end-to-end machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech and Language Processing
[ 3 ] M. McVicar, Y. Ni, R. Santos-Rodríguez. and T. De Bie. Using Online Chord Databases to Enhance Chord Recognition. Journal of New Music Research, Special Issue on Music and Machine Learning
[ 4 ] M. McVicar, Y. Ni, R. Santos-Rodríguez and T. De Bie. Curriculum Learning on Large Chord Databases. In Proceedings of the 12th International Society for Music Information Retreival (ISMIR), 2011
The full text (after corrections) and publications can be accessed at my Publications page.
I just realised I forgot to blog a paper! I’m second author, so I suppose it’s OK…promise you won’t be mad? A preprint of the paper, entitled “Understanding Effects of Subjectivity in measuring Chord Estimation Accuracy”, can be downloaded from my Publications page.
The main idea behind this paper is to investigate subjectivity in chord transcriptions and estimations. In the former, we were interested in finding out how consistent a set of musical experts annotated the chords to a given song. Crucially, if this agreement is less than 100%, then we cannot hope to ever design a ‘perfect’ chord estimation algorithm. Furthermore, if the maximum agreement between experts is, say, 95%, then any algorithm which scores higher than this must be modelling the nuances of a particular/set of particular annotators, which we define as the annotators’ subjectivity. We see no scientific gain for doing this, and so 95% (in this case) upper-bounds automatic chord estimation performance.
To study this, we asked 5 experts (including myself!) to annotate a set of 20 songs. We then measured each annotator’s estimation against the consensus annotation, which we assume to be a less subjective truth, and to also converge to the ‘true’ annotations as the number of experts tends to infinity. The results, interestingly, show that the most skilled annotator (not me, unfortunately; I’m annotator A.3) was able to score 90% against the consensus, indicating an upper bound of 90% on automatic chord estimation algorithms.
Next, we moved on to study subjectivity in automatic methods. Here, we find that the best systems are already able to achieve accuracies close to that of trained humans. Also, we derive a Sequence Crowd Learning algorithm which is able to obtain an accurate consensus annotation from a set of examples. This type of post-processing/bootstrapping has been explored in speech recognition (see ROVER, Recognizer Output Voting Error Reduction) but also in recent work in beat tracking. I also have something in the works for this…stay tuned!
“To assess the performance of an automatic chord estimation system, reference annotations are indispensable. However, owing to the complexity of music and the sometimes ambiguous harmonic structure of polyphonic music, chord annotations are inherently subjective, and as a result any derived accuracy estimates will be subjective as well. In this paper we investigate the extent of the confounding effect of subjectivity in reference annotations. Our results show that this effect is important, and they affect different types of automatic chord estimation systems in different ways. Our results have implications for research on automatic chord estimation, but also on other fields that evaluate performance by comparing against human provided annotations that are confounded by subjectivity.”
The final journal paper from my PhD thesis is now available! It’s called “Automatic Chord Estimation from Audio: A Review of the State of the Art”, is currently available on IEEE Explore and will probably be published in the February issue.
Within, we discuss feature extraction, modelling techniques, training and datasets, evaluation strategies (including MIREX, the annual benchmarking evaluation in why our system outperformed all other systems for two years), and software packages for chord estimation.
I’m particularly pleased with two figures in this paper. The first shows the annual performance of algorithms in the MIREX evaluations, clearly showing a performance plateau and overfitting on the Beatles dataset: as well as the challenges and benefits of an unseen test set (the much-appreciated McGill SALAMI dataset) in 2012.
The second is a ‘visual literature review’, showing breakthroughs in various aspects of the automatic chord estimation research problem chronologically:
It feels particularly good to get this this paper published, as it ties up my thesis work and is a quite in-depth study, comprising an IEEE Overview Article, which are ‘solid technical depth and lasting value and should provide advanced readers with a thorough overview of various fields of interest’ and published at most four times a year in the journal. It also gave us twice the usual page limit! A preprint pdf and bibtex link are available on my publications page. Abstract:
“In this overview article, we review research on the task of Automatic Chord Estimation (ACE). The major contribu- tions from the last 14 years of research are summarized, with de- tailed discussions of the following topics: feature extraction, mod- eling strategies, model training and datasets, and evaluation strate- gies. Results from the annual benchmarking evaluation Music In- formation Retrieval Evaluation eXchange (MIREX) are also dis- cussed as well as developments in software implementations and the impact of ACE within MIR. We conclude with possible directions for future research.”
I was listening to Radio 1 last week and heard ‘City of Angels’ by Thirty Seconds to Mars. I thought it sounded familiar, but couldn’t put my finger on why. After some noodling on the piano on my desk (MIR job perks!) I realised it’s almost identical to ‘With or Without You’ by U2. Here are the videos side-by-side, in the awesome Turntubelist:
Click each track’s ‘play’ icon to load, then the large play button underneath each video to play. Mixing is controlled by the slider. These songs share the same key (D major) the same chord sequence throughout (D, A, Bm, G) have almost identical tempi (105 vs 110 bpm), and the refrains even share the same melodies! (compare the sections starting 3:24 in ‘City of Angels’ vs 3:04 in ‘With or without you’).
This got me thinking that it would be cool to run through my buddy and former colleague at AISTMatthew Davies‘ (and collaborators) AutoMashUpper algorithm, which you can read about in his ISMIR paper from 2013.
The gist of the algorithm is that for a given input song, it searches for sections of similar pitch profiles throughout a target song (‘pitch-shifting’ to account for different keys, although this isn’t needed here), and then speeds them up/slows them down (again, not needed much in this instance) and layers them onto the original song, accounting for structure and changes in loudness etc. You can listen to the output of this mashup here:
It fits remarkably well! The algorithm lines up the coda sections so well (3:30 in the mix above) that it’s hard to make out Bono at all.
This got me thinking about There have been various instances of artists being accused of ‘copying’ other songs, from Led Zeppelin and Spirit to Coldplay and Joe Satriani. I was never clean what the grounds for suing on these cases was, since finding one song similar to another doesn’t detract from my aural pleasure; it only enhances it. Can’t we (as music consumers) enjoy both equally, if not more, because of this relationship? Doesn’t it actually provide users an enriched listening experience?
Today at work I went to a traditional Japanese Tea Ceremony lesson. Something I’ve wanted to do for a while, so it’s great that I managed to do it on the AIST campus so early on in my stay. And for just 200 Yen!
The ceremony was held in the ‘Japanese Room’ in the welfare centre, tucked away between the hairdressers and a restaurant (obviously). After the obligatory shoe-removing we were shown to cushions to kneel on, which I have to say I find extremely uncomfortable. My Western knees aren’t up to it! More training required.
First off we first had a display, where one main guest performed a kneeling bow (lots more of this to follow) at the entrance, then shuffled over next to us. Shuffling, it turns out, is preferred as it causes one to slow down and minimise disturbance. Her friend then joined her in the same fashion. After this, the tea-sensei came in and began the ceremony. This basically involves:
Confectionary offering. Wagashi are offered to the main guest in a large bowl by the sensei’s assistant after a series of bows. Bow to the rest of your party and say “o-saki-ni” (“Excuse me while I go first”). Wooden chopsticks are used to pick up the sweets, which must then by cleaned by folding the corner of the kai-shi (paper napkin) over the chopsticks and wiping. After returning the bowl to the host, the confectionary are eaten in small bites with a ku-ro-mo-n-ji (small wooden pick).
Washing the bowls. Hot water is taken from an iron pot with a bamboo ladle and poured into the (presumably already clean) bowls, known as cha-wa-n. A kai-shi is then used to wipe the cha-wa-n dry, first inside, then out.
Preparing the tea. A wooden spatula (cha-sa-ku) is used to measure out 1.5 measures of powdered green tea which are added to each bowl. Hot water is added to the tea from an iron pot, and then mixed with a tea whisk (cha-se-n). This kind of looks like a shaving brush, but is made from a single piece of bamboo. The host whisks the tea counter-clockwise, slowly at first and then increasing in speed.
Presenting the tea. After a series of bows and phrases I didn’t quite hear, the tea is offered to the first guest. The guest takes the tea across the mat in their right hand and then apologises to their neighbour for going first. The bowl is raised and placed in the left hand, before being rotated 180 degrees clockwise.
Drinking the tea and admiring the cha-wa-n. The tea is drunk in sips, with both hands, with minimal noise except for a final ‘slurp’ to finish. Cleaning the bowl is then done with your fingers, which are then wiped on your kai-shi. After this, the bowl is placed in front of you to admire. You should first place it to your left side and kneel to the right to get a good view of the right side. The same procedure is then repeated for the left side of the bowl. Finally, left the bowl to admire it from below, taking note of the personal seal of the maker. Bow to your host.
As you can see, it’s quite involved! No wonder that people spend their whole lifetimes studying ta-do (the way of the tea). After the display, we had a feeble attempt at being served ourselves, and then making our own tea.
In our brief lesson I got an idea about the basics I listed above, but on further research it seems that there is much more to it than this, including proper etiquette for opening of sliding doors, walking technique, how to enter and exit the room, who to bow to and when….overall I really enjoyed learning about the tea ceremony, and think I’ll be picking up a set of basic equipment to study at home, or as souvenirs for friends/family. Don’t think I’ll ever look at builders’ tea the same way again!
The blog is alive and kicking! More updates to come soon, including some recently-available publications, and maybe even my PhD thesis. I’d quite like to do a post on my adventures in America at LabROSA with Dan Ellis too, and the the work I’m currently conducting at AIST with Masataka Goto.
A lot has happened since my last post, but I thought I’d fire this up again with a very quick post highlighting an interview I did with Research Media Limited a while back. The (brief) transcript can be found here:
This weekend I went to my first meeting of the North East Music Information Special Interest Group, held at the echonest in Cambridge, MA. NEMISIG is an informal get-together for everyone in the northeastern region to see what’s happening in the labs, the research that’s being conducted, and who’s new or moved on to greener pastures.
Saw some really great stuff, including talks by Jessica Thompson on neurosemantic encoding of musical (and more general sound/audio) events, and a neat demo of robot musicians by Eric Schmid:
I also gave a quick talk about lyrics to audio alignment. Our basic idea is to take a capella audio (without instruments) and align automatically-scraped web lyrics to the audio using state-of-the-art speech recogniser software and use a few neat tricks to boost the accuracy and pair with the full polyphonic recordings. Work in progress, but I had fun putting this little demo together:
Here you can see an automatic alignment of the lyrics to the audio. In the top pane is the word-level alignment of the lyrics, whilst in the bottom pane is the phoneme (word unit) level alignment. It works pretty well! I’m hoping to scale this work up to a dataset of around 1,000 songs and write it up for a conference some place.
Just a quick post – tonight a documentary about one of my research areas, music and emotion, was shown on BBC1. The film looks at why music evokes emotion, describes some of the most characteristic emotions humans experience when listening to music, and which musical techniques are used to draw out these feelings.
The film also features footage from CMMR 2012 (Computer Music Modelling and Retrieval) where I presented some of my work, which aims to discover which emotions are most easily described by combinations of audio, lyrics, and social tags. Find out more on my publications page.
The documentary itself can be seen on iPlayer here for the next seven days. Look out for a familiar face around 32 minutes in! (people outside the UK, you can view this video if you have access to a VPN….)
So before Sandy landed and caused chaos in the city, I went to a monthly music hackathon. Working on the weekend after a few too many cocktails in Chelsea might not sound like great fun, but I can tell you it’s a really great event.
The concept of a hackathon is to ‘hack together’ some code or an application in a few hours. It’s a cool way to explore some ideas you might have had on the back burner but never had a chance to code up, or some stuff that doesn’t really fit into your thesis/grant (here’s the obligatory wikipedia link). Hackathons are regularly used or hosted by Facebook, Spotify, The Echnonest and Google for getting quick ideas tested fast – and sometimes failing fast too!
Our plan at labROSA was to port a load of Dan Ellis’ matlab scripts over to python. For non-computer scientists, this basically means that it will all be available without an expensive matlab license and can be used to foster more research.
So, at some point (stay tuned…) all of the above and much more will be free to researchers! However, to market this work, it was decided we needed something shiny to show off, especially for the end of day presentations!
I decided to see if I could use some of these scripts to automatically generate ‘gear shifts’ in pop music. A gear shift is basically a really cheesy key shift in a pop song where the chorus is repeated a semitone/tone up to add interest to the tune. It’s a great way of adding an extra minute to a song, and literally ‘lifts you up’ just as the song is becoming dull. They’re a staple for just about any X-factor christmas or Westlife track but the best example I could find is Whitney Houston’s I Will Always Love You (skip to 3 minutes)
Boom! What a great floor tom hit. So, my plan was to automatically ‘gear-shift’ any song. Then any song can be made 20% more awesome! Turns out it’s quite tricky, but you can do a pretty good job using some of Dan’s code. I first extracted beat-synchronous chroma features (read as: description of pitch evolution at the beat level) and used these to automatically find the chorus. Below is a self-similarity matrix for each beat, so pixel (i,j) represents the (cosine) similarity between beat i and beat j.
Dark colours are high similarity, and I smoothed the matrix in the top pane to highlight long-term similarity and allow some local dissimilarity. Then I looked for strong diagonal stripes, which in theory represent large repeated sections (such as a chorus). Finding these is really the tough part, but in red I’ve highlighted the best candidate for this song (I biased it to prefer beats near the end of the song).
After this it’s pretty simple to grab this section of the audio, fade out before, phase vocode the detected chorus up a semitone, add some compression for drama and, viola!
Pretty neat huh?! Sure it’s not perfect, the vocals get a little chipmunky as it’s already quite high register, but that’s the beauty of a hack!
Stay tuned for the release of some cool python code to be released to do phase vocoding, structural segmentation (finding choruses) etc in the near future.