A bit outdated (a running theme of this blog), but you can now find my 2014 ICASSP paper on my publications page. In this paper, we (myself, Dan Ellis and Masataka Goto) took a novel approach to transcribing the lyrics of pop songs.
The main idea is that although transcribing lyrics is orders of magnitude harder than transcribing speech, one aspect yet to be exploited is that singing typically features repetition throughout a song (most notably in the choruses). We introduce 1 pre-processing technique and two post-processing techniques to leverage this information, and show that both post-processing techniques offer significant improvements over a baseline Hidden Markov Model.
“Transcribing lyrics from musical audio is a challenging research problem which has not benefited from many advances made in the related field of automatic speech recognition, owing to the prevalent musical accompaniment and differences between the spoken and sung voice. However, one aspect of this problem which has yet to be exploited by researchers is that significant portions of the lyrics will be repeated throughout the song. In this paper we investigate how this information can be leveraged to form a consensus transcription with improved consistency and accuracy. Our results show that improvements can be gained using a variety of techniques, and that relative gains are largest under the most challenging and realistic experimental conditions.“