Some time ago I read a tweet from somebody who was comparing lyrics from Justin Bieber and Queen. I failed to find the original tweet but paraphrasing it was something along the lines of:
Spot the difference, Queen: “Misguided old mule with your pig headed rules/With your narrow minded cronies/Who are fools of the first division/” Justin Bieber: “Fa la la la la la la, la la la la la la la…”
The scientist in me then thought it would be interesting to assign some numbers to this, how to quantify the difference? At the same time, in the spirit of the Pragmatic Programmer, I had been meaning to learn a new programming language. I wanted to learn something that was as different as possible to what I knew already. With a nudge from Paul Graham I quickly settled on a Lisp: Clojure. Having seen a number of talks, I have a great respect for Rich Hickey and was curious to experience what he came up with.
Disclaimer: the main goal behind this post was to learn Clojure, the application here to lyrics is secondary and just a toy example. The scoring mechanism & workflow I use here fails for obvious reasons in many cases so don’t take it all too seriously or complain its broken. Issue pull requests instead 🙂
So the idea was, starting from scratch, can I put together a small Clojure program to rank the lyrical quality (in the lexical, non-semantic sense) of different music artists. A problem easy enough to code together while interesting enough for me to get a first feel of the language.
The workflow I decided on was as follows:
Given an artist, look up the most popular albums. Then for each album collect the lyrics for each track and run them through a scoring function. Average over all tracks. As a source of music information I used the Last.fm API, which they kindly allow free access to. Obtaining the lyrics for a given song turned out to be more tricky. I was assuming some kind of free, easy lyrics API existed but it turns out that’s not the case. At least I did not find one. Instead I decided to scrape content off lyrics.wikia.com as that was quick and easy to do. Since this is just a simple toy application I figured nobody would complain.
The core of the workflow is of course the scoring function. You could fill a dissertation with the merits of different approaches and much research has been done in measuring readability, complexity, and quality of prose. I did not delve too deep into this but just wanted something that provided a rough measure of duplication and readability. Again, the purpose of this was to learn Clojure not start flame wars between Rihana and Metallica fans.
Of the many readability metrics, the Flesch-Kincaid readability test seemed to be the most popular so that is what I used (normalized to get it roughly in the same range as the other two terms). Using the Flesch test led to an interesting detour into syllabification, a problem for which there is no watertight algorithm. Instead I settled on using the Carnegie Mellon Pronouncing Dictionary and a simple approximation for words not in the corpus. Adding the Flesch score together with the other terms gives you a number roughly in the range [0 3], with the higher the score the less inventive, shall we say, the lyrics.
With the code done we can then see if our mystery tweeter was right with his comparison. Turns out he was. Running through a number of artists and looking at their top 3 albums (as measured by Last.fm) actually shows quite a logical trend:
Unfortunately for Mr. Bieber though, Rihana takes the crown at 2.12!
From there it was easy to extend the code to calculate the bieberscore for a given last.fm user. You just start from the users’ top artists. Turns out my current score is 0.85, which probably has a lot to do with the French music I have been listening to.
The code (all 200 lines of it) can be found on the github repository and I encourage everybody to play with it and improve it further. There are many things you could add into the mix: rhyme density, SMOG/FOG metrics, words per second, multi-threaded computation, historical evolution, impact, etc. Maybe I will do so for a follow-up post.
Also, any tips on how to improve the code, make it more idiomatic are also more than welcome. Which brings me to the next point.
I thought the Lisp syntax would take a while to get used to, however its actually very easy and quickly feels natural and succinct. Trickier was knowing which functions to use and how, but that comes with time and practise. What I haven’t yet cracked is a proper debugging method. The standard stacktraces are very unhelpful and often I really wished I could just set a breakpoint at some nested function. There are lots of threads and suggestions about this on the web but it would be nice if some of this made it into the official docs/distribution. The same goes for the standard REPL which is pretty much useless. Luckily Leiningen works the way you would expect, though I haven’t been able to find a smooth way of using it with Vim (I followed this).
Another thing I noticed is that this functional style seems to encourage terse code and one-liners, this put me off initially when browsing through other clojure code. However, once you get used to the language you start to understand how it fits together and the terse code makes sense. That being said there are still a lot of idioms, functions, and syntax I need to wrap my head around.
On the whole though, I found the experience very positive and enjoyed using the language. Its just a shame Java filters through sometimes, breaking the bubble as it were. But I understand the compromise that the designers needed to make. I look forward to learning more and would encourage everybody to do the same.