This blog follows justin's adventure to becoming a data scientist.

his journey starts at Metis, a bootcamp, and will go [through good coffee and tea] to wherever it takes him.

NLP in Music: Time After Time

At the end of Week 8, we finished Project Fletcher! The assignment was to use natural language processing (NLP) techniques. Being language-based, our data source required lots of text. 

Check out the github repo for this project!

 Detective Fletcher!

Detective Fletcher!

The Question

I really wanted to do something interesting that people usually don't think about. With our level of knowledge and a two week timeframe, it was difficult to frame an original question. Being a musician, I decided I wanted to do a study on lyrics. The songs my father listened to sounds very different from the popular songs of today, but can we tell the difference solely based on lyrics without the actual music? What if I could tell what year a song was written solely based on their words? Maybe a song was "ahead of its time" or modern songs have that "old school" feeling.

Honestly, I thought that it wouldn't be possible to separate songs based off lyrics, but that would be a slightly interesting finding as well!

The Data

Unfortunately, lyrics are protected under copyright and a database of lyrics is not widely available for download. That means I had to scrape lyric sites which, again unfortunately, are crowd-sourced. More on that later.

Since I could not just get every single song in every year, I first scraped the artist and titles of the top 100 songs per year from 1950 to 2015 from Billboard's archived top 100 list. After cleaning those names and artists up a bit, I fed them into an automated BeautifulSoup scrape that first went to and, if that URL did not exist, to look on

Only taking the top 100 per year without splitting into genres, I was getting more and more skeptical about whether or not the language would be different enough between the years to separate.

The Model

Upon training a quick model, it became clear very quickly that training a model to try and separate song lyrics by year would not be very reliable. Most of my models were getting around an accuracy score of 0.03 to 0.04. I was expecting this, but then I started wondering if I could determine the DECADE which lyrics could prove popular in.

Using transformers in a pipeline into a model, I started looking for features to engineer to serve as clues to the age of a song. Here is an excerpt of the code for the pipeline I used:

    1 xgboostTarg = Pipeline([
    2     ('feats', FeatureUnion([  #feature union
    3         ("TFIDF", TfidfVectorizer()),  #features
    4         ("CountVector", CountVectorizer()),
    5         ('Length', LengthTransformer()),
    6         ('LoveCount', LoveCount()),
    7         ('Caps', CapTransformer()),
    8         ('Apostrophes', ApostropheCount()),
    9         ('NumCount', NumCount())
   10     ])),
   11     ("xgboost", xgb.XGBClassifier(max_depth=5, n_estimators=1000, learning_rate=0.05))  # classifier
   12 ])

I used some bag of words vectorizers and some custom transformers (length of the song, count of how many times 'love' was in the lyrics, number of capital letters, number of apostrophes, and number of numbers) "feature unioned" into an XG Boost model. The idea behind each of these custom transformers was that more recent songs would probably have fewer lyrics, try to find synonyms to "love," use more acronyms, more slang, and more counting, respectively. Pipelining this into different models gave the highest accuracy score to XG Boost.

Being a multinomial classification model, the accuracy would naturally fall against a binomial model. However, the 1950's to 2010's encompasses 7 decades. Just pure random chance yields an accuracy of around 14%. On the test set, my model has an accuracy of 42%! I'd say that's not too shabby of an increase. It's a good thing that this model won't be used to make any important decisions anytime soon, though...

The Visualization

After finishing the model for decade prediction of when the lyrics inputted might have been popular, there were a couple other questions I wanted to answer. Two of the main ones were: 

  1. What words were used most often in each decade and
  2. What topics were the most popular songs of the decade about?

Using LSA (latent semantic analysis) on all the lyrics in each decade, I compiled the top 5 components (or topics) related to the words. I also ran bigram/trigram counts over the decades to find the top 5 phrases in each decade as well.

One topic was number one across every single decade from the 1950's to 2015: love. Good sanity check.
Now the only phrase found in all the decades as was: "I can't"
Conclusion: Musicians can't love.

Taking these top 5 topics and phrases, I threw them into a D3 visualization with circle size proportional to the magnitude to which that data is present within the decade. D3 code for this visualization is available in the github repo for this project!

The Largest Struggle

After completing the model, I wanted to serve it on a Flask app so that anyone could input whatever lyrics they wanted to in order to get an estimate of the decade those given lyrics may have been popular in. However, I spent a considerable amount of time on getting a pickled model to work in flask, yet ran into multiple issues with implementation of the custom transformers. Unfortunately, time was running out and I had to abandon that venture to prepare for my presentation. 

The Future

As with our other projects, the biggest limiting factor for this one was the time available. Two weeks was a small amount of time to get everything up. Given more time, I would have liked to:

  • Get the Flask app working!
  • Clean the lyrics more and account for more of the crowd sourcing variability
  • Try to predict the genre based off of the words
  • Generate line-by-line lyrics (using Markov chains) in the style of the user's choice

Overall, I think my findings were pretty interesting and I definitely had fun wondering what kind of information could be wrung out of this data. I can't believe this was our penultimate project! One last one here at Metis!

P.S. It thinks all gansta' rap is would have been popular in the 2000's!


Classification: Pivot