At the end of Week 8, we finished Project Fletcher! The assignment was to use natural language processing (NLP) techniques. Being language-based, our data source required lots of text.
Check out the github repo for this project!
I really wanted to do something interesting that people usually don't think about. With our level of knowledge and a two week timeframe, it was difficult to frame an original question. Being a musician, I decided I wanted to do a study on lyrics. The songs my father listened to sounds very different from the popular songs of today, but can we tell the difference solely based on lyrics without the actual music? What if I could tell what year a song was written solely based on their words? Maybe a song was "ahead of its time" or modern songs have that "old school" feeling.
Honestly, I thought that it wouldn't be possible to separate songs based off lyrics, but that would be a slightly interesting finding as well!
Unfortunately, lyrics are protected under copyright and a database of lyrics is not widely available for download. That means I had to scrape lyric sites which, again unfortunately, are crowd-sourced. More on that later.
Since I could not just get every single song in every year, I first scraped the artist and titles of the top 100 songs per year from 1950 to 2015 from Billboard's archived top 100 list. After cleaning those names and artists up a bit, I fed them into an automated BeautifulSoup scrape that first went to SongLyrics.com and, if that URL did not exist, to look on LyricsMode.com.
Only taking the top 100 per year without splitting into genres, I was getting more and more skeptical about whether or not the language would be different enough between the years to separate.
Upon training a quick model, it became clear very quickly that training a model to try and separate song lyrics by year would not be very reliable. Most of my models were getting around an accuracy score of 0.03 to 0.04. I was expecting this, but then I started wondering if I could determine the DECADE which lyrics could prove popular in.
Using transformers in a pipeline into a model, I started looking for features to engineer to serve as clues to the age of a song. Here is an excerpt of the code for the pipeline I used: