This blog follows justin's adventure to becoming a data scientist.

his journey starts at Metis, a bootcamp, and will go [through good coffee and tea] to wherever it takes him.

Web Scraping & Regression: Bwaaah

Our second project, Project Luther, was to scrape collect movie data from publicly available data on websites such as Box Office Mojo, IMDb, and Rotten Tomatoes. After forming a question, we scraped the data relevant to answer it.

Check out this project's GitHub repo!

 John Luther from the British crime show, [SPOILER ALERT] Luther.

John Luther from the British crime show, [SPOILER ALERT] Luther.

The Question

Being a musician (and a huge fan of movie music), my question was whether or not the top 5 composers make a difference in the value of Domestic Total Gross. This YouTube video does a good job of illustrating the importance of music in movies:

So it's quite apparent that music is very important in a movie, but is it important that you have iconic and the correct music? Or is just any music fine?

Okay, maybe that was a little extreme, but maybe just having a composer writing music doesn't translate to the same gross as if you had one of the top 5. For a little bit of context, the average gross per movie for my dataset was around $82,650,000.

In comparison, the average gross for movies of the top 5 composers:

1. Hans Zimmer: 110 movies averaging $95,800,000
2. John Williams: 70 movies averaging $142,990,000
3. James Newton Howard: 117 movies averaging $71,860,000
4. Danny Elfman: 77 movies averaging $93,050,000
5. Alan Silvestri: 88 movies averaging $76,840,000

It seems that 3 of the top 5 composers' average movie gross are than the overall average of the movies I looked at. However, something to take into consideration was that perhaps you have a good composer because your budget was high and maybe that's why your movie was good,  so you made more money! That means I definitely had to put budget into the prediction model to take it into account.

The Data Scraping

Step one was to scrape what data I could. Using beautiful soup, I looked for the URLs of all the Box Office Mojo top 100 pages from 1996 - 2016 (present) and looked through the individual movie pages to find the information I could use in my prediction model. Thankfully, Box Office Mojo URLs have a very systematic and interpretable pattern. Here is an excerpt of the scrape code to find the relevant individual movie Box Office Mojo page URLs:

#Get URLS of top 100 grossing movies per year from 1996 to 2016
movie_urls = []

years = list(range(1996,2017))

for year in years:
    url = '' + str(year) + '&p=.htm'
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    tables = soup.find_all("table")
    url_name = re.findall(r'/movies/\?id=([\w\- ]+).htm', str(tables[3]))
    for url in url_name:
for movie in movie_urls:
    if 'starwars' in movie:
        movie2 = movie.replace('se','')

You can see that at the end, I had to do a little fudging with Star Wars because the Special Editions (#HanShotFirst) were released in 1997, but I had to refer to their 1980's pages to scrape their information. This was important because they are scored by John Williams, one of the big composers on my top 5 list, and my favorite. :)

Deciding what data to scrape was mostly just seeing what pieces of information that seemed to be important for predicting domestic total gross and what I could actually get. In the end, I settled on:

1. adjusted rank
2. budget
3. domestic rank
4. domestic total gross
5. runtime
6. year of release
7. yearly rank
8. composer (duh!)

The most difficult part of the scrape was putting the cases in for different movie page formats. For example, if a movie ranked in the top 10 of that year, they their rank would be in bold. To see the full project scrape code, visit my github repo!

The Data Check

After getting many, many errors, I finally got all the data scraped and made some scatter plots to see what I was dealing with and initial understanding of what I was going to work with.

Budget vs Domestic Total Gross.png

It seems that most low budget films end up with a low domestic gross. However, once you cross the $100 million budget threshold, there is an upward trend in the domestic total gross.

Remember that question of maybe having a high budget means you were able to hire a top 5 composer? After making a dummy variable for every movie to equal 1 if it had a top 5 composer vs 0 for every other composer, it seems that they were both pretty evenly distributed. This means that it doesn't necessarily mean you have a top 5 composer if your budget is high and top 5 composers do not necessarily write just for high budget films! Good news for small film makers out there.

However, looking at the top 5 composer dummy variable against the total domestic gross also shows a decently even distribution with a couple more outliers near the top (namely Star Wars Episode VII: The Force Awakens). Just looking at this histogram made me expect my results to return a result of "not really a correlation here, buddy."

The Analysis

At first, I used the adjusted all-time rank as the variable I was going to try and predict. Since the adjusted rank was adjusted for inflation and changing of movie ticket prices by Box Office Mojo, I thought that it would be a good way to compare across all the years. However, that just left me with 128 movies that actually had that adjusted all-time rank and all of my regression models came out with a score of around 0.05. That means only 5% of the variance could be explained with my model. Not good.

Then, I threw out the adjusted rank column and tried to predict Domestic Total Gross and came up with 1,122 movies. Not as large of a sample as I wanted, but it would have to do. After running that regression, I got scores around 0.96! 96% of the variation was accounted for in my model! THIS WAS A BREAKTHROUGH!

Not so fast.

I realized that I was using yearly rank and domestic rank as two of my features (variables used to predict domestic total gross) and the ranks were BASED on their gross, so of COURSE ranks would explain almost ALL of the domestic total gross. Sadly, I had to throw those columns out too. :(
After running a linear regression, lasso, ridge, elasticnet, random forest, and gradient boost model on my dataset, linear regression returned the highest score! Below is an excerpt from my linear regression code:

#Linear Regression

linear = linear_model.LinearRegression()
shuffler = cross_validation.ShuffleSplit(len(X_train), n_iter=5, test_size=.2, random_state = 40)

score = cross_validation.cross_val_score(linear, X_train, y_train, n_jobs=1, cv=shuffler)
results =, y_train)

print('model: linear')
print('Scores: ' + str(score))
print('Average Score: ' + str(np.mean(score)))

So applying my model to the testing data set vs the actual value, we can see that my model mostly under-predicts. Below are the coefficients associated with each feature (the value of the variable which has to change in order to change the predicted value one unit. In this case, $1 million more of domestic gross):

Budget (Millions): 0.89366
Runtime (Minutes): 0.32642
Year: 0.28257
Top 5 Composer (dichotomous): 28.02270

To answer my question, this model says that having a top 5 composer would yield, all else being equal, a greater domestic gross by $28 million! However, this model gave me a score of 0.34, which meant that although some of this story is explained, not all of it is accounted for. 

Next Steps

Bias and confounders could be present in the data. We know that director, actors, and story also play into the success of a movie which should be accounted for as well. Also, this is by all means not an exhaustive model, as illustrated by the score.

Interestingly, the model showed that a 0.28257 increase in the year would yield another million in gross. This could be due to inflation and should be adjusted for. Also, since we are talking about music, including OST sales could also paint a better picture since not everyone goes to watch a movie to listen to the music.

Genre could also be another causal variable since some movies rely more on music than others, such as...musicals.  It would also be nice to get the breakdown of the budget to see how much was used on production, post-production, and marketing. Some movies do well just because they were hyped while others didn't because no one really knew about it.

Composer experience could also be a factor in this. For example, Hans Zimmer tends to have "phases" of styles such as Pirates of the Caribbean sounding similar to The Rock soundtrack. Finally, franchise data would be nice to have as well since people may tend to want to watch a sequel if the previous movies did well. If it were a series that used the same themes (e.g. the Star Wars and Harry Potter series) the composer would be on his or her way to iconic status.

Classification: Pivot

Data Wrangling: CoQ