This blog follows justin's adventure to becoming a data scientist.

his journey starts at Metis, a bootcamp, and will go [through good coffee and tea] to wherever it takes him.

Classification: Pivot

Project McNulty is about understanding how to use classification methods. This means using a data set to classify or assign data points into groups based on a model we create. To present our model, we were to deploy it in a web app using Flask and D3.js as a visualization tool.

Check out the project on GitHub!

 Jimmy McNulty from  The Wire

Jimmy McNulty from The Wire

The Original Idea

My group's initial idea was to classify whether a company would be "good" to work for or "not" by using their ratings off Glassdoor as a reference. After using their API to get data on most of the companies and making our initial model, we had a score of 0.988!

But real life isn't that nice, so we knew something had to be wrong. After some thinking, we realized that we were predicting overall ratings based off of other ratings. The model predicted an outcome based on the data that it had within itself. Of course it would perform well! "Recommend to a friend" rating in particular had a huge impact because that is just a summary overall statistic of a company!

It was back to the drawing board.

The Pivot

After some more brainstorming, we decided to do something similar, but classifying startups as "successful" or "not successful." We defined "successful" as companies which went IPO, got bought out, or are still running. In contrast, "not successful" startups were just ones which closed. Taking it another level deep, we wanted to see if we could actually classify our startups as chances of IPO, being bought out, running, or failing.

The Data and Analysis

Our primary data set was archived CrunchBase data to the end of 2015. This gave us data on funding rounds, operating status, types/industries, and locations of a plethora of companies.

First, we used different models to do the initial binary classification using Logistic Regression, K-Nearest Neighbors, Gaussian Naive Bayes, Decision Tree, Random Forest, and XGBoost Classifier. Our highest AUC in this group was the XGBoost with an AUC of 0.66.

 P-R and ROC curves for initial binary classification model

P-R and ROC curves for initial binary classification model

For the second part of our classification, we fit XGBoost models to each outcome vs all other outcomes. For reference, the binary "operating" result was coded as 1 for operating, so the Precision-Recall Curve is skewed to the right.

 P-R and ROC Curve for multi-classification

P-R and ROC Curve for multi-classification

The Web App

Now that we had the model and results, our final task was to make a web app that displayed the data in a visualization that adapted to the variables inputted by the user. Unfortunately, we do not have a live demonstration of the app up right now.

As for the visualization, I thought it would be interesting to have a bar graph to see the percent chance of each outcome to compare against other outcomes and then have them transition to a stacked bar chart to see how much of the overall they took up. Here is an excerpt of the code:

function draw_stacked(data){

      var chart =".chart")
          .attr("width", width)
          .attr("height", height * data.length)

      var bar = chart.selectAll("rect")
          .style("fill", function(d) { return color(d); });

            .duration(500).delay(function (d, i) {return i*100} )
            .attr("width", function(d) { return d*5 })
            .attr("height", height - 1)
            .attr("x", function(d, i) {
                        return (typeof data[i-1] !== 'undefined') ? (data.slice(0,i)
                        .reduce(function(a,b) { return a + b}, 0))*5 : 0 } )
            .attr("y", 1)

You can see a mock-up of the visualization with static data below! Click the "separate" and "stacked" tabs to see it in action!

Just to put the icing on top of the cake to bring everything together, I also designed a quick logo with down and up arrows to represent the rise and fall of companies. 

 The Pivot logo!

The Pivot logo!

Next Steps

There is no reason to believe that our model is exhaustive, especially since there are multiple unmeasurable and unpredictable variables that go into whether or not a company succeeds. However, more data that may help us include:

  • Budget breakdown of the companies - how do the companies use the money they receive?
  • Stock market data at company milestones - how is the rest of the economy doing when they receive funding?
  • Marketing avenues - how are the companies marketing their products/services?
  • Growth - how quickly are the companies growing?
  • Level of education of employees and founders
  • More granularity in "operating" - what kind of operating status are they in?

All in all, I feel like this project gave my team a good glimpse into how your original idea doesn't always pan out the way you want it to. Being flexible is key, and making the most of what you have can turn out to evolve into something better than your previous thoughts!

NLP in Music: Time After Time

Web Scraping & Regression: Bwaaah