This blog follows justin's adventure to becoming a data scientist.

his journey starts at Metis, a bootcamp, and will go [through good coffee and tea] to wherever it takes him.

Data Wrangling: CoQ

Originally written on July 8, 2016

The first "week" of bootcamp is over! I survived! Except it wasn't a full week since our cohort didn't start until Tuesday due to July 4th being on Monday. Since the holiday pushed our start day back a day, that also meant that we had one less day to work on our first project, Project Benson.

Olivia Benson from Law and Order:  SVU

Olivia Benson from Law and Order:  SVU

For this assignment, we had to use the data publicly available from MTA on their turnstile counts and frame an analysis in a way that we helped a certain organization or company solve a problem. Our group wanted to do something a little different than a traditional approach to add a little flair.

MTA Turnstiles

MTA Turnstiles

Our group, awesomely named "Group 4", made a hypothetical emerging religion called the Church of Quant (QoC) whose VC backers would ex-communicate their supreme leader if they could not increase their numbers (and generate revenue) during their next recruitment period of March 2017.

Our largest issues came with the data cleaning, which took, quite literally, all of our time. Since the MTA turnstile data only reports the cumulative counts, we had to take the difference of the values from one reported time to the next. However, some weird values would pop up or the counter would get reset. Instead of just throwing away all of the data that didn't meet our desired framework, we tried to clean it up to reflect the most accurate counts as we could.

['C/A' 'UNIT' 'SCP' 'STATION' 'LINENAME' 
'DIVISION' 'DATE' 'TIME' 'DESC' 
'ENTRIES' 'EXITS                                                               ']

Above are all the columns available in the data that MTA provides, including the gargantuan gaping white space after 'EXITS' :(
We were not interested in some of the data such as the control area (C/A) and subunit channel position (SCP). Information we utilized more were station, date, time, entries, and exits.

Looking at turnstile data for the Marches of 2015 and 2016, we looked at foot traffic throughout the subway stations to give the QoC the best area to interact with the greatest number of potential converts. Our findings were not unexpected, with the top two busiest stations being Penn and Grand Central. After plotting our top 5 stations, they were all centralized in the Midtown region of Manhattan. Therefore, we suggested the CoQ heavily invest resources and evangelists in that area. Unfortunately, the data doesn't specify which 23rd St. station they were referring to and could be a sum of all of them.

Top 5 Busiest Stations in the MTA Network

Top 5 Busiest Stations in the MTA Network

General Bootcamp Things

Apart from being at school all day everyday, I think the hardest transition has been waking up early everyday. I'd usually get around 9 hours of sleep per night before, but now I've averaged around 6 a night. Coffee has also been a little bit of an issue here. They only have Folgers, which is definitely NOT the best part of waking up. Going to have to buy some Philz to brew my own!

I am going to try to post a vlog every week with these blog posts as well. They're probably going to be more about everyday life, rather than the work we do. So far, it has been tough but rewarding! Pretty sure the work will ramp up as we get more used to this schedule.

Web Scraping & Regression: Bwaaah