Being a photographer, I knew coming into Metis that I wanted my final project to do with photos. Since I've already done two projects on one of my other passions, music, I decided to combine my love of images with another interest of mine - cars.
This is my cousin, Benny, and me. The picture is a bit old, but he's 8 years old at the time of this writing and can name all the car manufacturers we drive or walk by with a glance. I wanted to see if I could build something that can recognize a car just as well as he can - or even better to the model-specific level!
When you see a car you're curious about, you might know it's a Ferrari, but finding the model is difficult and takes multiple Google searches. When you finally find the model name and throw it into Google, it does a good job with preliminary specs, but if you want something more, it takes even MORE searching.
To have a model that recognizes photos, you need a lot of photos to train it. After scouring the web for a data set, I came across the Stanford AI Cars data set. This data set consisted of 16,185 images of 196 classes. I just wanted to start with a baseline binomial classification. Being partial to BMWs, I manually (with the help of Benny), hand-picked the BMWs in both the test and training sets. However, with class balanced on the validation set (50% BMW's and 50% not BMW's) this binomial classification manufacturer level ended up with around 62% accuracy (only a 12% increase from pure chance!).
That was definitely worse than Benny - and it was just a binomial model on the manufacturer level! I needed to figure out a new approach.
Convolutional Neural Networks
I thought that the model could be getting confused because there were slightly different angles and many different cars, so I tried to simplify the problem first. Could it tell the difference between 7 manufacturers, only from the front view? Using the Fatkun batch image downloader extension for Chrome, I took 7,100 images off of Google Image results to train my model.
Image recognition is a relatively new deep learning technology based off of convolutional neural networks (CNNs). It would usually take a long time to train one from scratch, but good thing there are a few weights we can download based off of Image Net to aid us. I used the VGG-19 weights for my convolutional layers, and trained the fully connected top layers to classify my cars. The Keras deep learning library makes it relatively simple to utilize TensorFlow and/or Theano architecture. There is an amazing Keras blog post that goes into how you can do this!
Also, since we are working with large batches of image processing, it would be a good idea to run everything on an AWS GPU instance instead of your local machine.
Our input here will be a picture. In a nutshell, all the convolutional and pooling layers take the RGB channels and converts them into a matrix of values which feeds into the fully connected top layers (labelled FC in the diagram above) which then go through a classification model to output your predictions.
The New Approach
When I saw that the binomial manufacturer model didn't do well, I was thinking about how I could make a more accurate model. Since all of my photos have to go through the convolutional layers pre-trained by VGG-19, I knew I couldn't change those. However, we can train the fully connected top layers on different sets of images! If I could make multiple top layers and switch them out depending on what the previous top layer predicted, I could cut out a large number of possibilities along the way and, hopefully, would have a less-confused model.
Let's call what comes out of the convolutional layers "convoluted" after the photo has gone through the VGG-19 pre-trained weights. We can now start the decision tree-like structure by taking "convoluted" and throwing it through multiple top models.
Taking "convoluted" through the first top layer gives us an accuracy of 92% on whether the view of the car is a front view or an angled view. Then, depending on the view, I throw "convoluted" into a manufacturer prediction of either the front or angled view (depending on the output prediction of the first top layer) with 96% and 86% accuracy, respectively, of my 7 trained manufacturers.
If the model predicts the image to be an angled view of a BMW, I then throw "convoluted" into the top model that predicts it to be one of 8 different BMW models with a 63% accuracy. Although this accuracy drops a lot from the other top models, I think it's a great start seeing that there are 8 categories to choose from and the differences between BMW's are relatively subtle.
Using Flask, I created a web app that uses the decision tree process and all of my different models. Here's a quick video demo of it!
The first photo URL I am pasting in is a front view of my dream car, the Ferrari 458 Italia Speciale. You can see that the model is very confident that it is a Ferrari!
For the second picture, I threw in a BMW 5 Series and it got it right! You can also see that I'm starting to compile a few specs of the car on the bottom.
Finally, the third picture is a BMW 4 Series. However, it thinks it's a 3 Series! I find that interesting because a 4 Series is just a 2-door 3 Series. Sometimes I get them confused at a quick glance as well. Although the model gets some other pictures of 4 Series right, this one seems to throw it off a bit. Some fine-tuning and tweaking is still needed to bring out the nuances of similar body styles.
Identifying cars isn't only for the enjoyment of gearheads and could have some real-world applications. Using the same technology, a few use-case could benefit from integrating this approach:
- Integration within ride sharing apps - have you had difficulty locating your driver after requesting an Uber or Lyft right after coming out of an event? Talking to ride sharing drivers, a large problem seems to be when passengers cannot find them. Since the app gives you the make and model of the driver's car, this model can possibly help aid passengers with finding their drivers.
- Amber alerts - similarly, amber alerts give you the make and model of a suspect car as well. Using existing video infrastructure like traffic and security cameras, we could ping law enforcement when a car matches the description and help with the search.
- Gotta see 'em all - taking a page from Pokemon Go, you could take pictures of all the cool cars you saw and have an easy place to compare the specifications and show all your other gearhead friends!
For this project, I wanted to combine my passions of photos and cars as well as to challenge myself into learning something we didn't go over very much in class. I didn't want this project to just be recycling old projects in a new way, but to actually learn something and push myself to learn something new. Given that we only had 3.5 weeks from brainstorming to presentation, I'm very happy with what I've learned and what it has become. I didn't have much time to play around with fine-tuning the model(s), so I think that in time, this could actually become decently robust!
Although it does well with a limited number of makes and models, the biggest weakness is the sheer amount of data that is needed. By adding more possible vehicles, the amount of photos required to train the model will exponentially increase. After all, your prediction is only as good as your data.
Was I able to make something just as good if not better than Benny? Possibly. From these 7 makes, they're probably around the same. However, just the fact that this was possible blows my mind! Deep learning is truly powerful.
Now it's time to rest a bit, get on that job hunt, and edit those vlogs!