TL;DR version: This post walks through an image classification problem hosted on Kaggle for Yelp. I use Scala, DeepLearning4J and convolutional neural networks. For a self-guided tour, check out the project on Github here.
This project was motivated by a personal desire of mine to:
explore deep learning on a computer vision problem.
implement an end-to-end data science project in Scala.
build an image processing pipeline using real images.
Rather than using the MNIST or CIFAR datasets with pre-processed and standardized images, I wanted to go with a more “wild” dataset of “real-world” images.
I opted for the Kaggle Yelp Restaurant Photo Classification problem. The ~200,000 training images are raw uploads from Yelp users from mobile devices or cameras with a variety of sizes, dimensions, colors and quality.
What I did instead…
I was initially going to document this project end-to-end from image processing to training the convolutional neural networks. However, upon more research and practice actually tuning convolutional networks, I’ve reconsidered my process. While the Kaggle Yelp Photo Classification problem is a novel problem, it turns out not to be a great match with the deep learning techniques I wanted to explore. Thus, this article will focus mainly on the image processing pipeline using Scala. While I introduce DL4J here, I plan to discuss my experience with it in more detail in a forthcoming post.
The Kaggle problem
The Kaggle problem is this. Yelp wants to auto-classify restaurants on the 9 characteristics below:
Each restaurant has some number of images (from a couple to several hundred). However there are no restaurant features beyond these images. Thus it is a multiple-instance learning problem where each business in the training data is represented by its bag of images.
To deal with the multiple-instance issue, I simply applied the labels of the restaurant to all of the images associated with it and treated each image as a separate record.
To deal with with the multiple-label problem, I simply handled each class as a separate binary classification problem. While there are breeds of neural networks capable of classifying multiple labels, such as BP-MLL, these are not currently available in DL4J.
While I didn’t expect my initial approach would land me at the top of the Kaggle leaderboard, I did expect it would allow me to build a reasonable model while exploring new and untested (to me) tools and techniques: DeepLearning4j, Scala and convolutional nets. That assumption turned out to bigger than I expected.
The noise-to-signal ratio turned out to be too high with the Yelp data to train a meaningful convolutional network given my self-imposed constraints. From what I’ve deduced from the Kaggle forum, most teams are using pre-trained neural networks to extract features from each image. From there it can be tackled as a classical (non-image) classification problem with crafty feature creation and aggregation from the image to restaurant level.
While this is far more computationally efficient and could yield better predictions, it cuts out exactly the part I wanted to explore. I eventually compromised with myself and decided to re-factor the image pipeline I developed on this project for a similar better posed problem using CIFAR or dataset created myself from using image-net.
Images in the training set come in various shapes and sizes. See some examples below. My first pass at processing consists of:
resizing image to same same dimensions
Some images are tall…
Some images are wide…
Some images are outside…
Some images are inside…
Some images are food…
And some are random other things…
1. Square images
While images in the training set varied from portrait to landscape and the number of pixels, most were roughly square. Many were exactly 500 x 375, which was also the largest size, presumably the output of Yelp’s own image processing system.
To train a convolutional net, all images need to be the same shape and size. While there are likely fancier tricks and techniques that allow for different sized images, I started simple: make all images square, while preserving as much of the image as possible. I assume that the material of interest is centered, so I capture the middle-most square of each image.
original 500 x 375
squared 375 x 375
This example was created with the following code:
2. Re-size images
Now that images are squared, the re-sizing problem is relatively straightforward.
original 500 x 375
re-sized 256 x 256
re-sized 128 x 128
re-sized 64 x 64
re-sized 32 x 32
This example was created with the following code:
While DL4J and convolutional nets can certainly handle color images, I decided to simplify computation and start with grayscale. This way a single 64 x 64 pixel image is represented by 4096 features rather than 4096*3 (one for each color channel: R, G, B). There is a good discussion of the numerous ways to do this here. I opted to start with the simplest of all (averaging) which appeared to work quite well.
Here’s an example:
original image (left); grayscale conversion using RGB averaging (right)
This example was created with the following code:
Pipeline - images
Much of this section is specific to the Kaggle problem and discusses the data structures I created and used to keep store and manage images with their corresponding labels. It’s mainly an exploration of how to structure a data science project with Scala. If you’re primarily interested in DL4J, skip ahead to the Pipeline - DL4J section.
In my image processing pipeline, I modified the functions in the Gists above to methods of the java.awt.image.BufferedImage class.
This allows me to operate on images with chaining like this:
I’m not sure if this approach of extending an existing class with new methods is preferred to creating a new class, but it seemed to work well for my problem. I imagine it would be less clean if all instances of the original class do not need the newly defined methods. However, this wasn’t the case for me: all images need the new methods.
We need to load a couple CSV files containing metadata about each image. There are some Scala CSV reader libraries out there like scala-csv, however I forwent these to get more experience testing out Scala. I defined a basic file-reader readcsv which is used by readBizLabels and readBiz2ImgLabels to read in text files containing the labels for each Yelp business and image-to-business mappings respectively.
I make heavy use of the Scala map class. Essentially we have three maps:
bizMap (imgID -> bizID)
dataMap (imgID -> img data)
labMap (bizID -> labels)
I suppose I could have made classes for each of these as well, but they’re really just intermediate data structures, so I didn’t bother.
readBizLabels from the code above creates the bizMap and readBiz2ImgLabels creates the imgMap. processImages from the code below creates the dataMap. Next step: create a single data representation of these three separate but related data structures.
So there are four pieces of information to keep track of for each image:
The data is represented like this:
I defined a class alignedData to manage it all. When instantiating an instance of alignedData, the bizMap, dataMap and labMap are provided. I used Scala’s Option type for labMap since we don’t have this information when we score test data. None is provided that case.
Under the hood, the primary data structure has the following type:
List[(Int, String, Vector[Int], Set[Int])]
which corresponds to a list of Tuple4s containing this information:
I didn’t find many examples of DL4J applications in Scala… one of the reasons I’m documenting this project in detail. However, there are some useful examples here.
This took some exploring to figure out. The default speficification for DL4J networks that run with the ND4J DataSet do not train in batches. That is, each epoch (full pass through the training examples) will train on all training examples in one computation step. So all images, their data and corresponding weights must be held in memory at once.
This hogs memory with all but the smallest of datasets. I was running into heap space errors with just ~2,000 images sized 128 x 128. After switching to batch-mode, I was able to train on tens of thousands of images without memory issues.
Before I discovered this was the cause of my heap space problem, I posed my problem to the project contributors on the DL4J Gitter. I was pleased to learned that the next release of DL4J (3.9) is planned to move some computational operations off heap.
How to use batches
It’s easier to figure this out now that the DL4J examples repo has a convolutional net example (MNIST) using batches, which as of a few weeks ago was not there.
The biggest difference from mini-batch to full-batch mode is that you need to pass a MultipleEpochsIterator object rather than a ND4J DataSet to the fit method of your MultiLayerNetwork object. My approach doesn’t fully embrace iterators for their intended purpose, but hey it works and made for a smooth transition using my pipeline. You also need to add .miniBatch(true) to your MultiLayerConfiguration.Builder.
The distinction between iterations and epochs can be slightly confusing when moving from full-batch to mini-batch mode. If you’re not using miniBatches, the iterations method is used to specify how many epochs you want. However, when using batches, this is done directly in MultipleEpochsIterator and iterations can be set to 1. Explained here in the DL4J documentation.
Train convolutional network
This is the function that actually trains the convolutional net. It’s is long, probably too long. Lots of hyper-parameters hardcoded within that could be moved to function arguments or better yet, a config file. However, I was running locally on laptop with lots of tinkering, so this worked fine for me.
The CNN training function below does a lot: logging, test/train splitting, creating the MultipleEpochsIterator, training, reporting performance on test data, and saving the trained models to file.
I’ll save the intuition behind tuning for another post after I gain a better understanding myself on a more well posed problem. For now I’ll just ramble about what I tried and what happened.
Layers: I observed from some papers benchmarking convolutional nets solving the MNIST problem that a single convolutional layer generates decent results (certainly better than a benchmark of random, at least enough get started with). I trained with runs with up to three convolutional layers without errors, but my training was obviously slower (although not exponentially) and results were not any better than with one layer. Next time I plan to start with one convolutional layer, start tuning other parameters to get above benchmark results… and then explore additional layers.
# of samples: Training on all images took so much time, I didn’t have the patience to let it finish. It took me about 2 days to run a watered down CNN on my laptop with 50,000 images.
nepochs: This is the number of passes through all the training records. I’ve seen some networks with as few as 20 to as many as 1000 epochs. This is the main idea. There’s also an in-depth discussion about this here. I spent most of my tuning time trading off nepochs and the # of samples. I could tolerate training with lots of images to expose the network to a broader universe of features to learn… but only by cutting down the number of epochs to make run-time manageable.
nOut: This is the number feature maps. I tried runs with 10 to 500, chosen mostly by reviewing configurations for other image problems and the example DL4J networks.
learningRate: From what I’ve read this is pretty important. I didn’t fiddle with this much though. I think I tried the commonly used .01 and .001.
nbatch: This is the # of records in each batch. I tried 32, 64 and 128. I’m not sure how much of a difference this makes for results vs. computation.
I won’t say much here, since I didn’t end up putting much emphasis on this step for reasons explained at the beginning of this post.
My scoring approach assigns business-level labels by averaging the image-level predictions. I classify a business as label “0” if the average of the probabilities across all of its images belonging class “0” is greater than 0.5.
Also not much to say here, but this is how I aggregated image predictions to business scores for each model. And this is the code to generate the output CSV for Kaggle.
The whole project can be run from main.scala. Here it is:
This was my first foray into deep neural networks. I haven’t used theano or any of the other widely used implementations out there, so I unfortunately don’t have much to compare my experience to.
I will say that the current documentation will only take you so far. I spent a lot of time reading Neural Networks and Deep Learning to understand the concepts and reviewing the DL4J source code to try and figure out how to implement what I thought I wanted to do.
Discovering the DL4J Gitter was the single most useful moment I had. The creators are actively answering all sorts of questions in real-time. There’s also a room for earlyadopters discussing testing and feature requests which was interesting to browse. Very impressed with the commitment and willingness to help. I even got an email from someone on the DL4J team after I pushed this project to GitHub offering to help and pointing me to the CNN specialists.
Gitter is where the action is. There’s way more here than on StackOverflow. However, the content doesn’t appear to be indexed nearly as well on Google, so found myself “Googling” in the Gitter search bar for keywords and perusing through conversations to get answers.
I recommend using the deeplearning4j-ui tool if you can. I unfortunately wasn’t able to get it working, but it looks super useful for understanding how your net training is going.
Other awesome resources I found for visualizing training for CNNs are ConvNetJS and this one.