If you follow any of the popular blogs like Google's research, FastML, Smola's Adventures in Data Land, or one of the indie-pop ones like Edwin Chen's blog, you've probably also used ModelZoo. Actually, if you're like our boss, you affectionately call it "The Zoo". (Actually x 2, if you have interesting blogs that you read, feel free to let us know!)
Unfortunately, ModelZoo is only supported in Caffe. Fortunately, we've taken a look at the difference between the kernels in Keras, Theano, and Caffe for you, and after reading this blog, you'll be able to load models from ModelZoo into any of your favorite Python tools.
Why this post? Why not just download our Github code?
In short, it's better you figure out how these things work before you use them. That way, you're better armed to use the latest TensorFlow and Neon toolboxes if you're prototyping and transitioning your code to Caffe.
So, there's Hinton's Dropout and then there's Caffe's Dropout...and they're different. You might be wondering, "What's the big deal?" Well sir, I have a name of a guy for you, and it's Willy...Mr. Willy Nilly. One thing Willy Nilly likes to do is to write layers in neural networks in powers of 2; his favorite being 4096. Another thing he likes to do is introduce regularization (which includes Dropout) arbitrarily, and Bayesian theorists aren't a fan. Those people try to fit their work into the probabilistic framework, and they're trying to hold onto what semblance of theoretical bounds exist for neural networks. However, for you as a practitioner, understanding who's doing what will save you hours of debugging code.
We singled out Dropout because the way people have implemented it spans the gamut. There's actually some history as to this variation, but no one really cared, because optimizing for it has almost universally produced similar results. Much of the discussion stems from how the chain rule is implemented since randomly throwing stuff away is apparently not really a differentiable operation. Passing gradients back (i.e., backpropagation) is a fun thing to do; there's a "technically right" way to do it, and then there's what's works.
Back to ModelZoo, where we'd recommend you note the only sentence of any substance in this section, and the sentence is as follows. While Keras and perhaps other packages multiply the gradients by the retention probability at inference time, Caffe does not. That is to say, if you have a dropout level of 0.2, your retention probability is 0.8, and at inference time, Keras will scale the output of your prediction by 0.8. So, download the ModelZoo *.caffemodels, but know that deploying them on Caffe will produce non-scaled results, whereas Keras will.
Hinton explains the reason why you need to scale, and the intuition is as follows. If you've only got a portion of your signal seeping through to the next layer during training, you should scale the expectation of what the energy of your final result should be. Seems like a weird thing to care about, right? The argument that minimizes x is still the same as the argument that minimizes 2x. This turns out to be a problem when you're passing multiple gradients back and don't implement your layers uniformly. Caffe works in instances like Siamese Networks or Bilinear Networks, but should you scale your networks on two sides differently, don't be surprised if you're getting unexpected results.
What does this look like in Francois's code? Look at the "Dropout" code on Github, or in your installation folder under keras/layers/core.py. If you want to make your own layer for loading in the Dropout module, just comment out the part of the code that does this scaling:
You can modify the original code, or you can create your own custom layer. (We've opted to keep our installation of Keras clean and just implemented a new class that extended MaskedLayer.) BTW, you should be careful in your use of Dropout. Our experience with them is that they regularize okay, but could contribute to vanishing gradients really quickly.
Everyday except for Sunday and some holidays, a select few machine learning professors and some signal processing leaders meet in an undisclosed location in the early hours of the morning. The topic of their discussion is almost universally, "How do we get researchers and deep learning practitioners to code bugs into their programs?" One of the conclusions a while back was that the definition of convolution and dense matrix multiplication (or cross-correlation) should be exactly opposite of each other. That way, when people are building algorithms that call themselves "Convolutional Neural Networks", no one will know which implementation is actually being used for the convolution portion itself.
For those who don't know, convolutions and sweeping matrix multiplication across an array of data, differ in that convolutions will be flipped before being slid across the array. From Wikipedia, the definition is:
On the other hand, if you're sweeping matrix multiplications across the array of data, you're essentially doing cross-correlation, which on Wikipedia, looks like:
Like we said, the only difference is that darned minus/plus sign, which caused us some headache.
We happen to know that Theano and Caffe follow different philosophies. Once again, Caffe doesn't bother with pleasantries and straight up codes efficient matrix multiplies. To load models from ModelZoo into either Keras and Theano will require the transformation because they strictly follow the definition of convolution. The easy fix is to flip it yourself when you're loading the weights into your model. For 2D convolution, this looks like:
1 |
weights = weights[:,:,::-1,::-1] |
Here, the variable "weights" will be inserted into your model's parameters. You can set weights by indexing into the model. For example, say I want to set the 9th layer's weights. I would type:
1 |
model.layers[ 9 ].set_weights( weights ) |
Incidentally, and this is important, when loading any *.caffemodel into Python, you may have to transpose it in order to use it. You can quickly find this out by loading it if you get an error, but we thought it worth noting.
Alright, alright, we know what you're really here for; just getting the code and running with it. So, we've got some example code that classifies using Keras and the VGG net from the web at our Git (see the link below). But, let's go through it just a bit. Here's a step by step account of what you need to do to use the VGG caffe model.
The above simply takes in the network name (i.e., the prototxt file), and the parameter set (i.e., the *.caffemodel file). BTW, all this code is linked at our Git page. The output is the ordered dictionary of all the parameters in the variable "params". Really, we're just calling caffe's load function with caffe.Net.
Really, all I did was look at the prototxt file and just implemented it in Keras. BTW, this is an older version of Keras, so the definitions are a bit wonky. For example, in Convolutions and Dense, you need not specify the input dimension, unless you're the first layer, which you would need to. Anyway, you get the point. (We're going to fix/update this when we get around to it; check with our Git page.)
Here, "params" is the CNN parameters loaded in with caffe.Net. The variable "model1" is the Keras network. You can actually access the layers inside of your models. So, this is accessing the 0th and 1st layers in the model. Then, you're setting those parameters. Do this for all of the layers.
And now you have the basics! Go ahead and take a look at our Github for some goodies. Let us know!