If you’ve been reading for a while you’ve seen a few posts from the Lab talking about “deep learning”. In those posts we used deep learning to solve classification problems (e.g. “Which of my 1000 known classes is in this picture?”). The classification problem is great but we think the real action in neural networks is the intermediate representation (feature vector) that gets created. Karl’s Anything2Vec post spoke about creating vector representations of text. In this post, we'll explore how to use locality-sensitive hashing to search vector representations of images. A challenge when working with these feature vectors is how to efficiently index and search them. The most common type of search for these vectors is looking up the vectors that are most similar to a given vector, its nearest neighbors.

This feature vector contains a numeric description of the data. This description isn’t easily explained in words but, in exchange, has the ability to characterize things that the network has never seen before. We can use this property to find similar images, even if the picture isn’t like anything in the data the network was trained with. A quick test on a couple thousand images looking for “similar” images in this space was interesting. Here’s an example below. The top picture is the picture we chose and the bottom 3 are its nearest neighbors:

We knew we wanted to try nearest neighbor search across a larger dataset but we still had a problem (which is where my nearest neighbor odyssey began): every time we wanted to look for similar pictures we had to compare our picture against **every** item in our dataset. For the problem we were exploring, finding how similar two images were took 7ms. That meant that if you had 1000 items you wanted to check it took a reasonable 7 seconds but if you had 1 million pictures to compare against it would take almost two hours.

When we think about hashing we think about cryptographic hash functions such as MD5 and SHA-1/2, or we think about the hash functions that make hash tables work. These hash functions are generally designed so that a small change in the input produces a large change in the output. Locality-sensitive hash functions (LSH) have the exact opposite property; small changes in the input create small changes in the output. Put another way, similar items are likely to hash to similar values.

Let’s back up a little and talk about what a hash function is. A hash function can be any function that maps an input to a fixed output range (generally, but not necessarily smaller than the input range). Locality sensitive hash functions provide this mapping from the input space to a much smaller output space and go one step further. They promise that similar inputs will tend to map to similar outputs. How you define similar defines which LSH function you should use. I was interested in cosine similarity so I found that random binary projections were a good place to start.

We’ll walk through a 2D example of random binary projection to get the basic idea. First we draw a random line splitting our 2D space in two. Everything on one side is a 1 and on the other side is a 0. For the example below we take 4 random splits of our data.

Those splits result in a compressed representation of our data that looks like this:

Ok, but how do I use LSH?It turns out that the probability of two points having the same string of ones and zeros is proportional to the cosine similarity of the points. This same concept can be extended to higher dimensions. In the 3D example each split is a 2D plane. In the N-dimensional example each split is an (N-1)-dimensional hyperplane.

Using this compressed representation and what we remember from our algorithms class, we can formulate a plan about what to do next (hint: the last 4 paragraphs have been about hashing).

We (might) remember that hash tables give constant-time lookup. Constant-time lookup means that how long it takes to get an item is fixed and doesn’t depend on if I have 100 or 1 million items. Hash tables are built to help find exact matches to a key. Their efficiency relies on minimizing how often two items “collide,” or have the same hash (see: bad things that happen when there are too many collisions).

Instead of exact matches we want near matches. If we can figure out how to use hash tables for this then we'll be in good shape. To accomplish this, roughly, we do the opposite of what traditional hash tables do—we use a locality-sensitive hash function (so close items map to close bins) and we undersize our hash table relative to what a normal application would use. We then use the collisions to find related items.

If you’re like me, you think this is pretty cool but I sure don’t want to write all this from scratch. Now that I knew what I was looking for, I started to look for some libraries to help me do this. Since the rest of our work was in Python, Python bindings were a requirement. We found the following packages (see here for a more complete comparison):

Each of these packages makes a different promise. Some are optimized for pure speed while others allow you to use scalable storage layers underneath them.

Yes. When you put these tools together you get the ability to search a million images very fast. To really test this out, we took ≈1% of the Yahoo Flickr Creative Commons (YFCC) dataset and extracted feature vectors from that data using the output from VGGNet at the last convolutional layer.

The IPython notebook we used to try out nearest neighbor search is here__.__ We tried nearest neighbor search using three different methods: two of the LSH-based frameworks (Annoy and NearPy) and by checking each item using a brute-force search.

The HDF5 files we used aren't included, but each file has two datasets in it. One is called *filenames* and is a list of the filenames the image features are extracted from. The second is called *feats* and represents the features extracted using VGGNet at the last convolution layer. The code we used to generate the HDF5 files borrows preprocessing code from Andrej Karpathy's NeuralTalk codebase.

Ultimately we found that this was an easy way to perform nearest neighbor search in a scalable way. While this post focused on feature vectors to represent images, there are lots of other ways that you can get vector representations of your data (word2vec, bag-of-words, tf-idf). The approach for scaling nearest neighbor search for those problems is similar, with the only potential difference being that you might use a different locality-sensitive hash function if “similarity” is defined differently for your problem.

Locality-sensitive hashing is a powerful way to search for similar items across large datasets. If you want to search through the content of images, and not just metadata tags, LSH represents the best approach we've seen so far. The ease-of-use of the LSH libraries we tried makes it possible to build an image content search system that will scale to millions of images.

Tags: ANN, approximate nearest neighbors, deep learning, locality-sensitive hash