One More Reason Not To Be Scared of Deep Learning
Just how data-hungry is deep learning? It is an important question for those of us who don’t have an ocean of data from somewhere like Google or Facebook and still want to see what this deep learning thing is all about. If you have a moderate amount of your own data and your fancy new model gets mediocre performance, it is often hard to tell whether the fault is in your model architecture or in the amount of data that you have. Learning curves and other techniques for diagnosing training-in-progress can help, and much ink has been spilled offering guidance to young deep learners. We wanted to add to this an empirical case study in the tradeoff between data size and model performance for sentiment analysis.
We asked that question ourselves in the course of our work on sunny-side-up, a project assessing deep learning techniques for sentiment analysis (check out our post on learning about deep learning, which also introduces the project). Most real-world text corpora have orders of magnitude fewer documents than, for instance, the popular Amazon Reviews dataset. Even one of the stalwart benchmark datasets for sentiment analysis, IMDB Movie Reviews, has a "mere" tens of thousands of reviews compared to the Amazon dataset's millions. While deep learning methods have claimed exceptional performance on the IMDB set, some of the top performers are trained on outside datasets. If you were trying to do sentiment analysis in small collections of documents in under-resourced langauges like Hausa or Aymara, then 30 million Amazon Movie reviews might not be a great analogue.
To look at the effects of data size in deep learning for text, I'll look at the performance of Zhang, Zhao and LeCun's Crepe convolutional network architecture on differently sized subsets of the Amazon reviews set. The arXiv manuscript for Crepe claims impressive performance on (a different set of) Amazon reviews, so this is an interesting test bed for examining how much data such an algorithm might need for a sentiment task. Their paper suggests that performance degrades on datasets numbering in the hundreds of thousands of documents (which is still pretty big). But the datasets they compare have many more differences than just size, so it is hard to know how much data size itself impacts performance. Let’s look at performance on differently sized samples from the same dataset.
To test the Crepe architecture, we used the full "Health and Personal" section of the Amazon Reviews release, which has 3.7 million reviews. Instead of having the sentiment model try to predict the actual star rating, we threw out all the milquetoast, wishy-washy 3-star reviews and called 4- and 5-star reviews positive, while 1- and 2-star reviews were counted as negative. In sentiment analysis this binarized scale is sometimes called polarity. To compare performance across datasets, we trained Crepe for 5 epochs on subsets of the training set: the full 3 million, 500 thousand, 100 thousand, 50 thousand and 25 thousand. We kept the size and composition of the test set fixed at 700 thousand reviews. We were a bit surprised—the final test accuracy over five epochs of training does not actually degrade as much as we would have expected. The two largest subsets offer almost identical validation performance after just one training epoch, and the third and fourth largest sets largely catch up by the end of five epochs. Only the smallest subset, with 25 thousand documents, fails to learn anything at all.
Because of class imbalances in this dataset, accuracy numbers can be misleading. The reality, which mirrors certain real-world scenarios, is that there is a significant imbalance in the numbers of negative and positive cases. This may be part of the reason the classifiers trained on 25k and 50k samples were largely unable to learn anything, beyond categorically predicting the whole test as one label or the other. Although they start off at 17 percent accuracy (predicting everything as negative), this is not materially different from 83 percent accuracy (predicting everything as positive). Nonetheless, by the end of five epochs, the four largest subsets have each shown some improvement, with the top three highly competitive with each other.
Even though room for improvement on the baseline of 83% is fairly narrow, many real-world datasets are imbalanced in just this way, so it is interesting to examine how sentiment analysis techniques do on this data. We can look at their performance on the rarer class—negative reviews—to get a good assessment.
Every happy review is alike. Every unhappy review mangles overused Tolstoy quotes in its own way.
Because negative reviews make up only 17% of the test set (there is imbalance between the polarity categories), it is likely harder for the sentiment models trained on this data to identify negative reviews. However, combining overall data sparsity with unbalanced class sizes gives us a chance to see how deep learning models might perform on real-world data, where the least commonly encountered stuff is often the most interesting.
As before, the corpus samples with only 25,000 and 50,000 observations each did not learn enough to be an interesting point of comparison, but the larger samples were quite good at achieving precision scores matching models based on the whole corpus. Recall of negatives was slower to develop when training on smaller subsets, but the gap between the 500k model and the one trained on the full corpus is around 5 points in the final epoch. Whether we can do without those five extra points is a problem-specific consideration. For some use cases, like driverless cars, every tenth of a percent counts for potential lives saved and probably a lot of dollar signs too. But for our needs—triaging the sentiment of foreign texts—we’ll gladly take the “close enough” solution from training on a 100K training set if the expense fo getting more data is too high.
Data size is sometimes listed alongside norm-constrained optimization and dropout as a dependable regularization tactic: more data usually helps your model generalize better to unseen input. Not only do the curves for accuracy, precision, and recall get higher the more data the model has seen, they also jump around much less from epoch to epoch. Finding out how to regularize appropriately in the presence of smaller amounts of data is worth exploring.
Our models here were only trained for five epochs, compared to 10 epochs in the Crepe Paper™. Doing fewer epochs through the data might seem like one way to avoid overfitting, but if you don’t wait for your model to converge, it is much harder to accurately estimate how well your model would do on unseen data.
Deep learning models have high numbers of parameters, so overfitting is always a potential concern. With smaller numbers of observations this is an even more pressing matter. Turning down the power of a deep model—by leaving out layers or cutting them down in size—could cut out hundreds of thousands to millions of parameters and still deliver some performance. But even taking a highly-parameterized model as a given, the famed "data hunger" of deep learning applies less strongly to text classification problems like sentiment analysis than to computer vision or more complicated NLP tasks like machine translation.
We think deep learning is an exciting way to develop robust representations of text data for many NLP tasks, including sentiment analysis, as we explored in sunny-side-up. It is encouraging to see that deep learning can still perform on smaller (or at least midsized) datasets. If you are intrigued and want to explore deep learning in greater detail, take a look at our posts on development environments for deep learning, using the Caffe Model Zoo in Keras, and understanding word2vec. We always have something cooking. Check out the Lab41 GitHub for the latest on our current projects, including deep learning for writer identification and unsupervised pattern discovery in semi-structured logs. We hope you'll visit often!
Tags: datasets, deep learning, metrics, natural language processing, sentiment analysis
Get more insight into how we workLets go