Warning: preg_replace(): Compilation failed: invalid range in character class at offset 4 in /nas/content/live/lab41dev22/wp-content/plugins/crayon-syntax-highlighter/crayon_langs.class.php on line 340
Lab41

Gab41

No. 38

Generative Methods are a Good Idea for Handwriting

It has been interesting to watch deep learning evolve over the past four years. Deep learning has made some significant advances, but the progress in unsupervised learning has caught my eye recently. I was academically birthed from the womb of a Frequentist, but the impact of deep Bayesian models cannot be ignored. At the Lab we recently completed the D*Script challenge on handwriting author identification. I found myself delving into deep generative models and actually enjoying it. However, I got hung up on some of the key components that differentiate generative architectures from discriminant architectures. The purpose of this post is to highlight some of the differences between generative and discriminative auto-encoders, and how one could use generative models (specifically DRAW) to solve the problem of writer identification in handwritten documents.

The main method for deep unsupervised learning is the auto-encoder. The traditional auto-encoder (AE) is typically made up of dense, fully connected layers of a feed-forward neural network. AEs are unsupervised in that their targeted output is the same as their input. The goal of AEs is to compress the input data and then attempt to recreate it (see figure below), much like crushing a soda can and then trying to bend it back into its original form.

The first half of the AE is called the encoder, because it compresses the original input and encodes it into a latent space. Often this latent space has a lower dimensionality than the input. The second half is called the decoder, and its job is to undo and recreate what the encoder compressed. The loss function simply measures how much information was lost when comparing the original input to the reconstruction; the better the reconstruction, the lower the loss. The encoder and decoder are arbitrary functions. More recently other types of auto-encoders have been innovated.

The variational auto-encoder (VAE) was innovated with beautiful Bayesian theory to support its existence. VAEs differ from traditional AEs in that they have a generative component and want to say something about the distribution of the latent space $P(z|x)$. After data are compressed by the encoder, the encoding is used to define the parameters of a latent posterior distribution. In other words, we travel from a prior to a posterior in light of the data. From this distribution we can randomly draw a sample $z_{i}$, conditioned on input $x$, for the decoder to reconstruct back into the original input. The decoder is the generative portion of the VAE that makes sense of the latent sample. If we return to the crushed can analogy, this is like crushing a soda can, having it engulfed by The NeverEnding Story’s the “Nothing”, having manifest itself on Earth in one of many forms of human sorrow and suffering, and then reconstructing that back into a soda can (see figure below).

Two significant benefits emerge from this:

1) The input no longer has a direct route from beginning to end as with the AE, thus the decoder must learn how to disentangle the randomly drawn latent sample. This allows the decoder to model complex and even multi-modal distributions.
2) Given a randomly drawn sample $z_{i}$, not conditioned on input x, the decoder can “hallucinate” a reconstruction from the randomness. This allows it to generate samples that look like real world data.

However, there are a few obstacles to overcome with the latent posterior $P(z|x)$ from which $z_{i}$ is drawn. First, you won't be able to find it in the forest, because it is untrackable. Second, in practice, exact inference of the posterior is usually intractable. The later means we cannot differentiate what we need to and must use approximate inference. One method is to approximate $P$ with a distribution we know, call it $Q$. How close we come using $Q$ is evaluated using the Kullback-Leibler Divergence. This is the first part of the final loss function: the latent loss. But we still can't back-propagate through this architecture. Enter the Diederik (Kingma). In the enlightening paper Auto-Encoding Variational Bayes, Kingma et. al developed the “reparameterization trick” that transforms $z_{i}$ from being defined probabilistically as $z_{i} \sim Q(z|x)$ to being defined deterministically as $z_{i} = \mu + \sigma \cdot \epsilon$, where $\epsilon$ is a random noise from a standard normal distribution (Note: This specific reparameterization is for the normal distribution. Others exist for the other families of distributions.). $\mu$ and $\sigma$ are parameters learned through back-propagation. The encoder helps define the parameters of $Q(z|x)$, while the decoder attempts to recreate the input from the random sample $z_{i}$. How well we reconstruct the latent sample provides the second part of the VAE's final loss function: the reconstruction loss

So far we have talked about auto-encoders in an unsupervised setting. How do we make the leap to supervised or semi-supervised learning using the auto-encoder architecture? The traditional way to use an AE in the supervised setting is to encode the data and use the latent representation, call it $h_{i}$, as a feature-rich replacement for the input. Then all of the $h$’s are used as a dataset to train a separate classifier. This approach uses two separate architectures (AE and classifier) and hence two separate loss functions. The VAE, on the other hand, already has two parts to its loss function in its single architecture. In another Kingma et. al paper titled Semi-Supervised Learning with Deep Generative Models, a third loss is added to the final loss function: the classification loss (here is a presentation by Kingma and Welling that adds more clarity and visuals to their paper). This three-loss architecture is well poised to tackle handwriting challenges. But before we jump into the possible VAE end-to-end solution, let's briefly explore the handwriting authorship challenge.

Author identification is a very difficult problem in handwriting which often gets confused with handwriting optical character recognition (OCR). OCR is concerned with which letter is written, whereas author identification is concerned with how a letter is written. There are a few obstacles that stand in the way of analyzing writing styles. First, unique features need to be captured about each author. Without an automated process and handwriting experts, this can be a very time intensive problem. Second, any noise (lines, watermarks, stains, etc.) can throw off even the most sophisticated algorithm. Data for many of the handwriting competitions are blank white paper with handwriting in black ballpoint pen. See the ICDAR competition for an example. Third, in practice you don’t always know who wrote what. The many handwritten documents in the wild don’t have a signature, and even if they do there is no guarantee it belongs to the name signed. Thus, there is a classification part and a data retrieval part to this problem. So, what do VAEs have to do with all of this? I believe generative models have the end-to-end answer.

Process

Get more insight into how we work