Data Compression and variational inference with Variational Autoencoders
Published:
In this post, I’ll give an introduction to variational autoencoders, with some basic machine learning applications and finally there application in variational inference. UNDER CONSTRUCTION
Resources
As usual, here are some of the resources I’m using as references for this post. Feel free to explore them directly if you want more information or if my explanations don’t quite click for you.
- Stanford CS229: Machine Learning | Summer 2019 | Lecture 20 - Variational Autoencoder
- An Introduction to Variational Autoencoders - Diederik P. Kingma, Max Welling
- Variational Autoencoders - Arxiv Insights
- From Autoencoder to Beta-VAE - Lilian Weng
- Found this after I started making this blog post, it’s basically exactly what I wanted to do theory-wise and more. I’m going to focus a little more on implementation but otherwise I’d recommend going over there if you want extensions to the theory.
- GANs, AEs, and VAEs - Andy Casey
- Understanding Variational Autoencoders (VAEs) | Deep Learning - DeepBean
Table of Contents
- Motivation/Traditional Autoencoders
- Core Idea
- Construction of the Loss
- VI Example Application: Black Box VI
- Conclusion
Motivation/Traditional Autoencoders
A big thing in computer science is data compression. Taking high dimensional or simply large unlabelled inputs and reducing their size into some representation that is later interpretable or able to be used as input to reconstruct the original inputs. One particularly successful approach was the Autoencoder, machine learning architecture made up of two neural networks, an encoder that would taken the large input and reduce it into some small latent representation and a decoder that would take the values in this latent distribution and reconstruct the original inputs. The general structure is shown below.

Where \(\vec{z}_i \in \mathbb{R}^K\) is the compressed latent representation for the \(i^{th}\) data point \(\vec{x}_i \in \mathbb{R}^D\) (\(K<<D\)), \(E_\phi\) is the encoder and \(D_\theta\) is the decoder. The loss is then simplify how well the output matches the input, as all we care about is whether a good output can come from the latent vector or ‘bottleneck’. If we just use the L2 norm, for a single datapoint \(\vec{x}_i\) this looks like,
\[\begin{align} L_i^{AE}(\phi, \theta) &= ||\vec{x}_i - \vec{y}_i||^2 \\ &=||\vec{x}_i - D_\theta(E_\phi(\vec{x}_i))||^2. \end{align}\]And then for a whole dataset we can either take the sum or average, they’re equivalent up to a multiplicative constant so I’ll just use an average over the data space. For \(M\) datapoints this looks like,
\[\begin{align} L^{AE}(\phi, \theta) &= \frac{1}{M}\sum_{i}^M ||\vec{x}_i - \vec{y}_i||^2 \\ &=\frac{1}{M}\sum_{i}^M ||\vec{x}_i - D_\theta(E_\phi(\vec{x}_i))||^2. \end{align}\]And that’s really it. Very simple but it can do quite a lot. Coding that up real quick this is what it looks like.
class Encoder(nn.Module):
def __init__(self, input_dim, hidden_dim, latent_dim):
super(Encoder, self).__init__()
self.layer_1 = nn.Linear(input_dim, hidden_dim)
self.layer_2 = nn.Linear(hidden_dim, hidden_dim)
self.layer_3 = nn.Linear(hidden_dim, hidden_dim)
self.layer_4 = nn.Linear(hidden_dim, latent_dim)
def forward(self, x_inp):
x_int = F.relu(self.layer_1(x_inp))
x_int = F.relu(self.layer_2(x_int))
x_int = F.relu(self.layer_3(x_int))
# It's not common to have a sigmoid here, but it makes the plotting later on easier and doesn't change the salient features of the model
x_int = torch.sigmoid((self.layer_4(x_int)))
return x_int
class Decoder(nn.Module):
def __init__(self, latent_dim, hidden_dim, output_dim):
super(Decoder, self).__init__()
self.layer_1 = nn.Linear(latent_dim, hidden_dim)
self.layer_2 = nn.Linear(hidden_dim, hidden_dim)
self.layer_3 = nn.Linear(hidden_dim, hidden_dim)
self.layer_4 = nn.Linear(hidden_dim, output_dim)
def forward(self, x_inp):
x_int = F.relu(self.layer_1(x_inp))
x_int = F.relu(self.layer_2(x_int))
x_int = F.relu(self.layer_3(x_int))
x_hat = torch.sigmoid(self.layer_4(x_int))
return x_hat
class AEModel(nn.Module):
def __init__(self, input_dim, latent_dim, encoder_hidden_size, decoder_hidden_size):
super(AEModel, self).__init__()
self.E_encoder = Encoder(input_dim=input_dim, hidden_dim=encoder_hidden_size, latent_dim=latent_dim)
self.D_decoder = Decoder(latent_dim=latent_dim, hidden_dim = decoder_hidden_size, output_dim = input_dim)
def forward(self, x):
z = self.E_encoder(x)
x_hat = self.D_decoder(z)
return x_hat, z
One of the best ways we can see how well it did is to simply see if it can reproduce the inputs for a testing dataset. In the following I trained on the MNIST dataset comprising of a bunch of handwritten numbers that looks like the below.

We can train the autoencoder above with \(2\) the number of latent dimensions equal to \(2\) and the number of inputs dimensions of \(28^2\) as the images are \(28\times28\). We can then look at how weel the model does with a few grabs of inputs and outputs.






You may be wondering why the post isn’t just on autoencoders as this doesn’t look too bar. Well the issue is more obvious when we investigate the latent space of the model. First let’s see where images in the MNIST dataset land in the latent space (which remember is 2D so we can plot it like this).


A few things to note:
- Most of the distributions are bimodal with some parts mapped to one area of the parameter space and others to completely different areas
- Many of the numbers are overlapping in parameter space, and they don’t even necessarily look similar. (e.g. observe 4, 7 and 9)
- Numbers that you would think are similar are not necessarily put together (e.g. I would put 0, 9 and 6 together but instead 0 and 3 are close?)
- There’s areas of the parameter space that are empty, what do these values map to?
This is effectively looking at how the encoder is understanding the information and the key point is that the latent space is not well structured. We can also see how the space is interpreted by the decoder. Taking in inputs ranging from 0 to 1 in each dimension and seeing what the produce in the data space.

We can now kind of see what’s happening, the autoencoder isn’t mapping numbers that are similarly shaped together. It seems to merely be assigning the numbers areas, and then just phasing between the two with the stages of the phase not corresponding to anything interpretable. e.g. Looking at how the 0 phases into the 5, the ‘inbetween’ doesn’t look like anything that a human would write. You can also clearly see how the ‘5’s show up in two entirely separate areas where if you go between them you get a ‘3’?? Because of this arbitrary assignment the spaces in between numbers that actually populate the space are not interpretable, we can’t say that they will look close to numbers close in the parameter space, it might just be some gobbledygook.
We need some way to essentially make the space more structured. And we can do this by instead of interpreting numbers going to points, mapping them to distributions and saying that they are a draw of said distribution. Implicitly, this also makes the space continuous and allows us to generate new data as we can sample these distributions to generate new and possibly realistic data. This is essentially a Variational Autoencoder.
VAE Core Idea
A Variational Autoencoder1 or VAE has a similar structure to a traditional autoencoder in that it reduces the size of some input, maps this to some latent space, and then maps this back into the data space. The key difference is that the encoder instead of learning the map to the latent space directly, learns the map \(E_\phi\) of the inputs to the parameters that dictate the conditinoal distribution \(p_\phi(\vec{z}_i\vert\vec{x}_i)\). And the decoder \(D_\theta\) then learns the map to the parameters of the conditional distribution of the data given the latent parameters \(q_\theta(\vec{x}_i\vert\vec{z}_i)\). This is shown in the diagram below in the case of both \(p\) and \(q\) being normal distributions.

Infuriatingly, for the decoder, it is common to fix the covariance values as it makes the training more difficult. And since the decoder distribution is often a normal the loss eventually turns into the mean squared error, and then they call it the reconstruction loss. So the decoder loses it’s probabilistic interpretation and then some PhD student spends a couple hours trying to figure out why papers still refer to it as a probability distribution only to learn this fun fact at the end… anyways.
Construction of the loss
VAE Black Box VI
Conclusion
I’ve actually found that the wikipedia page is the best source to get an intuitive feel for VAEs rather than the blog posts, videos and papers I’ve read. Give it a look if you have the time. ↩