Generative Modelling and Variational AutoEncoders
Up until now, our attention has been mostly focused on supervised learning tasks where we have access to a certain number
of training samples, in the form of input-target pairs, and we train a model (e.g., a NN) to learn the best possible mapping
between the two. These kind of models are also usually referred to as discriminative models as they learn from training samples
their underlying conditional probability distribution
In the last lecture, we have also seen how the general principles of supervised learning can be adapted to accomplish a number of different tasks where input-target pairs are not available. Dimensionality reduction is one of such tasks, which are usually categorized under the umbrella of unsupervised learning.
Another very exciting area of statistics that has been recently heavily influenced by the deep learning revolution is the
so-called field of generative modelling. Here, instead of having access to input-target pairs, we are able to only gather
a (large) number of samples
- Learn the underlying distribution
, or - Learn to sample from the underlying distribution
Obviously, the first task is more general and usually more ambitious. Once you know a distribution, sampling from it is rather an easy task. In the next two lectures, we will however mostly focused on the second task and discuss two popular algorithms that have shown impressive capabilities to sample from high-dimensional, complex distributions.
To set the scene, let's take the simplest approach to generative modelling that has nothing to do with neural networks. Let's imagine
we are provided with
-
Training
- Compute the sample mean and covariance from the training samples:
- Apply the Cholesky decomposition to the covariance matrix:
- Compute the sample mean and covariance from the training samples:
-
Inference / Generation
- Sample a vector from a unitary, zero-mean normal distribution
- Create a new sample from the true distribution:
- Sample a vector from a unitary, zero-mean normal distribution
Unfortunately, multi-dimensional distributions that we usually find in nature are hardly gaussian and this kind of simple
generative modelling procedure falls short. Nevertheless, the approach that we take with some of the more advanced generative modelling
methods that we are going to discuss later on in this lecture does not differ from what we have done so far. A training phase, where the
free-parameters of the chosen parametric model (e.g., a NN) are learned from the available data, followed by a generation phase that uses
the trained model and some stochastic input (like the
Variational AutoEncoders (VAEs)
Variational AutoEncoders have been proposed by Kingma and Welling in 2013. in As the name implies, these networks take inspiration from the AutoEncoder networks that we have presented in the previous lecture. However, some small, yet fundamental changes are implemented to the network architecture as well as the learning process (i.e., loss function) to turn such family of networks from being able to perform dimensionality reduction to being generative models.
Let's start by looking at a schematic representation of a VAEs:
Even before we delve into the mathematical details, we can clearly see that one main change has been implemented to the network architecture:
instead of directly producing a vector
Reparametrization trick
This rather simple trick is referred to as Reparametrization trick and it is strictly needed in neural networks every time we want to introduce a stochastic process within the computational graph. In fact, by simply having a stochastic process parametrized by a certain mean and standard deviation that may come from a previous part of the computational graph (as in VAEs) we lose the possibility to perform backpropagation. Instead if we decouple the stochastic component (which we are not interested to update, and therefore to backpropagate onto) and the deterministic component(s), we do not lose access to backpropagation:
Why VAEs?
Before we progress in discussing the loss function and training procedure of VAEs, a rather simple question may arise: 'Why can we not use AEs for generative modelling?'
In fact, this could be achieved by simply modifying the inference step:
where instead of taking a precomputed
Unfortunately, whilst this idea may sound reasonable, we will be soon faced with a problem. In fact, the latent manifold learned by a AE may
not be regular, or in other words it may be hard to ensure that areas of such manifold that have not been properly sampled by the training data will
produce meaningful samples
as we can see, if a part of the latent 1-d manifold is not rich in training data, the resulting generated sample may be non-representative at all.
Whilst we discussed techniques that can mitigate this form of overfitting (e.g., sparse AEs), VAEs bring the learning process to a whole new level
by choosing a more appropriate regularization term
Regularization in VAEs
In order to better understand the regularization choice in VAEs, let's look once again at a schematic representation of VAEs but this time in a probabilistic mindset:
where we highlight here the fact that the encoder and decoder can be seen as probability approximators. More specifically:
: the encoder learns to sample from the latent space distribution conditioned on a specific input : the decoder learns to sample from the true distribution conditioned on a specific latent sample
By doing so, we can reinterpret the reconstruction loss as the negative log-likelihood of the decoder. And, provided that we have defined a
prior for the latent space
As in any statistical learning process, the overall loss of our VAEs shows a trade-off between the likelihood (i.e., learning from data) and prior (i.e., keeping close to the initial guess).
Before we provide a mathematical derivation supporting these claims, let's briefly try to provide some intuition onto why adding this regularization
makes VAEs more well behaved than AEs in terms of generating representation samples of the input distribution (
the effect of the regularization term in VAEs is such that the probability density function of the latent space ((
More precisely, the regularization term in VAEs ensures the following two properties for the latent space:
- continuity: two closely points in the latent space are similar in the original space;
- completness: any point sampled from the latent distribution is meaningful in the original space;
Mathematics of VAEs
To conclude our lecture on VAEs, we would like to gain a stronger mathematical understanding about the inner working of this model. In order to do so, we are required to introduce a technique commonly used in statistics to estimate complex distributions. This technique goes under the name of Variational Inference (VI).
Let's begin from the classical setup of Bayesian inference. We are interested in a certain probability distribution that we want to sample from or characterize (e.g., in terms of its mean and standard deviation), for example the following posterior distribution in a general inverse problem setting:
where
Variational Inference approaches the above problem in a special way. A parametric distribution
Let's now expand the expression of the KL divergence and show an equivalent formula for this optimization problem:
where we have eliminated
is the negative log-likelihood of a traditional Maximum likelihood estimation (i.e., data misfit term). In the special case of gaussian noise ( ), this becomes the MSE loss as discussed in one of our previous lectures; is the regularization term encouraging the proposal distribution to stay close to the prior.
Finally, let's slightly rearrange the expression in the 5th row:
The left hand side of this equation is called Evidence Lower Bound (ELBO). The names comes from the fact that the sum of these two terms is
always
Whilst we now understand the theoretical foundations of VI, to make it practical we need to specify:
-
A suitable proposal
, where suitable means that we can easily evaluate such probability, its KL divergence with a prior of choice, as well as sample from it. The simplest choice that is sometimes made in VI is named mean-field approximation where: where . This implies that there is no correlation over the different variables of the N-dimensional proposal distribution. Whilst this choice may be too simple in many practical scenarios, it is important to notice that this is not the same as assuming that the variables of the posterior itself are uncorrelated! -
A suitable optimizer. In the case where multiple \mathbf{x} samples are available,
, , and are differentiable we can simply use a stochastic gradient method. This special case of VI is named ADVI.
Moving back to where we started, the VAE model. Let's now rewrite the problem as a VI estimation (where
where the first term is responsible for updating the encoder whilst the second term contributes to the update of both the encoder and decoder.
The proposal distribution is here parametrized as