Probability refresher
Another set of fundamental mathematical tools required to develop various machine learning algorithms (especially towards the end of the course when we will focus on generative modelling)
In order to develop various machine learning algorithms (especially towards the end of the course when we will focus on generative modelling) we need to be familiarized with some basic concepts of: mathematical tools from:
- Probability: mathematical framework to handle uncertain statements;
- Information Theory: scientific field focused on the quantification of amount of uncertainty in a probability distribution.
Probability
Random Variable: a variable whose value is unknown, all we know is that it can take on different
values with a given probability. It is generally defined by an uppercase letter
Probability distribution: description of how likely a variable
-
Discrete distributions:
called Probability Mass Function (PMF) and can take on a discrete number of states N. A classical example is represented by a coin where N=2 and . For a fair coin, and . -
Continuous distributions:
called Probability Density Function (PDF) and can take on any value from a continuous space (e.g., ). A classical example is represented by the gaussian distribution where .
A probability distribution must satisfy the following conditions:
-
each of the possible states must have probability bounded between 0 (no occurrance) and 1 (certainty of occurcence):
(or , where the upper bound is removed because of the fact that the integration step in the second condition can be smaller than 1: ); -
the sum of the probabilities of all possible states must equal to 1:
(or ).
Joint and Marginal Probabilities: assuming we have a probability distribution acting over a set of variables (e.g.,
-
Joint distribution:
(or ); -
Marginal distribution:
(or ), which is the probability spanning one or a subset of the original variables;
Conditional Probability: provides us with the probability of an event given the knowledge that another event has already occurred
This formula can be used recursively to define the joint probability of N variables as product of conditional probabilities (so-called Chain Rule of Probability)
Independence and Conditional Independence: Two variables X and Y are said to be independent if
If both variables are conditioned on a third variable Z (i.e., P(X=x, Y=y | Z=z)), they are said to be conditionally independent if
Bayes Rule: probabilistic way to update our knowledge of a certain phenomenon (called prior) based on a new piece of evidence (called likelihood):
where
Mean (or Expectation): Given a function
and for the continuous case
In most Machine Learning applications, we do not have knowledge of the full distribution to evaluate the mean, rather we have access to N equi-probable samples that we assume are drawn from the underlying distribution. We can approximate the mean via the Sample Mean:
Variance (and Covariance): Given a function
Covariance is the extension of the variance to two or more variables, and it tells how much these variables are related to each other:
Here,
Finally, the covariance of a multidimensional vector
Distributions: some of the most used probability distributions in Machine Learning are listed in the following.
1. Bernoulli: single binary variable
with probability:
and momentum equal to:
2. Multinoulli (or categorical): extension of Bernoulli distribution to K different states
3. Gaussian: most popular choice for continuous random variables (most distributions are close to a normal distribution and the central limit theorem states that any sum of independent variables is approximately normal)
where the second definition uses the precision
4. Multivariate Gaussian: extension of Gaussian distribution to a multidimensional vector
where again
5. Mixture of distributions: any smooth probability density function can be expressed as a weighted sum of simpler distributions
where
A special case is the so-called Gaussian Mixture where each probability
Information theory
In Machine Learning, we are sometimes interested to quantify how much information is contained in a signal or how much two signals (or probability distributions) differ from each other.
A large body of literature exists in the context of telecommunications, where it is necessary to study how to transmit signals for a discrete alphabet over a noisy channel. More specifically, a code must be designed so to allow sending the least amount of bits for the most amount of useful information. Extension of such theory to continuous variables is also available and more commonly used in the context of ML systems.
Self-information: a measure of information in such a way that likely events have low information content, less likely events have higher information content and independent events have additive information:
such that for
Shannon entropy: extension of self-information to continuous variables, representing the expected amount of information in an event
Kullback-Leibler divergence: extension of entropy to 2 variables with probability
which is