By Z.H. Fu

## The Goal of This Tutorial

EM (Expectation Maximization) algorithm is a widely used method when a model has hidden variables that cannot be observed from the dataset. In this tutorial, we focus on the outline and intuition of the EM algorithm. A Gaussian mixture model is studied as an example.

## Gaussian Mixture Model

In the Gaussian mixture model, we assume that there are several generators which can generate samples from a Gaussian distribution. Each sample (denoted as $x^{(i)}=(x_1,x_2,...,x_d)$) of our data is sampled from one of the generators. We can only observe the position of each sample $x$ and we don’t know for each sample $x^{(i)}$, which generator $z^{(i)}$ generate it. Here, $z^{(i)}=\{1,2,..,k\}$ stand for the ID of each generator. For each generator $j$($j\in[1,k]$), it generates samples by a Gaussian distribution with parameter $\mu_j,\Sigma_j$. We use $\theta$ to denote all model parameters, $\theta=(\mu_1,\mu_2,...,\mu_k,\Sigma_1,\Sigma_2,...,\Sigma_k,\phi_1,\phi_2,...,\phi_k)$. Where $\phi_j$ is the probability of selecting $j$th generator and $\sum_{j=i}^k \phi_j=1$. The goal of Gaussian mixture model is to find a most suitable parameter $\theta$ to describe the data.

However, things get complicated, since we don’t know each sample belongs to which class $z^{(i)}=?$. Some people often doubt the relation between model parameters and hidden variables. The difference is that for each single sample, the model parameter is the same $\theta$ but they can have different hidden viaravbles $z^{(i)}$.

## General Problem

We abstract above problem into more general form. Suppose we have some samples, each sample $i$ is composed of visible features $x^{(i)}$ and hidden features $z^{(i)}$. Our ultimate goal is to maximize the incomplete log likelihood: \begin{aligned} \log p(x|\theta)&=\log \int_zp(x,z|\theta)dz\\ &= \log \int_z q(z|x)\frac{p(x,z|\theta)}{q(z|x)}\\ &\ge \int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}=\mathcal{L}(q,\theta)\ \ \ \text{(Jensen's inequality)}\\ &=\int_z q(z|x) \log p(x,z|\theta)-\int_z q(z|x) \log q(z|x)\ \ \ \ (\text{The second term is not related to }\theta)\\ \end{aligned} As shown above, $\int_z q(z|x) \log p(x,z|\theta)$ is called expected complete log likelihood. $\mathcal{L}(q,\theta)$ is the lower bound of the incomplete log likelihood. It is a lower bound of the incomplete log likelihood related to $\theta$. If we maximize it, the incomplete log likelihood will also increase.

In the next step, we show how to choose $q(z|x)$. We claim that when $q(z|x)=p(z|x,\theta)$, the lower bound get the equal sign. We substitute $q(z|x)=p(z|x,\theta)$:

\begin{aligned} &\mathcal{L}(q,\theta)=\int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}\\ =&\int_z p(z|x,\theta) \log\frac{p(x,z,\theta)}{p(z|x,\theta)}\\ =&\int_z p(z|x,\theta) \log p(x|\theta)\\ =& \log p(x|\theta)\int_z p(z|x,\theta)\\ =& \log p(x|\theta) \end{aligned}

In above derivation, we show that when $q(z|x)=p(z|x,\theta)$, Jensen’s inequality gets equal. After we determined $q(z|x)$, the remaining issue is to find a $\theta$ that maximize the expected complete log likelihood. The step of determine $q(z|x)$ is call the E - step, and the procedure of maximizing the expected complete log likelihood is called the M-step.

## Back to Gaussians Mixture Model

We use the EM algorithm to solve our Gaussian mixture model.

We denote \begin{aligned}w_j^{(i)}&=p(z^{(i)}=j|x^{(j)};\theta)\\ &=\frac{p(z^{(i)}=j)p(x|z^{(i)}=j;\theta)}{\sum_{l=1}^k p(z^{(i)}=l)p(x|z^{(i)}=l;\theta)}\\ p(z^{(i)}=j)&=\phi_j \end{aligned} $w_j^{(i)}$ can be viewed as the responsibility that component k takes for ‘explaining’ the observation x. $\phi_j$ is the priori distribution of selecting $j$th Gaussians. The expected complete log likelihood is:

\begin{aligned} &\sum_{i=1}^m\sum_{z^{(i)}}q(z^{(i)}|x^{(i)})\log\frac{p(x^{(i)},z^{(i)};\theta)}{q(z^{(i)}|x^{(i)})}\\ =&\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}} \end{aligned}

We can calculate the derivative of it w.r.t. each element of $\mu$ (denoted as $\mu_l$).

\begin{aligned} &\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}}\\ =&-\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j)\\ =&\sum_{i=1}^m\sum_{j=1}^k w_l^{(i)}\Sigma^{-1}(x^{(i)}-\mu_l) \end{aligned} Set this term equal to 0. We get: $\mu_l=\frac{\sum_{i=1}^m{w_l^{(i)}}x^{(i)}}{\sum_{i=1}^m{w_l^{(i)}}}$

Other parameters in $\theta$ can also be solved analytically by the same method.

## Summary

• EM algorithm can be used to find the parameters of models dealing with data with missing variables.
• EM algorithm constructs the lower bound of the incomplete log likelihood. It maximizes the lower bound to increase the original incomplete log likelihood.
• EM algorithm has two steps, i.e. E-step and M-step. In E-step, we use the best estimation $p(z|x,\theta)$ to estimate $q(z|x)$. In M-step, we maximize the parameters $\theta$ by setting the derivative of each parameter to 0.

## Reference

1. Ng, Andrew. “CS229 Lecture notes.” CS229 Lecture notes 1.1 (2000): 1-3.
2. Xiaogang, Wang. “ENGG5202 Lecture notes.” Chapter4, Homework1-3.
3. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.