## The Goal of This Tutorial

EM (Expectation Maximization) algorithm is a widely used method when a model has hidden variables that cannot be observed from the dataset. In this tutorial, we focus on the outline and intuition of the EM algorithm. A Gaussian mixture model is studied as an example.

## Gaussian Mixture Model

In the Gaussian mixture model, we assume that there are several generators which can generate samples from a Gaussian distribution. Each sample (denoted as \(x^{(i)}=(x_1,x_2,...,x_d)\)) of our data is sampled from one of the generators. We can only observe the position of each sample \(x\) and we don’t know for each sample \(x^{(i)}\), which generator \(z^{(i)}\) generate it. Here, \(z^{(i)}=\{1,2,..,k\}\) stand for the ID of each generator. For each generator \(j\)(\(j\in[1,k]\)), it generates samples by a Gaussian distribution with parameter \(\mu_j,\Sigma_j\). We use \(\theta\) to denote all model parameters, \(\theta=(\mu_1,\mu_2,...,\mu_k,\Sigma_1,\Sigma_2,...,\Sigma_k,\phi_1,\phi_2,...,\phi_k)\). Where \(\phi_j\) is the probability of selecting \(j\)th generator and \(\sum_{j=i}^k \phi_j=1\). The goal of Gaussian mixture model is to find a most suitable parameter \(\theta\) to describe the data.

However, things get complicated, since we don’t know each sample belongs to which class \(z^{(i)}=?\). Some people often doubt the relation between model parameters and hidden variables. The difference is that for each single sample, the model parameter is the same \(\theta\) but they can have different hidden viaravbles \(z^{(i)}\).

## General Problem

We abstract above problem into more general form. Suppose we have some samples, each sample \(i\) is composed of visible features \(x^{(i)}\) and hidden features \(z^{(i)}\). Our ultimate goal is to maximize the **incomplete log likelihood**: \[\begin{aligned}
\log p(x|\theta)&=\log \int_zp(x,z|\theta)dz\\
&= \log \int_z q(z|x)\frac{p(x,z|\theta)}{q(z|x)}\\
&\ge \int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}=\mathcal{L}(q,\theta)\ \ \ \text{(Jensen's inequality)}\\
&=\int_z q(z|x) \log p(x,z|\theta)-\int_z q(z|x) \log q(z|x)\ \ \ \ (\text{The second term is not related to }\theta)\\
\end{aligned}\] As shown above, \(\int_z q(z|x) \log p(x,z|\theta)\) is called **expected complete log likelihood**. \(\mathcal{L}(q,\theta)\) is the lower bound of the **incomplete log likelihood**. It is a lower bound of the incomplete log likelihood related to \(\theta\). If we maximize it, the incomplete log likelihood will also increase.

In the next step, we show how to choose \(q(z|x)\). We claim that when \(q(z|x)=p(z|x,\theta)\), the lower bound get the equal sign. We substitute \(q(z|x)=p(z|x,\theta)\):

\[\begin{aligned} &\mathcal{L}(q,\theta)=\int_z q(z|x) \log\frac{p(x,z|\theta)}{q(z|x)}\\ =&\int_z p(z|x,\theta) \log\frac{p(x,z,\theta)}{p(z|x,\theta)}\\ =&\int_z p(z|x,\theta) \log p(x|\theta)\\ =& \log p(x|\theta)\int_z p(z|x,\theta)\\ =& \log p(x|\theta) \end{aligned}\]

In above derivation, we show that when \(q(z|x)=p(z|x,\theta)\), Jensen’s inequality gets equal. After we determined \(q(z|x)\), the remaining issue is to find a \(\theta\) that maximize the **expected complete log likelihood**. The step of determine \(q(z|x)\) is call the E - step, and the procedure of maximizing the **expected complete log likelihood** is called the M-step.

## Back to Gaussians Mixture Model

We use the EM algorithm to solve our Gaussian mixture model.

We denote \[\begin{aligned}w_j^{(i)}&=p(z^{(i)}=j|x^{(j)};\theta)\\ &=\frac{p(z^{(i)}=j)p(x|z^{(i)}=j;\theta)}{\sum_{l=1}^k p(z^{(i)}=l)p(x|z^{(i)}=l;\theta)}\\ p(z^{(i)}=j)&=\phi_j \end{aligned}\] \(w_j^{(i)}\) can be viewed as the responsibility that component k takes for ‘explaining’ the observation x. \(\phi_j\) is the priori distribution of selecting \(j\)th Gaussians. The expected complete log likelihood is:

\[\begin{aligned} &\sum_{i=1}^m\sum_{z^{(i)}}q(z^{(i)}|x^{(i)})\log\frac{p(x^{(i)},z^{(i)};\theta)}{q(z^{(i)}|x^{(i)})}\\ =&\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}} \end{aligned}\]

We can calculate the derivative of it w.r.t. each element of \(\mu\) (denoted as \(\mu_l\)).

\[\begin{aligned} &\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\log\frac{\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}\exp(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j))\phi_j}{w_j^{(i)}}\\ =&-\nabla_{\mu_l}\sum_{i=1}^m\sum_{j=1}^k w_j^{(i)}\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}(x^{(i)}-\mu_j)\\ =&\sum_{i=1}^m\sum_{j=1}^k w_l^{(i)}\Sigma^{-1}(x^{(i)}-\mu_l) \end{aligned}\] Set this term equal to 0. We get: \[\mu_l=\frac{\sum_{i=1}^m{w_l^{(i)}}x^{(i)}}{\sum_{i=1}^m{w_l^{(i)}}}\]

Other parameters in \(\theta\) can also be solved analytically by the same method.

## Summary

- EM algorithm can be used to find the parameters of models dealing with data with missing variables.
- EM algorithm constructs the lower bound of the incomplete log likelihood. It maximizes the lower bound to increase the original incomplete log likelihood.
- EM algorithm has two steps, i.e. E-step and M-step. In E-step, we use the best estimation \(p(z|x,\theta)\) to estimate \(q(z|x)\). In M-step, we maximize the parameters \(\theta\) by setting the derivative of each parameter to 0.

## Reference

- Ng, Andrew. “CS229 Lecture notes.” CS229 Lecture notes 1.1 (2000): 1-3.
- Xiaogang, Wang. “ENGG5202 Lecture notes.” Chapter4, Homework1-3.
- Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.