The Quant Scratchbook: The Expectation Maximization (EM) Algorithm

So what's a Gaussian Mixture?

Sometimes when we are analyzing data, we encounter situations when the probability density function of our data looks something like below:

As you might guess intuitively, this PDF looks like it is formed by two Gaussian distributions stuck together. Indeed, it was created from two Gaussian random variables of separate means and variances. You can kind of guess that the first distribution probably has a mean of around 5, while the second one has a mean of around 10. It also seems apparent that the variance of the first distribution is probably smaller than that of the second.

Indeed, when I was generating the random variables that formed this PDF on Matlab, I used the parameters of $\mu_1=5$, $\sigma_1^2= 1$ for the first distribution, and $\mu_2=10$, $\sigma_2^2 = 2$ for the second distribution.

But actually, there is one other parameter that I used to generate this weird PDF - it is the probability that we are in either of the states. That is, every time I draw my next random number, I choose to generate it with either of the two random number generators. But whether I choose the first or second generator is also random, and has some probability attached to it. I will call this parameter $\rho$, and in this case I am choosing the first generator 50% of the time $\rho_1=0.5$, and I chose the second generator 50% of the time as well $\rho_2=0.5$.

You can easily see how complicated this can get. We were able to guess the mean and variance parameters for the above probability density function quite easily because the two states happened with equal chance. But what does the PDF look like, if we just tweak the $\rho$ a bit? Below is the plot, using all the same parameters except with different $\rho$'s. Now the $\rho_1=0.8$ and $\rho_2=0.2$:

Now it is much harder to guess the $\mu$'s and $\sigma^2$'s. The magic of the EM Algorithm is that we can retrieve all six of the parameters in one recursive process. The only assumption we need to make about our data is that the distribution is made from two separate Gaussian random variables, and that the random variables are independent to each other.

Logic Behind the EM Algorithm

There are two general steps in the EM Algorithm:

Use current estimates of the parameters to find a percentage weight of each observation belonging to either of the two states. (For example, with the graph above, an observation from the data $x_n = 9.5$ would be weighted more towards the second state than the first.)
Using the weights, we then maximize our MLE to find a new estimate for each of the six parameters.
Repeat steps 1 and 2 until the parameters (hopefully) converge or approach some asymptotic limit.

Getting the Maximum Likelihood Function

Lets assume that we have some estimation of the six parameters $\theta = (\mu_1,\mu_2,\sigma^2_1,\sigma^2_2, \rho_1,\rho_2)$. For every data point $x_n$, the probability density function should be:

\begin{align} P_{\theta}(x_n) &= P_{\theta}(x_n | state_n=1)P_\theta (state_n=1) + P_\theta (x_n | state_n=2)P_\theta (state_n=2) \label{eq:1} \\ &= \eta_{\mu_1,\sigma_1} \rho_1 + \eta_{\mu_2,\sigma_2} \rho_2 \end{align}

Where $\eta$ is the normal gaussian distribution:

\begin{align} \eta = \frac{e^{\frac{x-\mu^2}{2*\sigma^2}}}{\sqrt{2 \pi \sigma^2}} \end{align}

Assuming independent observations, take the likelihood function of a random variable with this PDF:

\begin{align} \max \log P_{\theta}(x_1,x_2,x_3, \cdots,x_n) = \max \sum_{x_n \in D} \log P_{\theta}(x_n) \end{align}
Which we can write as,
\begin{align}
\max N \sum_{n=1}^{N} \frac{\log P_{\theta}(x_n) }{N} \label{eq:solve}
\end{align}
We introduce another notation to represent $\eqref{eq:solve}$,
\begin{align}
N \sum_{n=1}^{N}\frac{\log P_{\theta}(x_n)}{N} := N \underbrace{\langle_{x_n \in D} \log P_{\theta}(x_n) \rangle}_{L(\theta)}
\end{align}

Take the derivative of the likelihood function $\eqref{eq:solve}$ and set equal to 0 to find the maximum likelihood estimators. The N outside the sum is a constant and becomese irrelevant:
\begin{align}
0=\frac{d}{d \theta} L(\theta) = \langle_{x_n \in D} \frac{d}{d \theta} \log P_{\theta}(x_n) \rangle = \langle_{x_n \in D} \frac{1}{P_{\theta} (x_n)} \frac{d}{d\theta} P_{\theta}(x_n) \rangle
\end{align}
Plug in $P_\theta (x_n)$ from $\eqref{eq:1}$.
\begin{align}
0=\frac{d}{d \theta} L(\theta) \\ &= \langle_{x_n \in D} \sum_{k \in S} \frac{1}{P_{\theta} (x_n)} \frac{d}{d\theta} P_{\theta}(x_n, S_n=k) \rangle \\ &= \langle_{x_n \in D} \sum_{k \in S} \underbrace{\frac{P_{\theta}(x_n, S_n=k)}{P_{\theta} (x_n)}}_{w_{k|n,\theta}} \underbrace{\frac{\frac{d}{d\theta} P_{\theta}(x_n, S_n=k)}{P_{\theta}(x_n, S_n=k)}}_{\frac{d}{d \theta} \log P_\theta(x_n,S_n=k)} \rangle \label{eq:10} \\ &= \langle_{x_n \in D} \sum_{k \in S} w_{k|n,\theta} \frac{d}{d \theta} \log P_\theta(x_n, S_n = k) \rangle \label{eq:mlf}
\end{align}

To get to $\eqref{eq:10}$, we multiplied and divided by $P_{\theta}(x_n, S_n=k)$. Equation $\eqref{eq:mlf}$ is the equation we would like to solve to obtain the maximum likelihood estimators. We call $w_{k|n,\theta}$ the membership weights, and it is equivalent to $P_{\theta}(S_n=k | x_n)$. The weight is the probability that we are in states 1 or 2 given the observed data point $x_n$.

Step 1: Calculate the Weights

The first step is to calculate the weights using the current best estimates of the six parameters. For our very first iteration, this will have to be obtained from some other way of estimation. Our algorithm will do this for every observed data point we have, which means if we have $(x_1,x_2, \cdots x_N)$, then we will calculate the membership weights exactly N-times in our algorithm.

Calculating the weights is relatively straight forward. We need to compute $w_{k|n,\theta}$ and in our case we only have two states $k \in S=(0,1)$ for every set of parameters $\theta$.

\begin{align}
w_{k|n,\theta} = \frac{P_{\theta} (x_n,S_n=k)}{P_{\theta}(x_n)} \\
\end{align}

For k = 1,
\begin{align}
&P_\theta (x_n, S_n=1) = P_\theta (S_n=1 | x_n)P(S_n=1) = \rho_{1} \eta_{\mu_1,\sigma_1}(x_n) \\
&P_\theta (x_n) = \sum_{k \in S} P_{\theta} (S_n=k | x_n) P(S_n=k) = \rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n) \\
&w_{1|n ,\theta} = \frac{\rho_{1} \eta_{\mu_1,\sigma_1}(x_n)}{\rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n)}
\end{align}

Repeat to obtain the weight for k=2 in the same way:
\begin{align}
&w_{2|n ,\theta} = \frac{\rho_{2} \eta_{\mu_2,\sigma_2}(x_n)}{\rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n)}
\end{align}

(Matlab has a very convenient shortcut to computing $\eta_{\mu,\sigma}$ using the NORMPDF function, used in the code posted below. Now that we have the $w_{k|n,\theta}$'s, we move on to the maximization step. The membership weights literally are just weights in the sense that the sum should add up to one.)

Step 2: Maximization

We need to maximize for each of the six parameters in $\theta$, which means we solve out equation $\eqref{eq:mlf}$ six times, taking the derivative for every $\theta = (\mu_1,\mu_2,\sigma_1,\sigma_2,\rho_1,\rho_2)$. From our previous step, the $w_{k|n}$'s are now predetermined constants. So we solve only for $\dfrac{d \log P_{\theta}(x_n, S_n=k)}{d\theta}$.

Recalling that:
\begin{align}
&P_{\theta}(x_n, S_n=k) = P_\theta (x_n | S_n=k) P_\theta (S_n = k) = \rho_k \eta_{\mu_k,\sigma_k}(x_n) = \rho_k * \frac{e^{\frac{x_n-\mu_k^2}{2*\sigma_k^2}}}{\sqrt{2 \pi \sigma_k^2}} \\
& \log P_{\theta} (x_n,S_n=k) = \log \rho_k - \log \sqrt{2 \pi} \sigma_k - \frac{(x_n - \mu_k)^2 }{2 \sigma_k^2}
\end{align}

Maximum Likelihood Estimates for $\rho_1$ and $\rho_2$

To solve for the MLE of $\rho_1$ and $\rho_2$, we first note that $\rho_2 = 1-\rho_1$ by definition. Thus $\frac{d \rho_2}{d \rho_1} = -1$.
\begin{align}
0 = \frac{d}{d \rho_1} L(\theta) \\
&= \langle_{x_n \in D} \sum_{k \in S} w_{k|n,\theta} \frac{d}{d \rho_1} \log P_\theta(x_n, S_n = k) \rangle \\
&= \langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1} \frac{d \rho_1}{d \rho_1} + \frac{w_{2|n,\theta}}{\rho_2} \frac{d \rho_2}{d \rho_1} \rangle \\
&= \langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1} + \frac{w_{2|n,\theta}}{\rho_2} (-1) \rangle
\end{align}

We have that,

\begin{align}
\langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1}\rangle &= \langle_{x_n \in D} \frac{w_{2|n,\theta}}{\rho_2}\rangle \\
\langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1}\rangle &= \langle_{x_n \in D} \frac{w_{2|n,\theta}}{1- \rho_1} \rangle \\
\frac{(1-\rho_1)}{\rho_1} \langle_{x_n \in D}w_{1|n,\theta} \rangle &= \langle_{x_n \in D} w_{2|n,\theta} \rangle\\
\hat{\rho_1} &= \frac{ \langle_{x_n \in D} w_{1|n} \rangle}{\langle_{x_n \in D} w_{1|n,\theta}+w_{2|n,\theta}\rangle} \end{align}

Noting that $w_{1|n}+w_{2|n}=1$ for every $x \in D$, we can generalize that
\begin{align}
\hat{\rho_1} = \langle_{x_n \in D} w_{1|n} \rangle && \hat{\rho_2} = \langle_{x_n \in D} w_{2|n}\rangle
\end{align}
Thus the MLE for $\rho_1$ and $\rho_2$ is just the average of the membership weights over the domain.

MLE for $\mu_1 , \mu_2$

\begin{align}
0 = \frac{d}{d \mu_1} L(\theta) \\
&= \langle_{x_n \in D} w_{1|n} \frac{d \log P_\theta (x_n,S_n=k)}{d \mu _1} \rangle \\
&= \langle_{x_n \in D}w_{1|n} \frac{x_n - \mu_k}{\sigma_1^2} \rangle
\end{align}
Now we can calculate for $\mu_1$,
\begin{align}
\hat{\mu_1} &= \frac{ \langle_{x_n \in D} w_{1|n} x_n \rangle}{\langle_{x_n \in D} w_{1|n}} \rangle \\
&= \frac{\sum_{n=1}^{N} w_{1|n} x_n}{\sum_{n=1}^{N} w_{1|n}}
\end{align}
Likewise for $\mu_2$,
\begin{align}
\hat{\mu_2} = \frac{\sum_{n=1}^N w_{2|n} x_n}{\sum_{n=1}^{N} w_{2|n}}
\end{align}

MLE for $\sigma_1,\sigma_2$

\begin{align}
0 = \frac{d}{d \sigma_k} L(\theta) \\
&= \langle_{x_n \in D} w_{k|n} \left( -\frac{1}{\sigma_k} + \frac{(x_n - \mu_k)^2}{\sigma_k^3} \right) \rangle \\
&= \frac{1}{\sigma_k^3} \langle_{x_n \in D} w_{k|n} (- \sigma_k^2 + (x_n - \mu_k)^2 ) \rangle
\end{align}
Therefore we have,
\begin{align}
\hat{\sigma}^2_k &= \frac{\langle_{x_n \in D} w_{k|n} (x_n - \mu_k)^2 \rangle}{\langle_{x_n \in D} w_{k|n}\rangle} \\
& = \frac{\sum_{n=1}^N w_{k|n} (x_n - \hat{\mu}_k)^2}{\sum_{n=1}^N w_{k|n}}
\end{align}

Hence we now have all the MLE parameters for the six original parameters we were trying to estimate. The EM Algorithm will go back and forth between steps 1 and 2 for N-times, or until some terminal condition is met.

Code

Below is a very simple code implementing the EM Algorithm outlined above. There appears to be some problem with the code. For practical purposes, a naive implementation of the EM Algorithm is probably not very useful, as the parameters often fail to converge with the code below.

function[]=EMAlgorithm(data,varargin)
 %%
 %Generate Random Numbers if no data supplied
 if nargin == 0
   for i=1:1:10000
     x=rand();
     if x < 0.8
       G(i)=5+1*randn();
     elseif x>=0.2
       G(i)=10+2*randn();
     end
   end
   data=G;
 else
   G=varargin{1};
 end

 %%
 %Initial Guesses for Parameters, Assuming only Two States
 S=sort(data);
 S1=S(1:round(length(data)/2));
 S2=S(round(length(data)/2):length(data));
 mu1(1)=mean(S1);
 mu2(1)=mean(S2);
 sigma1=std(S1);
 sigma2=std(S2);
 rho1(1)=0.5;
 rho2(1)=1-rho1;
 %%
 %Expectation Maximization Algo
 for i = 1:1:length(data) %for each observation
   %calculating membership weights assuming gaussian distribution
   w1=rho1(i)*normpdf(data(i),mu1,sigma1);
   w2=rho2(i)*normpdf(data(i),mu2,sigma2);
   A=w1+w2;
   weight1(i)=w1/A;
   weight2(i)=w2/A;
   %Updating Parameters
   rho1(i+1)=mean(weight1);
   rho2(i+1)=mean(weight2);
   mu1(i+1)=(data(1:i)*weight1')/sum(weight1);
   mu2(i+1)=(data(1:i)*weight2')/sum(weight2);
   sigma1=sqrt(sum(((data(1:i)-mu1(i)).^2)*weight1')/sum(weight1));
   sigma2=sqrt(sum(((data(1:i)-mu2(i)).^2)*weight2')/sum(weight2));
 end

 rho1(end)
 rho2(end)
 mu1(end)
 mu2(end)
 sigma1
 sigma2

References:

Shen, J. (2014). Gaussian Mixtures and the EM Algorithm. Personal Collection of J. Shen, University of Illinois at Urbana-Champaign, Urbana, Illinois.

The Quant Scratchbook

Wednesday, May 28, 2014

The Expectation Maximization (EM) Algorithm