Wednesday, May 28, 2014

The Expectation Maximization (EM) Algorithm


So what's a Gaussian Mixture?


Sometimes when we are analyzing data, we encounter situations when the probability density function of our data looks something like below:

Image and video hosting by TinyPic

As you might guess intuitively, this PDF looks like it is formed by two Gaussian distributions stuck together. Indeed, it was created from two Gaussian random variables of separate means and variances. You can kind of guess that the first distribution probably has a mean of around 5, while the second one has a mean of around 10. It also seems apparent that the variance of the first distribution is probably smaller than that of the second. 

Indeed, when I was generating the random variables that formed this PDF on Matlab, I used the parameters of $\mu_1=5$, $\sigma_1^2= 1$ for the first distribution, and $\mu_2=10$, $\sigma_2^2 = 2$ for the second distribution. 
But actually, there is one other parameter that I used to generate this weird PDF - it is the probability that we are in either of the states. That is, every time I draw my next random number, I choose to generate it with either of the two random number generators. But whether I choose the first or second generator is also random, and has some probability attached to it. I will call this parameter $\rho$, and in this case I am choosing the first generator 50% of the time $\rho_1=0.5$, and I chose the second generator 50% of the time as well $\rho_2=0.5$. 



You can easily see how complicated this can get. We were able to guess the mean and variance parameters for the above probability density function quite easily because the two states happened with equal chance. But what does the PDF look like, if we just tweak the $\rho$ a bit? Below is the plot, using all the same parameters except with different $\rho$'s. Now the $\rho_1=0.8$ and $\rho_2=0.2$: 
Image and video hosting by TinyPic
Now it is much harder to guess the $\mu$'s and $\sigma^2$'s. The magic of the EM Algorithm is that we can retrieve all six of the parameters in one recursive process. The only assumption we need to make about our data is that the distribution is made from two separate Gaussian random variables, and that the random variables are independent to each other. 

Logic Behind the EM Algorithm 

There are two general steps in the EM Algorithm:
  1. Use current estimates of the parameters to find a percentage weight of each observation belonging to either of the two states. (For example, with the graph above, an observation from the data $x_n = 9.5$ would be weighted more towards the second state than the first.)  
  2. Using the weights, we then maximize our MLE to find a new estimate for each of the six parameters.
  3. Repeat steps 1 and 2 until the parameters (hopefully) converge or approach some asymptotic limit. 

Getting the Maximum Likelihood Function


Lets assume that we have some estimation of the six parameters $\theta = (\mu_1,\mu_2,\sigma^2_1,\sigma^2_2, \rho_1,\rho_2)$. For every data point $x_n$, the probability density function should be: 


\begin{align} P_{\theta}(x_n) &= P_{\theta}(x_n | state_n=1)P_\theta (state_n=1) + P_\theta (x_n | state_n=2)P_\theta (state_n=2) \label{eq:1} \\ &= \eta_{\mu_1,\sigma_1} \rho_1 + \eta_{\mu_2,\sigma_2} \rho_2 \end{align}
Where $\eta$ is the normal gaussian distribution:
\begin{align} \eta = \frac{e^{\frac{x-\mu^2}{2*\sigma^2}}}{\sqrt{2 \pi \sigma^2}} \end{align}

Assuming independent observations, take the likelihood function of a random variable with this PDF:
\begin{align} \max \log P_{\theta}(x_1,x_2,x_3, \cdots,x_n) = \max \sum_{x_n \in D} \log P_{\theta}(x_n) \end{align}
Which we can write as,
\begin{align}
\max N \sum_{n=1}^{N} \frac{\log P_{\theta}(x_n) }{N} \label{eq:solve}
\end{align}
We introduce another notation to represent $\eqref{eq:solve}$,
\begin{align}
N \sum_{n=1}^{N}\frac{\log P_{\theta}(x_n)}{N} :=  N \underbrace{\langle_{x_n \in D} \log P_{\theta}(x_n) \rangle}_{L(\theta)}
\end{align}

Take the derivative of the likelihood function $\eqref{eq:solve}$ and set equal to 0 to find the maximum likelihood estimators. The N outside the sum is a constant and becomese irrelevant:
\begin{align}
0=\frac{d}{d \theta} L(\theta) = \langle_{x_n \in D} \frac{d}{d \theta} \log P_{\theta}(x_n) \rangle = \langle_{x_n \in D} \frac{1}{P_{\theta} (x_n)} \frac{d}{d\theta} P_{\theta}(x_n) \rangle
\end{align}
Plug in $P_\theta (x_n)$ from $\eqref{eq:1}$.
\begin{align}
0=\frac{d}{d \theta} L(\theta) \\ &= \langle_{x_n \in D} \sum_{k \in S} \frac{1}{P_{\theta} (x_n)} \frac{d}{d\theta} P_{\theta}(x_n, S_n=k) \rangle \\ &= \langle_{x_n \in D} \sum_{k \in S} \underbrace{\frac{P_{\theta}(x_n, S_n=k)}{P_{\theta} (x_n)}}_{w_{k|n,\theta}} \underbrace{\frac{\frac{d}{d\theta} P_{\theta}(x_n, S_n=k)}{P_{\theta}(x_n, S_n=k)}}_{\frac{d}{d \theta} \log P_\theta(x_n,S_n=k)} \rangle \label{eq:10} \\ &= \langle_{x_n \in D} \sum_{k \in S} w_{k|n,\theta} \frac{d}{d \theta} \log P_\theta(x_n, S_n = k) \rangle \label{eq:mlf}
\end{align}

To get to $\eqref{eq:10}$, we multiplied and divided by $P_{\theta}(x_n, S_n=k)$. Equation $\eqref{eq:mlf}$ is the equation we would like to solve to obtain the maximum likelihood estimators. We call $w_{k|n,\theta}$ the membership weights, and it is equivalent to $P_{\theta}(S_n=k | x_n)$. The weight is the probability that we are in states 1 or 2 given the observed data point $x_n$.

Step 1: Calculate the Weights

The first step is to calculate the weights using the current best estimates of the six parameters. For our very first iteration, this will have to be obtained from some other way of estimation. Our algorithm will do this for every observed data point we have, which means if we have $(x_1,x_2, \cdots x_N)$, then we will calculate the membership weights exactly N-times in our algorithm.

Calculating the weights is relatively straight forward. We need to compute $w_{k|n,\theta}$ and in our case we only have two states $k \in S=(0,1)$ for every set of parameters $\theta$.

\begin{align}
w_{k|n,\theta} = \frac{P_{\theta} (x_n,S_n=k)}{P_{\theta}(x_n)} \\
\end{align}

For k = 1,
\begin{align}
&P_\theta (x_n, S_n=1) = P_\theta (S_n=1 | x_n)P(S_n=1) =  \rho_{1} \eta_{\mu_1,\sigma_1}(x_n) \\
&P_\theta (x_n) = \sum_{k \in S} P_{\theta} (S_n=k | x_n) P(S_n=k) = \rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n) \\
&w_{1|n ,\theta} = \frac{\rho_{1} \eta_{\mu_1,\sigma_1}(x_n)}{\rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n)}
\end{align}

Repeat to obtain the weight for k=2 in the same way:
\begin{align}
&w_{2|n ,\theta} = \frac{\rho_{2} \eta_{\mu_2,\sigma_2}(x_n)}{\rho_1 \eta_{\mu_1,\sigma_1}(x_n) + \rho_2 \eta_{\mu_2,\sigma_2}(x_n)}
\end{align}

(Matlab has a very convenient shortcut to computing $\eta_{\mu,\sigma}$ using the NORMPDF function, used in the code posted below.  Now that we have the $w_{k|n,\theta}$'s, we move on to the maximization step. The membership weights literally are just weights in the sense that the sum should add up to one.)


Step 2: Maximization

We need to maximize for each of the six parameters in $\theta$, which means we solve out equation $\eqref{eq:mlf}$ six times, taking the derivative for every $\theta = (\mu_1,\mu_2,\sigma_1,\sigma_2,\rho_1,\rho_2)$. From our previous step, the $w_{k|n}$'s are now predetermined constants. So we solve only for $\dfrac{d \log P_{\theta}(x_n, S_n=k)}{d\theta}$.

Recalling that:
\begin{align}
&P_{\theta}(x_n, S_n=k) = P_\theta (x_n | S_n=k) P_\theta (S_n = k) = \rho_k \eta_{\mu_k,\sigma_k}(x_n) = \rho_k *  \frac{e^{\frac{x_n-\mu_k^2}{2*\sigma_k^2}}}{\sqrt{2 \pi \sigma_k^2}} \\
& \log P_{\theta} (x_n,S_n=k) = \log \rho_k - \log \sqrt{2 \pi} \sigma_k - \frac{(x_n - \mu_k)^2 }{2 \sigma_k^2}
\end{align}

Maximum Likelihood Estimates for $\rho_1$ and $\rho_2$

To solve for the MLE of $\rho_1$ and $\rho_2$, we first note that $\rho_2 = 1-\rho_1$ by definition. Thus $\frac{d \rho_2}{d \rho_1} = -1$.
\begin{align}
0 = \frac{d}{d \rho_1} L(\theta) \\
&= \langle_{x_n \in D} \sum_{k \in S} w_{k|n,\theta} \frac{d}{d \rho_1} \log P_\theta(x_n, S_n = k) \rangle \\
&= \langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1} \frac{d \rho_1}{d \rho_1} + \frac{w_{2|n,\theta}}{\rho_2} \frac{d \rho_2}{d \rho_1} \rangle \\
&= \langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1} + \frac{w_{2|n,\theta}}{\rho_2} (-1) \rangle
\end{align}

We have that,

\begin{align}
\langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1}\rangle &= \langle_{x_n \in D} \frac{w_{2|n,\theta}}{\rho_2}\rangle \\
\langle_{x_n \in D} \frac{w_{1|n,\theta}}{\rho_1}\rangle &= \langle_{x_n \in D} \frac{w_{2|n,\theta}}{1- \rho_1} \rangle \\
\frac{(1-\rho_1)}{\rho_1} \langle_{x_n \in D}w_{1|n,\theta} \rangle &= \langle_{x_n \in D} w_{2|n,\theta} \rangle\\
\hat{\rho_1} &= \frac{ \langle_{x_n \in D} w_{1|n} \rangle}{\langle_{x_n \in D} w_{1|n,\theta}+w_{2|n,\theta}\rangle} \end{align}

Noting that $w_{1|n}+w_{2|n}=1$ for every $x \in D$, we can generalize that
\begin{align}
\hat{\rho_1} = \langle_{x_n \in D} w_{1|n} \rangle && \hat{\rho_2} = \langle_{x_n \in D} w_{2|n}\rangle
\end{align}
Thus the MLE for $\rho_1$ and $\rho_2$ is just the average of the membership weights over the domain.

MLE for $\mu_1 , \mu_2$

\begin{align}
0 = \frac{d}{d \mu_1} L(\theta) \\
&= \langle_{x_n \in D} w_{1|n} \frac{d \log P_\theta (x_n,S_n=k)}{d \mu _1} \rangle \\
&= \langle_{x_n \in D}w_{1|n} \frac{x_n - \mu_k}{\sigma_1^2} \rangle
\end{align}
Now we can calculate for $\mu_1$,
\begin{align}
\hat{\mu_1} &= \frac{ \langle_{x_n \in D} w_{1|n} x_n \rangle}{\langle_{x_n \in D} w_{1|n}} \rangle \\
&= \frac{\sum_{n=1}^{N} w_{1|n} x_n}{\sum_{n=1}^{N} w_{1|n}}
\end{align}
Likewise for $\mu_2$,
\begin{align}
\hat{\mu_2} = \frac{\sum_{n=1}^N w_{2|n} x_n}{\sum_{n=1}^{N} w_{2|n}}
\end{align}

MLE for $\sigma_1,\sigma_2$

\begin{align}
0 = \frac{d}{d \sigma_k} L(\theta) \\
&= \langle_{x_n \in D} w_{k|n} \left( -\frac{1}{\sigma_k} + \frac{(x_n - \mu_k)^2}{\sigma_k^3} \right) \rangle \\
&= \frac{1}{\sigma_k^3} \langle_{x_n \in D} w_{k|n} (- \sigma_k^2 + (x_n - \mu_k)^2 ) \rangle
\end{align}
Therefore we have,
\begin{align}
\hat{\sigma}^2_k &= \frac{\langle_{x_n \in D} w_{k|n} (x_n - \mu_k)^2 \rangle}{\langle_{x_n \in D} w_{k|n}\rangle} \\
& = \frac{\sum_{n=1}^N w_{k|n} (x_n - \hat{\mu}_k)^2}{\sum_{n=1}^N w_{k|n}}
\end{align}

Hence we now have all the MLE parameters for the six original parameters we were trying to estimate. The EM Algorithm will go back and forth between steps 1 and 2 for N-times, or until some terminal condition is met.


Code

Below is a very simple code implementing the EM Algorithm outlined above. There appears to be some problem with the code. For practical purposes, a naive implementation of the EM Algorithm is probably not very useful, as the parameters often fail to converge with the code below. 

function[]=EMAlgorithm(data,varargin)
 %%
 %Generate Random Numbers if no data supplied
 if nargin == 0
   for i=1:1:10000
     x=rand();
     if x < 0.8
       G(i)=5+1*randn();
     elseif x>=0.2
       G(i)=10+2*randn();
     end
   end
   data=G;
 else
   G=varargin{1};
 end

 %%
 %Initial Guesses for Parameters, Assuming only Two States
 S=sort(data);
 S1=S(1:round(length(data)/2));
 S2=S(round(length(data)/2):length(data));
 mu1(1)=mean(S1);
 mu2(1)=mean(S2);
 sigma1=std(S1);
 sigma2=std(S2);
 rho1(1)=0.5;
 rho2(1)=1-rho1;
 %%
 %Expectation Maximization Algo
 for i = 1:1:length(data) %for each observation
   %calculating membership weights assuming gaussian distribution
   w1=rho1(i)*normpdf(data(i),mu1,sigma1);
   w2=rho2(i)*normpdf(data(i),mu2,sigma2);
   A=w1+w2;
   weight1(i)=w1/A;
   weight2(i)=w2/A;
   %Updating Parameters
   rho1(i+1)=mean(weight1);
   rho2(i+1)=mean(weight2);
   mu1(i+1)=(data(1:i)*weight1')/sum(weight1);
   mu2(i+1)=(data(1:i)*weight2')/sum(weight2);
   sigma1=sqrt(sum(((data(1:i)-mu1(i)).^2)*weight1')/sum(weight1));
   sigma2=sqrt(sum(((data(1:i)-mu2(i)).^2)*weight2')/sum(weight2));
 end

 rho1(end)
 rho2(end)
 mu1(end)
 mu2(end)
 sigma1
 sigma2



References:

Shen, J. (2014). Gaussian Mixtures and the EM Algorithm. Personal Collection of J. Shen, University of Illinois at Urbana-Champaign, Urbana, Illinois.   

Monday, May 26, 2014

An Introduction

Hello. I am a student at the University of Illinois, doing my masters in Financial Engineering. I suppose that I decided to begin this blog for several reasons, mostly for my own personal interests:

Firstly, this blog is not necessarily targeted at any specific group of people. Financial Engineering is still a very opaque field, and for those that know the term, it can hold some negative connotations associated with the subprime mortgage crisis. Coming from an engineering background and still in my first year of my masters, I am far from holding a complete grasp of this relatively new field. So I guess this blog will serve as my travel log - my path to discovering the mysterious reasons to why we even exist - as I look down every peculiar path, and seek to position the discipline's place in society. For a starter, I think most people would agree with me that much of financial engineering has to do with managing and maintaining the intricate web of the self-serving interests of powerful institutions and individuals in our society. However, this is of course not necessarily a new development in the field of finance. Beginning with the Rothschild family, finance has always had a strong association with the notion of power. The technological developments in the past decade has upped the ante, in many interesting ways that we still do not fully comprehend...

Secondly, I will introduce many computational models that I come across either in my classes, or in my own personal time on the internet. As I try to make sense of these models, I cannot guarantee any specific model to be correctly implemented.

Lastly, on the off-chance that you randomly found your way here, I hope that you will find my blog to be interesting. In the end, my blog is about the things that I find to be interesting and worthwhile. Please enjoy and join me in my thoughts :)

Documentary on the discovery of the Black-Scholes formula: