Problem: Maximum Likelihood Estimation of the Covariance Matrix for MVN

ex1.13 050329...

Original Problem Statement (Exercise 1.13)

$X_{(1)}, X_{(2)}, \dots, X_{(n)}$ are samples from a multivariate normal distribution $N_p(\mu, \Sigma)$. The likelihood function is as follows. Using differentiation, find the maximum likelihood estimates (MLE) for the parameters $\mu$ and $\Sigma$.

Given Likelihood Function:

The problem provides the likelihood function $L(\mu, \Sigma)$ as:

$$ L(\mu, \Sigma) = (2\pi)^{-\frac{np}{2}} |\Sigma|^{-\frac{n}{2}} \mathrm{etr}\left[-\frac{1}{2}\Sigma^{-1} \sum_{j=1}^n (X_{(j)} - \mu)(X_{(j)} - \mu)'\right] $$

where $\mathrm{etr}(A) = \exp(\mathrm{tr}(A))$. Let's use the notation $X_j$ instead of $X_{(j)}$ for simplicity. The likelihood function is:

$$ L(\mu, \Sigma) = (2\pi)^{-\frac{np}{2}} |\Sigma|^{-\frac{n}{2}} \exp\left( -\frac{1}{2} \mathrm{tr}\left( \Sigma^{-1} \sum_{j=1}^n (X_j - \mu)(X_j - \mu)' \right) \right) $$

Objective

Maximize the log-likelihood function with respect to the covariance matrix $\Sigma$.

Step 1: Write down the Log-Likelihood Function

Assume we have $n$ independent and identically distributed (i.i.d.) samples $X_1, \dots, X_n$ from a $p$-dimensional multivariate normal distribution $N_p(\mu, \Sigma)$.

The log-likelihood function is:

$$ l(\mu, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\sum_{j=1}^n (X_j - \mu)' \Sigma^-1 (X_j - \mu) $$

where:

  • $n$ is the number of samples
  • $p$ is the dimension of the data
  • $|\Sigma|$ is the determinant of the covariance matrix $\Sigma$
  • $\Sigma^{-1}$ is the inverse of $\Sigma$ (the precision matrix)

Step 2: Simplify the Summation Term using Trace

Notice that each term in the summation $(X_j - \mu)' \Sigma^-1 (X_j - \mu)$ is a scalar. For a scalar $a$, we have $a = \operatorname*{tr}(a)$. Using this and the cyclic property of the trace ($\operatorname*{tr}(ABC) = \operatorname*{tr}(CAB)$):

$$ \begin{aligned} \sum_{j=1}^n (X_j - \mu)' \Sigma^-1 (X_j - \mu) &= \sum_{j=1}^n \operatorname*{tr}\left( (X_j - \mu)' \Sigma^-1 (X_j - \mu) \right) \\ &= \sum_{j=1}^n \operatorname*{tr}\left( \Sigma^-1 (X_j - \mu)(X_j - \mu)' \right) \\ &= \operatorname*{tr}\left( \Sigma^-1 \sum_{j=1}^n (X_j - \mu)(X_j - \mu)' \right) \end{aligned} $$

Define the Scatter Matrix (around $\mu$) as $S_\mu = \sum_{j=1}^n (X_j - \mu)(X_j - \mu)'$.

The log-likelihood function becomes:

$$ l({\bar{X}}, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\mathrm{tr}(\Sigma^-1 S) $$

Step 3: Substitute the Sample Mean

The MLE for $\mu$ is the sample mean $\hat\mu = \bar{X} = \frac{1}{n}\sum_{j=1}^n X_j$. Substituting $\bar{X}$ for $\mu$ (or considering the profile likelihood for $\Sigma$), we define the sample scatter matrix:

$$ S = \sum_{j=1}^n (X_j - {\bar{X}})(X_j - {\bar{X}})' $$

The log-likelihood function (now viewed primarily as a function of $\Sigma$, given $\bar{X}$) is:

$$ l(\mu, \Sigma) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\Sigma| - \frac{1}{2}\operatorname*{tr}(\Sigma^-1 S_\mu) $$

Step 4: Variable Substitution using the Precision Matrix $W$

To simplify differentiation, let the precision matrix be $W = \Sigma^{-1}$. This implies $\Sigma = W^{-1}$. We substitute terms involving $\Sigma$:

  • $$ \Sigma^{-1} = W $$
  • $$ \ln|\Sigma| = \ln|W^{-1}| = \ln(|W|^{-1}) = -\ln|W| $$

Substituting these into $l(\bar{X}, \Sigma)$, we obtain a function of $W$, let's call it $g(W)$:

$$ \begin{aligned} g(W) &= l(\bar{X}, W^{-1}) \\ &= -\frac{np}{2}\ln(2\pi) - \frac{n}{2}(-\ln|W|) - \frac{1}{2}\operatorname*{tr}(W S) \\ &= -\frac{np}{2}\ln(2\pi) + \frac{n}{2}\ln|W| - \frac{1}{2}\operatorname*{tr}(W S) \end{aligned} $$

Step 5: Differentiate with respect to $W$

Our goal is to maximize $g(W)$. We compute the derivative of $g(W)$ with respect to the matrix $W$ (using Definition 1.4.2) and set it to the zero matrix.

$$ \begin{aligned} \frac{\partial g(W)}{\partial W} &= \frac{\partial}{\partial W} \left(-\frac{np}{2}\ln(2\pi) + \frac{n}{2}\ln|W| - \frac{1}{2}\mathrm{tr}(W S)\right) \\ &= 0 + \frac{n}{2} \frac{\partial (\ln|W|)}{\partial W} - \frac{1}{2} \frac{\partial (\mathrm{tr}(WS))}{\partial W} \end{aligned} $$

We need two key matrix derivatives. Assuming $W$ is symmetric:

  1. Derivative of $\ln|W|$: Based on standard matrix calculus results (analogous to Property 1.4.16(2) and 1.4.9):
    $$ \frac{\partial \ln|W|}{\partial W} = W^{-1} $$
  2. Derivative of $\operatorname*{tr}(WS)$: Based on standard matrix calculus results (analogous to Property 1.4.6):
    $$ \frac{\partial (\mathrm{tr}(WS))}{\partial W} = S $$

Substituting these results back:

$$ \frac{\partial g(W)}{\partial W} = \frac{n}{2} W^{-1} - \frac{1}{2} S $$$$ \frac{n}{2} {\hat W}^{-1} - \frac{1}{2} S = 0 $$

where $\hat W$ denotes the matrix $W$ that satisfies this equation.

$$ \frac{n}{2} {\hat W}^{-1} = \frac{1}{2} S $$

Multiply both sides by $2$:

$$ n {\hat W}^{-1} = S $$

Solve for $\hat W^{-1}$:

$$ {\hat W}^{-1} = \frac{1}{n} S $$

Step 7: Convert Back to $\Sigma$

Recall that $W = \Sigma^{-1}$, so $W^-1 = (\Sigma^{-1})^{-1} = \Sigma$.

The resulting estimate $\hat W^{-1}$ must be symmetric since $S$ is symmetric.

Therefore, the Maximum Likelihood Estimate (MLE) $\hat\Sigma_{ML}$ is the $\hat W^{-1}$ we just found:

$$ {\hat\Sigma}_{ML} = {\hat W}^{-1} = \frac{1}{n} S $$

Substituting the definition of $S$ :

$$ {\hat\Sigma}{ML} = \frac{1}{n} \sum{j=1}^n (X_j - {\bar{X}})(X_j - {\bar{X}})' $$

This is the MLE for the covariance matrix $\Sigma$ of a multivariate normal distribution.