## 1. Gaussian Mixture Model

A Gaussian mixture is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

Gaussian mixture model(GMM) is an clustering algorithm similar to k-means. Unlike k-means, GMM is a soft clustering, each instance associates with a cluster with a probability. Formally, given a set of observations $$X = (x_1, x_2, ..., x_n)$$, $$x_i \in R^{d}$$, and $$i=1, 2, ..., n$$, Gaussian mixture model assumes:

1. there are $$K$$ mixture components, component value denoted as $$k \in \{1, 2, ..., K\}$$

2. $$x_i$$ comes from component $$z_i$$, $$Z = (z_1, z_2, ..., z_n)$$ is the latent variables

3. $$$$P(z_i=k) = \pi_k$$$$, s.t. $$\sum_{k=1}^K \pi_k = 1$$, $$\pi = (\pi_1, \pi_2, ...\pi_K)$$ is the mixing proportions

4. $$$$P(x_i=x | z_i=k) \sim N(\mu_k, \Sigma_k)$$$$, $$N(\mu_k, \Sigma_k)$$ is a multivariate Gaussian distribution, the density function is

$$$f(x_i) = \frac{1}{\sqrt{(2 \pi)^d \det \Sigma_k}} \exp\left( -\frac{1}{2} (x_i - \mu_k)^T \Sigma_k^{-1} (x_i - \mu_k) \right),$$$

where $$\Sigma_k$$ is the $$d \times d$$ covariance matrix for component $$k$$. The objective of GMM is to estimate the probabilities that each instance belongs to each component, that is

$$$P(z_i=k | x_i), k = 1, 2, ... K, i=1, 2, ... n \label{obj}$$$

## 2. Bayes and Maximum Likelihood Estimation

With Bayes rule, we have

\begin{aligned} P(z_i=k|x_i) & = \frac{P(x_i|z_i)P(z_i=k)}{P(x_i)} \\ & = \frac{P(x_i|z_i)P(z_i=k)}{\sum_{k=1}^KP(x_i|z_i)P(z_i=k)} \\ & = \frac{\pi_k*N(x_i, \mu_k, \Sigma_k)}{\sum_{k=1}^K\pi_k*N(x_i, \mu_k, \Sigma_k)} \\ & = w_{ik} \end{aligned} \label{weights}

If we know parameters $$\theta = (\pi_1, \pi_2, ..., \pi_K, \mu_1, \mu_2, ..., \mu_K, \Sigma_1, \Sigma_2, ..., \Sigma_K)$$, we will get the objective. Therefore, the target becomes to estimate the unknown parameter $$\theta$$. To estimate $$\theta$$, GMM uses maximum likelihood estimation(MLE),

\begin{aligned} L(\theta|X, Z) & = P(X, Z|\theta) \\ & = \prod_{i=1}^n \sum_{k=1}^K P(x_i=x|z_i=k)P(z_i=k) \\ & = \prod_{i=1}^n \sum_{k=1}^K \pi_k N(\mu_k, \Sigma_k) \end{aligned}

log-likelihood becomes

\begin{aligned} l(\theta|X, Z) & = \sum_{i=1}^n log(\,\sum_{k=1}^K \pi_k N(\mu_k, \Sigma_k))\, \end{aligned}

Normally, we can get

$$$\theta^{*} = \underset{\theta}{\operatorname{argmax}} l(\theta|X, Z)$$$

by set

$$$\frac{\partial l}{\partial \theta} = 0$$$

However, it’s hard to solve $$\theta$$ with closed-form expressions now. So GMM uses the Expectation—maximization algorithm(EM).

## 3. Expectation—Maximization Algorithm

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.

EM iterates the expectation step(E-step) and the maximization step(M-step) until converge:

1. initializes $$\theta = (\pi_1, \pi_2, ..., \pi_K, \mu_1, \mu_2, ..., \mu_K, \Sigma_1, \Sigma_2, ..., \Sigma_K)$$ to $$\theta_0$$
2. start iteration until convergence or reach the max iterate number, at step $$t\geq0$$:

2.1 E-step: with $$\theta_{t}$$, calculates $$$$P(z_i=k|x_i)$$$$ through equation $$\eqref{weights}$$ for $$k=1, 2, ..., K$$ and $$i=1, 2, ..., n$$

2.2 M-step: estimates $$\theta_{t+1}$$ by Maximum Likelihood Estimation

In E-step, EM defines the expected value of the log-likelihood function of $$\theta$$ as

$$$\mathbf{Q}(\theta|\theta_{t}) = \mathbf{E}_{Z|X, \theta^t}(logL(\theta;X,Z))$$$

In M-step, EM estimates next $$\theta$$ as

$$$\theta^{t+1} = \underset{\theta}{\operatorname{argmax}} \mathbf{Q}(\theta|\theta_{t}) \quad s.t. \ \sum_{k=1}^K \pi_k = 1$$$

## 4. GMM with EM

In GMM situation, we have

\begin{aligned} L(\theta;x_i, z_i) & = \prod_{k=1}^K I(z_i=k) P(x_i, z_i|\theta) \\ & = \prod_{k=1}^K I(z_i=k) P(x_i|z_i, \theta) P(z_i | \theta) \\ & = \prod_{k=1}^K I(z_i=k) N(x_i|\mu_k, \Sigma_k) \pi_k \end{aligned}

where $$I(z_i=k)$$ is a indicator function. And

$$$L(\theta|X, Z) = \prod_{i=1}^n L(\theta;x_i, z_i)$$$

In E-step, the expected value of the log-likelihood function becomes

\begin{aligned} \mathbf{Q}(\theta|\theta_{t}) & = \mathbf{E}_{Z|X, \theta^t}(logL(\theta;X,Z)) \\ & = \mathbf{E}_{Z|X, \theta^t}\sum_{i=1}^nlog(L(\theta;x_i, z_i)) \\ & = \sum_{i=1}^n \mathbf{E}_{z_i|x_i, \theta^t} log(L(\theta;x_i, z_i)) \\ & = \sum_{i=1}^n \mathbf{E}_{z_i|x_i, \theta^t} \sum_{k=1}^K I(z_i=k) *(log\pi_k + logN(x_i|\mu_k, \Sigma_k)) \\ & = \sum_{i=1}^n \sum_{k=1}^K P(z_i=k|x_i) * (log\pi_k + logN(x_i|\mu_k, \Sigma_k)) \end{aligned}

Combine the current $$\theta_{t}$$ and equation $$\eqref{weights}$$, we have

$$$w_{ik} = P(z_i=k|x_i)$$$

where $$w_{ik}$$ is a constant. Thus

$$$\mathbf{Q}(\theta|\theta_{t}) = \sum_{i=1}^n \sum_{k=1}^K w_{ik} * (log\pi_k + logN(x_i|\mu_k, \Sigma_k))$$$

In M-step, the optimal problem becomes

\begin{aligned} \underset{\theta}{\operatorname{max}} \quad & \mathbf{Q}(\theta|\theta_{t}) \\ \text{s.t.} \quad & \sum_{k=1}^K \pi_k = 1 \end{aligned}

which is a Lagrangian problem. Therefore,

\begin{aligned} H(\theta, \lambda) = \mathbf{Q}(\theta|\theta_{t}) + \lambda(\sum_{k=1}^K \pi_k - 1) \end{aligned}

Take partial derivatives, we have

\begin{aligned} \frac{\partial H}{\partial \mu_k} & = \sum_{i=1}^n w_{ik} * \frac{\partial logN(x_i|\mu_k, \Sigma_k)}{\partial \mu_k} \\ & = \sum_{i=1}^n w_{ik} \Sigma_{k}^{-1}(x_i - \mu_k) \end{aligned} \label{partialmu}

and

$$$\frac{\partial H}{\partial \Sigma_k} = \frac{1}{2}\{ \sum_{i=1}^n w_{ik} \Sigma_{k}^{-T} (x_i - \mu_k)(x_i-\mu_k)^{T}(\Sigma^{-T} - I) \} \label{partialSigma}$$$

and

$$$\frac{\partial H}{\partial \pi_k} = \sum_{i=1}^n w_{ik} / \pi_k + \lambda \label{pi_k}$$$

and

$$$\frac{\partial H}{\partial \lambda} = \sum_{k=1}^K \pi_k - 1 \label{lambda}$$$

Set $$\eqref{partialmu}$$ equal to zero, we have

$$$\mu_k^* = \frac{\sum_{i=1}^n w_{ik}x_i}{\sum_{i=1}^n w_{ik}}$$$

Set $$\eqref{partialSigma}$$ equal to zero, and combine with $$\mu_k^*$$, we have

$$$\Sigma_k^* = \frac{\sum_{i=1}^n w_{ik}(x_i-\mu_k)(x_i-\mu_k)^{T} }{\sum_{i=1}^n w_{ik}}$$$

Set $$\eqref{pi_k}$$ and $$\eqref{lambda}$$ equal to zero, we have

$$$\pi_k^* = \frac{\sum_{i=1}^n w_{ik}}{\sum_{k=1}^K \sum_{i=1}^n w_{ik}}$$$

Therefore, we got $$\theta_{t+1} = (\mu_1^*, \mu_2^*, ... \mu_K^*, \Sigma_1^*, \Sigma_2^*, ... \Sigma_K^*, \pi_1^*, \pi_2^*, ... \pi_K^*)$$.

## 5. A numpy python implementation

First, we need to initialize $$\theta_0$$. Here, we use kmeans++, kmeans and random to initialize parameters. After initialization, we calculate the weights equation $$\eqref{weights}$$, which is the E-step.

# 0. import
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

X, y = make_blobs()

# 1. EM: Initlization and E-step

def _kmeans_init(X, n_component, kmeans_method):
n_samples = X.shape[0]
if kmeans_method == 'kmeans++':
kmeans = KMeans(n_clusters=n_components).fit(X)
else:
kmeans = KMeans(n_clusters=n_components, init='random').fit(X)
labels = kmeans.labels_

weight_matrix = None

for i in range(n_component):
sub_component = X[labels==i]

pi = sub_component.shape[0]/n_samples
mean = np.mean(sub_component, axis=0)
cov = np.cov(sub_component.T)

pdf_value = multivariate_normal.pdf(X, mean, cov)
weight = pi*pdf_value

if weight_matrix is None:
weight_matrix = weight
else:
weight_matrix = np.column_stack((weight_matrix, weight))

weight_sum = np.sum(weight_matrix, axis=1)[:, np.newaxis]
weights = weight_matrix/weight_sum
return weights

def _initialize(X, n_component, method):
# default random
n_samples = X.shape[0]
if method == 'kmeans++':
weights = _kmeans_init(X, n_component, 'kmeans++')
elif method == 'kmeans_random':
weights = _kmeans_init(X, n_component, 'kmeans_random')
else:
weights = np.random.rand(n_samples, n_components)
w = weights.sum(axis=1)
weight_sum = weights.sum(axis=1)[:, np.newaxis]
weights /= weight_sum

return weights


Given the value of weights equation $$\eqref{weights}$$, we update the value of $$\theta$$ through MLE, which is the M-step. To avoid forloop, convert the numerator of equation $$\eqref{partialSigma}$$ as:

\begin{aligned} A & = \sum_{i=1}^n w_{ik}(x_i - \mu_k)(x_i - \mu_k)^T \\ \Rightarrow \quad A_{st} &= \sum_{i=1}^n w_{ik}x_{is}x_{it} - \sum_{i=1}^nw_{ik}\mu_{ks}x_{it} - \sum_{i=1}^nw_{ik}\mu_{kt}x_{is} + \sum_{i=1}^nw_{ik}\mu_{ks}\mu_{kt} \\ \Rightarrow \quad A_{st} & = A1_{st} - A2_{st} - A3_{st} + A4_{st} \end{aligned}

for $$1 \leq s, t \leq d$$, where

\begin{aligned} A1 & = w_k*(X.T) \\ A2 & = A3.T \\ A3 & = \mu_k((w_k*(X.T)).dot(H)) \\ A4 & = (\sum_{i=1}^n w_{ik})*(\mu_k.dot(\mu_k.T)) \\ w_k & = np.array([w_k1, w_k2, ... w_kn]) \\ \mu_k & = np.array([\mu_k1, \mu_k2, ..., \mu_kd]) \\ H & = np.ones(shape=(n, d)) \end{aligned}
# 2. EM: M-step

def _update_means(X, weights):
weight_sum = np.sum(weights, axis=0)
means = X.T.dot(weights)/weight_sum
return means

def _update_covs(X, n_components, weights, means):
n, m = X.shape
covs = None
H = np.ones(shape=(n, m))

for k in range(n_components):
weight = weights[:, k]
weight_sum = np.sum(weight)
mean = means[:, k]
weighted_X_T = weight*(X.T)

a1 = (weighted_X_T).dot(X)
a3 = mean*(weighted_X_T.dot(H))
a2 = a3.T
mean_reshaped = mean.reshape(-1, 1)
a4 = weight_sum*((mean_reshaped).dot(mean_reshaped.T))
a = a1 - a2 - a3 + a4

cov = a/weight_sum
if covs is None:
covs = cov
else:
covs = np.dstack((covs, cov))
return covs

def _update_pis(weights):
w_each_component = np.sum(weights, axis=0)
w_sum = np.sum(w_each_component)
pis = w_each_component/w_sum
return pis

def _update_weights(X, n_components, means, covs, pis):
weighted_pdf_array = None
for k in range(n_components):
mean = means[:, k]
cov = covs[:,:,k]
pi = pis[k]
pdf_values = multivariate_normal.pdf(X, mean, cov)
weighted_pdf_values = pi*pdf_values
if weighted_pdf_array is None:
weighted_pdf_array = weighted_pdf_values
else:
weighted_pdf_array = np.column_stack((weighted_pdf_array, weighted_pdf_values))

weighted_sum = np.sum(weighted_pdf_array, axis=1)[:, np.newaxis]
weights = weighted_pdf_array/weighted_sum
return weights

# 3. gmm fit
def gmm_fit(X, n_components, init_method, max_iter=100):
weights = _initialize(X, n_components, init_method)
iter_left = max_iter
while iter_left > 0:
means = _update_means(X, weights)
covs = _update_covs(X, n_components, weights, means)
pis = _update_pis(weights)
weights_next = _update_weights(X, n_components, means, covs, pis)

if np.allclose(weights, weights_next):
print('converage at step', max_iter-iter_left)
break
else:
weights = weights_next
iter_left -= 1
return means, covs, weights_next

# 4. predict labels
def gmm_predict(X, n_components, means, covs, weights):
pdf_values = None
for k in range(n_components):
mean = means[:, k]
cov = covs[:, :, k]
pdf_value = multivariate_normal.pdf(X, mean, cov)
if pdf_values is None:
pdf_values = pdf_value
else:
pdf_values = np.column_stack((pdf_values, pdf_value))
labels = np.argmax(pdf_values, axis=1)
return labels

# 5. gmm
def gmm_fit_predict(X, n_components, init_method, max_iter):
means, covs, weights = gmm_fit(X, n_components, init_method, max_iter)
labels = gmm_predict(X, n_components, means, covs, weights)
return labels

# 6. result and demonstration
def plot_scatter(ax, X, labels, title):
ax.scatter(X[:, 0], X[:, 1], c=labels)
ax.set_title(title)

n_components = 3
gmm_labels_kmeans_plus_plus = gmm_fit_predict(X, n_components, 'kmeans++', max_iter=1000)
gmm_labels_kmeans_random = gmm_fit_predict(X, n_components, 'kmeans_random', max_iter=1000)
gmm_labels_random = gmm_fit_predict(X, n_components, 'random', max_iter=1000)

fig, axes = plt.subplots(2, 2, figsize=(16, 10))
plot_scatter(axes[0, 0], X, y, 'Blobs')
plot_scatter(axes[0, 1], X, gmm_labels_kmeans_plus_plus, 'kmeans++')
plot_scatter(axes[1, 0], X, gmm_labels_kmeans_random, 'kmeans random')
plot_scatter(axes[1, 1], X, gmm_labels_random, 'random')