This isn't an "explain it like I'm 4"-feasible concept. You'll need to know some terminology, like multimodal, distribution, Bayes theorem, etc. The following explanation is going to use all of that.
Suppose you have some multimodal data. Maybe the upper bound on the data is infinite and the lower bound is infinite too. A Gamma distribution won't work, because the data can be negative. A Beta distribution won't work because the range of the desired parameters is outside the [0, 1]. The closest you can get is the Normal distribution. But, since the data is multimodal, that doesn't seem to be able to work either. What do we do?
It turns out that one thing you could do is construct a finite mixture model. It's essentially a combination of a number of normal distributions. Each normal distribution has a particular weight assigned to it, \(\pi_j\) for \(j = 1 \rightarrow K\) where K is the number of normal distributions used in the mixture model. Each normal distribution is known as a mixture component, and has parameters for mean and variance: \(\mu_j, \sigma^2_j\). So for a particular data set \(y\) where 1 data point is \(y_i\), we can say that \(y_i\) follows a finite mixture if:
$$ \begin{align*} y_i &\sim \Sigma_{i=1}^K\pi_jN(\mu_j, \sigma^2_j) \end{align*} $$Thus, the likelihood function for this set of data with parameter vector \(\theta = (\mu_1,...,\mu_K,\sigma^2_1,...,\sigma^2_K,\pi_1,...\pi_K)\), remembering that there are \(N\) data points, would be:
$$ \begin{align*} f(y | \theta) &= \Pi_{i=1}^N\Sigma_{j=1}^K(\pi_j})N(\mu_j, \sigma^2_j) \end{align*} $$It so happens that the computation is made easier if "latent component allocaters" are introduced into the equation. If I said that in public, many people would probably run away. What I mean is that we can think of each data point being "assigned" a particular normal distribution from which it comes. That way, we can fit the \(mu_j, \sigma^2_j\) according to each data point assigned to that cluster. If we leave it like it is, then we have no idea which \(y_i\) goes to which distribution.
This is why we introduce a set of N "z" parameters, each of them corresponding to a \(y_i\) for each \(i,...,N\) The value of each \(z_i\) is the component, in the interval [1, K], to which the \(i\)th data point is assigned. The z values are serving as a kind of mapping - they map each data point to its respective normal distribution "mixture component."
This means that we can rewrite the likelihood like so:
To be continued...
$$ \begin{align*} f(y | \theta, \pmb{z}) &= \\ end{align*} $$