A Guide to Maximum Likelihood Estimation in Supervised Learning

This article demystifies the ML learning modeling process under the prism of statistics.

We will understand how our assumptions on the data enable us to create meaningful optimization problems. In fact, we will derive commonly used criteria such as cross-entropy in classification and mean square error in regression.

Finally, I am trying to answer an interview question that I encountered: What would happen if we use MSE on binary classification?

>

Likelihood VS probability and probability density

To begin, let’s start with a fundamental question: what is the difference between likelihood and probability? The data xxx are connected to the possible models θ by means of a probability P(x,θ) or a probability density function (pdf) p(x,θ).

In short, A pdf gives the probabilities of occurrence of different possible values. The pdf describes the infinitely small probability of any given value. We’ll stick with the pdf notation here. For any given set of parameters θ, p(x,θ) is intended to be the probability density function of x. The likelihood p(x,θ) is defined as the joint density of the observed data as a function of model parameters. That means, for any given x, p(x=fixed⁡,θ) can be viewed as a function of θ. Thus, the likelihood function is a function of the parameters θ only, with the data held as a fixed constant.

Notations

We will consider the case were we are dealt with a set XXX of mmm data instances X={x(1),..,x(m)} that follow the empirical training data distribution pdatatrain(x)=pdata(x). which is a good and representative sample of the unknown and broader data distribution pdatareal(x).

The Independent and identically distributed assumption

This brings us to the most fundamental assumption of ML: Independent and Identically Distributed (IID) data (random variables). Statistical independence means that for random variables A and B, the joint distribution PA,B(a,b) factors into the product of their marginal distribution functions. That’s how sums multi-variable joint distributions are turned into products. Note that the product can be turned into a sum by taking the log∏x=∑logx. Since log(x) is monotonic, it’s not changing the optimization problem.

Our estimator (model) will have some learnable parameters θ that make another probability distribution pmodel(x,θ). Ideally, pmodel(x,θ)≈pdata(x).

>

The essence of ML is to pick a good initial model that exploits the assumptions and the structure of the data. Less literally, a model with a decent inductive bias. As the parameters are iteratively optimized, pmodel(x,θ) gets closer to pdata(x).

In neural networks, because the iterations happen in a mini-batch fashion instead of the whole dataset, m will be the mini-batch size.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is simply a common principled method with which we can derive good estimators, hence, picking θ such that it fits the data.

To disentangle this concept, let’s observe the formula in the most intuitive form:

The optimization problem is maximizing the likelihood of the given data. Outputs are the conditions in the probability world. Unconditional MLE means we have no conditioning on the outputs, so no labels.

In a supervised ML context, the condition would simply be the data labels

Quantifying distribution closeness: KL-div

One way to interpret MLE is to view it as minimizing the “closeness” between the training data distribution pdata(x) and the model distribution pmodel(x,θ). The best wayto quantify this “closeness” between distributions is the KL divergence, defined as:

where EEE denotes the expectation over all possible training data. In general, the expected value EEE is a weighted average of all possible outcomes. We will replace the expectation with a sum, whilst multiplying each term with its possible “weight” of happening, that is pdatap(x)

Bonus: What would happen if we use MSE on binary classification?

So far I presented the basics. This is a bonus question that I was asked during an ML interview: What if we use MSE on binary classification?

When y^(i)=0:

If the network is right, y^=0, the gradient is null.

When y^(i)=1:

If the network is right, y^=1, the gradient is null.

Latest articles

Related articles