Statistical Estimation - Maximum Likelihood Estimation (MLE)

June 01, 2025

📘 Statistical Estimation

When dealing with statistics, we usually have:

A population with unknown parameters (e.g., mean $\mu$ , variance $\sigma^2$ , probability $p$ , etc.).
A sample of observations drawn from that population.

Since population parameters are unknown constants, we need to estimate them from sample data.

1. Point Estimation

A point estimator is a single statistic (function of sample observations) that provides a “best guess” of the parameter.

Example: Sample mean $\bar{X} = \frac{1}{n}\sum X_i$ is an estimator of population mean $\mu$ .
The obtained numerical value is called the point estimate.

2. Interval Estimation

Instead of one value, we provide an interval of plausible values with a given level of confidence.

Example:
$\bar{X} \pm Z_{\alpha/2}\cdot \frac{\sigma}{\sqrt{n}}$
is a 95% confidence interval for $\mu$ .

3. Properties of Good Estimators

the following properties are essential:

Unbiasedness:
$E(\hat{\theta}) = \theta$
The expected value of the estimator equals the true parameter.
Consistency:
As $n \to \infty$ , $\hat{\theta} \to \theta$ in probability.
Efficiency:
Among unbiased estimators, the one with minimum variance is preferred.
Sufficiency:
An estimator is sufficient if it uses all available information in the sample about the parameter.

📘 Maximum Likelihood Estimation (MLE)

Idea

Proposed by R.A. Fisher (1922), MLE is one of the most powerful and widely used methods of estimation.

It works on the principle of choosing the parameter values that maximize the likelihood of observing the given data.

Step-by-Step Procedure

Suppose $X_1, X_2, …, X_n$ is a random sample from a distribution with pdf/pmf $f(x|\theta)$ , where $\theta$ is an unknown parameter.

Likelihood Function:
$L(\theta) = \prod_{i=1}^n f(x_i|\theta)$
This is the joint probability of the sample, considered as a function of $\theta$ .
Log-Likelihood:
For easier calculations, take logs:
$\ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i|\theta)$
First Derivative (Likelihood Equation):
$\frac{d\ell(\theta)}{d\theta} = 0$
Solving this gives the MLE, $\hat{\theta}$ .
Second Derivative Test:
Ensure
$\frac{d^{2} ℓ (θ)}{d θ^{2}} < 0$
to confirm a maximum.

Example 1: MLE for Bernoulli / Binomial

Let $X \sim Binomial (n, p). Suppose$ $x$ successes are observed.

Likelihood:
$L(p) = \binom{n}{x} p^x (1-p)^{n-x}$
Log-likelihood:
$\ell(p) = x \ln p + (n-x)\ln(1-p)$
Differentiate:
$\frac{d\ell}{dp} = \frac{x}{p} - \frac{n-x}{1-p} = 0$
Solve:
$\hat{p} = \frac{x}{n}$

✅ Thus, the MLE of $p$ is the sample proportion.

Example 2: MLE for Normal Mean ( $\mu$ )

Suppose $X_1, X_2, \dots, X_n \sim N(\mu, \sigma^2)$ , with $\sigma^2$ known.

Likelihood:
$L(\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$
Log-likelihood:
$\ell(\mu) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$
Differentiate:
$\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0$
Solve:
$\hat{\mu} = \bar{X}$

✅ Hence, the MLE of the population mean is the sample mean.

Advantages of MLE

Consistency: As $n \to \infty$ , $\hat{\theta} \to \theta$ .
Asymptotic normality: For large samples, the distribution of $\hat{\theta}$ tends to normal.
Efficiency: Attains the Cramér–Rao lower bound asymptotically.
General applicability: Works for discrete, continuous, and complex models.

Limitations of MLE

Can be algebraically complicated (often requires iterative methods).
Sensitive to outliers.
For small samples, MLE may be biased.

Search This Blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme