# Calculating maximum likelihood estimators

If the probability model that describes a population is completely known (along with its parameters), then we can use it to obtain information about the population. However, in the real world, this is rarely the case. Instead, we have observed data. We may have the information that the observed data follows a particular distribution but its parameters are not known. In other words, the form of the distribution from which the observed data is drawn is known (perhaps it is an assumption) but the specific values of the parameters are not known. Then the option we have is to use the observed data to estimate the values of the parameters.

One way to estimate the parameters is the method of moments, which is relative easy to use (for the most parts). This is the focus of the practice problem set. In this post, we discuss the method of maximum likelihood estimation.

The method of maximum likelihood estimation is to maximize the probability or likelihood of observing the data we collected. Suppose that the form of the distribution is known and its density function is $f(x; \theta_1, \theta_2, \cdots, \theta_k)$. But the $k$ parameters are not known. The goal is to choose one particular member of the assumed parametric distribution family $f(x; \hat{\theta_1}, \hat{\theta_2}, \cdots, \hat{\theta_k})$ that gives the highest likelihood of the observed data. Let’s consider the exponential distribution as an example.

Exponential Example

Suppose it is known that size of claims from a large group of insureds has an exponential distribution with unknown mean $\theta$. The density function is $f(x)=\frac{1}{\theta} e^{-x/\theta}$ where $x>0$. We observe $n$ claims $x_1,x_2,\cdots,x_n$. The method of maximum likelihood is to choose the value of $\theta$ that has the highest likelihood of observing these observations. The likelihood of observing the data is:

\displaystyle \begin{aligned} L(\theta)&=f(x_1) \cdot f(x_2) \cdots f(x_n) \\&=\frac{1}{\theta} e^{\frac{-x_1}{\theta}} \cdot \frac{1}{\theta} e^{\frac{-x_2}{\theta}} \cdots \frac{1}{\theta} e^{-\frac{x_n}{\theta}} \\&=\frac{1}{\theta^n} \ e^{\frac{-\sum \limits_{i=1}^n x_i}{\theta}} \end{aligned}

The goal is to choose the value of $\theta$ so that the function $L(\theta)$ is as large as possible. In other words, the goal is to maximize the function $L(\theta)$, which is called the likelihood function. In many cases, it is easier to maximize the natural log of $L(\theta)$.

$\displaystyle l(\theta)=\ln[L(\theta)]=-n \ln(\theta)-\frac{\sum \limits_{i=1}^n x_i}{\theta}$

The function $l(\theta)$ is called the log-likelihood function. The $\theta$ for which $l(\theta)$ is maximum is also a value for which $L(\theta)$ is maximum. The following gives the first and second derivatives of $l(\theta)$.

$\displaystyle l'(\theta)=-\frac{n}{\theta}+\frac{\sum \limits_{i=1}^n x_i}{\theta^2}$

$\displaystyle l''(\theta)=\frac{n}{\theta^2}-2 \ \frac{\sum \limits_{i=1}^n x_i}{\theta^3}$

Setting the first derivative equal to zero and solving for $\theta$ gives

$\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i}{n}$

Plugging $\hat{\theta}$ into the second derivative produces a negative value. Thus $\hat{\theta}$ gives the maximum log-likelihood $l(\theta)$ and thus the maximum likelihood $L(\theta)$. The value $\hat{\theta}$ is called the maximum likelihood estimate (MLE) of the parameter $\theta$. It is also called the maximum likelihood estimator of the parameter $\theta$ since $\hat{\theta}$ is also a function (as the observations change, the estimate will change). Note that $\hat{\theta}$ is the mean of the sample $x_1,x_2,\cdots,x_n$. In this instance, the maximum likelihood estimate coincides with the method of moments estimate. Though such examples are the exception, several more examples of MLE = method of moments estimates are discussed below.

MLE

As the above example suggests, the first step in maximum likelihood estimation is to come up with the likelihood function and then the log-likelihood function (by taking the natural log of the likelihood function). If there is only one parameter, take the derivative of the log-likelihood function and then set it equal to zero and solve for the parameter. If there are more than one parameters in the log-likelihood function, take partial derivative with respective to each parameter. Then set the resulting partial derivatives equal to zero and solve the resulting system of equations.

The likelihood of a data point $x$ (if its value is completely known) is simply the density function evaluated at $x$ (for a continuous distribution) or the probability function evaluated at $x$ (for a discrete distribution). For a given sample $x_1,x_2,\cdots,x_n$, the likelihood function is simply the product of the likelihoods at the individual data points $x_i$.

Another point to keep in mind. When working with likelihood function or log-likelihood function, positive constants can be omitted. This is illustrated by the example of normal distribution.

Normal Example

Observations: $x_1,x_2,\cdots,x_n$. We assume that the data are drawn from a normal distribution with parameters $\mu$ and $\sigma$. The following is the density function.

$\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \sigma} \ e^{-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}} \ \ \ \ \ \ \ -\infty

The following is the full likelihood function.

\displaystyle \begin{aligned} L(\mu,\sigma)&=f(x_1) \cdot f(x_2) \cdots f(x_n) \\&=\frac{1}{(\sqrt{2 \pi})^n} \ \frac{1}{\sigma^n} \ e^{\frac{-\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}} \end{aligned}

The constant $\frac{1}{(\sqrt{2 \pi})^n}$ in the last expression can be skipped. When taking the derivative of the log-likelihood function, the log of this constant will become a zero. Thus the essential likelihood function and the log-likelihood function are the following:

$\displaystyle L(\mu,\sigma)=\frac{1}{\sigma^n} \ e^{\frac{-\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}}$

$\displaystyle l(\mu,\sigma)=\ln[L(\mu,\sigma)]=-n \ln(\sigma)-\frac{\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}$

Now take partial derivatives of $l(\mu,\sigma)$, first with respect to $\mu$ and then with respect to $\sigma$.

$\displaystyle \frac{\partial \ l(\mu,\sigma)}{\partial \ \mu}=\frac{2 \sum \limits_{i=1}^n (x_i-\mu)}{\sigma^2}=\frac{2 \biggl[ \biggl(\sum \limits_{i=1}^n x_i \biggr) - n \mu \biggr] }{\sigma^2}=0$

$\displaystyle \frac{\partial \ l(\mu,\sigma)}{\partial \ \sigma}=-\frac{n}{\sigma}+\frac{\sum \limits_{i=1}^n (x_i-\mu)^2}{\sigma^3}=-\frac{n}{\sigma}+\frac{\biggl( \sum \limits_{i=1}^n x_i^2\biggr) - \mu^2}{\sigma^3}=0$

Solving the first equation, we obtain the solution $\hat{\mu}$. Plug that into the second equation and we produce $\hat{\sigma}^2$.

$\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n x_i}{n} \ \ \ \ \ \ \ \ \ \ \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n x_i^2}{n}- \hat{\mu}^2$

The MLE estimate for the mean $\mu$ for the normal distribution is the sample mean and the MLE estimate for $\sigma^2$ is the sample variance.

Formulas

The MLE method does not always have a closed form calculation. For some distributions, the only way to get MLE estimates is through software package. The following list gives several distributions that have accessible calculation for MLE. The list is by no means exhaustive. The distribution names in red are the ones whose MLE estimates coincide with the method of moments estimates.

.

Exponential Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\theta} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ \ x>0$
MLE Estimate $\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Inverse Exponential Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{\theta \ e^{-\theta / x} }{x^2} \ \ \ \ \ \ \ \ x>0$
MLE Estimate $\displaystyle \hat{\theta}=\frac{n }{\sum \limits_{i=1}^n 1/x_i}$

Normal Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \ \sigma} \ e^{-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}} \ \ \ \ -\infty
MLE Estimate $\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n x_i}{n}$
MLE Estimate $\displaystyle \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n x_i^2}{n}- \hat{\mu}^2$

Lognormal Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \ \sigma \ x} \ e^{-\frac{1}{2} \frac{(\ln(x)-\mu)^2}{\sigma^2}} \ \ \ \ 0
MLE Estimate $\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n \ln(x_i)}{n}$
MLE Estimate $\displaystyle \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n [\ln(x_i)]^2}{n}- \hat{\mu}^2$

Pareto Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{\alpha \ \theta^\alpha}{(x+\theta)^{\alpha+1}} \ \ \ \ \ \ x>0, \ \ \theta \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\alpha}=\frac{n }{\ln \biggl(\prod \limits_{i=1}^n (\theta+x_i) \biggr)- n \ \ln(\theta) }$

Weibull Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\tau \ \frac{x^{\tau-1}}{\theta^\tau} \ e^{- x^\tau / \theta^\tau} \ \ \ \ \ \ x>0, \ \ \tau \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\theta}=\biggl[\frac{\sum \limits_{i=1}^n x_i^\tau}{n} \biggr]^{1/\tau}$

Uniform Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\theta} \ \ \ \ \ \ 0
MLE Estimate $\displaystyle \hat{\theta}=\text{max}(x_1,x_2,\cdots,x_n)$

Gamma Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \frac{1}{\theta^\alpha} \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ x>0, \ \ \alpha \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{\alpha \ n}$
Fitted Mean $\displaystyle \alpha \ \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Binomial Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\binom{m}{x} \ p^x \ (1-p)^{m-x} \ \ \ \ \ \ x=0,1,2,\cdots,n, \ \ m \ \text{fixed}$
MLE Estimate $\displaystyle \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{m \cdot n}$
Fitted Mean $\displaystyle m \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Poisson Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ x=0,1,2,3,\cdots,$
MLE Estimate $\displaystyle \hat{\lambda}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Negative Binomial Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\binom{r+x-1}{x} \ p^r \ (1-p)^{x} \ \ \ \ \ \ x=0,1,2,3,\cdots, \ \ r \ \text{fixed}$
MLE Estimate…… $\displaystyle \hat{p}=\frac{r }{r+\frac{1}{n} \sum \limits_{i=1}^n x_i}=\frac{r }{r+\overline{x}}$
Fitted Mean…… $\displaystyle \frac{r \ (1-\hat{p})}{\hat{p}}=\frac{1}{n} \sum \limits_{i=1}^n x_i=\overline{x}$

Remarks

The observed data discussed in all the above examples and formulas are the case for complete data (or individual data). In this scenarios, each data point in the data set $x_1,x_2,\cdots,x_n$ is known. In other words, the data is not grouped data (not summarized in any way), not censored and not truncated. For claims data in the form of individual data, no deductible or other insurance coverage modification has been applied. So complete data or individual data is exactly as it is recorded. The next post discusses how to calculate MLE for grouped data and censored or truncated data.

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2018 – Dan Ma

## 5 thoughts on “Calculating maximum likelihood estimators”

1. […] post continues the preceding post on maximum likelhood estimation. The preceding post focuses on calculating MLE when there is […]

2. […] estimation. The practice problems are to reinforce the concepts discussed in two posts – this one and this one. The first post shows how to obtain maximum likelihood estimates given complete data […]

3. […] one present basic practice problems to reinforce the concepts discussed in two posts – this one and this one. The first post shows how to obtain maximum likelihood estimates given complete data […]

4. […] posts focus on maximum likelihood estimation for continuous distributions (this post and this post). In this post we shift the attention to parameter estimation for discrete […]

5. […] posts on parameter estimation focus on continuous distributions – this one and this one. Two practice problem sets, Practice Problem Set 2 and Practice Problem Set 3, are to […]