Practice Problem Set 4 – estimating parameters of discrete distributions

This practice problem set is to reinforce the topic discussed in this post, the topic of estimating parameters of discrete distributions.

Other posts on parameter estimation focus on continuous distributions – this one and this one. Two practice problem sets, Practice Problem Set 2 and Practice Problem Set 3, are to reinforce these two previous posts.

.

Practice Problem 4-A

The following table gives information on claim frequency data of a group of insureds.

Frequency # of Insureds
0 39
1 25
2 20
3 8
4 4
5 3
6 1
7+ 0

A Poisson distribution with mean $\lambda$ is fitted to the claim frequency data.

• Determine the maximum likelihood estimate of the parameter $\lambda$.
• Determine the probability of having at least one claim.

.

Practice Problem 4-B

The following table gives information on claim frequency data of a group of insureds.

Frequency # of Insureds
0 39
1 25
2 20
3 7
4+ 9

A Poisson distribution with mean $\lambda$ is fitted to the claim frequency data using maximum likelihood estimation.

• Determine the log-likelihood function.
• Determine the equation obtained by setting the derivative of the log-likelihood function equal to zero. Note that this is the equation for determining the maximum likelihood estimate of $\lambda$. However, solving this equation requires using numerical methods.

.

 Practice Problem 4-C The probability function for the number of losses for a given insured in a year is given by the following: $\displaystyle P[X=x]=\binom{r+x-1}{x} \ p^r \ (1-p)^x \ \ \ \ \ \ x=0,1,2,\cdots$ where $r=3$ and the parameter $p$ is unknown. The following shows the numbers of losses for five insureds in one year: 0, 2, 3, 1, 3. Use the method of maximum likelihood estimation to estimate the parameter $p$. Determine the probability of observing zero claims according to the fitted distribution.

.

Practice Problem 4-D

The observed claim frequency data of a group of policyholders is given in the following table.

Frequency # of Insureds
0 94
1 64
2 32
3 7
4 3

A binomial distribution with parameters $m=4$ and $p$ is fitted to the given claim frequency data.

• Estimate the parameter $p$ using maximum likelihood estimation.
• Determine the probability of observing zero claims or 1 claim according to the fitted distribution.

.

 Practice Problem 4-E A large group of insureds is made up of two groups – low risk group and high risk group. The annual claim frequency for an insured in the low risk group has a Poisson distribution with mean $\lambda$. The annual claim frequency for an insured in the high risk group has a Poisson distribution with mean $2 \lambda$. Ten insured are observed for 5 years (5 insureds in each group). Their claim counts are as follows: Low Risk Group: 0, 2, 1, 0, 3 High Risk Group: 1, 0, 2, 3, 1 Estimate the parameter $\lambda$ using maximum likelihood estimation.

.

 Practice Problem 4-F The number of claims in a year for an insured has a distribution whose probability function is given by the following: $\displaystyle P[X=x]=\biggl(\frac{1}{1+\theta} \biggr) \ \biggl(\frac{\theta}{1+\theta} \biggr)^x \ \ \ \ \ \ x=0,1,2,3,\cdots$ Out of a group of 100 insureds that have been observed for one year, 55 of them have no claims, 25 of them have exactly 1 claim and 20 of them have 2 or more claims. Estimate the parameter $\theta$ using maximum likelihood estimation. Determine the probability that there are two or more claims in a year for a randomly chosen insured.

.

 Practice Problem 4-G Two groups of insureds are pooled for the purpose of maximum likelihood estimation. The number of claims per year for Group 1 follows a binomial distribution with parameters $n=12$ and $p$. The number of claims per year for Group 2 follows a binomial distribution with parameters $n=20$ and $p$. In observing these two groups for 3 years, there are 15 claims from Group 1 and 28 claims from Group 2. Estimate the parameter $p$ using maximum likelihood estimation.

.

 Practice Problem 4-H The number of claims from 5 policyholders are: 2, 3, 1, 5, 5 A zero-truncated geometric distribution is fitted to the claim data using maximum likelihood estimation. Determine the estimated probability that the number of claims is at least 2.

.

Practice Problem 4-I

The observed claim frequency data for a group of 105 insureds is given below.

Frequency # of Insureds
0 40
1 24
2 20
3 9
4 5
5 4
6 3
7+ 0

Which of the (a,b,0) distributions (Poisson, binomial, negative binomial) is the most appropriate fit to the claim frequency data? Answer this question from the following two angles.

• Comparing the sample mean and the sample variance.
• Compute the ratio $k \frac{n_k}{n_{k-1}}$ where $n_k$ is the number of insureds having $k$ claims. Plot these ratio against $k$. Observe the slop of the plot.

.

 Practice Problem 4-J When fitting a binomial distribution with both parameters unknown, the maximum likelihood estimation using log-likelihood profile is demonstrated in Example 7 and Example 8 in this post. Show that this approach does not work for the claim frequency data in Problem 4-D.

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

Problem Answer
4-A
• $\displaystyle \hat{\lambda}=\frac{126}{100}=1.26$
• $\displaystyle 1-e^{-1.26}=0.7163$
4-B
• $\displaystyle \ln(L)=86 \ \ln(\lambda)-91 \ \lambda+9 \ \ln \biggl(1-e^{-\lambda}-\lambda e^{-\lambda}-\frac{\lambda^2}{2} e^{-\lambda}-\frac{\lambda^3}{6} e^{-\lambda} \biggr)$
• $\displaystyle \frac{86}{\lambda}-91+\frac{9}{1-e^{-\lambda}-\lambda e^{-\lambda}-\frac{\lambda^2}{2} e^{-\lambda}-\frac{\lambda^3}{6} e^{-\lambda}} \bigg(\frac{\lambda^3}{6} e^{-\lambda} \biggr)=0$
4-C
• $\displaystyle \hat{p}=\frac{5}{8}$
• $\displaystyle \biggl(\frac{5}{8} \biggr)^3=0.24414$
4-D
• $\displaystyle \hat{p}=\frac{161}{800}=0.20125$
• $\displaystyle P(X=0,1)=0.8172770109$
4-E
• $\displaystyle \hat{\lambda}=\frac{13}{15}$
4-F
• $\displaystyle \hat{\theta}=\frac{13}{16}$
• $\displaystyle \biggl(\frac{13}{29}\biggr)^2=0.20095$
4-G
• $\displaystyle \hat{p}=\frac{43}{96}$
4-H
• $\displaystyle \hat{p}=\frac{5}{16}$
• $\displaystyle \frac{11}{16}$

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2019 – Dan Ma

Estimating parameters of discrete distributions

Previous posts focus on maximum likelihood estimation for continuous distributions (this post and this post). In this post we shift the attention to parameter estimation for discrete distributions, in particular, the three commonly used discrete distributions – Poisson, binomial and negative binomial.

Practice problems to reinforce concepts discussed here are found here.

Practice problems for maximum likelihood estimation for continuous distributions are found here and here.

In fitting a discrete distribution to observed data, we focus on two procedures – method of moments and maximum likelihood estimation.

For method of moments estimation, we adopt the approach of equating the sample mean with the population mean for distributions with one parameter (e.g. Poisson) and equating the sample mean with the population mean and equating the sample variance with the population variance for distributions with two parameters (e.g. negative binomial). Of course, for two-parameter distributions, instead of equating sample variance with population variance, we can instead equate sample second moment with population second moment.

For maximum likelihood estimation (MLE), the idea is similar to MLE for continuous distributions. In the discrete case, use the probability function (or probability mass function) to set up a likelihood function instead of the probability density function. The rest of the procedure works similarly – take the natural log of the likelihood function, take derivative(s) and solve the equation(s) resulting from equating the derivative(s) to zero. In addition to using examples, we point out the issues in implementing MLE for negative binomial distribution and binomial distribution.

Poisson Distribution

The Poisson distribution has only one parameter $\lambda$, which is the mean of the distribution. When complete data is available, the method of moments estimate of $\lambda$ would be the sample mean and the maximum likelihood estimate of $\lambda$ is also the sample mean. Thus for the Poisson distribution, the method of moments estimate coincides with the maximum likelihood estimate in the presence of complete data. However, when the sample data is not complete data (e.g. grouped data, censored data or truncated data), the maximum likelihood estimate of $\lambda$ does not equal the method of moments estimate.

Example 1
The claim frequency data of 100 insureds is given in the following table.

# of Claims # of Insureds
0 40
1 24
2 20
3 8
4 5
5 3
6+ 0
Total 100

A Poisson distribution is fitted to the claim frequency data using maximum likelihood estimation. Determine the resulting estimate of the probability of having zero claims.

The sample mean frequency is:

$\displaystyle \overline{x}=\frac{0 \cdot 40+1 \cdot 24+2 \cdot 20+3 \cdot 8+4 \cdot 5+5 \cdot 3}{100}=\frac{123}{100}=1.23$

The method of moments estimate of the mean $\lambda$ is $\hat{\lambda}=1.23$, which is also the maximum likelihood estimate. The estimated probability of having zero claims is $e^{-1.23}=0.29229$.

Example 2

The following table gives the claim frequency data of a group of insureds.

# of Claims # of Insureds
0 or 1 26
2 12
3 3
4+ 0

Fit the Poisson distribution to the claim frequency data using maximum likelihood. Determine the estimated probability of observing 0 or 1 claim.

Since the given observed claim frequency data is not complete data, do not equate the maximum likelihood estimate with the sample mean. In any case, the sample mean is a little murky since we do not know how many of the 26 insureds have zero claims. The probability of 0 or 1 claim is $P(X=0,1)=e^{-\lambda}(1+\lambda)$. The likelihood function is given by the following.

\displaystyle \begin{aligned} L(\lambda)&=\biggl(e^{-\lambda}(1+\lambda) \biggr)^{26} \biggl(\frac{1}{2!} \ \lambda^2 e^{-\lambda} \biggr)^{12} \biggl(\frac{1}{3!} \ \lambda^3 e^{-\lambda}\biggr)^3 \\&=C \ (1+\lambda)^{26} \ \lambda^{33} \ e^{-41 \lambda} \end{aligned}

The $C$ in $L(\lambda)$ is a multiplicative constant term that can be ignored. The following gives the log-likelihood function and its derivative.

$\displaystyle l(\lambda)=26 \ \ln(1+\lambda) +33 \ \ln(\lambda)-41 \ \lambda$

$\displaystyle \frac{d}{d \lambda} \ l(\lambda)=\frac{26}{1+\lambda}+\frac{33}{\lambda}-41=0$

Setting the derivative equal to zero leads to the quadratic equation $41 \lambda^2-18 \lambda -33=0$. Solving this equation produces the following estimate of $\lambda$ and the estimated probability.

$\hat{\lambda}=1.14313$.

$P(X=0,1)=e^{-\hat{\lambda}} (1+\hat{\lambda})=0.68327$

Negative Binomial Distribution

The negative binomial distribution has two parameters. we consider two parametrizations of the negative binomial distribution.

(1)……$\displaystyle P(X=k)=\binom{r+k-1}{k} \ p^r \ (1-p)^k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ k=0,1,2,3,\cdots$

(2)……$\displaystyle P(X=k)=\binom{r+k-1}{k} \ \biggl(\frac{1}{1+\theta} \biggr)^r \ \biggl(\frac{\theta}{1+\theta} \biggr)^k \ \ \ \ \ \ k=0,1,2,3,\cdots$

Depending on the version, the negative binomial parameters are either $r$ and $p$ or $r$ and $\theta$. To get ready for method of moments estimation, note the population mean and variance in the two versions.

(3)……$\displaystyle \mu=r \ \frac{1-p}{p} \ \ \ \ \ \ \ \ \ \ \sigma^2=r \ \frac{1-p}{p^2}$

(4)……$\displaystyle \mu=r \ \theta \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma^2=r \ \theta (1+\theta)$

Equating the sample mean $\overline{x}$ with $\mu$ and the sample variance $\hat{\sigma}^2$ with $\sigma^2$ produces the following method of moments estimates.

(5)……$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}} \ \ \ \ \ \ \ \ \ \ \ \hat{p}=\frac{\overline{x}}{\hat{\sigma}^2}$

(6)……$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}} \ \ \ \ \ \ \ \ \ \ \ \hat{\theta}=\frac{\hat{\sigma}^2-\overline{x}}{\overline{x}}$

The estimates in (5) are the method of moments estimates for the negative binomial distribution as described in (1). The estimates in (6) are the method of moments estimates for the negative binomial distribution as described in (2). For both cases to work, the sample variance must exceed the sample mean, i.e. $\hat{\sigma}^2>\overline{x}$. In both (5) and (6), the sample variance $\hat{\sigma}^2$ is obtained by the biased sample variance, i.e. the one obtained by dividing by sample size $n$ rather than $n-1$.

Example 3
Use the sample data in Example 1. Fit negative binomial distribution to the observed claim frequency data using method of moments. Determine the probability of observing zero claims according to the fitted distribution.

From Example 1, $\overline{x}=1.23$. The following gives the sample variance.

$\displaystyle \hat{\sigma}^2=\frac{24 \cdot 1+20 \cdot 2^2+8 \cdot 3^2+5 \cdot 4^2+3 \cdot 5^2}{100}=\frac{331}{100}=3.31$

According to (5), the estimates of $r$ and $p$ are:

$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}}=\frac{1.23^2}{3.31-1.23}=\frac{1.5129}{2.08}=0.72736$

$\displaystyle \hat{p}=\frac{\overline{x}}{\hat{\sigma}^2}=\frac{1.23}{3.31}=0.37160$

Then $P_0$, the probability of observing zero claims, is $\hat{p}^{\hat{r}}=0.48674$.

When both parameters are unknown, maximum likelihood estimation for the negative binomial distribution requires using a numerical software package. The following example demonstrates why.

Example 4
The observed claim counts for three insureds: 0, 1, 2. Fit a negative binomial distribution to the observed data.

The likelihood function is based on the probability function in (1).

\displaystyle \begin{aligned} L=L(r,p)&=p^r \cdot \binom{r}{1} p^r (1-p)^1 \cdot \binom{r+1}{2} \ p^r \ (1-p)^2 \\&=\frac{1}{2} \ r^2 \ (r+1) \ p^{3r} \ (1-p)^3 \end{aligned}

$\displaystyle l=\ln L=2 \ \ln r+\ln (r+1)+3 r \ln p+3 \ln (1-p)$

Taking the partial derivatives with respect to both parameters.

$\displaystyle \frac{\partial}{\partial r} \ l=\frac{2}{r}+\frac{1}{r+1}+3 \ln p=0$

$\displaystyle \frac{\partial}{\partial p} \ l=\frac{3 r}{p}-\frac{3}{1-p}=0$

Solving these two equations produces the following equations.

$\displaystyle r =\frac{p}{1-p}$

$\displaystyle 2 \frac{1-p}{p}+1-p+3 \ln p =0$

Solving for $p$ in the last equation would require numerical techniques.

In light of Example 4, we do not focus on MLE for the case that both of the negative binomial parameters are unknown. When the $r$ parameter is known, maximum likelihood estimation works like method of moments in that the product of the two parameters $r$ and $\theta$ is the sample mean.

Example 5
Using maximum likelihood estimation, fit the negative binomial distribution with parameters $r=2$ and $\theta$ to the claim frequency data in Example 1. Determine the probability of observing zero claims according to the fitted distribution.

With $r \theta=\overline{x}$, we have $\hat{\theta}=\frac{\overline{x}}{2}=\frac{1.23}{2}=0.615$. Then the probability of observing zero claims is $P(X=0)=(\frac{1}{1+\hat{\theta}})^{2}=\frac{1}{1.615^2}=0.3834$.

Binomial Distribution

The binomial distribution has two parameters $m$ and $p$ where $m$ is a positive integer and $p$ is a real number between 0 and 1. This is a model for counting the number of successes in performing a series of $m$ independent Bernoulli trials (a Bernoulli trial is a random experiment in which there are two distinct outcomes called success and failure). Usually the $m$ parameter is denoted by $n$. However, we already use $n$ to mean the sample size. So the parameters of the binomial distribution are $m$ and $p$. The following is the probability function.

(7)……$\displaystyle P(X=k)=\binom{m}{k} \ p^k \ (1-p)^{m-k} \ \ \ \ \ \ \ \ \ \ \ k=0,1,2,\cdots,m$

The mean of the binomial distribution is $\mu=m \ p$ and its variance is $\sigma^2=m \ p \ (1-p)$. When both parameters $m$ and $p$ are unknown, we can use the method of moments estimation. However, it is likely that the $\hat{m}$ estimate may end up not being an integer. In that case, the compromise is to round $\hat{m}$ to the nearest integer. This is one pitfall of working with an integer-parameter.

For maximum likelihood estimation, let’s start with the simpler case that $m$ is known. In this case the parameter $p$ is the only one that needs to be estimated. Suppose that $x_1,x_2,\cdots,x_n$ is the sample data where $0 \le x_i \le m$ for each $i$. Then maximum likelihood estimator of $p$ is given by

(8)……$\displaystyle \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{m \ n}$

There is a handy way to interpret the MLE estimate of $\hat{p}$. Each data point $x_i$ is an observed number of successes when performing $m$ Bernoulli trials. In the sample of size $n$, $m \ n$ is the total number of trials. The sum of all the $x_i$ would be the total number of successes out of the $m \ n$ trials. Thus $\hat{p}$ is the sample proportion of successes.

Based on (8), $m \hat{p}=\overline{x}$. When the parameter $m$ is known, the maximum likelihood estimate $\hat{p}$ is also the method of moments estimate.

When both $m$ and $p$ are not known, the maximum likelihood estimation of $m$ and $p$ is done by creating a likelihood profile for various possible values of $m$. A possible value of $m$ has to be at least as large as the largest binomial observation. The steps for creating a likelihood profile is as follows:

1. Start with the value of $m$ that is the largest observed value.
2. Using the chosen $m$, calculate $\hat{p}$ according to (8).
3. Evaluate the log-likelihood at $\hat{p}$.
4. Increase $m$ be 1.
5. Repeat Step 2 to Step 4 until a maximum in log-likelihood is found.

For the likelihood profile approach to work, the sample variance must be less than the sample mean. Otherwise, the log-likelihood values will increase without bound (see Problem 4-J here).

Example 6
Claim frequency data has been collected from 100 insureds and is given in the following table.

# of Claims # of Insureds
0 30
1 40
2 25
3 5
4+ 0

Fit the binomial distribution to the given claim frequency data using the method of moments.

The following gives the sample mean and sample variance.

$\displaystyle \overline{x}=\frac{0 \cdot 30+1 \cdot 40+2 \cdot 25+3 \cdot 5}{100}=\frac{105}{100}=1.05$

\displaystyle \begin{aligned} \hat{\sigma^2}&=\frac{30 \ (0-1.05)^2+40 \ (1-1.05)^2 +25 \ (2-1.05)^2+5 \ (3-1.05)^2}{100} \\&=\frac{74.75}{100}=0.7475 \end{aligned}

Note that the sample variance is less than the sample mean. It is then possible to fit binomial distribution to the observed data. This fact is crucial for performing maximum likelihood estimation (the next two examples). The following steps give the method of moments estimates.

$\displaystyle m \ p=1.05$

$\displaystyle m \ p \ (1-p)=0.7475$

$\displaystyle 1-\hat{p}=\frac{0.7475}{1.05} \ \ \ \ \rightarrow \ \ \ \ \hat{p}=1-\frac{0.7475}{1.05}=0.28810$

$\displaystyle \hat{m}=\frac{1.05}{\hat{p}}=3.6446$

Since the calculated $\hat{m}$ is not an integer, round $\hat{m}$ to 4. As a result, the method of moments estimates are $\hat{m}=4$ and $\hat{p}=0.2625$.

Example 7
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood estimation. Assume that $m$ is known with $m$ ranging from 3 to 8.

The maximum likelihood estimate of $p$ can be obtained by formula (8). The estimated are:

$\displaystyle m=3 \ \ \ \ \ \ \hat{p}=\frac{105}{3 \cdot 100}=\frac{105}{300}=0.35$

$\displaystyle m=4 \ \ \ \ \ \ \hat{p}=\frac{105}{4 \cdot 100}=\frac{105}{400}=0.2625$

$\displaystyle m=5 \ \ \ \ \ \ \hat{p}=\frac{105}{5 \cdot 100}=\frac{105}{500}=0.21$

$\displaystyle m=6 \ \ \ \ \ \ \hat{p}=\frac{105}{6 \cdot 100}=\frac{105}{600}=0.175$

$\displaystyle m=7 \ \ \ \ \ \ \hat{p}=\frac{105}{7 \cdot 100}=\frac{105}{700}=0.15$

$\displaystyle m=8 \ \ \ \ \ \ \hat{p}=\frac{105}{8 \cdot 100}=\frac{105}{800}=0.13125$

Example 8
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood. Assume that both parameters $m$ and $p$ are unknown. The maximum likelihood estimation is performed by creating a likelihood profile as described above.

The largest observation is in the sample is 3 (there are 5 such observations). In creating the likelihood profile, the starting value of $m$ is 3. Use this $m$ value to set up the likelihood function $L$ and the corresponding log-likelihood function $l$. Then evaluate $l$ at $\hat{p}=0.35$ (0.35 is found in Example 7).

\displaystyle \begin{aligned} L&=\biggl((1-p)^3 \biggr)^{30} \biggl(3 \cdot p^1 \cdot (1-p)^2 \biggr)^{40} \biggl(3 \cdot p^2 \cdot (1-p)^1 \biggr)^{25} \biggl(p^3 \biggr)^{5} \\&=3^{40} \cdot 3^{25} \cdot p^{105} \cdot (1-p)^{195} \end{aligned}

$\displaystyle l=\ln(L)=40 \ \ln(3)+25 \ \ln(3)+105 \ \ln(p)+195 \ \ln(1-p)$

\displaystyle \begin{aligned} l(\hat{p})=l(0.35)&=40 \ \ln(3)+25 \ \ln(3)+105 \ \ln(0.35)+195 \ \ln(0.65) \\&=-122.8241929 \end{aligned}

Next, perform the same process using $m=4$. The process is continued until a maximum is log-likelihood is found. The following table shows the results.

$\hat{m}$ $\hat{p}$ log-likelihood
3 0.35 -122.8241929
4 0.2625 -123.5787391
5 0.21 -123.523266
6 0.175 -137.2092949
7 0.15 -124.171543
8 0.13125 -124.4007318

The log-likelihood is the greatest at the starting value of $m=3$. The log-likelihood decreases as $m$ increases. Thus the maximum likelihood estimates are $\hat{m}=3$ and $\hat{p}=0.35$.

Other Considerations

Poisson, binomial and negative binomial are three commonly used discrete distributions. One important distinction among these three distribution is that the mean and variance are identical for the Poisson distribution, the mean is greater than the variance for the binomial distribution and the mean is less than the variance for the negative binomial distribution. Thus we have the following observation.

In examining sample data for discrete distributions, we should compare the sample mean and sample variance. If sample mean is roughly the same, then Poisson might be a good fit. If the sample mean is greater than the sample variance, the binomial distribution might be a good fit. If sample mean is less than the sample variance, the negative binomial distribution might be a good fit.

The universe of discrete distributions is larger than the three commonly used discrete distributions. However, the guideline described in the above paragraph is a good starting point in the modeling process.

For the sample claim frequency data in Example 1, the sample mean is 1.23 and the sample variance is 3.31. Among the three distributions of Poisson, binomial and negative binomial, the negative binomial distribution best represents the data. For the sample claim frequency data in Example 6, the binomial distribution best represents the data since the sample variance is significantly less than the sample mean.

The above observation about comparing the sample mean and sample variance is a useful one. When fitting a Poisson, binomial or negative binomial distribution, there is another technique that is more refined. The key is to consider these distributions as members of the (a,b,0) class of distributions (the (a,b,0) class is introduced here). The distributions in the (a,b,0) class is characterized by the following recursive relation.

(9)……$\displaystyle \frac{P_k}{P_{k-1}}=a+\frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The notation $P_k$ refers to the probability that the distribution takes on the value of $k$. For any member of the (a,b,0) class, the probabilities can be generated according to (9) for some constants $a$ and $b$. The three commonly used discrete distributions – Poisson, binomial, and negative binomial – are (a,b,0) distributions. This means that any one of these distributions can generated recursively using (9). See Table 1 in this post for the $a$ and $b$ associated with each of the three distributions. The relation (9) can be rearranged as follows:

(10)……$\displaystyle k \ \frac{P_k}{P_{k-1}}=a k +b \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The relation (10) says that the ratio $\frac{k \ P_k}{P_{k-1}}$ is a linear function of $k$ with the slope being $a$ and the y-intercept being $b$. If the (a,b,0) distribution is a Poisson distribution, then $a=0$. If the (a,b,0) distribution is a negative binomial distribution, then $a>0$. If the (a,b,0) distribution is a binomial distribution, then $a<0$. Thus the slope in (10) is an indicator of the (a,b,0) distribution.

Using observed data, $P_k$ is estimated by the ratio $\frac{n_k}{n}$ where $n$ is the sample size and $n_k$ is the number of observations that equal $k$. Then relation (10) is approximated by the following.

(11)……$\displaystyle k \ \frac{n_k}{n_{k-1}}=a k +b \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

If the sample data is drawn from an (a,b,0) distribution, the quantity on the left-hand side of (11) should have a linear pattern when plotted against $k$. If the plot is roughly horizontal, it is an indication that the (a,b,0) distribution is a Poisson distribution. If the plot has a positive slope, it is an indication that the (a,b,0) distribution is a negative binomial distribution. If the plot has a negative slope, it is an indication that the (a,b,0) distribution is a binomial distribution. This is further discussed in the following example.

Example 9
Consider the sample claim frequency data in Example 1. The quantities $\frac{k \ n_k}{n_{k-1}}$ are shown in the following table.

$k$ $n_k$ $\displaystyle \frac{k \ n_k}{n_{k-1}}$
0 40
1 24 0.6
2 20 1.67
3 8 1.2
4 5 2.5
5 3 3
6+ 0

The following is a plot of the ratio $\frac{k \ n_k}{n_{k-1}}$ against $k$

The plot shows roughly a linear pattern. The slope is clearly positive. This suggests that the negative binomial distribution is a good fit.

When fitting an (a,b,0) distribution, it is a good idea to construct a plot according to relation (11). A couple of caveats. Any category with $n_k=0$ cannot be used in the plot. The plot is less reliable if there is an insufficient amount of data.

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2019 – Dan Ma

More on calculating maximum likelihood estimators

This post continues the preceding post on maximum likelhood estimation. The preceding post focuses on calculating MLE when there is complete data (or individual data). This post focuses on calculating MLE for the other data scenarios such as grouped data, censored and truncated data.

Individual data refers to a data set where the exact value of every data point in the data set is completely known. Grouped data refers to a summarized data set that consists of frequency data, i.e. the counts that fall into a set of intervals.

Censored data refers to a data set where information on some of the data points is only partially known. For example, a data point exceeding a limit $u$ is recorded as $u$ (the data point is right censored or censored from above). A data point lower than a limit $l$ is recorded as $l$ (the data point is left censored or censored from below). A handy example of a censored data set is a reliability study where the times at failure for machines are recorded during a 5-year period. In this study, the time at failure for any machine that is still operating at the end of the study is recorded as 5 even though the machine may continue to work for a number of more years.

Truncated data refers to a data set where data values in some intervals are not observed and are thus ignored. For example, in an insurance coverage with a deductible $d$, when considering payment data, any loss that is below $d$ is not included into the calculation. This is an example of a data set that is truncated below. Any data set such that values above a certain threshold are not observed or collected is truncated above.

For censored data and truncated data, we focus on claim data with a policy limit (censored from above) or on claim data with a deductible (truncated from below) or on claim data with both a policy limit and a deductible.

Several examples (Example 3, Example 4, Example 6 and Example 7) concern the Pareto distribution. The Pareto distribution used here is also called Pareto type II distribution. For useful facts about Pareto type II, see this post in a companion blog.

Grouped Data

In this scenario, the data points are not available individually. Instead, we know the counts of the data points that fall into a set of intervals. Unlike the case for complete data, the likelihood is not the value of the density function. It is the difference of two values of the cumulative distribution function (CDF) to account for the probability of a data point falling into an interval. The rest of the procedure is the same as before – finding the likelihood function, and then taking log to get the log-likelihood function. Then take the derivative or partial derivatives and set the derivative or partial derivative equal to zero. The maximum likelihood estimates are then the solutions of the resulting equations. This is illustrated in Example 1.

Example 1
The following claim data has been collected from a large of insureds.

Interval Frequency
$(0,5)$ 10
$(5,10)$ 2
$(10,15)$ 6
$(15,20)$ 1
$(20,\infty)$ 1
Total 20

The exponential distribution with mean $\theta$ is fitted to the grouped data. Calculate the maximum likelihood estimate of the parameter $\theta$.

Note that the density is $f(x)=\frac{1}{\theta} \ e^{-x/\theta}$. The CDF is $F(x)=1-e^{-x/\theta}$.

Any observation that falls into the interval (0, 5) has likelihood $1-e^{-5/\theta}$, which is $F(5)$, accounting for the probability of an observation being in that interval. The likelihood for the interval (0, 5) is then $(1-e^{-5/\theta})^{10}$. Any observation that falls into the interval (5, 10) has likelihood $e^{-5/\theta}-e^{-10/\theta}$, which is $F(10)-F(5)$. The likelihood for the interval is $(e^{-5/\theta}-e^{-10/\theta})^2$. Continue on with the same process. The likelihood function $L(\theta)$ is the product of the likelihood of the intervals.

$\displaystyle L(\theta)=\biggl[1-e^{-5/\theta} \biggr]^{10} \ \biggl[e^{-5/\theta}-e^{-10/\theta} \biggr]^2 \ \biggl[e^{-10/\theta}-e^{-15/\theta} \biggr]^6 \ \biggl[e^{-15/\theta}-e^{-20/\theta} \biggr] \ e^{-20/\theta}$

The likelihood function can be further simplified before obtaining the log-likelihood function.

$\displaystyle L(\theta)=e^{-105/\theta} \ \biggl[1-e^{-5/\theta} \biggr]^{19}$

$\displaystyle l(\theta)=\ln L(\theta)=-\frac{105}{\theta}+19 \ \ln (1-e^{-5/\theta})$

Solving the equation obtained by setting the derivative of the log-likelihood function equal to zero gives the maximum likelihood estimate.

$\displaystyle \frac{d \ l(\theta)}{d \ \theta}=\frac{105}{\theta^2}-\frac{19}{1-e^{-5/\theta}} \ \frac{5 e^{-5/\theta}}{\theta^2}=0$

$\displaystyle e^{-5/\theta}=\frac{21}{124}$

$\displaystyle -5/\theta=\ln \biggl[\frac{21}{124}\biggr]$

$\displaystyle \hat{\theta}=\frac{-5}{\ln \biggl[\frac{21}{124}\biggr]}=7.7597$

The most obvious difference with the case of individual data MLE is that the likelihood function is made up of product of differences of values of the CDF. Otherwise, the same process applies. For some distributions, the maximum likelihood estimate is hard to do for grouped data because of the CDF being hard to manipulate mathematically. It is also the case the method of moments is also difficult to carry out for grouped data.

Censored Data

An example of censored data would be an insurance coverage with a policy limit. Any loss exceeding the limit $u$ is considered the value of $u$. The likelihood of this data point is then $1-F(u)$, the probability of a data point exceeding $u$. The rest of the MLE procedure works the same as before. To contrast, if the censored data point is below a threshold $m$, then the likelihood of the data point is $F(m)$.

Example 2
Observed claims are: 5, 6, 9, 15, 23. In addition, there are two claims exceeding the policy limit of 25.

An exponential distribution with mean $\theta$ is fitted to the claim data. Calculate the maximum likelihood estimate of $\theta$.

For the individual data points, the likelihood is $f(x)=\frac{1}{\theta} \ e^{-x/\theta}$. For the censored data points, the likelihood is $1-F(x)=e^{-x/\theta}$, the probability of exceeding the limit. The following is the likelihood function.

$\displaystyle L(\theta)=\frac{1}{\theta^5} \ e^{-\frac{5+6+9+15+23}{\theta}} \ e^{-\frac{25+25}{\theta}}=\frac{1}{\theta^5} \ e^{-\frac{108}{\theta}}$

The following derivation gives the maximum likelihood estimate $\hat{\theta}$.

$\displaystyle l(\theta)= \ln L(\theta)=-5 \ln(\theta)-\frac{108}{\theta}$

$\displaystyle \frac{d \ l(\theta)}{d \ \theta}=-\frac{5}{\theta}+\frac{108}{\theta^2}=0$

$\displaystyle -5+\frac{108}{\theta}=0$

$\displaystyle \hat{\theta}=\frac{108}{5}=21.6$

Truncated Data

We center the discussion on the scenario of a coverage with a deductible. Truncation is due to the fact that payment on a claim is conditional on the loss exceeding the deductible. Suppose that the insurance coverage has a deductible $d$. Suppose that claims $x_1,x_2,\cdots,x_n$ have been observed (individual data). We assume that losses below $d$ are not submitted. So all observations $x_i$ are above the deductible $d$. There are two ways to applying maximum likelihood estimation to such truncated claim data.

1. Work with the claim data $x_1,x_2,\cdots,x_n$ as is without any modification. Then the resulting maximum likelihood fitted distribution would be for claim data before applying any deductible. The mean of this fitted distribution would be the mean claim cost without a deductible. Of course, we can then estimate from this fitted distribution the claim cost of imposing a deductible.
2. This approach is called shifting since the approach is to subtract the deductible $d$ from each observed claims $x_i$. The resulting maximum likelihood fitted distribution would be for the claim payment reflecting a deductible of $d$. The mean of this fitted distribution would be the mean claim cost per payment (over all losses exceeding the deductible of $d$). For this reason, the original mean claim cost (without a deductible) cannt be recovered from this fitted distribution. However, imposing a deductible of $d$ to this fitted distribution would be equivalent to imposing a deductible of $2 d$ to the original loss distribution.

Essentially in approach 1, we fit a distribution to the truncated claim data (but unmodified by the deductible). The resulting maximum likelihood fitted distribution is for the original loss distribution before any deductible being applied. In the second approach we fit a distribution to the claim payment data (after shifting a deductible from the data). The resulting maximum likelihood fitted distribution is for the claim payment distribution reflecting the deductible used in the shifting. Which approach to use depends on whether we want to fit a distribution to the truncated claim data including the deductible or fit a distribution to the claim payment data (with the deductible not included).

To illustrate how these two approaches work, we fit the Pareto distribution to a set of claim data in both ways (Example 3 and Example 4). We round out the discussion on truncated data with an example using exponential distribution (Example 5).

Example 3
An insurance coverage has a deductible of 5. The following claims are observed:

12, 8, 14, 17, 13

A Pareto distribution with parameters $\alpha$ and $\theta=20$ is fitted to these data. Determine the maximum likelihood estimate of $\alpha$. We wish that the fitted Pareto distribution is an estimated model for claim cost before the deductible. So we do not subtract the deductible of 5 from the data points (i.e. approach 1). We discuss several ways of using this fitted Pareto distribution to estimate claim costs.

The density function and the CDF of the Pareto distribution are:

$\displaystyle f(x)=\frac{\alpha \ 20^\alpha}{(x+20)^{\alpha+1}} \ \ \ \ \ \ x>0$

$\displaystyle F(x)=1-\biggl(\frac{20}{x+20} \biggr)^\alpha \ \ \ \ \ \ \ x>0$

Because we assume that we do not have any information about claims below 5, observing a claim $x$ is conditional on the fact that the loss underlying that claim exceeds 5. Thus the likelihood of a claim $x$ is a conditional probability. The likelihood of a claim amount $x$ is $\frac{f(x)}{1-F(5)}$. Plugging in the Pareto information, the following is the likelihood of a claim $x$.

$\displaystyle \frac{f(x)}{1-F(5)}=\frac{\frac{\alpha \ 20^\alpha}{(x+20)^{\alpha+1}}}{\biggl(\frac{20}{5+20} \biggr)^\alpha}=\frac{\alpha \ 25^\alpha}{(x+20)^{\alpha+1}}$

There are 5 data points. The likelihood function is then the product of these 5 likelihood values.

$\displaystyle L(\alpha)=\frac{\alpha^5 \ 25^{5 \alpha}}{\prod \limits_{i=1}^5 (x_i+20)^{\alpha+1}}=\frac{\alpha^5 \ 25^{5 \alpha}}{37196544^{\alpha+1}}$

The usual steps produce the maximum likelihood estimate for $\alpha$.

$\displaystyle l(\alpha)=\ln L(\theta)=5 \ln(\alpha)+5 \alpha \ln(25)-(\alpha+1) \ \ln (37196544)$

$\displaystyle \frac{d \ l(\alpha)}{d \ \theta}=\frac{5}{\alpha}+5 \ln(25)-\ln(37196544)=0$

$\displaystyle \hat{\alpha}=\frac{5}{\ln(37196544)-5 \ln(25)}=3.7387$

The Pareto distribution with $\hat{\alpha}=3.7387$ and $\theta=20$ is the fitted distribution for claim data in this insurance coverage. The deductible of 5 is not factored into this Pareto distribution. So this is the fitted distribution for the claim cost before applying the deductible. Thus, the mean claim cost without any deductible is $E[X]=\frac{20}{\hat{\alpha}-1}=7.3027$. Solving the equation $F(x)=0.5$ gives the median. Thus, the median claim cost without any deductible is 4.0739. When imposing a deductible of 5, here’s the estimated claim costs:

Limited Expected value………..$\displaystyle E[X \wedge 5]=\frac{\theta}{\hat{\alpha}-1} \ \biggl[1-\biggl(\frac{\theta}{5+\theta} \biggr)^{\hat{\alpha}-1} \biggr]=3.3392$

Claim Cost Per Loss……………….$\displaystyle E[X]-E[X \wedge 5]=3.9635$

Claim Cost Per Payment………..$\displaystyle \frac{E[X]-E[X \wedge 5]}{1-F(5)}=9.1284$

When imposing a deductible of 10, here’s the estimated claim costs based on the fitted Pareto distribution.

Limited Expected value………..$\displaystyle E[X \wedge 10]=\frac{\theta}{\hat{\alpha}-1} \ \biggl[1-\biggl(\frac{\theta}{10+\theta} \biggr)^{\hat{\alpha}-1} \biggr]=4.8971$

Claim Cost Per Loss……………….$\displaystyle E[X]-E[X \wedge 10]=2.4056$

Claim Cost Per Payment………..$\displaystyle \frac{E[X]-E[X \wedge 10]}{1-F(10)}=10.9541$

The claim cost without a deductible is $E[X]=7.3027$ over all losses. When imposing a deductible of 5, the claim cost per loss is reduced to 3.9635. When imposing a deductible of 10, the claim cost per loss is further reduced to 2.4056. Note that the claim costs per payment are conditional means (calculated over all losses exceeding the deductible). So they are higher than the claim cost per loss.

Example 4
We now show how to estimate MLE using the second approach for truncated data. We continue to use the claim data from Example 3. We still wish to fit a Pareto distribution with parameters $\alpha$ and $\theta=20$ to the same data. This time we subtract the deductible of 5 from the claims. The resulting fitted Pareto distribution is for the distribution of claim payments based on the deductible of 5.

After subtracting the deductible of 5, the data are: 7, 3, 9, 12, 8. The maximum likelihood estimation is based on this shifted data. This data set is a complete data set. We can use the formula shown in the preceding post.

$\displaystyle \hat{\alpha}=\frac{n}{\ln\biggl(\prod \limits_{i=1}^n (\theta+x_i) \biggr)-n \ln(\theta)}=\frac{5}{\ln(16136064)-5 \ln(20)}=3.0904$

The Pareto distribution with parameters $\hat{\alpha}=3.0904$ and $\theta=20$ is the fitted distribution for claim payments. The deductible of 5 is baked into this Pareto distribution. The mean of this distribution is $E[X]=\frac{20}{\hat{\alpha}-1}=9.5675$. This mean is the mean claim payment with a deductible of 5 baked in. So we cannot recover the claim cost without deductible from this fitted distribution. This fitted Pareto distribution is modified from the original Pareto distribution describing the losses without the deductible. If we impost a deductible of 5 to this modified distribution, the result would be equivalent to imposing a deductible of 10 to the original distribution.

Limited Expected value………..$\displaystyle E[X \wedge 5]=\frac{\theta}{\hat{\alpha}-1} \ \biggl[1-\biggl(\frac{\theta}{5+\theta} \biggr)^{\hat{\alpha}-1} \biggr]=3.5666$

Claim Cost Per Payment………..$\displaystyle \frac{E[X]-E[X \wedge 5]}{1-F(5)}=11.9594$

The mean claim cost of 11.9594 is equivalent to the mean claim cost when imposing a deductible of 10 to the claim data before the deductible. Note that 11.9594 is in line with the equivalent number of 10.9541 in Example 3. The two answers may be equivalent but they usually do not equate exactly.

Example 5
This example deals with the same coverage and same claim data as in Example 3. This time we fit the exponential distribution with mean $\theta$ to the claim data. We apply the maximum likelihood estimation using the first approach (without subtracting the deductible from the claim data). Observing a claim $x$ is conditional on it exceeding the deductible 5. The likelihood of a claim $x$ is

$\displaystyle \frac{\frac{1}{\theta} e^{-x/\theta}}{e^{-5/\theta}}=\frac{1}{\theta} \ e^{-(x-5)/\theta}$

Thus the likelihood function is:

\displaystyle \begin{aligned} L(\theta)&=\frac{1}{\theta^5} \ e^{-(7-5)/\theta} \ e^{-(10-5)/\theta} \ e^{-(12-5)/\theta} \ e^{-(16-5)/\theta} \ e^{-(22-5)/\theta} \\&=\frac{1}{\theta^5} \ e^{-42/\theta} \end{aligned}

The maximum likelihood estimate is derived as follows:

$\displaystyle l(\theta)=\ln [L(\theta)]=-5 \ \ln(\theta)-\frac{42}{\theta}$

$\displaystyle \frac{d \ l(\theta)}{d \ \theta}=-\frac{5}{\theta}+\frac{42}{\theta^2}=0$

$\displaystyle \hat{\theta}=\frac{42}{5}=8.4$

On careful examination, note that if we use the shifted approach (the second approach) on the exponential distribution, we get the same maximum likelihood estimate $\hat{\theta}=8.4$. Because the exponential distribution is memoryless, either approach for truncated data leads to the same likelihood function $L(\theta)$. The exponential distribution is the only case where the maximum likelihood fitted distribution is both for claim data without a deductible and for claim payment with a deductible. Any other distribution would lead to two different fitted distributions when using both approaches for truncated claim data (just like the Pareto distribution in Example 3 and Example 4).

One comment about the two approaches. If there are two approaches in handling truncated claim data, how do we know which approach to use in an exam problem? The answer depends on the goal of the problem. If the goal is to generate a fitted distribution to answer questions about the loss distribution or the claim data before applying any deductible, the first approach is used. Possible wordings: applying MLE on the original claim data, the fitted distribution is the loss distribution, or the loss distribution is fitted to a distribution.

If the goal is to generate a fitted distribution to answer questions about claim payment reflecting a certain deductible, then use approach 2 by shifting a number from the claim data. Possible wordings: shifting the data by some amount, a certain distribution is fitted to the claim payment data, or claim payment data is fitted to this certain distribution. The idea is that we should look for instruction in the problem.

Censoring and Truncation Combined

We can also apply maximum likelihood estimation on claim data arising from insurance coverage with both a deductible and a policy limit. The addition of the policy limit poses no new challenge. The deductible is already taken care of by the two approaches discussed in the preceding section. The only new piece of information we need is on how to handle the censored limit. Any data point that is above the maximum covered loss $u$ is represented as $u$. Its likelihood is one of the following depending on the approach.

Approach 1………..$\displaystyle \frac{1-F(u)}{1-F(d)}$

Approach 2………..$\displaystyle 1-F(u-d)$

In Approach 1, the denominator is $1-F(d)$ indicating that the likelihood is a conditional probability. The numerator is $1-F(u)$ indicating that the original data point is not known but is above the limit $u$. In Approach 2, we use the limit $u$ to stand in for the actual data point but subtract the deductible from it to make $u-d$ the claim payment.

For any individual data point in the claim data (any data point above the deductible and below the limit), the likelihood has already been described in the preceding section (in one of two approaches). We now close with two more examples demonstrating combining truncation and censoring.

Example 6
An insurance coverage has a deductible of 5 and a maximum covered loss of 25. The following claims are observed:

12, 8, 14, 17, 13, 25*, 25*

The first 5 data points are individual data, the same data set found in Example 3. The last two claims with asterisk are claims that exceed 25 and are recorded as 25. Just like Example 3, we fit the Pareto distribution with parameters $\alpha$ and $\theta=20$ to these data in order to estimate the claim cost without a deductible.

For the 2 data points 25, the following is the likelihood:

$\displaystyle \frac{1-F(25)}{1-F(5)}=\frac{\biggl(\frac{20}{45} \biggr)^\alpha}{\biggl(\frac{20}{25} \biggr)^\alpha}= \frac{25^\alpha}{45^\alpha}$

The individual data points are the same as in Example 3. We only need to multiply the above likelihood (two times) to the $L(\alpha)$ in Example 3.

$\displaystyle L(\alpha)=\frac{\alpha^5 \ 25^{5 \alpha}}{\prod \limits_{i=1}^5 (x_i+20)^{\alpha+1}} \ \frac{25^\alpha}{45^\alpha} \ \frac{25^\alpha}{45^\alpha}=\frac{\alpha^5 \ 25^{7 \alpha}}{37196544^{\alpha+1} \ 45^{2 \alpha}}$

The usual steps produce the maximum likelihood estimate for $\alpha$.

$\displaystyle l(\alpha)=\ln L(\theta)=5 \ln(\alpha)+7 \alpha \ln(25)-(\alpha+1) \ \ln (37196544)-2 \alpha \ln(45)$

$\displaystyle \frac{d \ l(\alpha)}{d \ \theta}=\frac{5}{\alpha}+7 \ln(25)-\ln(37196544)-2 \ln(45)=0$

$\displaystyle \hat{\alpha}=\frac{5}{\ln(37196544)+2 \ln(45)-7 \ln(25)}=1.9897$

The fitted Pareto distribution with parameters $\hat{\alpha}=1.9897$ and $\theta=20$ is a distribution to the claim cost without a deductible.

Example 7
Use the same data set in Example 6 but use the shifting approach (the second approach described in the preceding section. The fitted Pareto distribution will be a model for claim payments for the insurance coverage with a deductible of 5.

For the two data points of 25, the likelihood is $1-F(25-5)=(20/40)^\alpha$. The likelihood function is obtained by multiply this likelihood (two times) with the likelihood of the individual data points.

$\displaystyle L(\alpha)=\frac{\alpha^5 \ 20^{5 \alpha}}{16136064^{\alpha+1}} \ \biggl(\frac{20}{40}\biggr)^\alpha \ \biggl(\frac{20}{40}\biggr)^\alpha=\frac{\alpha^5 \ 20^{7 \alpha}}{16136064^{\alpha+1} \ 40^{2 \alpha}}$

The usual steps produce the maximum likelihood estimate for $\alpha$.

$\displaystyle l(\alpha)=\ln L(\theta)=5 \ln(\alpha)+7 \alpha \ln(20)-(\alpha+1) \ \ln (16136064)-2 \alpha \ln(40)$

$\displaystyle \frac{d \ l(\alpha)}{d \ \theta}=\frac{5}{\alpha}+7 \ln(20)-\ln(16136064)-2 \ln(40)=0$

$\displaystyle \hat{\alpha}=\frac{5}{\ln(16136064)+2 \ln(40)-7 \ln(20)}=1.6643$

The fitted Pareto distribution with parameters $\hat{\alpha}=1.6643$ and $\theta=20$ is a distribution to the claim payment after a deductible of 5 is met.

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2018 – Dan Ma

Calculating maximum likelihood estimators

If the probability model that describes a population is completely known (along with its parameters), then we can use it to obtain information about the population. However, in the real world, this is rarely the case. Instead, we have observed data. We may have the information that the observed data follows a particular distribution but its parameters are not known. In other words, the form of the distribution from which the observed data is drawn is known (perhaps it is an assumption) but the specific values of the parameters are not known. Then the option we have is to use the observed data to estimate the values of the parameters.

One way to estimate the parameters is the method of moments, which is relative easy to use (for the most parts). This is the focus of the practice problem set. In this post, we discuss the method of maximum likelihood estimation.

The method of maximum likelihood estimation is to maximize the probability or likelihood of observing the data we collected. Suppose that the form of the distribution is known and its density function is $f(x; \theta_1, \theta_2, \cdots, \theta_k)$. But the $k$ parameters are not known. The goal is to choose one particular member of the assumed parametric distribution family $f(x; \hat{\theta_1}, \hat{\theta_2}, \cdots, \hat{\theta_k})$ that gives the highest likelihood of the observed data. Let’s consider the exponential distribution as an example.

Exponential Example

Suppose it is known that size of claims from a large group of insureds has an exponential distribution with unknown mean $\theta$. The density function is $f(x)=\frac{1}{\theta} e^{-x/\theta}$ where $x>0$. We observe $n$ claims $x_1,x_2,\cdots,x_n$. The method of maximum likelihood is to choose the value of $\theta$ that has the highest likelihood of observing these observations. The likelihood of observing the data is:

\displaystyle \begin{aligned} L(\theta)&=f(x_1) \cdot f(x_2) \cdots f(x_n) \\&=\frac{1}{\theta} e^{\frac{-x_1}{\theta}} \cdot \frac{1}{\theta} e^{\frac{-x_2}{\theta}} \cdots \frac{1}{\theta} e^{-\frac{x_n}{\theta}} \\&=\frac{1}{\theta^n} \ e^{\frac{-\sum \limits_{i=1}^n x_i}{\theta}} \end{aligned}

The goal is to choose the value of $\theta$ so that the function $L(\theta)$ is as large as possible. In other words, the goal is to maximize the function $L(\theta)$, which is called the likelihood function. In many cases, it is easier to maximize the natural log of $L(\theta)$.

$\displaystyle l(\theta)=\ln[L(\theta)]=-n \ln(\theta)-\frac{\sum \limits_{i=1}^n x_i}{\theta}$

The function $l(\theta)$ is called the log-likelihood function. The $\theta$ for which $l(\theta)$ is maximum is also a value for which $L(\theta)$ is maximum. The following gives the first and second derivatives of $l(\theta)$.

$\displaystyle l'(\theta)=-\frac{n}{\theta}+\frac{\sum \limits_{i=1}^n x_i}{\theta^2}$

$\displaystyle l''(\theta)=\frac{n}{\theta^2}-2 \ \frac{\sum \limits_{i=1}^n x_i}{\theta^3}$

Setting the first derivative equal to zero and solving for $\theta$ gives

$\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i}{n}$

Plugging $\hat{\theta}$ into the second derivative produces a negative value. Thus $\hat{\theta}$ gives the maximum log-likelihood $l(\theta)$ and thus the maximum likelihood $L(\theta)$. The value $\hat{\theta}$ is called the maximum likelihood estimate (MLE) of the parameter $\theta$. It is also called the maximum likelihood estimator of the parameter $\theta$ since $\hat{\theta}$ is also a function (as the observations change, the estimate will change). Note that $\hat{\theta}$ is the mean of the sample $x_1,x_2,\cdots,x_n$. In this instance, the maximum likelihood estimate coincides with the method of moments estimate. Though such examples are the exception, several more examples of MLE = method of moments estimates are discussed below.

MLE

As the above example suggests, the first step in maximum likelihood estimation is to come up with the likelihood function and then the log-likelihood function (by taking the natural log of the likelihood function). If there is only one parameter, take the derivative of the log-likelihood function and then set it equal to zero and solve for the parameter. If there are more than one parameters in the log-likelihood function, take partial derivative with respective to each parameter. Then set the resulting partial derivatives equal to zero and solve the resulting system of equations.

The likelihood of a data point $x$ (if its value is completely known) is simply the density function evaluated at $x$ (for a continuous distribution) or the probability function evaluated at $x$ (for a discrete distribution). For a given sample $x_1,x_2,\cdots,x_n$, the likelihood function is simply the product of the likelihoods at the individual data points $x_i$.

Another point to keep in mind. When working with likelihood function or log-likelihood function, positive constants can be omitted. This is illustrated by the example of normal distribution.

Normal Example

Observations: $x_1,x_2,\cdots,x_n$. We assume that the data are drawn from a normal distribution with parameters $\mu$ and $\sigma$. The following is the density function.

$\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \sigma} \ e^{-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}} \ \ \ \ \ \ \ -\infty

The following is the full likelihood function.

\displaystyle \begin{aligned} L(\mu,\sigma)&=f(x_1) \cdot f(x_2) \cdots f(x_n) \\&=\frac{1}{(\sqrt{2 \pi})^n} \ \frac{1}{\sigma^n} \ e^{\frac{-\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}} \end{aligned}

The constant $\frac{1}{(\sqrt{2 \pi})^n}$ in the last expression can be skipped. When taking the derivative of the log-likelihood function, the log of this constant will become a zero. Thus the essential likelihood function and the log-likelihood function are the following:

$\displaystyle L(\mu,\sigma)=\frac{1}{\sigma^n} \ e^{\frac{-\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}}$

$\displaystyle l(\mu,\sigma)=\ln[L(\mu,\sigma)]=-n \ln(\sigma)-\frac{\frac{1}{2} \sum \limits_{i=1}^n (x_i-\mu)^2 }{\sigma^2}$

Now take partial derivatives of $l(\mu,\sigma)$, first with respect to $\mu$ and then with respect to $\sigma$.

$\displaystyle \frac{\partial \ l(\mu,\sigma)}{\partial \ \mu}=\frac{2 \sum \limits_{i=1}^n (x_i-\mu)}{\sigma^2}=\frac{2 \biggl[ \biggl(\sum \limits_{i=1}^n x_i \biggr) - n \mu \biggr] }{\sigma^2}=0$

$\displaystyle \frac{\partial \ l(\mu,\sigma)}{\partial \ \sigma}=-\frac{n}{\sigma}+\frac{\sum \limits_{i=1}^n (x_i-\mu)^2}{\sigma^3}=-\frac{n}{\sigma}+\frac{\biggl( \sum \limits_{i=1}^n x_i^2\biggr) - \mu^2}{\sigma^3}=0$

Solving the first equation, we obtain the solution $\hat{\mu}$. Plug that into the second equation and we produce $\hat{\sigma}^2$.

$\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n x_i}{n} \ \ \ \ \ \ \ \ \ \ \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n x_i^2}{n}- \hat{\mu}^2$

The MLE estimate for the mean $\mu$ for the normal distribution is the sample mean and the MLE estimate for $\sigma^2$ is the sample variance.

Formulas

The MLE method does not always have a closed form calculation. For some distributions, the only way to get MLE estimates is through software package. The following list gives several distributions that have accessible calculation for MLE. The list is by no means exhaustive. The distribution names in red are the ones whose MLE estimates coincide with the method of moments estimates.

.

Exponential Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\theta} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ \ \ x>0$
MLE Estimate $\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Inverse Exponential Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{\theta \ e^{-\theta / x} }{x^2} \ \ \ \ \ \ \ \ x>0$
MLE Estimate $\displaystyle \hat{\theta}=\frac{n }{\sum \limits_{i=1}^n 1/x_i}$

Normal Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \ \sigma} \ e^{-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}} \ \ \ \ -\infty
MLE Estimate $\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n x_i}{n}$
MLE Estimate $\displaystyle \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n x_i^2}{n}- \hat{\mu}^2$

Lognormal Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\sqrt{2 \pi} \ \sigma \ x} \ e^{-\frac{1}{2} \frac{(\ln(x)-\mu)^2}{\sigma^2}} \ \ \ \ 0
MLE Estimate $\displaystyle \hat{\mu}=\frac{\sum \limits_{i=1}^n \ln(x_i)}{n}$
MLE Estimate $\displaystyle \hat{\sigma}^2=\frac{\sum \limits_{i=1}^n [\ln(x_i)]^2}{n}- \hat{\mu}^2$

Pareto Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{\alpha \ \theta^\alpha}{(x+\theta)^{\alpha+1}} \ \ \ \ \ \ x>0, \ \ \theta \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\alpha}=\frac{n }{\ln \biggl(\prod \limits_{i=1}^n (\theta+x_i) \biggr)- n \ \ln(\theta) }$

Weibull Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\tau \ \frac{x^{\tau-1}}{\theta^\tau} \ e^{- x^\tau / \theta^\tau} \ \ \ \ \ \ x>0, \ \ \tau \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\theta}=\biggl[\frac{\sum \limits_{i=1}^n x_i^\tau}{n} \biggr]^{1/\tau}$

Uniform Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\theta} \ \ \ \ \ \ 0
MLE Estimate $\displaystyle \hat{\theta}=\text{max}(x_1,x_2,\cdots,x_n)$

Gamma Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle f(x)=\frac{1}{\Gamma(\alpha)} \ \frac{1}{\theta^\alpha} \ x^{\alpha-1} \ e^{-\frac{x}{\theta}} \ \ \ \ \ \ x>0, \ \ \alpha \ \text{fixed}$
MLE Estimate $\displaystyle \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{\alpha \ n}$
Fitted Mean $\displaystyle \alpha \ \hat{\theta}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Binomial Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\binom{m}{x} \ p^x \ (1-p)^{m-x} \ \ \ \ \ \ x=0,1,2,\cdots,n, \ \ m \ \text{fixed}$
MLE Estimate $\displaystyle \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{m \cdot n}$
Fitted Mean $\displaystyle m \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Poisson Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ x=0,1,2,3,\cdots,$
MLE Estimate $\displaystyle \hat{\lambda}=\frac{\sum \limits_{i=1}^n x_i }{n}$

Negative Binomial Distribution
Data $x_1,x_2,\cdots,x_n$
Density $\displaystyle P[X=x]=\binom{r+x-1}{x} \ p^r \ (1-p)^{x} \ \ \ \ \ \ x=0,1,2,3,\cdots, \ \ r \ \text{fixed}$
MLE Estimate…… $\displaystyle \hat{p}=\frac{r }{r+\frac{1}{n} \sum \limits_{i=1}^n x_i}=\frac{r }{r+\overline{x}}$
Fitted Mean…… $\displaystyle \frac{r \ (1-\hat{p})}{\hat{p}}=\frac{1}{n} \sum \limits_{i=1}^n x_i=\overline{x}$

Remarks

The observed data discussed in all the above examples and formulas are the case for complete data (or individual data). In this scenarios, each data point in the data set $x_1,x_2,\cdots,x_n$ is known. In other words, the data is not grouped data (not summarized in any way), not censored and not truncated. For claims data in the form of individual data, no deductible or other insurance coverage modification has been applied. So complete data or individual data is exactly as it is recorded. The next post discusses how to calculate MLE for grouped data and censored or truncated data.

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2018 – Dan Ma

Practice Problem Set 1 – method of moments estimation

The post presents basic practice problems for the topic of parametric model selection, in particular estimating distributional parameters using the method of moments.

The task at hand is that of estimating an unknown parameter (or parameters) of a distribution. A relatively easy and conceptually clear approach is the method of moments. The approach is to set the sample moments equal to distributional moments and solve for the parameters in the resulting equations. For example, if the distribution has a single parameter $\theta$, then set the sample mean equal to the expression in terms of $\theta$ (the mean of the distribution). If there are two parameters, then set the sample mean and sample second moment equal to the appropriate expressions for the first and second moments of the distribution. The method of moments estimates are the solutions of the two equations.

The method of moments estimates are not unique. The same sample data do not always produce the same estimates. For a 2-parameter estimation, we can obtain the estimates by setting the first two sample moments equal to the first two distributional moments. We can equate the third and fourth moments. But by doing so, we will obtain different numerical estimates. We take the natural approach of equating the first $k$ moments if there are $k$ parameters. For some distributions whose mean and variance do not exist, we may have to equate the -1 or -2 moments (e.g. inverse exponential distribution).

The topic of method of moments was covered in the old Exam C but is no longer part of the syllabus of the short-term actuarial math (STAM). However, the method of maximum likelihood estimation (MLE) is still in the STAM syllabus. The method of moments is a great contrast to MLE. For certain distributions (e.g. binomial, Poisson, negative binomial, exponential, gamma and normal), the MLE estimates are identical to the method of moments estimates. It will be advantageous to know the method of moments for these distributions since the method of moments is, in many cases, easier to compute. The method of moments requires the use of moments of various probability distributions. Thus it is also a great review of basic information – moments of various distributions and other distributional quantities in additional to being background information for MLE.

The method of moments is also discussed here in a companion site.

.

 Practice Problem 1-A An examination of 1,000 insurance claims produces the following summary. $\displaystyle \sum \limits_{i=1}^{1000} x_i=5476.51$ $\displaystyle \sum \limits_{i=1}^{1000} x_i^2=126450.53$ A Pareto distribution is fitted to the data using the method of moments. Determine the estimates of the Pareto parameters $\alpha$ and $\theta$. Calculate the probability that a claim exceeds 5 using the fitted distribution.

.

 Practice Problem 1-B A sample of size 5 produced the values 1.76, 39.37, 5.81, 7.49 and 0.92. You fit a lognormal distribution using the method of moments. Determine the estimates of the lognoemal parameters $\mu$ and $\sigma$. Use these estimates to estimate the probability of observing a value exceeding 2.5.

.

 Practice Problem 1-C The claim size follows a gamma distribution with shape parameters $\alpha=2$ and scale parameter $\theta$. A random sample of 8 claims are obtained and their values are as follows: 5.72, 12.75, 14.51, 8.65, 7.41, 12.55, 9.44, 4.86 Use the method of moments to estimate the scale parameter $\theta$. Determine the mean and variance of the fitted gamma distribution.

.

 Practice Problem 1-D In the current year, there are 500 claims with a total amount of 475,000 from a large pool of policyholders. Claim size is subject to an increase of 10% from the previous year due to inflation. A gamma distribution with shape parameter $\alpha=2$ and unknown scale parameter $\theta$ is used to model the claim size distribution. Use the method of moments to estimate the scale parameter $\theta$ for the next year. Determine the probability of observing a claim next year in excess of 1045.

.

 Practice Problem 1-E A claim size distribution is a mixture of two exponential distributions, one with mean $\theta$ (weight 75%) and one with mean $\tau$ (weight 25%). A sample of 20 claims is observed and is summarized below. $\displaystyle \sum \limits_{i=1}^{20} x_i=550$ $\displaystyle \sum \limits_{i=1}^{20} x_i^2=99370$ Use the method of moments to estimate the parameters $\theta$ and $\tau$. Determine the probability of observing a claim in excess of 10.

.

 Practice Problem 1-F The distribution of claim size is a gamma distribution with parameters $\alpha$ and $\theta$ (scale). The following is a random sample of 6 claims. 33, 29, 21, 54, 12, 3 Use the method of moments to estimate the parameters $\alpha$ and $\theta$.

.

 Practice Problem 1-G The following is a random sample of losses. 2, 4, 3, 6, 50, 4, 7, 1 The losses $X$ are assumed to follow a Pareto distribution with parameters $\alpha$ and $\theta$. Use the method of moments to estimate the parameters $\alpha$ and $\theta$. Determine $E[X \wedge 50]$, the limited expected value at 50.

.

 Practice Problem 1-H Observations $x_1,x_2,\cdots,x_n$ are made about a certain distribution. The following is the summary. $\displaystyle \frac{1}{n} \sum \limits_{i=1}^{n} x_i=0.825$ $\displaystyle \frac{1}{n} \sum \limits_{i=1}^{n} x_i^2=0.720$ The beta distribution with the following density function $\displaystyle f(x)=\frac{\Gamma(a+b)}{\Gamma(a) \ \Gamma(b)} \ x^{a-1} \ (1-x)^{b-1} \ \ \ \ \ \ 0 is fitted to the observed data. Use the method of moments to estimate the parameters $a$ and $b$.

.

Practice Problem 1-I

You are given the following claim data for a large group of insurance policyholders.

Year # of Claims Total Claim Amount
1 200 35,000
2 250 48,000

Sizes of claims from this group of policyholders are subject to an annual inflation of 10%. Claim size is to be modeled by a Pareto distribution with parameters $\alpha=3.5$ and $\theta$.

Use the method of moments to estimate the parameter $\theta$ in year 3.

.

 Practice Problem 1-J The following 5 claims are sampled from a lognormal distribution with parameters $\mu$ and $\sigma$. 5, 43, 8, 11, 3 Use the method of moments to estimate the parameters $\mu$ and $\sigma$. Determine the 80th percentile of the fitted lognormal distribution.

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

$\text{ }$

Problem Answer
1-A
• $\hat{\alpha}=2.9025$ and $\hat{\theta}=10.4189$
• $\displaystyle P[X > 5]=0.321$
1-B
• $\hat{\mu}=1.9108$ and $\hat{\sigma}=0.9934$
• $\displaystyle P[X > 2.5]=0.8413$
1-C
• $\hat{\theta}=4.743125$
• $E[X]=9.48625$ and $Var[X]=44.99446953$
1-D
• $\hat{\theta}=522.5$
• $\displaystyle P[X > 1045]=3 e^{-2}=0.4060$
1-E
• $\hat{\theta}=3.5$ and $\hat{\tau}=99.5$
• $\displaystyle P[X > 10]=0.26917$
1-F
• $\hat{\alpha}=2.4228$ and $\hat{\theta}=10.4561$
1-G
• $\hat{\alpha}=3.2903$ and $\hat{\theta}=22.0443$
• $E[X \wedge 50]=8.9861$
1-H
• $\displaystyle \hat{a}=\frac{11}{5}=2.2$ and $\displaystyle \hat{b}=\frac{7}{15}=0.4667$
1-I
• $\displaystyle \hat{\theta}=528.6111$
1-J
• $\displaystyle \hat{\mu}=2.2657$ and $\displaystyle \hat{\sigma}=0.8642$
• 80th percentile = $e^{2.991628}=19.918$

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2018 – Dan Ma