# Estimating parameters of discrete distributions

Previous posts focus on maximum likelihood estimation for continuous distributions (this post and this post). In this post we shift the attention to parameter estimation for discrete distributions, in particular, the three commonly used discrete distributions – Poisson, binomial and negative binomial.

Practice problems to reinforce concepts discussed here are found here.

Practice problems for maximum likelihood estimation for continuous distributions are found here and here.

In fitting a discrete distribution to observed data, we focus on two procedures – method of moments and maximum likelihood estimation.

For method of moments estimation, we adopt the approach of equating the sample mean with the population mean for distributions with one parameter (e.g. Poisson) and equating the sample mean with the population mean and equating the sample variance with the population variance for distributions with two parameters (e.g. negative binomial). Of course, for two-parameter distributions, instead of equating sample variance with population variance, we can instead equate sample second moment with population second moment.

For maximum likelihood estimation (MLE), the idea is similar to MLE for continuous distributions. In the discrete case, use the probability function (or probability mass function) to set up a likelihood function instead of the probability density function. The rest of the procedure works similarly – take the natural log of the likelihood function, take derivative(s) and solve the equation(s) resulting from equating the derivative(s) to zero. In addition to using examples, we point out the issues in implementing MLE for negative binomial distribution and binomial distribution.

Poisson Distribution

The Poisson distribution has only one parameter $\lambda$, which is the mean of the distribution. When complete data is available, the method of moments estimate of $\lambda$ would be the sample mean and the maximum likelihood estimate of $\lambda$ is also the sample mean. Thus for the Poisson distribution, the method of moments estimate coincides with the maximum likelihood estimate in the presence of complete data. However, when the sample data is not complete data (e.g. grouped data, censored data or truncated data), the maximum likelihood estimate of $\lambda$ does not equal the method of moments estimate.

Example 1
The claim frequency data of 100 insureds is given in the following table.

# of Claims # of Insureds
0 40
1 24
2 20
3 8
4 5
5 3
6+ 0
Total 100

A Poisson distribution is fitted to the claim frequency data using maximum likelihood estimation. Determine the resulting estimate of the probability of having zero claims.

The sample mean frequency is:

$\displaystyle \overline{x}=\frac{0 \cdot 40+1 \cdot 24+2 \cdot 20+3 \cdot 8+4 \cdot 5+5 \cdot 3}{100}=\frac{123}{100}=1.23$

The method of moments estimate of the mean $\lambda$ is $\hat{\lambda}=1.23$, which is also the maximum likelihood estimate. The estimated probability of having zero claims is $e^{-1.23}=0.29229$.

Example 2

The following table gives the claim frequency data of a group of insureds.

# of Claims # of Insureds
0 or 1 26
2 12
3 3
4+ 0

Fit the Poisson distribution to the claim frequency data using maximum likelihood. Determine the estimated probability of observing 0 or 1 claim.

Since the given observed claim frequency data is not complete data, do not equate the maximum likelihood estimate with the sample mean. In any case, the sample mean is a little murky since we do not know how many of the 26 insureds have zero claims. The probability of 0 or 1 claim is $P(X=0,1)=e^{-\lambda}(1+\lambda)$. The likelihood function is given by the following.

\displaystyle \begin{aligned} L(\lambda)&=\biggl(e^{-\lambda}(1+\lambda) \biggr)^{26} \biggl(\frac{1}{2!} \ \lambda^2 e^{-\lambda} \biggr)^{12} \biggl(\frac{1}{3!} \ \lambda^3 e^{-\lambda}\biggr)^3 \\&=C \ (1+\lambda)^{26} \ \lambda^{33} \ e^{-41 \lambda} \end{aligned}

The $C$ in $L(\lambda)$ is a multiplicative constant term that can be ignored. The following gives the log-likelihood function and its derivative.

$\displaystyle l(\lambda)=26 \ \ln(1+\lambda) +33 \ \ln(\lambda)-41 \ \lambda$

$\displaystyle \frac{d}{d \lambda} \ l(\lambda)=\frac{26}{1+\lambda}+\frac{33}{\lambda}-41=0$

Setting the derivative equal to zero leads to the quadratic equation $41 \lambda^2-18 \lambda -33=0$. Solving this equation produces the following estimate of $\lambda$ and the estimated probability.

$\hat{\lambda}=1.14313$.

$P(X=0,1)=e^{-\hat{\lambda}} (1+\hat{\lambda})=0.68327$

Negative Binomial Distribution

The negative binomial distribution has two parameters. we consider two parametrizations of the negative binomial distribution.

(1)……$\displaystyle P(X=k)=\binom{r+k-1}{k} \ p^r \ (1-p)^k \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ k=0,1,2,3,\cdots$

(2)……$\displaystyle P(X=k)=\binom{r+k-1}{k} \ \biggl(\frac{1}{1+\theta} \biggr)^r \ \biggl(\frac{\theta}{1+\theta} \biggr)^k \ \ \ \ \ \ k=0,1,2,3,\cdots$

Depending on the version, the negative binomial parameters are either $r$ and $p$ or $r$ and $\theta$. To get ready for method of moments estimation, note the population mean and variance in the two versions.

(3)……$\displaystyle \mu=r \ \frac{1-p}{p} \ \ \ \ \ \ \ \ \ \ \sigma^2=r \ \frac{1-p}{p^2}$

(4)……$\displaystyle \mu=r \ \theta \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \sigma^2=r \ \theta (1+\theta)$

Equating the sample mean $\overline{x}$ with $\mu$ and the sample variance $\hat{\sigma}^2$ with $\sigma^2$ produces the following method of moments estimates.

(5)……$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}} \ \ \ \ \ \ \ \ \ \ \ \hat{p}=\frac{\overline{x}}{\hat{\sigma}^2}$

(6)……$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}} \ \ \ \ \ \ \ \ \ \ \ \hat{\theta}=\frac{\hat{\sigma}^2-\overline{x}}{\overline{x}}$

The estimates in (5) are the method of moments estimates for the negative binomial distribution as described in (1). The estimates in (6) are the method of moments estimates for the negative binomial distribution as described in (2). For both cases to work, the sample variance must exceed the sample mean, i.e. $\hat{\sigma}^2>\overline{x}$. In both (5) and (6), the sample variance $\hat{\sigma}^2$ is obtained by the biased sample variance, i.e. the one obtained by dividing by sample size $n$ rather than $n-1$.

Example 3
Use the sample data in Example 1. Fit negative binomial distribution to the observed claim frequency data using method of moments. Determine the probability of observing zero claims according to the fitted distribution.

From Example 1, $\overline{x}=1.23$. The following gives the sample variance.

$\displaystyle \hat{\sigma}^2=\frac{24 \cdot 1+20 \cdot 2^2+8 \cdot 3^2+5 \cdot 4^2+3 \cdot 5^2}{100}=\frac{331}{100}=3.31$

According to (5), the estimates of $r$ and $p$ are:

$\displaystyle \hat{r}=\frac{\overline{x}^2}{\hat{\sigma}^2-\overline{x}}=\frac{1.23^2}{3.31-1.23}=\frac{1.5129}{2.08}=0.72736$

$\displaystyle \hat{p}=\frac{\overline{x}}{\hat{\sigma}^2}=\frac{1.23}{3.31}=0.37160$

Then $P_0$, the probability of observing zero claims, is $\hat{p}^{\hat{r}}=0.48674$.

When both parameters are unknown, maximum likelihood estimation for the negative binomial distribution requires using a numerical software package. The following example demonstrates why.

Example 4
The observed claim counts for three insureds: 0, 1, 2. Fit a negative binomial distribution to the observed data.

The likelihood function is based on the probability function in (1).

\displaystyle \begin{aligned} L=L(r,p)&=p^r \cdot \binom{r}{1} p^r (1-p)^1 \cdot \binom{r+1}{2} \ p^r \ (1-p)^2 \\&=\frac{1}{2} \ r^2 \ (r+1) \ p^{3r} \ (1-p)^3 \end{aligned}

$\displaystyle l=\ln L=2 \ \ln r+\ln (r+1)+3 r \ln p+3 \ln (1-p)$

Taking the partial derivatives with respect to both parameters.

$\displaystyle \frac{\partial}{\partial r} \ l=\frac{2}{r}+\frac{1}{r+1}+3 \ln p=0$

$\displaystyle \frac{\partial}{\partial p} \ l=\frac{3 r}{p}-\frac{3}{1-p}=0$

Solving these two equations produces the following equations.

$\displaystyle r =\frac{p}{1-p}$

$\displaystyle 2 \frac{1-p}{p}+1-p+3 \ln p =0$

Solving for $p$ in the last equation would require numerical techniques.

In light of Example 4, we do not focus on MLE for the case that both of the negative binomial parameters are unknown. When the $r$ parameter is known, maximum likelihood estimation works like method of moments in that the product of the two parameters $r$ and $\theta$ is the sample mean.

Example 5
Using maximum likelihood estimation, fit the negative binomial distribution with parameters $r=2$ and $\theta$ to the claim frequency data in Example 1. Determine the probability of observing zero claims according to the fitted distribution.

With $r \theta=\overline{x}$, we have $\hat{\theta}=\frac{\overline{x}}{2}=\frac{1.23}{2}=0.615$. Then the probability of observing zero claims is $P(X=0)=(\frac{1}{1+\hat{\theta}})^{2}=\frac{1}{1.615^2}=0.3834$.

Binomial Distribution

The binomial distribution has two parameters $m$ and $p$ where $m$ is a positive integer and $p$ is a real number between 0 and 1. This is a model for counting the number of successes in performing a series of $m$ independent Bernoulli trials (a Bernoulli trial is a random experiment in which there are two distinct outcomes called success and failure). Usually the $m$ parameter is denoted by $n$. However, we already use $n$ to mean the sample size. So the parameters of the binomial distribution are $m$ and $p$. The following is the probability function.

(7)……$\displaystyle P(X=k)=\binom{m}{k} \ p^k \ (1-p)^{m-k} \ \ \ \ \ \ \ \ \ \ \ k=0,1,2,\cdots,m$

The mean of the binomial distribution is $\mu=m \ p$ and its variance is $\sigma^2=m \ p \ (1-p)$. When both parameters $m$ and $p$ are unknown, we can use the method of moments estimation. However, it is likely that the $\hat{m}$ estimate may end up not being an integer. In that case, the compromise is to round $\hat{m}$ to the nearest integer. This is one pitfall of working with an integer-parameter.

For maximum likelihood estimation, let’s start with the simpler case that $m$ is known. In this case the parameter $p$ is the only one that needs to be estimated. Suppose that $x_1,x_2,\cdots,x_n$ is the sample data where $0 \le x_i \le m$ for each $i$. Then maximum likelihood estimator of $p$ is given by

(8)……$\displaystyle \hat{p}=\frac{\sum \limits_{i=1}^n x_i }{m \ n}$

There is a handy way to interpret the MLE estimate of $\hat{p}$. Each data point $x_i$ is an observed number of successes when performing $m$ Bernoulli trials. In the sample of size $n$, $m \ n$ is the total number of trials. The sum of all the $x_i$ would be the total number of successes out of the $m \ n$ trials. Thus $\hat{p}$ is the sample proportion of successes.

Based on (8), $m \hat{p}=\overline{x}$. When the parameter $m$ is known, the maximum likelihood estimate $\hat{p}$ is also the method of moments estimate.

When both $m$ and $p$ are not known, the maximum likelihood estimation of $m$ and $p$ is done by creating a likelihood profile for various possible values of $m$. A possible value of $m$ has to be at least as large as the largest binomial observation. The steps for creating a likelihood profile is as follows:

1. Start with the value of $m$ that is the largest observed value.
2. Using the chosen $m$, calculate $\hat{p}$ according to (8).
3. Evaluate the log-likelihood at $\hat{p}$.
4. Increase $m$ be 1.
5. Repeat Step 2 to Step 4 until a maximum in log-likelihood is found.

For the likelihood profile approach to work, the sample variance must be less than the sample mean. Otherwise, the log-likelihood values will increase without bound (see Problem 4-J here).

Example 6
Claim frequency data has been collected from 100 insureds and is given in the following table.

# of Claims # of Insureds
0 30
1 40
2 25
3 5
4+ 0

Fit the binomial distribution to the given claim frequency data using the method of moments.

The following gives the sample mean and sample variance.

$\displaystyle \overline{x}=\frac{0 \cdot 30+1 \cdot 40+2 \cdot 25+3 \cdot 5}{100}=\frac{105}{100}=1.05$

\displaystyle \begin{aligned} \hat{\sigma^2}&=\frac{30 \ (0-1.05)^2+40 \ (1-1.05)^2 +25 \ (2-1.05)^2+5 \ (3-1.05)^2}{100} \\&=\frac{74.75}{100}=0.7475 \end{aligned}

Note that the sample variance is less than the sample mean. It is then possible to fit binomial distribution to the observed data. This fact is crucial for performing maximum likelihood estimation (the next two examples). The following steps give the method of moments estimates.

$\displaystyle m \ p=1.05$

$\displaystyle m \ p \ (1-p)=0.7475$

$\displaystyle 1-\hat{p}=\frac{0.7475}{1.05} \ \ \ \ \rightarrow \ \ \ \ \hat{p}=1-\frac{0.7475}{1.05}=0.28810$

$\displaystyle \hat{m}=\frac{1.05}{\hat{p}}=3.6446$

Since the calculated $\hat{m}$ is not an integer, round $\hat{m}$ to 4. As a result, the method of moments estimates are $\hat{m}=4$ and $\hat{p}=0.2625$.

Example 7
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood estimation. Assume that $m$ is known with $m$ ranging from 3 to 8.

The maximum likelihood estimate of $p$ can be obtained by formula (8). The estimated are:

$\displaystyle m=3 \ \ \ \ \ \ \hat{p}=\frac{105}{3 \cdot 100}=\frac{105}{300}=0.35$

$\displaystyle m=4 \ \ \ \ \ \ \hat{p}=\frac{105}{4 \cdot 100}=\frac{105}{400}=0.2625$

$\displaystyle m=5 \ \ \ \ \ \ \hat{p}=\frac{105}{5 \cdot 100}=\frac{105}{500}=0.21$

$\displaystyle m=6 \ \ \ \ \ \ \hat{p}=\frac{105}{6 \cdot 100}=\frac{105}{600}=0.175$

$\displaystyle m=7 \ \ \ \ \ \ \hat{p}=\frac{105}{7 \cdot 100}=\frac{105}{700}=0.15$

$\displaystyle m=8 \ \ \ \ \ \ \hat{p}=\frac{105}{8 \cdot 100}=\frac{105}{800}=0.13125$

Example 8
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood. Assume that both parameters $m$ and $p$ are unknown. The maximum likelihood estimation is performed by creating a likelihood profile as described above.

The largest observation is in the sample is 3 (there are 5 such observations). In creating the likelihood profile, the starting value of $m$ is 3. Use this $m$ value to set up the likelihood function $L$ and the corresponding log-likelihood function $l$. Then evaluate $l$ at $\hat{p}=0.35$ (0.35 is found in Example 7).

\displaystyle \begin{aligned} L&=\biggl((1-p)^3 \biggr)^{30} \biggl(3 \cdot p^1 \cdot (1-p)^2 \biggr)^{40} \biggl(3 \cdot p^2 \cdot (1-p)^1 \biggr)^{25} \biggl(p^3 \biggr)^{5} \\&=3^{40} \cdot 3^{25} \cdot p^{105} \cdot (1-p)^{195} \end{aligned}

$\displaystyle l=\ln(L)=40 \ \ln(3)+25 \ \ln(3)+105 \ \ln(p)+195 \ \ln(1-p)$

\displaystyle \begin{aligned} l(\hat{p})=l(0.35)&=40 \ \ln(3)+25 \ \ln(3)+105 \ \ln(0.35)+195 \ \ln(0.65) \\&=-122.8241929 \end{aligned}

Next, perform the same process using $m=4$. The process is continued until a maximum is log-likelihood is found. The following table shows the results.

$\hat{m}$ $\hat{p}$ log-likelihood
3 0.35 -122.8241929
4 0.2625 -123.5787391
5 0.21 -123.523266
6 0.175 -137.2092949
7 0.15 -124.171543
8 0.13125 -124.4007318

The log-likelihood is the greatest at the starting value of $m=3$. The log-likelihood decreases as $m$ increases. Thus the maximum likelihood estimates are $\hat{m}=3$ and $\hat{p}=0.35$.

Other Considerations

Poisson, binomial and negative binomial are three commonly used discrete distributions. One important distinction among these three distribution is that the mean and variance are identical for the Poisson distribution, the mean is greater than the variance for the binomial distribution and the mean is less than the variance for the negative binomial distribution. Thus we have the following observation.

In examining sample data for discrete distributions, we should compare the sample mean and sample variance. If sample mean is roughly the same, then Poisson might be a good fit. If the sample mean is greater than the sample variance, the binomial distribution might be a good fit. If sample mean is less than the sample variance, the negative binomial distribution might be a good fit.

The universe of discrete distributions is larger than the three commonly used discrete distributions. However, the guideline described in the above paragraph is a good starting point in the modeling process.

For the sample claim frequency data in Example 1, the sample mean is 1.23 and the sample variance is 3.31. Among the three distributions of Poisson, binomial and negative binomial, the negative binomial distribution best represents the data. For the sample claim frequency data in Example 6, the binomial distribution best represents the data since the sample variance is significantly less than the sample mean.

The above observation about comparing the sample mean and sample variance is a useful one. When fitting a Poisson, binomial or negative binomial distribution, there is another technique that is more refined. The key is to consider these distributions as members of the (a,b,0) class of distributions (the (a,b,0) class is introduced here). The distributions in the (a,b,0) class is characterized by the following recursive relation.

(9)……$\displaystyle \frac{P_k}{P_{k-1}}=a+\frac{b}{k} \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The notation $P_k$ refers to the probability that the distribution takes on the value of $k$. For any member of the (a,b,0) class, the probabilities can be generated according to (9) for some constants $a$ and $b$. The three commonly used discrete distributions – Poisson, binomial, and negative binomial – are (a,b,0) distributions. This means that any one of these distributions can generated recursively using (9). See Table 1 in this post for the $a$ and $b$ associated with each of the three distributions. The relation (9) can be rearranged as follows:

(10)……$\displaystyle k \ \frac{P_k}{P_{k-1}}=a k +b \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The relation (10) says that the ratio $\frac{k \ P_k}{P_{k-1}}$ is a linear function of $k$ with the slope being $a$ and the y-intercept being $b$. If the (a,b,0) distribution is a Poisson distribution, then $a=0$. If the (a,b,0) distribution is a negative binomial distribution, then $a>0$. If the (a,b,0) distribution is a binomial distribution, then $a<0$. Thus the slope in (10) is an indicator of the (a,b,0) distribution.

Using observed data, $P_k$ is estimated by the ratio $\frac{n_k}{n}$ where $n$ is the sample size and $n_k$ is the number of observations that equal $k$. Then relation (10) is approximated by the following.

(11)……$\displaystyle k \ \frac{n_k}{n_{k-1}}=a k +b \ \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

If the sample data is drawn from an (a,b,0) distribution, the quantity on the left-hand side of (11) should have a linear pattern when plotted against $k$. If the plot is roughly horizontal, it is an indication that the (a,b,0) distribution is a Poisson distribution. If the plot has a positive slope, it is an indication that the (a,b,0) distribution is a negative binomial distribution. If the plot has a negative slope, it is an indication that the (a,b,0) distribution is a binomial distribution. This is further discussed in the following example.

Example 9
Consider the sample claim frequency data in Example 1. The quantities $\frac{k \ n_k}{n_{k-1}}$ are shown in the following table.

$k$ $n_k$ $\displaystyle \frac{k \ n_k}{n_{k-1}}$
0 40
1 24 0.6
2 20 1.67
3 8 1.2
4 5 2.5
5 3 3
6+ 0

The following is a plot of the ratio $\frac{k \ n_k}{n_{k-1}}$ against $k$

The plot shows roughly a linear pattern. The slope is clearly positive. This suggests that the negative binomial distribution is a good fit.

When fitting an (a,b,0) distribution, it is a good idea to construct a plot according to relation (11). A couple of caveats. Any category with $n_k=0$ cannot be used in the plot. The plot is less reliable if there is an insufficient amount of data.

actuarial practice problems

Dan Ma actuarial

Daniel Ma actuarial

Daniel Ma Math

Daniel Ma Mathematics

Actuarial exam

$\copyright$ 2019 – Dan Ma

## One thought on “Estimating parameters of discrete distributions”

1. […] practice problem set is to reinforce the topic discussed in this post, the topic of estimating parameters of discrete […]