Previous posts focus on maximum likelihood estimation for continuous distributions (this post and this post). In this post we shift the attention to parameter estimation for discrete distributions, in particular, the three commonly used discrete distributions – Poisson, binomial and negative binomial.
Practice problems to reinforce concepts discussed here are found here.
Practice problems for maximum likelihood estimation for continuous distributions are found here and here.
In fitting a discrete distribution to observed data, we focus on two procedures – method of moments and maximum likelihood estimation.
For method of moments estimation, we adopt the approach of equating the sample mean with the population mean for distributions with one parameter (e.g. Poisson) and equating the sample mean with the population mean and equating the sample variance with the population variance for distributions with two parameters (e.g. negative binomial). Of course, for two-parameter distributions, instead of equating sample variance with population variance, we can instead equate sample second moment with population second moment.
For maximum likelihood estimation (MLE), the idea is similar to MLE for continuous distributions. In the discrete case, use the probability function (or probability mass function) to set up a likelihood function instead of the probability density function. The rest of the procedure works similarly – take the natural log of the likelihood function, take derivative(s) and solve the equation(s) resulting from equating the derivative(s) to zero. In addition to using examples, we point out the issues in implementing MLE for negative binomial distribution and binomial distribution.
Poisson Distribution
The Poisson distribution has only one parameter
, which is the mean of the distribution. When complete data is available, the method of moments estimate of
would be the sample mean and the maximum likelihood estimate of
is also the sample mean. Thus for the Poisson distribution, the method of moments estimate coincides with the maximum likelihood estimate in the presence of complete data. However, when the sample data is not complete data (e.g. grouped data, censored data or truncated data), the maximum likelihood estimate of
does not equal the method of moments estimate.
Example 1
The claim frequency data of 100 insureds is given in the following table.
# of Claims |
# of Insureds |
0 |
40 |
1 |
24 |
2 |
20 |
3 |
8 |
4 |
5 |
5 |
3 |
6+ |
0 |
Total |
100 |
A Poisson distribution is fitted to the claim frequency data using maximum likelihood estimation. Determine the resulting estimate of the probability of having zero claims.
The sample mean frequency is:
The method of moments estimate of the mean
is
, which is also the maximum likelihood estimate. The estimated probability of having zero claims is
.
Example 2
The following table gives the claim frequency data of a group of insureds.
# of Claims |
# of Insureds |
0 or 1 |
26 |
2 |
12 |
3 |
3 |
4+ |
0 |
Fit the Poisson distribution to the claim frequency data using maximum likelihood. Determine the estimated probability of observing 0 or 1 claim.
Since the given observed claim frequency data is not complete data, do not equate the maximum likelihood estimate with the sample mean. In any case, the sample mean is a little murky since we do not know how many of the 26 insureds have zero claims. The probability of 0 or 1 claim is
. The likelihood function is given by the following.
The
in
is a multiplicative constant term that can be ignored. The following gives the log-likelihood function and its derivative.
Setting the derivative equal to zero leads to the quadratic equation
. Solving this equation produces the following estimate of
and the estimated probability.
.

Negative Binomial Distribution
The negative binomial distribution has two parameters. we consider two parametrizations of the negative binomial distribution.
(1)……
(2)……
Depending on the version, the negative binomial parameters are either
and
or
and
. To get ready for method of moments estimation, note the population mean and variance in the two versions.
(3)……
(4)……
Equating the sample mean
with
and the sample variance
with
produces the following method of moments estimates.
(5)……
(6)……
The estimates in (5) are the method of moments estimates for the negative binomial distribution as described in (1). The estimates in (6) are the method of moments estimates for the negative binomial distribution as described in (2). For both cases to work, the sample variance must exceed the sample mean, i.e.
. In both (5) and (6), the sample variance
is obtained by the biased sample variance, i.e. the one obtained by dividing by sample size
rather than
.
Example 3
Use the sample data in Example 1. Fit negative binomial distribution to the observed claim frequency data using method of moments. Determine the probability of observing zero claims according to the fitted distribution.
From Example 1,
. The following gives the sample variance.
According to (5), the estimates of
and
are:
Then
, the probability of observing zero claims, is
.
When both parameters are unknown, maximum likelihood estimation for the negative binomial distribution requires using a numerical software package. The following example demonstrates why.
Example 4
The observed claim counts for three insureds: 0, 1, 2. Fit a negative binomial distribution to the observed data.
The likelihood function is based on the probability function in (1).
Taking the partial derivatives with respect to both parameters.
Solving these two equations produces the following equations.
Solving for
in the last equation would require numerical techniques.
In light of Example 4, we do not focus on MLE for the case that both of the negative binomial parameters are unknown. When the
parameter is known, maximum likelihood estimation works like method of moments in that the product of the two parameters
and
is the sample mean.
Example 5
Using maximum likelihood estimation, fit the negative binomial distribution with parameters
and
to the claim frequency data in Example 1. Determine the probability of observing zero claims according to the fitted distribution.
With
, we have
. Then the probability of observing zero claims is
.
Binomial Distribution
The binomial distribution has two parameters
and
where
is a positive integer and
is a real number between 0 and 1. This is a model for counting the number of successes in performing a series of
independent Bernoulli trials (a Bernoulli trial is a random experiment in which there are two distinct outcomes called success and failure). Usually the
parameter is denoted by
. However, we already use
to mean the sample size. So the parameters of the binomial distribution are
and
. The following is the probability function.
(7)……
The mean of the binomial distribution is
and its variance is
. When both parameters
and
are unknown, we can use the method of moments estimation. However, it is likely that the
estimate may end up not being an integer. In that case, the compromise is to round
to the nearest integer. This is one pitfall of working with an integer-parameter.
For maximum likelihood estimation, let’s start with the simpler case that
is known. In this case the parameter
is the only one that needs to be estimated. Suppose that
is the sample data where
for each
. Then maximum likelihood estimator of
is given by
(8)……
There is a handy way to interpret the MLE estimate of
. Each data point
is an observed number of successes when performing
Bernoulli trials. In the sample of size
,
is the total number of trials. The sum of all the
would be the total number of successes out of the
trials. Thus
is the sample proportion of successes.
Based on (8),
. When the parameter
is known, the maximum likelihood estimate
is also the method of moments estimate.
When both
and
are not known, the maximum likelihood estimation of
and
is done by creating a likelihood profile for various possible values of
. A possible value of
has to be at least as large as the largest binomial observation. The steps for creating a likelihood profile is as follows:
- Start with the value of
that is the largest observed value.
- Using the chosen
, calculate
according to (8).
- Evaluate the log-likelihood at
.
- Increase
be 1.
- Repeat Step 2 to Step 4 until a maximum in log-likelihood is found.
For the likelihood profile approach to work, the sample variance must be less than the sample mean. Otherwise, the log-likelihood values will increase without bound (see Problem 4-J here).
Example 6
Claim frequency data has been collected from 100 insureds and is given in the following table.
# of Claims |
# of Insureds |
0 |
30 |
1 |
40 |
2 |
25 |
3 |
5 |
4+ |
0 |
Fit the binomial distribution to the given claim frequency data using the method of moments.
The following gives the sample mean and sample variance.
Note that the sample variance is less than the sample mean. It is then possible to fit binomial distribution to the observed data. This fact is crucial for performing maximum likelihood estimation (the next two examples). The following steps give the method of moments estimates.
Since the calculated
is not an integer, round
to 4. As a result, the method of moments estimates are
and
.
Example 7
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood estimation. Assume that
is known with
ranging from 3 to 8.
The maximum likelihood estimate of
can be obtained by formula (8). The estimated are:
Example 8
Use the same data in Example 6. Fit the binomial distribution to the observed claim frequency data using maximum likelihood. Assume that both parameters
and
are unknown. The maximum likelihood estimation is performed by creating a likelihood profile as described above.
The largest observation is in the sample is 3 (there are 5 such observations). In creating the likelihood profile, the starting value of
is 3. Use this
value to set up the likelihood function
and the corresponding log-likelihood function
. Then evaluate
at
(0.35 is found in Example 7).
Next, perform the same process using
. The process is continued until a maximum is log-likelihood is found. The following table shows the results.
 |
 |
log-likelihood |
3 |
0.35 |
-122.8241929 |
4 |
0.2625 |
-123.5787391 |
5 |
0.21 |
-123.523266 |
6 |
0.175 |
-137.2092949 |
7 |
0.15 |
-124.171543 |
8 |
0.13125 |
-124.4007318 |
The log-likelihood is the greatest at the starting value of
. The log-likelihood decreases as
increases. Thus the maximum likelihood estimates are
and
.
Other Considerations
Poisson, binomial and negative binomial are three commonly used discrete distributions. One important distinction among these three distribution is that the mean and variance are identical for the Poisson distribution, the mean is greater than the variance for the binomial distribution and the mean is less than the variance for the negative binomial distribution. Thus we have the following observation.
In examining sample data for discrete distributions, we should compare the sample mean and sample variance. If sample mean is roughly the same, then Poisson might be a good fit. If the sample mean is greater than the sample variance, the binomial distribution might be a good fit. If sample mean is less than the sample variance, the negative binomial distribution might be a good fit.
The universe of discrete distributions is larger than the three commonly used discrete distributions. However, the guideline described in the above paragraph is a good starting point in the modeling process.
For the sample claim frequency data in Example 1, the sample mean is 1.23 and the sample variance is 3.31. Among the three distributions of Poisson, binomial and negative binomial, the negative binomial distribution best represents the data. For the sample claim frequency data in Example 6, the binomial distribution best represents the data since the sample variance is significantly less than the sample mean.
The above observation about comparing the sample mean and sample variance is a useful one. When fitting a Poisson, binomial or negative binomial distribution, there is another technique that is more refined. The key is to consider these distributions as members of the (a,b,0) class of distributions (the (a,b,0) class is introduced here). The distributions in the (a,b,0) class is characterized by the following recursive relation.
(9)……
The notation
refers to the probability that the distribution takes on the value of
. For any member of the (a,b,0) class, the probabilities can be generated according to (9) for some constants
and
. The three commonly used discrete distributions – Poisson, binomial, and negative binomial – are (a,b,0) distributions. This means that any one of these distributions can generated recursively using (9). See Table 1 in this post for the
and
associated with each of the three distributions. The relation (9) can be rearranged as follows:
(10)……
The relation (10) says that the ratio
is a linear function of
with the slope being
and the y-intercept being
. If the (a,b,0) distribution is a Poisson distribution, then
. If the (a,b,0) distribution is a negative binomial distribution, then
. If the (a,b,0) distribution is a binomial distribution, then
. Thus the slope in (10) is an indicator of the (a,b,0) distribution.
Using observed data,
is estimated by the ratio
where
is the sample size and
is the number of observations that equal
. Then relation (10) is approximated by the following.
(11)……
If the sample data is drawn from an (a,b,0) distribution, the quantity on the left-hand side of (11) should have a linear pattern when plotted against
. If the plot is roughly horizontal, it is an indication that the (a,b,0) distribution is a Poisson distribution. If the plot has a positive slope, it is an indication that the (a,b,0) distribution is a negative binomial distribution. If the plot has a negative slope, it is an indication that the (a,b,0) distribution is a binomial distribution. This is further discussed in the following example.
Example 9
Consider the sample claim frequency data in Example 1. The quantities
are shown in the following table.
 |
 |
 |
0 |
40 |
|
1 |
24 |
0.6 |
2 |
20 |
1.67 |
3 |
8 |
1.2 |
4 |
5 |
2.5 |
5 |
3 |
3 |
6+ |
0 |
|
The following is a plot of the ratio
against 

The plot shows roughly a linear pattern. The slope is clearly positive. This suggests that the negative binomial distribution is a good fit.
When fitting an (a,b,0) distribution, it is a good idea to construct a plot according to relation (11). A couple of caveats. Any category with
cannot be used in the plot. The plot is less reliable if there is an insufficient amount of data.
actuarial practice problems
Dan Ma actuarial
Daniel Ma actuarial
Daniel Ma Math
Daniel Ma Mathematics
Actuarial exam
2019 – Dan Ma