Estimating Population Size of Criminals: A New Horvitz–Thompson Estimator under One-Inflated Positive Poisson

Abstract

Many crime datasets often display an excess of “1” counts, arises when arrested criminals have the desire and ability to avoid subsequent arrests. In this study, a new Horvitz–Thompson (HT) estimator under one-inflated positive Poisson–Lindley (OIPPL) distribution which allow for one-inflation and the existence of heterogeneity in the data is developed to estimate the hidden population size of criminals. From the simulation study and applications to real crime datasets, the OIPPL is capable to provide an adequate fit to the datasets considered and the proposed HT estimator is found to produce a more precise estimate of the population size with a narrower 95% confidence interval as compared to several other contending estimators considered in this study.

Keywords

capture-recapture inflated models large number of ones number of criminals zero-truncated Poisson–Lindley

Introduction

The history of studies on estimating the population size can be detected as early as the 1600s by John Graunt (Hald, 2003) and the 1800s by Pierre Laplace (Cochran, 1978). As described by White et al. (1982), one of the earliest estimators for the population size can be traced back to the works in the late 19th, known as the Petersen–Lincoln estimator, which equates the proportion of marked sample from the recaptured sample with the proportion of the captured sample from the population. This estimator is proposed in the name of Peterson and Lincoln, two persons who had made a great contribution in the area of ecology. Chapman (1951) modified the biased Petersen–Lincoln estimator to be an unbiased estimator. In the 20^th century, the estimation of population size is pioneered by M’Kendrick (1925) who studied the movement of people or cells which causes the transference of infectious diseases.

The determination of population size under capture-recapture framework involves considering both observed and unobserved members of the population. However, the number of unobserved members is unknown, and thus it needs to be estimated before a reliable population size is determined. The estimator of the population size based on the capture-recapture framework has been widely studied in biological science. Good (1953) pointed out that if a particular species is observed or represented $r$ times in a sample size $N$ , then a direct empirical estimate $r / N$ is not a good estimate of the population frequency when $r$ is small. The author proceeded to propose an estimator known as Good–Turing estimator based on smoothed values of observed counts which are drawn randomly from an infinite population. Recently, Hwang et al. (2015) proposed an adjusted Good–Turing estimator by assuming the sample data are drawn from a finite population without replacement.

Besides Good–Turing and adjusted Good–Turing estimators, other estimators for the population size can also be found in many past studies. In the late 1980s, several estimators for different models were proposed (Chao, 1987, 1989; Zelterman, 1988). Chao (1987) proposed a minimum bound for the estimate of population size known as Chao estimator, for a model that incorporates heterogeneity of capture probability by making use of “1” and “2” counts only. However, under moment restrictions, another estimator for population size with a correction factor can be obtained, hence known as corrected Chao estimator. This corrected Chao estimator required the values of the first and the second sample moments. Chao (1989) proceeded to propose a bias-corrected Chao estimator under time-variation model for sparse capture-recapture data. In the presence of one-inflation, Chao estimator can extremely overestimate the population size (Böhning et al., 2019), and thus the authors proceeded to propose a modified Chao estimator to avoid overestimation by also incorporating the “2” and “3” counts. To ensure the estimator of population size estimator is robust, Zelterman (1988) proposed a new estimator called Zelterman estimator. The Zelterman estimator has a similar representation as the Horvitz–Thompson estimator except that the probability of “0” counts is obtained using Zelterman estimator rather than based on untruncated distribution. Besides these estimators, Chao and Lee (1992) proposed several other estimators known as Chao–Lee estimators for the population size by making use of sample coverage and coefficient of variation. Inspired by Chao–Lee estimators, Cecconi et al. (2012) introduced a new and flexible estimator, which involved simultaneous estimation of the population size, coefficient of variation and parameter in Dirichlet prior.

Several studies have proposed the estimator of species population size using a parametric approach, involving a statistical distribution or a regression model. One of the earlier works on estimating the population size using parametric approach can be traced back to the work by Edwards and Eberhardt (1967), Horvitz and Thompson (1952) as well as Nixon et al. (1967). The Horvitz–Thompson estimator, following the name of the authors, makes use of information on truncated and untruncated distributions of members in a population for estimating the population size. Edwards and Eberhardt (1967) have estimated the population size of cottontails by employing two geometric models based on the maximum likelihood estimator and the simple linear regression. The same techniques were employed in the study of Nixon et al. (1967) for the estimation of the population size of squirrels. In both studies, the geometric models give a closer approximation to the actual population size than other estimators considered. Since the Chao (1987) estimator disregards any information from counts larger than “2,” Niwitpong et al. (2013) incorporated censoring technique at “1” counts and zero-truncated geometric distribution to overcome problems in the Chao estimator. The resulting estimator, known as the censored estimator, depends only on the “1” counts. By using the ratio of successive probability, Anan et al. (2016) proposed an estimator for the population size based on Conway–Maxwell–Poisson distribution, which does not depend on the complex normalizing constant term in the Conway–Maxwell–Poisson distribution. The same group of authors, Anan et al. (2019) recently introduced a Good–Turing-type estimator under geometric distribution and found that the population size estimated using the new estimator will always be greater than those obtained from the Good–Turing estimator.

Several positive or zero-truncated models, which are inflated at “1” counts, have recently attracted several researchers (Godwin, 2017, 2019; Godwin & Böhning, 2017; Kaskasamkul & Böhning, 2018) because of an extra inflation parameter in the count data which captures the effects on “1” counts. Godwin and Böhning (2017) pointed out that the excess of “1” counts appears naturally when the captured and marked subjects have the inclination and the ability to avoid getting recaptured. It is important to note that the information on how to avoid being recaptured are already present and not learned after the first capture (Godwin & Böhning, 2017). The authors then proposed a Horvitz–Thompson estimator for the population size based on a one-inflated positive Poisson model, for describing the positive count data with an excess of ones. Following the work of Godwin and Böhning (2017), Godwin (2017) proposed another model that can incorporate both one-inflation and unobserved heterogeneity, known as one-inflated zero-truncated negative binomial. Recently, Godwin (2019) estimated the population size by making use of the one-inflated positive Poisson mixture model. The one-inflated positive Poisson mixture model also takes into account of both unobserved heterogeneity and excess ones when estimating the population size. Kaskasamkul and Böhning (2018) studied the zero-truncated one-inflated geometric model in estimating population size of criminals. The model assumes that the inflation parameter is imposed to all counts including “0” counts, with truncation at zero happening later (Kaskasamkul & Böhning, 2018). The authors have also considered obtaining estimates by truncating singletons completely (Kaskasamkul & Böhning, 2018). When the truncated count data are inflated or deflated at any points in the count data, one can refer to the work of Böhning and Ogden (2020) for the statistical inference as well as for estimating the size of the hidden population.

The capture-recapture technique applied in the area of biological science has also been applied for research in the field of quantitative criminology. Capture-recapture framework in the context of criminology can be described as the number of times a person initially get caught by the authorities (capture) and get caught again (recapture, if any). The idea of estimating the population size of a species can be applied in estimating the population size of criminals for a specific crime. Rossmo and Routledge (1990) stressed the importance of knowing the population size of criminals because it may affect the formulation and development of criminal justice policy. The authors proposed a model that incorporates heterogeneity in the form of inverse Gaussian and target response.

Another way of incorporating heterogeneity is by employing the truncated regression model. A group of researchers has employed the truncated Poisson regression model in estimating the population size of criminals who committed illegal firearm possession and drunk driving (Van der Heijden et al., 2003a), illegal immigrants (Van der Heijden et al., 2003b) as well as domestic violence (Van der Heijden et al., 2014). Mixtures of truncated Poisson can also be used in estimating the number of population size, as shown in the work of Böhning et al. (2004) in which the number of drug users is estimated. Besides truncated Poisson regression and truncated Poisson mixtures, truncated negative binomial regression model which can be used to explain additional unobserved heterogeneity was also applied in determining the population size of opiate users (Cruyff & Van der Heijden, 2008) and domestic violence (Van der Heijden et al., 2014). The estimated population size based on the truncated Poisson regression model can be used as a lower bound for the true population size (Van der Heijden et al., 2014). Bouchard et al. (2019) innovatively used the number of arrests in their regression models based on covariate-adjusted models and the resulting estimates refer to those who are at risk of arrest. The authors have only considered the case with the sample data consist of the number of arrests with rearrests happen in the next 5 days after the initial or previous arrests (Bouchard et al., 2019). The size and the trend of the crimes can also be estimated by evaluating the effect of individual covariates on the probabilities of survival and capture for an open population using Cormack–Jolly–Seber model (Cai & Xia, 2018).

We note that the Poisson assumption only holds if the Poisson parameter which indicates the number of arrests experienced by an individual criminal remains constant over time and assumed to be not influenced by the prior apprehension by the authority. However, this assumption is too restrictive and does not allow for extra flexibility in the data to take into account of the presence of heterogeneity and contamination (Godwin & Böhning, 2017). Our proposed model deals one form of heterogeneity which is one-inflation. We note that the one-inflation may happen from two scenarios. First, the first-time offender has the ability and desire to avoid being re-apprehended. In some cases, the offender may already have either the ability or the desire to avoid being re-apprehended (Godwin & Böhning, 2017). However, the knowledge which gives the offender the ability or desire to avoid re-apprehension may be learned prior to the first apprehension. For example, by targeting pre-offenders before their first offense in child sexual abuse (Levine & Dandamudi, 2016). Even if the offenders involved in child abuse, which is not the original intention of the prevention program, there is a chance they will stop doing after the first apprehension based on the knowledge they learned from the program. In prostitution crime, Rossmo and Routledge (1990) have noted that the prostitutes gain some knowledge from other prostitutes on how to avoid getting arrested. Surely, if they are being arrested the first time, based on Rossmo and Routledge (1990), there is high chance that they will only be arrested for one time, contributing to large number of single arrests.

Secondly, the apprehensions of the previously arrested offenders are misclassified as their first offence or into different category. Albeit being rare, this misclassification may happen due to clerical error or data collection. The police may have misclassified the criminal offenses intentionally or accidentally which then causes over or undercounting for certain offenses, which further leads to inflation or deflation (Nolan et al., 2011). For example, the uniform crime report defines robbery as a combination of assault and larceny. So, there is a chance that the robbers are wrongly charged with either assault or larceny, showing how a misclassification happens (Nolan et al., 2011).

The type of the population, whether it is open or closed is another issue that should be considered when estimating the population size of criminals. Closed population ensures the number of criminals to be constant, but this is not the case most of the time (Van der Heijden et al., 2003a). The size of the whole population may be constant but the population of the offenders is open in the sense that the offenders may enter and leave the population. New offenders enter the population by becoming first-time offenders whereas current offenders are considered as leaving the population if they are incarcerated. It is worthy to note that the new offenders who exit the population after the first arrest can also contribute to one-inflation. Therefore, it is acceptable to assume that the population are open.

The information on the behavior of criminals is not easily obtainable especially when dealing with secondary data because not all information is disclosed. For a given period of study, an open population will allow new or previously arrested offenders (outside of the period of study) to enter. The question that arises here is whether apprehension of the previously arrested offenders is categorized as the first offense. If this is the case, then we can expect misclassifications in the criminal counts. The misclassification will not happen for new offenders who commit one or multiple offenses in the period of study. If the apprehension of the previously arrested offenders is categorized based on the total number of offenses including those from outside of the period of study, then the conclusion made from the study cannot be fully explained for that particular period of study, which further complicate the statistical analysis.

When modeling count data, one important property that needs to be taken into account is overdispersion. Wagh and Kamalja (2018) mentioned that overdispersion in the data can be caused by either the existence of heterogeneity in the population or excess of zeroes. For positive count data, the “0” counts are unobserved and the majority of the data comes from “1” counts. Therefore, it is believed that an excess of “1” counts can contribute to the dispersion (over or under) in the data. Hence this paper is motivated by the need for a model that can handle dispersion which occurs due to heterogeneity in the data as well as the excess of “1” counts. The population of interest is assumed to be open with the existence of heterogeneity and large number of ones. It is believed and will be demonstrated based on the simulation study as well as real data applications that the model is pertinent for describing the population with those characteristics mentioned.

In this study, the positive Poisson–Lindley model inflated at “1” will be developed. Note that the positive Poisson–Lindley distribution, which is also known as zero-truncated Poisson–Lindley distribution, is able to adequately fit either overdispersed or underdispersed data (Ghitany et al., 2008). In other words, the positive Poisson–Lindley distribution can be thought of as a truncated Poisson count, which is adjusted for Lindley-wise distributed heterogeneity. When the inflation factor is imposed on positive Poisson–Lindley, an inflated positive Poisson–Lindley is produced that can accommodate both excess in “1” counts and heterogeneity in the data, and thus is believed to provide a better estimate for the population size.

Firstly, the positive Poisson -Lindley model inflated at “1” is developed and the maximum likelihood estimators for the parameters of the model are described. Next, the variance of the proposed estimator is derived, along with notes on some alternative estimators available in the literature. A simulation study is conducted to compare the performance of the proposed population size estimators with several other estimators. In addition, the performance of the proposed estimator in estimating the population size of criminals for two crime datasets are investigated. Finally, some remarks on the proposed model as well as the proposed estimator for the population size are discussed.

Excess “1” Counts in Positive Poisson-Lindley Distribution

There are two ways to introduce extra “1” counts into any positive count data distribution. First, the inflation parameter $ω$ is introduced to nonzero counts of untruncated distribution, then the distribution is truncated at 0. The second way is by introducing the inflation parameter $p$ for all counts of untruncated distribution, and then truncating at 0 (for further explanation, see Godwin & Böhning (2017)). In this study, we consider Poisson–Lindley as the untruncated distribution and the extra “1” counts are introduced via the first way. Therefore, the resulting distributions from the first method are denoted as one-inflated positive Poisson–Lindley (OIPPL) distribution.

One-Inflated Positive Poisson–Lindley (OIPPL) Distribution

Let $f (y | θ)$ be the probability mass function (pmf) of Poisson–Lindley distribution (Sankaran, 1970), then the pmf of a one-inflated Poisson–Lindley with inflation at nonzero counts is given as

f_{1} (y | ω, θ) = {\begin{matrix} f (0 | θ) \\ ω [1 - f (0 | θ)] + (1 - ω) f (1 | θ) \\ (1 - ω) f (y | θ) \end{matrix} \begin{matrix} ; y = 0 \\ ; y = 1 \\ ; y \geq 2, \end{matrix}

(1)

where $0 < ω < 1, θ > 0$ . In this manner, the excess probability affects only the nonzero counts. The unobserved population of “0” counts is unaffected. To ensure that (1) is a valid pmf, one can easily show that $\sum_{y = 1}^{\infty} f_{1} (y | ω, θ) = 1 - f (0 | θ)$ , which when added with $f (0 | θ)$ when $y = 0$ will equate to one. The distribution in (1) is then truncated at 0, giving the pmf for OIPPL distribution as

f_{2} (y | ω, θ) = {\begin{matrix} ω + (1 - ω) h (1 | θ) \\ (1 - ω) h (y | θ) \end{matrix} \begin{matrix} ; y = 1 \\ ; y \geq 2, \end{matrix}

(2)

where $h (y | θ)$ is hereon known as positive Poisson–Lindley, which also refers to zero-truncated Poisson-Lindley (ZTPL) distribution, proposed by Ghitany et al. (2008). The full pmf for OIPPL distribution is given as

f_{2} (y | ω, θ) = {\begin{matrix} ω + (1 - ω) \frac{θ^{2} (θ + 3)}{(θ + 1) (θ^{2} + 3 θ + 1)}; y = 1 \\ (1 - ω) \frac{θ^{2} (θ + y + 2)}{{(θ + 1)}^{y} (θ^{2} + 3 θ + 1)}; y \geq 2 . \end{matrix}

(3)

In order for OIPPL to cater for deflation as well, $ω$ can be set to $ξ / (1 + ξ)$ , where $ξ \in ℝ$ . However, in this study, we focus on estimating the population size with a large number of ones. When “supposedly” multiple offenders’ offense is misclassified as first offense or wrongly charged with different felonies, the resulting data will be inflated at one. The OIPPL distribution can be developed similarly if the truncation at “0” counts is done first and then the inflation factor is introduced at “1” counts. The maximum likelihood estimator of the parameters for OIPPL can be obtained by maximizing the log-likelihood function $l$ below, which can be written in terms of the frequency of $y$ -count, $n_{y}$ as

l = \ln L (ω, θ) = \sum_{y = 1}^{\infty} n_{y} \ln f_{2} (y | ω, θ),

where $\sum_{y = 1}^{\infty} n_{y} = n$ . Böhning and Ogden (2020) provided inference on models with general flation, that is, inflation or deflation of counts with zero-truncated distribution act as a baseline distribution. By differentiating $l$ with respect to $ω$ and $θ$ , and by setting the derivatives to 0, the maximum likelihood estimators for $ω$ and $θ$ are given respectively as

\begin{array}{l} \hat{ω} = 1 - \frac{(n - n_{1}) (\hat{θ} + 1) ({\hat{θ}}^{2} + 3 \hat{θ} + 1)}{n ({\hat{θ}}^{2} + 4 \hat{θ} + 1)}, \\ (n - n_{1}) [\frac{\hat{θ} (\hat{θ} + 2) ({\hat{θ}}^{2} + 6 \hat{θ} + 3)}{{({\hat{θ}}^{2} + 4 \hat{θ} + 1)}^{2}} + \frac{3 \hat{θ} + 2}{\hat{θ} ({\hat{θ}}^{2} + 3 \hat{θ} + 1)}] \\ + n m_{1} - n_{1} + \sum_{y = 1}^{\infty} \frac{n_{y}}{\hat{θ} + y + 2} = 0, \end{array}

(4)

where $\hat{θ}$ is ML estimator for $θ$ , $\hat{ω}$ is the ML estimator for $ω$ and $m_{1}$ is the sample mean. The latter equation in equation (4) can be solved numerically. Note that, the flation parameter used by Böhning and Ogden (2020) is compliment to the $ω$ here. However, one can reach similar estimator for $ω$ by taking one minus flation parameter estimator by Böhning and Ogden (2020). A simple formula on the flation (inflation or deflation) parameter is given in Proposition 1 by Böhning and Ogden (2020).

It can be noted that by using the reparameterization in equation (5), a new distribution named positive one-inflated Poisson-Lindley distribution can be obtained.

p = 1 - \frac{(1 - ω) {(θ + 1)}^{3}}{{(θ + 1)}^{3} - ω θ (θ + 2)} .

(5)

However, describing the new distribution using the reparameterization in equation (5) is beyond the scope of this study. One can refer to Godwin and Böhning (2017) for the development of positive one-inflated Poisson distribution using reparameterization for one-inflated positive Poisson distribution based on a similar idea.

The OIPPL distribution can cater for overdispersed or underdispersed data, however, obtaining the exact dispersion index is somewhat tedious. Therefore, the sample dispersion index is calculated by taking the ratio of sample variance to the sample mean based on generated random data that follows OIPPL distribution with different values of parameters, which is given in Table 1 (see steps 1-3 in the algorithms for the first simulation study). From Table 1, it is clear that OIPPL distribution can cater for both overdispersed and underdispersed data. As $θ$ or $ω$ or both increase, the sample dispersion index values decrease.

Table 1.

Sample Dispersion Values Based on 1,000 Random Data Generated from OIPPL Distribution ω = 0.2, 0.4, 0.6, and 0.8 and θ = 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0.

ω	θ
ω	0.5	1.0	1.5	2.0	2.5	3.0
0.2	3.18	1.56	0.88	0.59	0.51	0.32
0.4	3.05	1.06	0.72	0.47	0.38	0.37
0.6	2.64	1.03	0.50	0.33	0.30	0.19
0.8	2.29	0.58	0.48	0.17	0.13	0.08

Estimation of Population Size using OIPPL Distribution

The common estimator of population size, $\hat{N}$ is in the form of the Horvitz–Thompson formula, which can be written as

\hat{N} = n / [1 - f (0 | \hat{λ})],

(6)

where $n$ is the truncated sample size with nonzero data, $\hat{λ}$ is the estimator of $λ$ under truncated distribution and $f (0 | \hat{λ})$ is the probability of “0” counts under untruncated distribution. Therefore, the estimator of population size under OIPPL distribution is given by

{\hat{N}}_{H T - O I P P L} = \frac{n}{1 - f (0 | \hat{θ})} = \frac{n {(\hat{θ} + 1)}^{3}}{{\hat{θ}}^{2} + 3 \hat{θ} + 1},

(7)

and hence can be denoted as Horvitz–Thompson estimator under OIPPL distribution (HT-OIPPL). In the context of criminology, for instance, the probability of “0” counts estimated from the untruncated distribution may explain the proportion of criminals who is yet to get caught by the authorities. The resulting estimated population size explains the total number of criminals of a specific crime in the interested area of study. In the presence of one-inflation, Böhning and Van der Heijden (2019) stressed the importance of only up-weighting the nonextra-singletons and proceeded to solve by removing the singletons completely when estimating the hidden population size. However, for example, in the drunk driving crime, the offenders may be incarcerated due to drunk driving which makes them exit the population. It is even worse if accidental deaths involve and the offenders are charged with murder but not driving under influence, which can lead to one-inflation. Therefore, with these fair assumptions, it is reasonable to keep the singletons in model fitting and the estimated parameter $\hat{θ}$ is used for estimating the population size.

Variance Estimation and Confidence Interval

A simple and general formula for the variance of the population size estimator has been proposed by Böhning (2008), which makes use of the idea that the variance of the population size estimator comes from two sources of variation. The first term in equation (8) explains the binomial variation involved in sampling $n$ data with population size $N$ and probability $g (θ)$ where $g (θ) = 1 - f (0 | θ)$ . The second term in equation (8) explains the variation due to the estimation of the parameter $θ$ based on $n$ observed data. Generally, the variance can be written as

Var(\hat{N}) = {Var}_{\hat{θ}, n} [\frac{n}{g (\hat{θ})}] = {Var}_{n} {E_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} + E_{n} {{Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} .

(8)

From equation (8), it is clear that the variation in the estimated population size consists of two sources of variation (Böhning, 2008). Consider the first term and by using the delta method,

E_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}] \approx \frac{n}{g (θ)} .

Since $n ~ B i n o m i a l (N, g (θ))$ , we obtain

{Var}_{n} {E_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} {=Var}_{n} [\frac{n}{g (θ)}] = \frac{N g (θ) [1 - g (θ)]}{g {(θ)}^{2}} .

(9)

The term in equation (9) can be further estimated by substituting $θ$ with $\hat{θ}$ and $N g (θ)$ with $n$ , leading to

V \hat{a} r_{n} {E_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} \approx \frac{n [1 - g (\hat{θ})]}{g {(\hat{θ})}^{2}} .

(10)

For OIPPL distribution, $g (θ) = 1 - f (0 | θ) = (θ^{2} + 3 θ + 1) / {(θ + 1)}^{3}$ . Therefore,

V \hat{a} r_{n} {E_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} \approx \frac{n {\hat{θ}}^{2} (\hat{θ} + 2) {(\hat{θ} + 1)}^{3}}{{({\hat{θ}}^{2} + 3 \hat{θ} + 1)}^{2}} .

(11)

Consider the second term in equation (8), assume that

E_{n} {{Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} \approx {Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}] .

(12)

Note that by using the delta method,

\begin{array}{l} {Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}] = n^{2} {Var}_{\hat{θ} | n} [\frac{1}{g (\hat{θ})}] \approx n^{2} {[\frac{g^{'} (θ)}{g {(θ)}^{2}}]}^{2} \\ {Var}_{\hat{θ} | n} (\hat{θ}) = {[\frac{n θ (θ + 4) {(θ + 1)}^{2}}{{(θ^{2} + 3 θ + 1)}^{2}}]}^{2} {Var}_{\hat{θ} | n} (\hat{θ}), \end{array}

(13)

where $V a r_{\hat{θ} | n} (\hat{θ})$ is the variance of the estimator for the parameter $θ$ of Poisson–Lindley distribution given in Theorem 4 of Ghitany and Al-Mutairi (2009). The second term in equation (8) can be written as

E_{n} {{Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} = {[\frac{n θ (θ + 4) {(θ + 1)}^{2}}{{(θ^{2} + 3 θ + 1)}^{2}}]}^{2} {Var}_{\hat{θ} | n} (\hat{θ}),

(14)

where $V a r_{\hat{θ} | n} (\hat{θ}) = I^{- 1} (θ) / n$ , where $I (θ)$ is the Fisher’s information about $θ$ for Poisson–Lindley distribution (Ghitany & Al-Mutairi, 2009) and

\begin{array}{l} I (θ) = \frac{2}{θ^{2}} - \frac{3 θ^{2} + 4 θ + 2}{θ {(θ + 1)}^{3}} + \frac{θ^{2}}{{(θ + 1)}^{2}} \int_{0}^{1} \frac{t^{θ + 1}}{θ + 1 - t} d t \\ = - \frac{θ^{3} - 2 θ^{3} - 4 θ - 2}{θ^{2} {(θ + 1)}^{3}} + \frac{θ^{2}}{{(θ + 1)}^{2}} \int_{0}^{1} \frac{t^{θ + 1}}{θ + 1 - t} d t . \end{array}

(15)

By substituting $\hat{θ}$ for $θ$ , the second term in equation (8) can be estimated by

\begin{array}{l} {\hat{E}}_{n} {{Var}_{\hat{θ} | n} [\frac{n}{g (\hat{θ})}]} = {[\frac{n \hat{θ} (\hat{θ} + 4) {(\hat{θ} + 1)}^{2}}{{({\hat{θ}}^{2} + 3 \hat{θ} + 1)}^{2}}]}^{2} \\ {[- \frac{{\hat{θ}}^{3} - 2 {\hat{θ}}^{3} - 4 \hat{θ} - 2}{{\hat{θ}}^{2} {(\hat{θ} + 1)}^{3}} + \frac{{\hat{θ}}^{2}}{{(\hat{θ} + 1)}^{2}} \int_{0}^{1} \frac{t^{\hat{θ} + 1}}{\hat{θ} + 1 - t} d t]}^{- 1} / n . \end{array}

(16)

Therefore, the estimated variance of the population size estimator after some simplification and by combining equations (11) and (16), can be written as

\begin{array}{l} V \hat{a} r ({\hat{N}}_{H T - O I P P L}) = \frac{n {\hat{θ}}^{2} (\hat{θ} + 2) {(\hat{θ} + 1)}^{3}}{{({\hat{θ}}^{2} + 3 \hat{θ} + 1)}^{2}} \\ + {[\frac{n \hat{θ} (\hat{θ} + 4) {(\hat{θ} + 1)}^{2}}{{({\hat{θ}}^{2} + 3 \hat{θ} + 1)}^{2}}]}^{2} \\ {[- \frac{{\hat{θ}}^{3} - 2 {\hat{θ}}^{3} - 4 \hat{θ} - 2}{{\hat{θ}}^{2} {(\hat{θ} + 1)}^{3}} + \frac{{\hat{θ}}^{2}}{{(\hat{θ} + 1)}^{2}} \int_{0}^{1} \frac{t^{\hat{θ} + 1}}{\hat{θ} + 1 - t} d t]}^{- 1} / n \end{array}

(17)

Under the assumption that the population distribution is approximately normal, a 95% confidence interval for this estimator can be obtained by using

{\hat{N}}_{H T - O I P P L} \pm z_{0.975} S E ({\hat{N}}_{H T - O I P P L}),

(18)

where $S E ({\hat{N}}_{H T - O I P P L}) = \sqrt{V a \hat{r} ({\hat{N}}_{H T - O I P P L})}$ and $z_{0.975} = 1.96$ .

Alternative Estimators

Several estimators for the population size have been introduced in the context of the capture-recapture framework but the following estimators are prioritized. Generally, the estimator and its variance are given to ease the calculation of a 95% confidence interval which has a similar form as in equation (17).

Horvitz–Thompson via zero-truncated Poisson–Lindley

A Horvitz–Thompson estimator based on zero-truncated Poisson–Lindley (HT-ZTPL) will serve as a benchmark which can be defined as

{\hat{N}}_{H T - Z T P L} = \frac{n}{1 - f (0 | {\hat{θ}}_{Z T P L})} = \frac{n {({\hat{θ}}_{Z T P L} + 1)}^{3}}{{\hat{θ}}_{Z T P L}^{2} + 3 {\hat{θ}}_{Z T P L} + 1},

(19)

where ${\hat{θ}}_{Z T P L}$ is the estimated parameter based on zero-truncated Poisson–Lindley distribution. The variance of this estimator is similar to the variance of ${\hat{N}}_{H T - O I P P L}$ in equation (16) except that $\hat{θ}$ is substituted with ${\hat{θ}}_{Z T P L}$ .

Good–Turing estimator

Good (1953) proposed a Good–Turing estimator (GT) for the population size which can be obtained by considering Poisson distribution as the underlying distribution. The estimated probability of zero event can be written as ${\hat{p}}_{0} = f_{1} / S$ , where $S = \sum_{y = 1}^{\infty} y f_{y}$ , and $f_{y}$ is the observed $y$ -counts. The resulting estimator for the population can be written as

{\hat{N}}_{G T} = \frac{n}{1 - f_{1} / S},

(20)

with variance (Lerdsuwansri, 2012)

V a \hat{r} ({\hat{N}}_{G T}) = \frac{n f_{1} / S}{{(1 - f_{1} / S)}^{2}} + \frac{n^{2}}{{(1 - f_{1} / S)}^{4}} [\frac{f_{1} (1 - f_{1} / {\hat{N}}_{G T})}{S^{2}} + \frac{f_{1}^{2}}{S^{3}}] .

(21)

Zelterman estimator

Zelterman estimator (Zt) is a robust estimator for the population size under unobserved heterogeneity which can be obtained by considering truncated Poisson distribution as the underlying distribution (Zelterman, 1988). The estimator can be written as

{\hat{N}}_{Z t} = \frac{n}{1 - \exp (- 2 f_{2} / f_{1})} .

(22)

The estimated variance for the estimator above has been studied by Böhning (2008) and is given as

V a \hat{r} ({\hat{N}}_{Z t}) = n G (\hat{λ}) [1 + n G (\hat{λ}) {\hat{λ}}^{2} (1 / f_{1} + 1 / f_{2})],

(23)

where

G (\hat{λ}) = \frac{\exp (- \hat{λ})}{{[1 - \exp (- \hat{λ})]}^{2}}; \hat{λ} = \frac{2 f_{2}}{f_{1}} .

Chao estimator

Chao (1987, 1989) proposed Chao estimator for population size under unobserved heterogeneity where the data is generated from a Poisson mixture distribution which can be written as

{\hat{N}}_{C h} = n + \frac{f_{1}^{2}}{2 f_{2}},

(24)

with its variance studied by Böhning (2008), given as

V \hat{a} r ({\hat{N}}_{C h}) = \frac{1}{4} \frac{f_{1}^{4}}{f_{2}^{3}} + \frac{f_{1}^{3}}{f_{2}^{2}} + \frac{1}{2} \frac{f_{1}^{2}}{f_{2}} - \frac{1}{4} \frac{f_{1}^{4}}{(n f_{2}^{2})} - \frac{1}{2} \frac{f_{1}^{4}}{f_{2} (2 n f_{2} + f_{1}^{2})} .

(25)

To avoid overestimation for the population size in the presence of one-inflated data, a modified Chao estimator has been proposed by Böhning et al. (2019) by excluding information of one-valued observations. The modified Chao estimator (MCh) and its variance is given by

{\hat{N}}_{M C h} = n + {\hat{f}}_{0} = n + \frac{a_{0} a_{3}^{2}}{a_{2}^{3}} \frac{f_{2}^{3}}{f_{3}^{2}}

(26)

and

V a r ({\hat{f}}_{0}) = {\hat{f}}_{0} (\frac{1}{f_{2} + f_{3}}) [1 + \frac{{(2 f_{2} + 3 f_{3})}^{2}}{f_{2} f_{3}}],

(27)

respectively, where $a_{x}$ is the constant term in the discrete power series distribution with probability mass function, $p_{x} (θ) = a_{x} θ^{x} / η (θ)$ for $x = 0, 1, 2, 3, \dots,$ where $η (θ)$ is the normalizing constant. To avoid the estimator from suffering severe bias, a bias-corrected modified Chao estimator (BCMCh) is proposed by Böhning et al. (2019) given as

{\hat{N}}_{B C M C h} = n + {\hat{f}}_{0, b} = n + \frac{a_{0} a_{3}^{2}}{a_{2}^{3}} \frac{f_{2}^{3} - 3 f_{2}^{2} + 2 f_{2}}{(f_{3} + 1) (f_{3} + 2)},

(28)

with variance of ${\hat{f}}_{0, b}$ which is given by

V a r ({\hat{f}}_{0, b}) = {\hat{f}}_{0, b}^{2} (\frac{1}{f_{2} + f_{3}}) [1 + \frac{{(2 f_{2} + 3 f_{3})}^{2}}{(f_{2} + 1) (f_{3} + 1)}],

(29)

which is equivalent to the estimated variance of bias-corrected modified Chao estimator, $V \hat{a} r ({\hat{N}}_{B C M C h})$ . Using the same assumption as in Chao estimator that the data is generated from a Poisson mixture, the bias-corrected modified Chao estimator can be reduced to

{\hat{N}}_{B C M C h} = n + {\hat{f}}_{0, b} = n + \frac{2}{9} \frac{f_{2}^{3} - 3 f_{2}^{2} + 2 f_{2}}{(f_{3} + 1) (f_{3} + 2)},

(30)

by taking $a_{x} = 1 / x!$ for a Poisson distribution. The estimated variance of the estimator follows the same assumption.

Simulation Study

A comprehensive assessment of the proposed estimator for the population size, HT-OIPPL is investigated via simulation study by comparing the performance of the estimator with some known estimators as described in Section 2.4 which are HT-ZTPL, GT, Zt, and BCMCh estimators. The bias-corrected modified Chao estimator is chosen instead of the conventional Chao estimator and the modified Chao estimator as the first is an improved estimator that considers one-inflation in the data as well with the reduction in bias. Our simulation study compares the performance of all estimators under different underlying conditions with respect to sample sizes as well as different parameter combinations in data-generating processes.

Simulation Scenario

The simulation study is conducted to investigate in which conditions, the proposed estimator is the most suitable. The data are generated from OIPPL distribution with parameters $θ = 0.5, 1.0, 1.5$ and $ω = 0.1, 0.3, 0.5, 0.7$ . This setting specifically generates data with excess ones. The performance of HT-OIPPL is compared with HT-ZTPL, GT, Zt, and BCMCh estimators in equations (19), (20), (22), and (30), respectively in terms of percentage of relative absolute bias and percentage of relative standard deviation. The algorithm for simulation is given below.

Step 1: Generate small random data of $N = 100$ that follows Poisson–Lindley distribution with parameter $θ$ .

Step 2: Remove $n_{0}$ zero-valued data from $N$ random data to get $n$ nonzero-valued data.

Step 3: Randomly alter $k$ data from $n$ nonzero-valued data into “one” at a fixed proportion $ω$ to reflect the property of one-inflation such that $k = ω n$ .

Step 4: Fit the new altered count data with $n$ sample size to OIPPL distribution.

Step 5: Estimate the population size based on all the estimators considered in equations (7), (19), (20), (22), and (30).

Step 6: Repeat Steps 1 to 5 for 2,000 times and obtain the estimated values of $N$ and compare them with the original simulated population size $N$ in terms of the percentage of relative absolute bias (RAB) defined as

R A B = (100 / N) | \bar{\hat{N}} - N |

and percentage of relative standard deviation (RSd) defined as

R S d = (100 / N) {[\sum_{i = 1}^{2000} {({\hat{N}}_{i} - \bar{\hat{N}})}^{2} / 2000]}^{1 / 2},

where ${\hat{N}}_{i}$ is the estimated $N$ on the ith replication that depends on the estimators considered and $\bar{\hat{N}}$ is the sample mean of the estimated population size.

Step 7: Repeat the simulation study from the data-generating process for $N = 1000, 10000$ values to represent medium and large population size.

Step 3 is an important step when generating OIPPL distribution because this step shows how a one-inflation occurs. It can be that the offenders exit the population due to incarceration or the rare phenomenon of misclassifying the offenders. The relative absolute bias measures how close the estimated population size from each replication to the actual population size whereas the relative standard deviation measures how close each estimated population size for each replication with each other and with respect to the actual population size. It is important that the estimated population size is close to the actual population size. The lower standard deviation value concludes that the estimator is consistent because the estimated population size from each replication has a similar value.

The results of the simulation are plotted in Figures 1 and 2. Figure 1 refers to the graph of relative absolute bias values of the estimated population size under different estimation methods with varying parameters values. Figure 2 refers to the graph of relative standard deviation values of the estimated population size under different estimation methods with varying parameters values.

Figure 1.

Graph of percentage relative absolute bias values of HT-OIPPL, HT-ZTPL, GT, BCMCh, and Zt estimators with respect to N = 100, 1,000, 10,000 when the data-generating process is OIPPL for parameters ω = 0.1, 0.3, 0.5, and 0.7 and θ = 0.5, 1.0, and 1.5.

Figure 2.

Graph of percentage relative standard deviation values of HT-OIPPL, HT-ZTPL, GT, BCMCh, and Zt estimators with respect to N = 100, 1,000, and 10,000 when the data-generating process is OIPPL for parameters ω = 0.1, 0.3, 0.5, and 0.7 and θ = 0.5, 1.0, and 1.5.

From both Figures 1 and 2, it can be observed that for any given values of $ω$ and $θ$ , as the sample size increases, the percentage of relative absolute bias and the percentage of relative standard deviation values for HT-OIPPL approach to zero, suggesting that HT-OIPPL is asymptotically unbiased and consistent. It is not surprising that HT-OIPPL gives the smallest value for the percentage of relative absolute bias and for the percentage of relative standard deviation as compared to other estimators considered when the data-generating process is based on OIPPL distribution. It is also worthy to note that when the data is generated from OIPPL distribution, other estimators considered do not perform admirably. Even the Zelterman estimator which is said to be a robust estimator tends to not be able to adequately estimate the population size. Similarly, the bias-corrected modified Chao estimator is unable to adequately estimate the population size with the presence of one-inflation. Despite resulting in a smaller percentage of relative standard deviation, the Good-Turing estimator cannot be selected as a good estimator for the population size of the data generated based on OIPPL because the resulting percentage of relative absolute bias values are very high compared to the proposed estimator. It is expected that HT-ZTPL does not give an accurate estimation because HT-ZTPL does not take the one-inflation component into account. Therefore, when the data is generated based on the assumption that the population is distributed as OIPPL distribution, HT-OIPPL is the best estimator based on both measures used in the simulation study.

Alternative Scenario

Both one-inflated positive Poisson (Godwin & Böhning, 2017) and one-inflated zero-truncated negative binomial distributions (Godwin, 2017) have similar inflation parameter $ω$ as the OIPPL distribution, whereas OIPPL distribution has an additional Lindley component for heterogeneity. However, zero-truncated negative binomial (ZTNB) distribution neither has an inflation parameter nor a Lindley component as a mixing distribution for heterogeneity. Therefore, estimating population size using HT-OIPPL based on ZTNB as the data-generating process can be the most challenging. Therefore, the count data are generated from ZTNB distribution with parameters $λ = 1.0, 2.0, 3.0$ and $α = 0.5, 1.0, 1.5$ . The Horvitz–Thompson estimator for the population size for ZTNB distribution (Godwin, 2017, 2019), denoted as HT-ZTNB, is given by

{\hat{N}}_{H T - Z T N B} = \frac{n}{[1 - {(1 + \frac{\hat{λ}}{\hat{α}})}^{- \hat{α}}]},

(31)

where $λ$ is the shape parameter and $α$ is the scale parameter for ZTNB distribution. The performance of HT-OIPPL is compared with HT-ZTNB in terms of percentage of relative bias and percentage of relative standard deviation. The simulation procedures are given as follows:

Step 1: Generate small random data of $N = 100$ that follows negative binomial with parameters $λ$ and $α$ .

Step 2: Remove $n_{0}$ zero-valued data from $N$ random data to get $n$ nonzero-valued data.

Step 3: Fit the new altered count data with $n$ sample size to OIPPL distribution.

Step 4: Estimate the population size based on estimators considered in equations (7) and (31).

Step 5: Repeat Steps 1 to 4 for 2,000 times and obtain the estimated values of $N$ and compare them with the original simulated population size $N$ in terms of percentage of relative absolute bias (RAB) and percentage of relative standard deviation (RSd) as defined in Section 3.1.

Step 6: Repeat the simulation study from data-generating process for $N = 1000, 10000$ values to represent medium and large population size.

The results of the simulation are plotted in Figures 3 and 4. Figure 3 refers to the graph of relative absolute bias values of the estimated population size under different estimation methods with varying parameters values. Figure 4 refers to the graph of relative standard deviation values of the estimated population size under different estimation methods with varying parameters values.

Figure 3.

Graph of percentage relative absolute bias values of HT-OIPPL and HT-ZTNB with respect to N = 100, 1,000, and 10,000 when the data-generating process is ZTNB distribution for parameters λ = 1.0, 2.0, and 3.0 and α = 0.5, 1.0, and 1.5.

Figure 4.

Graph of percentage standard deviation values of HT-OIPPL and HT-ZTNB with respect to N = 100, 1,000, and 10,000 when the data-generating process is ZTNB distribution for parameters λ = 1.0, 2.0, and 3.0 and α = 0.5, 1.0, and 1.5.

From Figures 3 and 4, it is not surprising that for any given $λ$ and $α$ , the percentage of relative absolute bias and the percentage of relative standard deviation for HT-ZTNB decreases as sample size increases. However, when the sample size is low ( $N = 100$ ), HT-OIPPL gives a smaller percentage of relative absolute bias and percentage of relative standard deviation. The cost of removing OIPPL from consideration tends to be higher especially when the data is generated from ZTNB distribution with a small sample size. Despite the percentage of relative absolute bias for HT-OIPPL being stagnant for different sample sizes, the HT-OIPPL is not totally incompetent when comparing with HT-ZTNB as the former always gives smaller values of percentage of relative standard deviation.

Applications

To illustrate the capability of the OIPPL distribution in model fitting, two real crime-related datasets are examined. For both datasets, we have compared the fitted number of ones from ZTPL with the original data. Since the estimated number of ones based on ZTPL is lower than the observed number of ones in the data, we have concluded that one-inflation exists in the data with respect to ZTPL distribution. The same conclusion can be reached if the score test proposed by Godwin and Böhning (2017) is used. However, the score test requires the maximum likelihood estimator of $λ$ from positive Poisson (zero-truncated Poisson) distribution with the $H_{0}$ is the data have no one-inflation. Based on the score test, we have rejected $H_{0}$ at a 5% significance level. This allows us to use OIPPL for model fitting. For each dataset, the population size is estimated using HT-OIPPL, HT-ZTPL, HT-ZTNB, GT, Zt and BCMCh estimators. The 95% confidence of all estimators except HT-ZTNB is calculated because the variance of HT-ZTNB is not given by Godwin (2017, 2019). The chi-square goodness-of-fit statistics and its associated p-values are reported as well to see the adequacy of the model fitting using OIPPL distribution.

Prostitutes in Vancouver

Rossmo and Routledge (1990) have estimated the data on street prostitutes in Vancouver by including both target response and auxiliary information. Apprehending prostitutes has a minute deterrence effect (Rossmo & Routledge, 1990), and thus it is reasonable to assume that the contagion effect is negligible. Furthermore, the authors mentioned that the prudent ones learn how to avoid being arrested and not to avoid prostitution from the experienced which will result in a large number of one-time arrest data. The prostitutes may already have the desire to avoid being arrested, but the ability to avoid learned by some prudent ones from the experienced is prior to apprehensions (Godwin & Böhning, 2017). Since the avoidance ability is not learned after the first arrest, it is reasonable that the ability will not be learned after the next arrests, which will contribute to one-inflation (Godwin & Böhning, 2017). Therefore, it is acceptable to estimate the population size of the street prostitutes in Vancouver using the OIPPL model which takes one-inflation component into account.

The results of the model fitting of the number of times the prostitutes get arrested and the estimated population size are given in Table 2. From the model fitting in Table 2, the OIPPL provides an adequate fit to the data along with ZTNB. However, the chi-square statistics based on fitting OIPPL is smaller than that for ZTNB, making population size estimated from HT-ZTNB less favorable than the HT-OIPPL. Furthermore, the standard deviation of all estimators except for the Zelterman estimator is considerably small. Therefore, relying solely based on the values of the standard deviation is not wise for this dataset, and consequently making HT-OIPPL the most suitable. The population size of the prostitutes in Vancouver based on HT-OIPPL is estimated to be 1,687 with a 95% confidence interval of 1,578 to 1,796. This estimate has an extremely narrower confidence limit compared with the work of Rossmo and Routledge (1990), who have estimated the population size of prostitutes in Vancouver as 1,610 with a 95% confidence interval of 1,380 to 2,000, implying that the OIPPL model improved the estimation of the population size of the prostitutes in Vancouver, at least in terms of standard error.

Table 2.

The Results of Model Fitting for OIPPL, ZTPL, and ZTNB Models, the Estimated Population Size and the Standard Error of the Estimators with Its Associated 95% Confidence Interval.

Datasets		$S D (\hat{N})$	95% lower	95% upper	$\hat{θ}$ ( $\hat{ω}$ )	$\hat{λ}$ ( $\hat{α}$ )	$χ^{2}$ (p-Value)
Prostitutes in Vancouver
HT-OIPPL	1,687	56	1,578	1,796	1.3739 (0.2299)		3.696 (.296)
HT-ZTPL	1,962	75	1,816	2,108	1.7341		43.312 (<.001)
HT-ZTNB	4,085					0.3787 (0.2960)	6.835 (.088)
GT	1,359	41	1,279	1,440
Zt	1,903	129	1,649	2,156
BCMCh	997	35	929	1,066
Drunk drivers
HT-OIPPL	80,137	2,267	75,694	84,579	8.3227 (0.5067)		0.670 (.413)
HT-ZTPL	153,721	6,202	141,566	165,877	16.2134		23.457 (<.001)
HT-ZTNB	948,606					0.0106 (0.0868)	15.172 (<.001)
GT	81,811	8,930	64,308	99,313
Zt	91,710	4,162	83,552	99,868
BCMCh	18,006	2,655	12,801	23,211

Drunk Drivers

Police records on drunk drivers have been considered by Van der Heijden et al. (2003) to estimate their population size. The authors have employed both the zero-truncated Poisson null model and regression model in estimating the population size of drunk drivers. The unobserved heterogeneity in the data is explained by adding covariates in the regression model (Van der Heijden et al., 2003). Godwin and Böhning (2017) have eliminated the contagion factor due to police because the police stop and check each passing driver and arrest those who are drunk driving. Besides, when drunk drivers are locked up, they exit the population for some time and this will contribute to one-inflation in the arrest data. On that account, the OIPPL model is a suitable candidate for modeling data on drunk drivers with a large number of ones and consequently, for estimating the population size.

The results of the model fitting of the data and the estimated population size are given in Table 2. It is found that the OIPPL model provides an adequate fit to the data, whereas ZTPL and ZTNB fail to adequately fit the data. The population size estimated from the HT-OIPPL can be compared directly with the population size estimated based on the null model from Van der Heijden et al. (2003). The authors have provided an estimate of 78,710 with a 95% confidence interval of 72,738 to 84,682 for the population size. From our proposed model, the estimated population size is 80,137 with a tighter 95% confidence interval of 75,694 to 84,579. The authors did mention that the lower bound of the estimated population size of drunk drivers is 113,771 because the added covariates to the zero-truncated Poisson regression reduce the deviance of the model significantly (Van der Heijden et al., 2003). However, since the OIPPL model gives a better estimate for the population size of drunk drivers against the null model from Van der Heijden et al. (2003) in terms of standard error and confidence interval, it is presumed that the regression model based on the OIPPL will improve the lower bound of the population size. This hypothesis is something to be tested in future study. Even though the BCMCh estimator has a relatively close standard deviation to HT-OIPPL and a significantly smaller estimated population size, these values are questionable because its validity cannot be tested statistically, unlike the estimated population size based on OIPPL, in which the chi-square goodness-of-fit test is performed as a model adequacy test. Other considered estimators give very high standard error making them less desirable when selecting the best estimator for the population size of drunk drivers.

Conclusions

In many cases, the hidden population size is unknown and needs to be estimated. In criminology, knowing the population size of criminals for a crime is important in establishing and formulating new rules. A new distribution which is named as one-inflated positive Poisson–Lindley distribution is proposed and the new distribution incorporates the one-inflation component which is a form of heterogeneity that exists when the data has a large number of “1” counts. The flexibility of this distribution can be seen when the datasets generated randomly from this distribution has the property of either overdispersion or underdispersion, together with a large number of ones in the data.

In order to estimate the hidden population size, a new estimator for the population size that takes the form of Horvitz–Thompson estimator is proposed by considering the one-inflated positive Poisson–Lindley distribution as the underlying model. The performance of this new estimator is investigated in the context of unbiasedness and consistency, with comparison to other commonly used empirical estimators such as Good–Turing estimator, bias-corrected modified Chao estimator and Zelterman estimator. Horvitz–Thompson estimator based on zero-truncated Poisson–Lindley, also known as positive Poisson–Lindley is also included in the simulation study for comparison purposes. It is not surprising that when the simulated data are generated with an assumption that the data follow the proposed model, the estimated population size is closer to the original values, which can be seen from the small values of percentage of relative absolute bias and percentage of relative standard deviation. Therefore, the proposed estimator is asymptotically unbiased and consistent.

To exhibit the usefulness of the proposed estimator under different data-generating process, the zero-truncated negative binomial distribution is selected because the distribution has neither one-inflation component nor heterogeneity in the form of Lindley distribution. The performance of the proposed estimator is compared with the Horvitz–Thompson estimator based on the zero-truncated negative binomial distribution. It is expected that the latter estimator gives better estimates when the zero-truncated negative binomial distribution is selected as the data-generating process. However, for small sample size, the proposed estimator gives a lower percentage of relative absolute bias, suggesting that the proposed model is useful with small sample generated from the zero-truncated distribution. Furthermore, the percentage of relative standard deviation of the proposed estimator is always low than those from the Horvitz–Thompson estimator based on the zero-truncated negative binomial distribution, suggesting a consistent estimation over different sample sizes. Therefore, it can be concluded that it is harmful to ignore the use of the proposed estimator under different data-generating process.

Two datasets with a large number of ones are used for model fitting to investigate the performance of the proposed estimator in estimating the population size of real crime datasets. The results of the model fitting to the two datasets show that the proposed model is adequate in describing the data. The standard deviation of estimated population size based on the OIPPL for all datasets is found to be considerably small when compared to other estimators and this results in a narrower 95% confidence interval. The estimated population size can be a useful indicator for authorities to understand the scale of criminals who have not been apprehended yet. In reality, the real population size is still unknown and it is believed that the estimated population size can be set as a starting value for the true population size.

In some cases, the criminals do learn to avoid after the initial arrest and the deterrence will affect the probability of re-apprehension. In this case, our model may not be able to fully capture the effects of the deterrence, which is a form of contagion. An in-depth analysis can also be done to further understand the behavior of the criminals by incorporating demographic, economic and social factors as covariates into a regression model, which can be further investigated in the future study. Some covariates that can be included are age, gender, race, education, poverty, involvement in a gang violence and others. The covariates vary based on the crime of interest. By selecting appropriate factors into the regression model, a better view of the physical behavior of the criminals for a specific crime in a society can be obtained, and this will provide a more accurate estimation of the population size.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is funded by Ministry of Education, Malaysia, grant number FRGS/1/2019/STG06/UKM/01/5 and by Universiti Kebangsaan Malaysia, grant number GUP-2019-031.

ORCID iD

Razik Ridzuan Mohd Tajuddin

Author Biographies

Razik Ridzuan Mohd Tajuddin is a new researcher in the area of count data distribution and currently focusing on quantitative criminology especially in estimating population of criminals.

Noriszura Ismail is a Professor in and Head of Actuarial Science program at Universiti Kebangsaan. Her research specializes in applied statistics and risk modeling.

Kamarulzaman Ibrahim is a Professor in Universiti Kebangsaan. His research specializes in applied statistics.

References

Anan

Böhning

Maruotti

(2016). Population size estimation and heterogeneity in capture-recapture data: A linear regression estimator based on the Conway-Maxwell-Poisson distribution. Statistical Methods and Applications, 26, 49–79.

Anan

Böhning

Maruotti

(2019). On the turing estimator in capture-recapture count data under the geometric distribution. Metrika, 82, 149–172.

Böhning

(2008). A simple variance formula for population size estimators by conditioning. Statistical Methodology, 5, 410–423.

Böhning

Kaskasamkul

Van der Heijden

(2019). A modification of Chao’s lower bound estimator in the case of one-inflation. Metrika, 82, 361–384.

Böhning

Ogden

H. E.

(2020). General flation models for count data. Metrika: International Journal for Theoretical and Applied Statistics, 84(2), 245–261.

Böhning

Suppawattanabodee

Kusolvisitkul

Viwatwongkasem

(2004). Estimating the number of drug users in Bangkok 2001: A capture-recapture approach using repeated entries in one list. European Journal of Epidemiology, 19, 1075–1083.

Böhning

Van der Heijden

(2019). The identity of the zero-truncated, one-inflated likelihood and the zero-one-truncated likelihood for general count data with an application to drunk-driving in Britain. The Annals of Applied Statistics, 13, 1198–1211.

Bouchard

Morselli

Macdonald

Gallupe

Zhang

Farabee

(2019). Estimating risks of arrest and criminal populations: Regression-adjustments to capture-recapture models. Crime & Delinquency, 65, 1764–1797.

Cai

Xia

(2018). Estimating size of drug users in Macau: An open population capture-recapture model with data augmentation using public registration data. Asian Journal of Criminology, 13, 193–206.

10.

Cecconi

Gandolf

Sastri

C. C. A.

(2012). A new estimator for the number of species in a population. Sankhya A, 74, 80–100.

11.

Chao

(1987). Estimating the population size for capture-recapture data with unequal catchability. Biometrics, 43, 783–791.

12.

Chao

(1989). Estimating population size for sparse data in capture-recapture experiments. Biometrics, 45, 427–438.

13.

Chao

Lee

S. M.

(1992). Estimating the number of classes via sample coverage. Journal of American Statistical Association, 87, 210–217.

14.

Chapman

D. G.

(1951). Some properties of the hypergeometric distribution with applications to zoological census. University of California Publications in Statistics, 1, 131–160.

15.

Cochran

W. G.

(1978). Laplace’s ratio estimator. In David

H. A.

(Ed.), Contributions to survey sampling and applied statistics. Papers in Honor of H. O. Hartley (pp. 3–10). Academic Press.

16.

Cruyff

Van der Heijden

(2008). Point and interval estimation of the population size using a zero-truncated negative binomial regression model. Biometrical Journal, 50, 1035–1050.

17.

Edwards

W. R.

Eberhardt

(1967). Estimating cottontail abundance from livetrapping data. The Journal of Wildlife Management, 31, 87–96.

18.

Ghitany

M. E.

Al-Mutairi

D. K.

(2009). Estimation methods for the discrete Poisson-Lindley distribution. Journal of Statistical Computation and Simulation, 79, 1–9.

19.

Ghitany

M. E.

Al-Mutairi

D. K.

Nadarajah

(2008). Zero-truncated Poisson-Lindley distribution and its application. Mathematics and Computers in Simulation, 79, 279–287.

20.

Godwin

R. T.

(2017). One-inflation and unobserved heterogeneity in population size estimation. Biometrical Journal, 59, 1–15.

21.

Godwin

R. T.

(2019). The one-inflated positive Poisson mixture model or use in population size estimation. Biometrical Journal, 61, 1–16.

22.

Godwin

R. T.

Böhning

(2017). Estimation of the population size by using the one-inflated positive Poisson model. Journal of the Royal Statistical Society: Series C (Applied Statistics), 66, 425–448.

23.

Good

I. J.

(1953). The population frequency of species and the estimation of population parameters. Biometrika, 40, 234–264.

24.

Hald

(2003). A history of probability and statistics and their applications before 1750. Wiley.

25.

Horvitz

D. G.

Thompson

D. J.

(1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.

26.

Hwang

W. H.

Lin

C. W.

Shen

T. J.

(2015). Good-turing frequency estimation in a finite population. Biometrical Journal, 57, 321–329.

27.

Kaskasamkul

Bohning

(2018). Population size estimation for one-inflated count data based upon the geometric distribution. In Bohning

Van der Heijden

P. G. M.

Bunge

(Eds.), Capture-recapture methods for the social and medical sciences (pp. 191–209). Chapman & Hall/CRC.

28.

Lerdsuwansri

(2012). Generalisation of the Lincoln-Peterson approach to non-binary source variable [Ph.D. thesis]. University of Reading.

29.

Levine

J. A.

Dandamudi

(2016). Prevention of child sexual abuse by targeting pre-offenders before first offense. Journal of Child Sexual Abuse, 25, 719–737.

30.

M’Kendrick

A. G.

(1925). Applications of mathematics to medical problems. Proceedings of the Edinburgh Mathematical Society, 44, 98–130.

31.

Niwitpong

Böhning

Van der Heijden

Holling

(2013). Capture-recapture estimation based upon the geometric distribution allowing for heterogeneity. Metrika, 76, 495–519.

32.

Nixon

C. M.

Edwards

W. R.

Eberhardt

(1967). Estimating squirrel abundance from livetrapping data. The Journal of Wildlife Management, 31, 96–101.

33.

Nolan

J. J.

Haas

S. M.

Napier

J. S.

(2011). Estimating the impact of classification error on the “statistical accuracy” of uniform crime reports. Journal of Quantitative Criminology, 27, 497–519.

34.

Rossmo

D. K.

Routledge

(1990). Estimating the size of criminal populations. Journal of Quantitative Criminology, 6, 293–314.

35.

Sankaran

(1970). The discrete Poisson-Lindley distribution. Biometrics, 26, 145–149.

36.

Van der Heijden

Bustami

Cruyff

Engbersen

Van Houwelingen

H. C

. (2003b). Point and interval estimation of the population size using truncated Poisson regression model. Statistical Modelling, 3, 305–322.

37.

Van der Heijden

Cruyff

Böhning

. (2014). Capture recapture to estimate criminal populations. In Bruinsma, G. J. N. & Weisburd, D. L., Encyclopedia of criminology and criminal justice (pp. 267–276). Springer.

38.

Van der Heijden

Cruyff

Van Houwelingen

H. C

. (2003a). Estimating the size of a criminal population from police records using the truncated Poisson regression model. Statistica Neerlandica, 57, 289–304.

39.

Wagh

Y. S.

Kamalja

K. K.

(2018). Zero-inflated models and estimation in zero-inflated Poisson distribution. Communications in Statistics–Simulation and Computation, 47, 2248–2265.

40.

White

G. C.

Anderson

D. R.

Burnham

K. P.

Otis

D. L.

(1982). Capture-recapture and removal methods for sampling closed populations (pp. 18–19). Los Alamos National Lab.

41.

Zelterman

(1988). Robust estimation in truncated discrete distributions with application to capture recapture experiments. Journal of Statistical Planning and Inference, 18, 225–237.

Estimating Population Size of Criminals: A New Horvitz–Thompson Estimator under One-Inflated Positive Poisson–Lindley Model

Abstract

Keywords

Introduction

Excess “1” Counts in Positive Poisson-Lindley Distribution

One-Inflated Positive Poisson–Lindley (OIPPL) Distribution

Estimation of Population Size using OIPPL Distribution

Variance Estimation and Confidence Interval

Alternative Estimators

Horvitz–Thompson via zero-truncated Poisson–Lindley

Good–Turing estimator

Zelterman estimator

Chao estimator

Simulation Study

Simulation Scenario

Alternative Scenario

Applications

Prostitutes in Vancouver

Drunk Drivers

Conclusions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

Author Biographies

References