Sampling frames are mostly incomplete in large scale surveys. This paper suggests the use of probability proportional to size with replacement (PPSWR) sampling scheme for estimation of population mean with an incomplete frame. The variance of the estimators has been obtained and its efficiency has been compared with Agarwal and Gupta (2008) estimator when the frame is not complete. The results obtained have been illustrated with the help of hypothetical data. The problem of determining optimum sample size and retainment factor has also been discussed using a suitable linear cost function.
While sampling from a finite population, the value of some auxiliary character x that is closely related to the main character y of interest is frequently available for all units of the population. The variable x, when properly formed, can be used to calculate the size of a unit. For example, in socioeconomic surveys, data on the size of the people from a previous census may be used to estimate the size of villages; in agricultural surveys, data on the area under the crop, if available, may be used to estimate the size of the farm. In such cases, Mukhopadhyay (2008) suggested that instead of sampling the units with equal probability with or without replacement, the units could be sampled with probability proportional to size-measure x (PPS) with or without replacement.
For planning any sample survey, a sampling frame which is a list of all sampling units, is required. There may be a situation in which the sample frame is incomplete. In such a case the question frequently arises is selection of sample size. Hansen and Hurwitz (1943) attempted the problem of incomplete samples in mailed surveys. Hansen, Hurwitz, and Jabine (1963) suggested a predecessor-successor procedure for getting data on the missing units of the frame.
Agarwal and Gupta (2007), (2008) contributed significantly to the theory of estimation with an incomplete frame by providing a detailed review of the problem. The problem was extended to a two-stage sampling design by Gupta and Agarwal (2013) in which the incomplete frame is that of second stage units. Singh (1985) gave a mathematical formulation of the problem of predecessor-successor method for estimating a total number of units missing from the frame and the total for the character under study for the target population.
In this paper, an estimator of a population mean has been proposed by selecting units from the available frame by using PPSWR, in the sampling procedure developed by Agarwal and Gupta (2008).
Sampling procedure
Let ‘N’ be the units in the target population. Among these are units in the sampling frame and are units not in the frame, such that . To obtain the value of characteristics under study Y, first, a sample of units is selected from available units in the frame through PPSWR. Then a frame is prepared of those units which occur in between the selected units and next to them. Let such units be . From these units are selected by SRSWOR considering .
Let , and Y are population total of the value of characteristics of units in the frame, units not in the frame and units of the target population respectively. Also, , , and are total for characteristics under the study of sample values of , and units sampled from , , and units respectively.
Let, , , , .
Method of estimation
Select a sample of size units out of units which are available in the frame. If units are selected in the sample, we examine the next unit in the population whether it is in frame or not, and continue till we get a unit that is in the frame. These units are denoted by , so it is the number of units between and unit in the frame which are missing such that,
The unbiased estimator of is then given by , where;
When we select , , …, in the sample , , …, will automatically be selected in the sample as , which is a random variable. Therefore, is also a random variable.
Proposed estimator
The proposed estimator is a combination of a weighted Hansen Hurwitz (HH) estimator based on units sampled from the complete frame and units sub-sampled from the incomplete frame. The proposed estimator will be represented as:
where, and . The estimation of proposed estimator is:
This is unbiased estimator of the population mean. Variance of proposed estimator is:
Substituting the value of as in Eq. (3) and differentiating it w.r.t. , we obtain:
The optimum values of and are:
Substituting the values of and in Eq. (3), we obtain the optimum variance as:
Cost function
According to Sukhatme et al. (1984), the purpose of a sample survey is to obtain the estimator with minimum variance for a fixed cost or to get the estimator with predetermined variance for minimum cost. The total cost of the survey depends upon many factors e.g., the overhead cost, travelling cost, enumeration cost, etc. In PPS sampling, the cost function C in its simplest form is given by the linear model:
where, is the cost of establishment, is the cost of evaluating unit from frame, is the cost of evaluating unit not in the frame. Also, is a retainment factor that is constant, and .
To determine the optimum values of and , we consider the cost function as:
Minimizing the cost function w.r.t. and and using Lagrange’s constant multiplier method, we obtain the optimum values of and as:
Substituting these optimum values of and in Eq. (3), we obtain the minimum variance of for given fixed cost as:
The estimator of variance
Substituting the consistent estimator of all the terms in Eq. (3) we get the consistent estimator of the proposed estimator as
The above results can be summarized in the following theorems:
.
If a target population of size N consists of units that are present in sampling frame and number of non-included units between each of the units included in the frame , , , …, . That is, () are the number of non-included units till the next unit is included in the frame. Then the average number of non-included units in between selected units in the sample is given by:
, which is an unbiased estimator of the average number of non-included units between any two included units in the frame given by with variance and
.
Under the given layout, the estimator of variable Y while using the information about X, the weighted average estimator of Y is given as:
where is the Hansen Hurwitz estimator based on a sample of size drawn from units that are in the sampling frame and is an average mean based on a sample of size drawn from units that are occurring in between the selected units and next to it.
Numerical illustration
A numerical illustration has been taken as a reference from Gupta et al. (2021) which is generated hypothetically as the live example for an incomplete frame is not available in the literature.
Suppose there are 100 households in a city, where 70 of them have a sampling frame that is available, and the remaining 40 are not listed in the city corporation record. The aim is to estimate the mean Y using the auxiliary variable X.
The proposed situation is being summarized as follows:
Target number of households (N) 100
Family size: X 423
Total expenditure of target population: Y 840.2 thousand rupees.
Number of households included in the frame () 70
Number of households not included in the frame () 30
Data of 100 households, for included as well as non-included houses in the frame, is mentioned in Table 1.
Data with included and non included units in sampling frame
S.N.
1
4
5
6
8
9
10
Total
X
3
2
1
5
3
2
1
1
9
2
29
Y
7
4
2
10
6
4.5
1.5
1.2
18
4.2
58.4
S.N.
11
12
14
16
19
20
Total
X
3
8
1
7
1
4
1
9
1
3
38
Y
6.2
6
2
14
2
8
2.2
15
2.4
6
63.8
S.N.
21
22
23
25
26
27
29
30
Total
X
6
7
1
4
3
7
3
7
5
7
50
Y
12
14
2
8.5
5.5
15
5.8
13
10
14.2
99.5
S.N.
31
32
36
37
38
39
40
Total
X
1
2
5
6
2
4
7
4
2
4
37
Y
2
4
10
12
4.4
8.2
15
7.8
3.6
8
74.1
S.N.
41
42
44
45
46
47
48
49
50
Total
X
3
2
6
4
6
6
2
5
1
1
36
Y
6
4
12
8
12
13
4.3
10
2
2.5
72.5
S.N.
51
52
55
58
59
Total
X
3
7
8
5
5
7
1
2
4
7
49
Y
5.7
14
16
10
9.6
15
2.2
3.8
7.6
15
98.4
S.N.
61
62
63
64
65
66
67
68
69
70
Total
X
3
4
7
7
3
5
4
2
3
7
45
Y
5.8
8.5
13
14
7
10
8
4
6
14
90.3
S.N.
71
72
76
77
79
Total
X
9
5
6
7
5
2
8
5
6
6
59
Y
20
11
13
13
9
5
17
11
13
12
124
S.N.
83
84
88
89
90
Total
X
1
2
6
4
8
2
6
3
3
2
37
Y
1.5
2.6
13
7.5
15
4.6
12
7
6
4
72.7
S.N.
91
92
94
95
96
97
99
100
Total
X
5
3
8
6
5
5
6
2
1
2
43
Y
9.5
7
16
12
10
11
11
4
2
4
86.5
: Units which are not included in sampling frame.
Calculations
To estimate the total number of non-included units, characteristic mean, variance etc. a sample of 10% houses is selected by SRSWOR i.e., seven houses are selected out of 70 listed houses.
Case 1 – When non-included units behave as included units:
Random sample of included units
R.N.
38
26
34
69
57
47
41
0
0
0
0
3
0
0
X
3
4
2
1
6
3
2
Y
5.7
7.8
4.3
2
13
7
3.8
R.N: Random Number.
Sample size of included units 7 as given in Table 2.
Estimate of total number of non-included units is:
Estimate of target population with the help of variable X is given by sample values:
persons in target population.
Case 2 – When non-included units do not behave as included units:
The value for our estimate can be calculated as:
We take a sample of included units of sample size 6. Since, there are only four non-included units in between selected units and the next listed units as shown in Table 3.
Random sample of non-included units
R.N.
1
13
20
29
Mean
2
6
7
8
5.75
4
12
13
16
11.25
Thus, 4. Now, from one digit random number table, the selected units are at serial numbers 3, 4. Hence, 2.
thousand
Average expenditure thousand.
Estimate of variance
Table 4 shows the numerical results of PPS and AG estimator. It can be easily observed that out of the two estimators, the proposed estimator i.e. PPS is more efficient than the Agarwal and Gupta(AG) estimator.
Variance and S.E of estimates
PPS estimator
AG estimator
Relative efficiency
Conclusion
The paper contributes to the development of unbiased estimators by utilizing the units not included in the sampling frame. Findings of this paper reveal that considering PPS sampling in case of incomplete sampling frame brings improvement in the estimation of population total, mean and variance.
The theoretical concepts developed are supported by the hypothetical example. Estimated value of the total number of persons not included in the population is 30 with SD 0.3012. Estimated value of total number of persons in targeted population is 300. The estimated average expenditure in our estimate is Rs. 25.83 thousand, with S.E 0.6101.
The proposed estimator is more efficient than the one proposed by Agarwal and Gupta (2008).
References
1.
AgarwalB., & GuptaP.C. (2007). Synergism in incomplete Sampling Design. Management and Change, 1, 183-190.
2.
AgarwalB., & GuptaP.C. (2008). Estimation from incomplete Sampling Frames in case of Simple Random Sampling. Model Assisted Statistics and Applications, 3(2), 113-117.
3.
AgarwalB., & GuptaP.C. (2012). Estimation from incomplete sampling frame for two-stage sampling design. Indian Journal of Statistical Application, 1(2), 52-58.
4.
GuptaP.C.JoshiV.NagarP., & SinghA.K. (2021). Use of ratio estimation in incomplete sampling frames. International Journal of Agricultural and Statistical Sciences, 21(1).
5.
HansenM.H., & HurwitzW.N. (1943). On the theory of sampling from finite populations. AMS, 14, 333-362.
6.
HansenM.H.HurwitzW.N., & JabineT.B. (1963). The use of imperfect lists for probability sampling at the U.S. bureau of census. Bulletin of the International Statistical Institute, 40(1), 497-517.
7.
MukhopadhyayP. (2008). Theory and methods of survey sampling. PHI Learning Pvt. Ltd.
8.
SinghR. (1985). Estimation from incomplete data in longitudinal surveys. Journal of Statistical Planning and Inference, 11(2), 163-170.
9.
SukhatmeP.V.SukhatmeB.V.SukhatmeS., & AsokC. (1984). Sampling theory of surveys with applications, 3rd Iowa State University Press.