Abstract
The estimation of finite population characteristics, particularly the mean, is critical in survey sampling, as it has a direct impact on the validity of sample data results. As the number and complexity of available data increases, so does the demand for more robust and precise estimators. Traditional estimators frequently fail to take full advantage of auxiliary information, which can be critical for enhancing estimating accuracy. This paper presents a novel log-ratio estimator that incorporates auxiliary information in a non-linear manner using logarithmic transformations of the study and auxiliary variables. The log-ratio estimator's performance is assessed against 9 different classical and modern estimators using two essential metrics: mean squared error (MSE) and percentage relative efficiency. Our empirical analysis includes a variety of applications, such as estimating the area under wheat based on cultivated area, estimating peppermint oil production based on field area, and three real-world datasets: breast cancer deaths, cancer deaths by gender (male and female), and brain tumor survival rates. To supplement these applications, a simulation study with a population size of 100,000 and a sample size of 1,000 was run up to 100,000 times to assess the estimators’ resilience and stability. The empirical and simulation results consistently reveal that the log-ratio estimator produces lower MSE and higher PRE values across all datasets, exhibiting greater accuracy and efficiency over traditional estimators. These findings suggest that the log-ratio estimator gives more trustworthy and efficient estimates, particularly when dealing with complicated data. As such, this estimator is a promising tool for future survey sampling, helping to enhance estimating methodologies capable of dealing with the challenges provided by modern, large-scale datasets.
Introduction
The foundation for drawing inferences about a whole population from data from a smaller sample is provided by sampling theory, which is a basic component of statistical inference. Sampling is a vital technique in research across many domains because, in practice, gathering data from an entire population is either impractical or prohibitively expensive. The study of sampling theory focuses on the design of estimators, sample selection, and the statistical inferences that can be made from them. Using data gathered from a population subset, the objective is to draw trustworthy and accurate conclusions regarding population parameters. Effective estimators are essential to this procedure. Since these estimators assess the quality and dependability of inferences made from sample data, which directly impacts decision-making processes, their development is essential. The ratio estimator is one of the most effective estimators in survey sampling, especially when the correlation between the study and auxiliary variable is positive and high.
A ratio estimator was initially developed by Cochran (1940) for survey sampling to estimate the population mean of a research variable using information from a related variable. Sisodia and Dwivedi (1981) suggested a modified ratio estimator for cases where the population coefficient of variation is known. An estimate developed by Prasad (1989) provides a class of ratio-type estimators for population means in finite population sample surveys, using Simple Random Sampling (without replacement), when data on an auxiliary variable positively linked to the study variable are available. A modified ratio-type estimator that takes into account both kurtosis and the coefficient of variation is created in Upadhyaya and Singh (1999). The traditional ratio estimator is the most effective of all the estimators covered in this study, according to a study by Kadilar and Cingi (2003) that employed ratio estimators to examine apple productivity and the number of apple trees in Turkish villages. In Singh (2003), several ideas and methods for creating an effective estimator were discussed. Triveni and Danish (2023) created a ratio-type estimator to estimate the population average of a research variable using information from a correlational variable. To address the challenge of estimating the population mean of the study variate, Singh and Tailor (2005) employ a ratio estimator, utilising data on the population mean and coefficient of variation of an auxiliary character. A ratio estimator based on a known correlation value was presented by Singh and Tailor (2003). Kadilar and Cingi (2004). proposed an efficient ratio estimator in simple random sampling. Singh et al. (2004) provide an improved estimator of the population mean using a power transformation. For the population mean, Yan and Tian (2010), Singh and Rani (2006), and Sing and Kumar (2011) offered a couple of ratio-type estimators depending on the auxiliary variable's known skewness. Subramani and Kumarapandiyan (2012a, 2012b) proposed two modified ratio estimators for calculating the population mean of the research variable by linearly combining known population median values and the coefficient of variation of the auxiliary variable. A group of estimators developed by Koyuncu and Kadilar (2009) takes advantage of auxiliary data in SRS design. In Khan et al. (2015), the limited population means under maximum and minimum values, dependent on knowledge of auxiliary variables, are evaluated using estimators of the ratio type estimator. To estimate the population average of the research variables, Triveni and Danish (2024a, 2024b). developed a better separate ratio estimator by using the highly related variable. Audu et al. (2021) presented three difference-cum-ratio estimators that use known population mean, variance, and auxiliary variable to estimate the limited population coefficient of variation of a study variable. For estimating population ratios using simple random sampling without replacement (SRSWOR), three ratio estimators were proposed in Ounrittichai et al. (2024). Kadilar and Cingi (2007) developed a modified ratio estimate based on the correlation coefficient and taking into account the estimators from Upadhyaya and Singh (1999). Using certain robust methods to get the most out of the auxiliary variable, Zaman et al. (2022) suggested a ratio estimator that estimates the population mean in simple random sampling. According to Yadav et al. (2024), utilising known information on an auxiliary variable can improve the estimation of population means. By including bivariate auxiliary data, Triveni and Danish (2024a, 2024b) proposed a novel combination ratio type estimator for estimating the population mean of the research variable. Lakshmi et al. (2025) proposed a log and power transformation for ratio estimation and applied it to the solar radiation dataset. Singh et al. (2025) provide three exponential estimators for handling fuzzy data. Javed et al. (2025) proposed a different type of estimator using the exponential function approach to forecast the population mean.
While there are several estimators for survey sampling, many of them still rely on simplifying assumptions that do not completely account for the complexity of modern data. Traditional linear models frequently fail to account for non-linear interactions between study and auxiliary variables, which limits their applicability in real-world settings. Furthermore, most existing research focuses on a single estimator or fails to compare multiple approaches across various data sources. This study seeks to fill these shortcomings by introducing a log-ratio estimator in the framework of simple random sampling (SRS) that includes auxiliary information more freely and handles complex data structures more successfully. By comparing this suggested estimator against 9 alternatives, we gain a more comprehensive understanding of its performance across various datasets and scenarios, confirming its efficiency and accuracy in estimating population parameters. The findings shed light on the limitations of standard estimators and the advantages of more robust, nonlinear techniques.
The Terms and Notations
Let N signify the population size and n the sample size. In this study, we consider the study variable Y and the auxiliary variable X. The population mean for the variables X and Y is determined as
Literature Review
The combined ratio estimator by Cochran (1940) is defined as
The MSE of (1) is
When the population coefficient of variation
The MSE of (4) is
Also, the bias of (4) is
A modified ratio estimator was proposed by Singh and Tailor (2003) as
The MSE of (7) is
Also, the bias of (7) is
A ratio-type estimator was suggested by Yan and Tian (2010) as
The MSE of (10) is
Also, the bias of (10) is
The coefficient of variation and skewness coefficient on the auxiliary variable were used by Yan and Tian (2010) to suggest a ratio-type estimator as
The MSE of (13) is
Also, the bias of (13) is
Where
The modified ratio estimator proposed by Subramani and Kumarapandiyan (2012a) is given by
The MSE of (16) is
Also, the bias of (16) is
Where
The modified ratio estimator proposed by Subramani and Kumarapandiyan (2012a) is given by
The MSE of (19) is
Also, the bias of (19) is
Where
The modified ratio estimator proposed by Subramani and Kumarapandiyan (2012b) is given by
The MSE of (22) is
Also, the bias of (22) is
Where
The modified ratio estimator proposed by Subramani and Kumarapandiyan (2012b) is given by
The MSE of (25) is
Also, the bias of (25) is
Where
As the logarithmic function manages variability over data spread with greater efficiency, we developed a ratio estimator with the linear combination of the log function, and it is as follows
Evaluate the above expression by using the following error terms, and we get
Solving the above equation, we get,
BIAS for the proposed estimator is given by
MSE for the proposed estimator is given by
Now, differentiating equation (31) for
In survey sampling, MSE is an important metric for determining estimator efficiency. We employ MSE equations for all estimators in the literature review, including the proposed one, to compare the efficiency of competitive estimators to that of the proposed model. This section describes the mathematical conditions under which the suggested estimator is more efficient than the other estimators utilised for comparison.
On comparing (2) with (31), we get
Comparing the MSE of the proposed estimator with other estimators, we get the following equations (33) and (34)
When i = 2, 3, 4, 5, 6 and 7, then we have the following efficiency equation
When i = 8 and 9, then we have the following efficiency equation
When the equations in (32), (33), and (34) are satisfied, then we can say that the proposed estimator is more efficient than the other estimators that are used for comparative study.
Data Set-I
The following data statistics Table 1 is based on the estimation of wheat area (study variable) in 1974, based on wheat cultivated area (related variable) in 1973, taken from Singh and Chaudhary (1986).
Population Parameters for Data Set-I.
Population Parameters for Data Set-I.
The data set taken from Yadav et al. (2019), which is based on the output of peppermint oil based on the field area in Bigha, is the basis for the data statistics in Table 2.
Population Parameters for Data Set-II.
Population Parameters for Data Set-II.
Table 2 lists all parameters required to calculate the MSE values of respective estimators. The MSE values can be calculated based on their corresponding MSE equations, and the PRE values, which are provided in Table 3, are computed using the formula below.
Values of PRE Based on Datasets I and II.
We used different data sets taken from Kaggle to conduct a real data application. The following is a discussion of the three distinct data sets that describe various forms of cancer.
Data Set-III
Using breast cancer fatalities as an auxiliary variable and breast cancer cases as a study variable, we examined the data presented by Gupta (2025) and the required data parameters are provided in Table 4.
Parameters for Data Set- III.
Parameters for Data Set- III.
The data set is taken from the American Cancer Society (2024). The predicted number of cancer deaths in the US in 2024 is the auxiliary variable, whereas the main variable is the anticipated number of new invasive cancer cases in the US in 2024 (both sexes). The analysed data statistics are provided in Table 5
Parameters for Data Set-IV.
Parameters for Data Set-IV.
Using data provided by Miadul (2025), we calculated the required statistical data, as shown in Table 6, where the survival rate from brain tumours serves as the research variable and the tumour development rate is considered a related variable. The calculated PRE values are provided in Table 7.
Parameters for Dataset-V.
Parameters for Dataset-V.
PRE Values of Estimators for Data Sets- III, IV and V.
Along with the suggested estimator, we run a simulation study for every estimator utilised in the study. Using a population size of 100,000 and a sample size of 1000, we simulate the sample data 100,000 times. The average PRE values of these 100,000 simulations are given in Table 8.
PRE Values of Estimators Under Simulation Based on Different Values of the Correlation Coefficient.
PRE Values of Estimators Under Simulation Based on Different Values of the Correlation Coefficient.
The suggested log-ratio estimator outperforms 9 other estimators on numerous datasets in terms of both Mean Squared Error (MSE) and Percentage Relative Efficiency (PRE). In Table 3 of empirical studies, the suggested estimator performed admirably in estimating wheat area based on cultivated area and peppermint oil production based on field area. It had the lowest MSE values and the greatest PRE values, significantly outperforming classic ratio-type estimators. This shows that the log-ratio estimator is both accurate and efficient, leveraging auxiliary information more effectively than linear techniques. In real-world applications that included complicated datasets such as breast cancer deaths based on cancer cases, projected new cancer cases based on estimated fatalities, and survival rates based on brain tumour development, the suggested estimator performed well. The estimator attained an extraordinarily high PRE of more than 50% in the tumour survival rate dataset provided in Table 7, demonstrating its robustness in highly variable and non-linear medical data. Many standard estimators exhibited considerable decreases in efficiency, with some producing extremely low PREs, implying their limited application in such complicated circumstances. The log-ratio estimator's logarithmic transformation appears to adapt well to skewed and kurtotic data structures, making it a useful tool for a variety of real-world survey sample challenges. Furthermore, in the simulation exercise, where data were created and evaluated across 100,000 iterations with a large population size (100,000) and sample size (1,000), the log-ratio estimator not only maintained its efficiency but outperformed all other estimators with a PRE presented in Table 8, which examined under different correlation coefficients. This gives strong evidence that it is stable under large-scale sampling and repeated trials.
Overall, the results show that the log-ratio estimator is an advanced, non-linear alternative to existing techniques that can provide extremely accurate and efficient estimates in both controlled and real-world contexts. Its constant dominance in MSE and PRE measures across empirical, applied, and simulated data situations indicates that it is a dependable and generalizable method for estimating limited population means, especially in the setting of modern, complex datasets where traditional estimators may fall short.
Conclusion
In this paper, we present a log-ratio estimator and obtain its MSE and bias equation up to first-order approximation. A detailed study of empirical, real-world, and simulated datasets indicates that the proposed log-ratio estimator outperforms standard and modified ratio estimators. Its ability to incorporate auxiliary information via a logarithmic transformation allows it to effectively handle complex relationships and distributional anomalies found in modern datasets. The estimator continuously showed lower MSE and higher PRE values, demonstrating its stability, precision, and efficiency. In both controlled empirical settings and real-world cancer and agricultural datasets, it beat rival estimators. The simulation analysis supports its advantages under repeated and large-sample settings.
These findings not only demonstrate the efficiency of the log-ratio estimator but also illustrate the growing importance of nonlinear, data-adaptive techniques in survey sampling. As data complexity grows, so does the demand for estimators that can handle a variety of features without sacrificing performance. Because of its design and proven efficiency, the log-ratio estimator stands out as a viable contender for wider use in modern survey methods. Future research might look into its adaptability in stratified or cluster sampling frameworks, as well as its performance across various domains, to cement its position as an important tool in statistical estimation.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
