Review of how the generalized regression estimators contribute to estimating the financial and economic data with missing observations under unequal probability sampling

Purpose – Knowing financial and economic information beforehand benefits in planning and developing policies for every country especially for a developing country like Thailand and for other Asian countries. Unfortunately, missing data or non-response plays an essential role in many areas of studies including finance and economics. Eradication of missing data in a proper way before further analysis can gain remarkable outcomes and can be effective for planning policies. This review on the generalized regression estimators for population total can be applied to financial, economic and other data when missing data are present. Design/methodology/approach – The generalized regression estimators for estimating population total, including the variance estimators under unequal probability sampling without replacement with missing data are explored under the reverse framework. Applications to financial and economic data in Thailand are also reviewed. Findings – The review of literatures related to the proposed estimator shows the best performance, giving smaller variances in all scenarios. Originality/value – The generalized regression estimators can assist in estimating financial and economic data that contain missing values with different missing mechanisms and can be used in other applications which help gain more superior estimators.


Introduction
Generalized regression (GREG) estimation is optimized for design-based estimations of population totals for survey sampling, which are often used in financial data which are seldom complete, becoming an inherent issue requiring a solution.An opulence of economic advancement is imperative in every country to maintain the country's infrastructure and quality of life of citizens which calls for statistical analysis of data, where the problems of missing data and suitable estimators arise.Measures have been placed on a plethora of aspects to ensure economic development in Thailand, as seen in sustainable development plans in "Thailand 4.0", as Thailand is a country highly dependent on revenue from tourism.With this reason, the economy is liable to fluctuations, especially recently due to the coronavirus pandemic.After withdrawal of revenue from foreign tourists, the economy became more focused on citizens' assets, income, and cash flow within the country.A myriad of policies have been enforced to provide stability to individuals' financial stability and capability to manage their assets during a pandemic.Analysis of the population's financial issues is vital for proper repairment of the crisis and instigation of solutions and endorsement for citizens in need throughout the duration of the pandemic.Data on the population's expenses is required for insight on the financial obstacles being faced and to further analyze then address the concerns suitably.
Furthermore, the government has induced many means to stimulate tourism within the country such as the "Thai Travel Together" campaign which allows cash flow within the country and mitigates hardships inflicted upon the economy as a result of the crisis from COVID-19.Moreover, additional facets impact the economy, including unsubstantial investment that afflicts the economy on a large-scale.Sustainable development plans have been enforced to target ten industries and try to resolve production efficiency and competitiveness afflicting Thailand's industrial economic structure.
However, missing data or nonresponse often occurs in real world data which can obscure facts used for decision making in business and economics, so opportunities are lost due to incomplete data.Missing data occurs due to nonresponse or participants choosing not to answer specific questions for instance.Missing data can occur when it does not depend on missing values or observed values, called missing completely at random (MCAR) or uniform nonresponse, or the missingness correlates to the observations but is not related to the missing values and this is called missing at random (MAR).Therefore, resolving nonresponse is imperative for appropriate financial planning.Difficulties in acquiring accurate data can be a result of lack of records or nonresponse derived from surveys.In conclusion, statistical methods that tackle nonresponse are vital measures to solving this problem.The nonresponse issue was first recommended by Hansen and Hurwitz (1946) in the mail survey.They introduced an unbiased estimator for population mean that used data from a sample survey on both respondents and non-respondents under unequal probability sampling without replacement (UPWOR).Horvitz and Thompson (1952) suggested using the weight to create an unbiased population total estimator under unequal probability sampling for with and without replacement.The first order of inclusion probability is used as the weight for correction of the bias.Unfortunately, there is an issue in calculating variance in Horvitz and Thompson due to it requiring joint inclusion probabilities which are hard to find in some complex survey designs.Later, Hajek (1964) proposed a new estimator to correct an issue of the variance estimator which produces less variance with respect to Horvitz and Thompson (1952), but only when there is no relationship between the study variable and the inclusion probabilities.Their new estimator is a ratio estimator, which is the ratio of sample means of two random variables.for estimating population total which is an approximately unbiased ratio estimator.
The GREG estimator is a special type of calibration estimator and improves this method of estimation using auxiliary information.It is in the shape of the Horvitz and Thompson (1952) estimator which integrates with the weighting approach as it can assist in reducing the nonresponse bias.Bethlehem and Keller (1987) introduced to use weights using linear models which is a new weighting method that can be used in person-based estimations.Many works have been done based on GREG to use the benefit of the relationship between the study and auxiliary variables to skyrocket the efficiency of the population total or population mean estimators and also the variance estimators (see, e.g.Montanari, 1987;S€ arndal et al., 1992;Estevao and S€ arndal, 2003;S€ arndal and Lundstr€ om, 2005;S€ arndal, 2007).The two-phase framework concerns studying the selected sample and nonresponse in the first and second phases, respectively, under nonresponse.It is a popular technique to use to study the GREG estimators' variance (see, e.g.Rao, 1990;S€ arndal, 1992;Deville and S€ arndal, 1994;S€ arndal and Lundstr€ om, 2005).Fay (1991) invented an alternative to the two-phase measure, the reverse framework.The name comes from the order of studies being reversed, nonresponse is a candidate in the first phase and the sampling shown in the second phase (see, e.g.Shao and Steel, 1999;Haziza and Rao, 2006;Haziza, 2010).Under this reverse method, the population total estimators and the GREG estimators along with their variance estimators were investigated within the MCAR and MAR nonresponse mechanisms and under different assumptions for the response probabilities and the sampling fractions (Lawson, 2017;Lawson and Ponkaew, 2019;Lawson and Siripanich, 2022;Ponkaew and Lawson, 2023).
In this paper, the GREG estimators under the reverse framework will be reviewed.The structure of this paper is as follows.The literature review is shown in section 2. The basic setup and the generalized regression estimators with missing data are reviewed in sections 3 and 4, respectively.Examples of the application related to financial and economic data in Bangkok, Thailand are displayed in section 5. Lastly, some conclusions and discussions are presented in section 6.

Literature review
First of all, let's see how the generalized regression estimators have been developed and can be useful for estimating financial, economic, and other data.The generalized regression estimator can estimate the population mean or total.It is in the shape of Horvitz and Thompson's (1952), a very well-known population total estimator under unequal probability sampling for both including and not including replacement.Nevertheless, the Horvitz and Thompson's variance estimator is facing issues as it calls for the known joint inclusion probabilities, also known as the second order inclusion probabilities.They are the probabilities of two different units of populations selected in the sample.These values are difficult to find in complex survey designs and therefore the Horvitz and Thompson estimator is not easy to use in practice.Sometimes they are difficult to be calculated.Under unequal probability sampling using replacement, the formulas of the variance estimators are in their simple forms because these probability values, which is different from the variance formula under UPWOR which requires joint inclusion probabilities.
Some researchers also made an effort to solve this issue in the estimation of variance (Sen, 1953;Yates and Grundy, 1953) but still face the same issue requiring joint inclusion probability which is not known or hard to find.Therefore, some methods have been suggested in estimating the joint inclusion probability (Hartley and Rao, 1962;Hajek, 1964Hajek, , 1981;;Brewer, 2002;Brewer and Donadio, 2003).
The GREG estimators assist in finding population mean and total when there is information based on the related auxiliary variable to the study variable.The formula of the GREG estimator is in the structure of the Horvitz and Thompson (1952) estimator with additional adjustments calculated from an auxiliary variable.Optimal GREG estimators were developed using the known value of the regression coefficient in the population (Montanari, 1987;Berger et al., 2003) under different sampling plans such as stratified two-stage cluster sampling.The Taylor linearization method is used to study the variance and associated variance of the GREG estimator which is in a nonlinear form and therefore it needs to be transformed to a linear one.A drawback of the GREG variance estimator under this situation is that it requires complex methods in calculating the variance under UPWOR due to the requirement of the known joint inclusion probabilities as same as Horvitz andThompson's (1952) method. With nonresponse, S€ arndal andLundstr€ om (2005) have introduced an almost unbiased GREG estimator for estimating population total and a variance estimator under the two-phase framework which requires nonresponse propensities.Under the reverse framework, some literatures explored GREG estimators including missing data.A GREG estimator based on the population total estimator when unit nonresponse appears within the study variable with a negligible sampling fraction under an unstratified, one-stage sample, with probability being unequal has been suggested when the nonresponse mechanism is MCAR.This is quite a restrictive assumption where the response probability is constant and tend to not occur in practice and also the estimator is in a nonlinear form (Lawson and Ponkaew, 2019).However, they proposed to use the modified automated linearization method to deal with this problem and showed that their estimator is unbiased and response probability is not essential.Recently in 2023, under the same assumptions of the previous work, the ratio method of estimation is applied to create the new GREG estimators (Ponkaew and Lawson, 2023).Their estimators are more efficient than the previous work in terms of giving smaller relative bias and root mean square errors as the criterions.We can also see from the application results that were applied to the Thai maize agricultural industry in Thailand in 2019 based on the data from the Office of the Agricultural Economics that their estimators provide a smaller variance in estimating the estimate values of total yield of maize in Thailand which could help in planning for policies for the economics part of Thailand's agriculture in the future.
Under a more flexible nonresponse mechanism such as MAR to allow for more practicality to use in realistic situations, an approximately unbiased GREG estimator and its variance under UPWOR has been suggested in less controlled circumstances, with the response probabilities both known and unknown and the nonresponse mechanism is non-uniform, with both a small sampling fraction or any sampling fraction.This type of nonresponse mechanism can be called MAR or the ignorable nonresponse mechanism.The less restrictive situations in this estimator can assist by acquiring vital data imperative for financial and economic projects in many areas where missingness happens in the study variable.For example, to study farm profitability and resilience, which brings in revenue for the country can be investigated using the GREG estimators by estimating liabilities and net worth using some variables for instance farm type, farm size, region, tenure, and economic performance.Nevertheless, economic data, e.g. the agricultural industry such as total yield, total profit, and total income can be applied using the GREG estimator to find out these values in advance for planning for effective decision making which can develop economic wealth for the whole nation.Handling missingness appropriately can benefit the reliability of the data that is utilized for planning in Thailand and other countries around the world (Lawson and Siripanich, 2022).

Basic setup
The notations and the basic notions under the reverse framework will be introduced.Let y be a study variable and a population total of the y variable is Y ¼ P i∈U y i where U ¼ f1; 2; :::; N g and N is a population size.Let x be an auxiliary variable and the population total of the x variable is X ¼ P i∈U x i .The order of the paired ith values of the study variable y and auxiliary variable x is ðy i ; x i Þ, i ¼ 1; 2; :::; N.For the ratio estimator, the variable x is an auxiliary variable.The auxiliary variables k and w are used to define the first and joint inclusion probabilities under UPWOR and utilized to construct the ratio estimator respectively.A sample s of size n is drawn using UPWOR.For selecting the population unit i in U, the known and nonzero probability is represented by P i ¼ X i =X where PðsÞ be the second order inclusion probability.Assume that the information of n 3 ðq þ 1Þ matrix of values x or The expectation and variance according to UPWOR sampling are defined as E S and V S respectively.The population total GREG estimator is where x i ¼ ðx i1 ; :::; x ij ; :::; x im Þ 0 , i 5 1, 2, . .., n, are the column vectors of the auxiliary variable and q i are calculated by the linear assisting model ξ: Under nonresponse, R and r i denote the response mechanism and the y i response indicator variable, respectively.
Let p i be the response probability shown as p i ¼ Pðr i ¼ 1Þ: Let E R and V R be the expectation and variance operators according to the response mechanism, and E and V be the overall expectation and variance operators, respectively.Therefore, The GREG estimator b Y GREG variance from the reverse framework is 4. Generalized regression estimators with missing data Numerous works have investigated the GREG estimators with missing data under the twophase framework to study the GREG estimators' variance where in the first phase only the interested sample is examined and in the second phase only the nonresponse is contemplated.
Under the two-phase framework, the GREG estimator and variance were studied in the presence of nonresponse (S€ arndal and Lundstr€ om, 2005).They also recommended an automated linearization method in finding the variance of the GREG estimator where the partial derivatives are not obligatory as in the Taylor series linearization (see, e.g.Estevao and S€ arndal, 2003;S€ arndal and Lundstr€ om, 2005;S€ arndal, 2007).A GREG estimator for population total with nonresponse using the two-phase framework is (S€ arndal and Lundstr€ om, 2005

Asian Journal of Economics and Banking
The variance of b where e i ¼ ðy i − x 0 i βÞ, β ¼ where where b e Apart from the two-phase framework, the reverse framework by Fay (1991) is also studied to investigate the GREG estimators variance with the order of the selected sample and nonresponse reversed in the phases of sampling.Again, the same issue arises in the variance estimator which is in a nonlinear form and as a result it needs to be transformed to a linear function.Under the reverse framework, a new GREG estimator has been suggested MCAR or the uniform nonresponse mechanism where the response probability is constant.Most researchers (Lawson and Ponkaew, 2019;Ponkaew and Lawson, 2023) considered it under this assumption due to simplicity.A new GREG estimator for nonresponse under UPWOR was developed based on Lawson's (2017) concept, a nonlinear estimator for population total/ mean and is an almost unbiased estimator with probability being proportional to size sampling consisting of replacement.The benefit of the Lawson estimator is that the response probability is not required in the estimation but is under the assumption that the probabilities of response are the same for all units and the sampling fraction can be omitted.Lawson's (2017) The associated variance estimator for the b Y r is Under the same assumptions where the nonresponse mechanism is MCAR, the sampling fraction is can be omitted under UPWOR, based on the Lawson (2017) estimator, a new GREG estimator has been suggested as follows (Lawson and Ponkaew, 2019).
i : (3.9) When the population size N is known, the population total GREG estimator is

Asian Journal of Economics and Banking
They also assumed that b β r − β ¼ O p n r − 1 2 and r n → 0 as n → ∞, where fr n g is a sequence consisting of positive real numbers.For the GREG estimators' variance, they considered two situations; replace P i∈s r i π i by i βÞ ðy j − x 0 j βÞ and using the Taylor linearization approach, then where They also studied in theory that b Later, a new GREG estimator derived from the ratio method has been proposed based on the work of Lawson and Ponkaew (2019) using the same assumptions where the nonresponse mechanism is MCAR and they stretched it to cover the situation where the sampling fraction is also large and therefore it cannot be neglected.They also developed to cases where the response probabilities are known and unknown assisting with the benefit of the known auxiliary variable with nonresponse.Usually under the reverse framework the second part of the variance component is omitted but they considered the case that the variance component in this part cannot be ignored (Ponkaew and Lawson, 2023).Therefore, Y GREG:LP RÞ.Again, they considered the automated linearization approach in the transformation of the b Y GREG:LP into a less complex form.They assumed three assumptions in their study; the response mechanism is Their GREG estimators for population mean and total are respectively, Under the reverse framework the V ð b Y * GREG:R Þ can be gained by, where The variance of Ponkaew and Lawson (2023) are ; when p is unknown : Asian Journal of Economics and Banking (2) The estimators of ; when p is unknown : Unfortunately, the works we mentioned above are considered under a strong assumption when the nonresponse mechanism is MCAR where the response probability is constant only.
The novel GREG estimators for population mean and total under a more flexible situation where nonresponse occurs under missing at random or MAR, which is a more practical situation, were proposed based on the previous works when the auxiliary variable is known to improve the efficiency of the estimators (Lawson and Siripanich (2022).In their study, they assumed that, C 1 : r n → 0 as n → ∞, where fr n g is a sequence of positive real numbers and where b Y r ¼ P i∈s r i y i In variance estimation due to the nonlinear estimator, they suggested two estimation techniques called the modified automated linearization approaches to deal with this issue.They suggested to replace P i∈s r i π i p i by P i∈U r i p i in their estimators and used the Taylor linearization approach to transform nonlinear estimator to linear form.
Their variance estimators are ; when p i is known for all i ∈ s ; when p i is unknown for all i ∈ s ; where b

Asian Journal of Economics and Banking
The estimators of ; when p i is known for all i ∈ s ; when p i is unknown for all i ∈ s ; where b These GREG estimators can be calculated using any statistical packages, e.g.R program which was used in the reviewed studies.Due to these new GREG estimators are new estimators under the presence of missing data under unequal probability sampling and so unfortunately there is no function in R that can be used straight away.Although they are not that complex to use in the estimation.

Examples of application to financial and economic data
The GREG estimator was applied to estimate the total monthly household income from five communities in Bang Sue district, Bangkok, Thailand (Lawson and Siripanich, 2022).The results were based on a sample of size 195 households that was drawn using UPWOR with Midzuno's (1952) scheme out of 1,181 households which consists of 30% nonresponse in the monthly income.The monthly expenditure, age and work in hours per week were considered as the auxiliary variables to assist in estimating the total income and the variance.The logistic regression model was used to find the unknown response probability using the age variable.
Their results showed that their suggested GREG estimator gave the estimated total income for all households equal to 36,068,543 baht and smaller variances in regards to the S€ arndal and Lundstr€ om (2005) estimator.
Data on total monthly income in households is the key to understanding a core part of a country's economy.Information on the financial status of citizens contributes to money flow in the economy and provides invaluable insights for strategizing policies to overcome economic inequalities.Estimation of these statistics allow policymakers to identify income disparities within the nation, integrate measures to assert equality and stabilize the economy, leading to the amelioration of quality of life on a myriad of aspects.
Another example was found in studying Thailand's agriculture which is one of the sources of income that support Thailand's economy (Ponkaew and Lawson, 2023).The Thai maize of Thailand in 2019 from the Office of the Agricultural Economics was studied based on a sample size of 25 provinces being selected using the UPWOR method by Midzuno (1952) out of 63 provinces.The data contained a 30% nonresponse rate.The total yield of maize estimates for all provinces in Thailand in 2019 was found using their suggested GREG AJEB estimator and cultivated area and the harvest area in 2019 were considered as the auxiliary variables along with the cultivated area in 2018 as the size variable.The estimates of total yield of maize for all provinces in Thailand was 525,124 with the smallest variance with respect to the existing estimator.
Statistical estimation of agricultural yield is imperative for agricultural countries such as Thailand and a large part of Asia.These nations' histories have all consisted of agriculture as their geography and climate incline toward successful growing of crops.In prevailing times, export plays an inherent role as one of the major income sources, and an opulence of land is recruited for farming.These farmers are often short on resources and must go through many lengths to save on time and money, to ensure that their yields bring in profit and not losses.The prediction of crop yields can help policymakers working with farmers to anticipate food shortages leading to losses, and potential risks of farming strategies.As many countries are dependent on agriculture, estimation of accurate yields is an essential component of their economies.

:::
When p i ¼ p for all units i in U, Additionally, the Lawson (2017) estimator for estimating the population total is AJEB The associated variance estimator for b