Robust Estimation in Stratification Sampling

Download Full-Text PDF Cite this Publication

Text Only Version

Robust Estimation in Stratification Sampling

Ajiteru S. Oyeyemi

ICT Centre, Federal Polytechnic Offa, Offa, Kwara State, Nigeria

Abstract:- The estimation of any variable of interest such as the one considered in this study: monthly allowance and expenditure of students in the university depends on the sampling scheme used. In this research, estimation using simple random sampling and stratification sampling is considered. Statistical Package for Social Sciences (SPSS) was used for easier computation of the estimates needed. It was found that stratification sampling scheme gives a better variance and it is therefore recommended.

Keywords: Estimation, Variable, population, Sampling and Stratification

INTRODUCTION

Sample design has two aspects: a selection process, the rules and operations by which some members of the population are included in the sample; and an estimation process (or estimator) for computing the sample statistics, which are sample estimates of population values (Hidiroglou and Rao,1983).

Kish (1965) opined that survey objectives should determine the sample design; but the determination is actually a two-way process, because the problems of sample design often influence and change the survey objectives. We shall encounter examples of the ways in which survey objectives and sample design interact to produce overall survey designs.

A dialogue between the researcher and the sampler must occur before any aspect of survey design is frozen, because a change in one aspect may dictate a change in others. Instead of a dialogue, the decisions may involve a larger cast: sampler, researcher, and consumer; and the last, perhaps the grantor of the project, may feel behind him the silent pressure of the ultimate consumers of the data the members of a profession or, perhaps, a wider public. The dialogue may occur silently within one head, if the researcher and sampler are one; but the dialogue should nevertheless take place (Kuk, 1988 and Okafor, 2002).

Most samples are prepared by statistician and other researchers who are not primarily sampling specialists. Nevertheless, it is helpful, although sometimes difficult, to separate sampling design from the related activities involved in survey research. The sample design covers the tasks of selection and estimation for making inference from sample value to the population value. Beyond this are the problems of making inferences from the survey population to another and generally broader population, with measurements free from error (Rao, 1994).

AIM AND OBJECTIVES

The aim of this study is to compare variances in linear combination using simple random sampling and stratification sampling. The objectives are to:

  1. Estimate the mean monthly income and expenses of students in the university

  2. Present a better procedure for estimation when the mean monthly incomes and expenses of students in the university are involved.

Characteristics of population elements are transformed to variables Yi by the survey operations of measurement. Some literature deals directly with the statistical populations of the variables Yi. But I prefer to say that the ith element has the variable Yi. This permits us to talk of the many variables (Yi, Xi, Zi, Wi, Pi and so on) of the same element (Rao, Kovar and Mantel, 1990). We can also consider relationships between variables of an element, changes of variables, and accuracy of measurements of variables. A statistic based on the variables found in a sample results in a random variable is what we call a variate (Kendall and Buckland, 1957).

METHODOLOGY

j

j

Yi = value of the Yi variable for the jth sample element y = n yj

= 1 n y

(1)

n j j

s2 = 1 n (y

– )2

y n1 j j

Var() = E{( n yj n)2} = E{( n ) (yj – ) + n(yj ) (yk – )}

j j,k j

Var() = n N Y i2 + n (n1) [N Y Yi – ] = n2 + n (n1) [ N(Y )2 – N(Yi – )2] = (1-f) nS2

N i i

N (N1)

ih i

(2)

y N (N1) i i i y

where f = n/N Also,

Var (N) = N2 Var () = 1f S2 (3)

n y

MEAN AND VARIANCE OF SIMPLE RANDOM SAMPLING

The simple mean of the sample of a selection is the SRS mean, and we distinguish it with the subscript 0:

0 = 1 n yj = 1 [ y1 + y2 + … + yn ] (4)

n j n

2

2

The results of an SRS selection may be used for other estimators also, for example, with post-stratification or with a ratio estimator (Sarndal and Wright, 1992). But we treat those separately as other designs. Simple random sampling is a sample design specifying both the SRS selection and the simple mean estimate. The variance of the SRS mean 0 is computed as

var ( ) = (1 f) S

0

where

s2 = 1

n1

n

j

j

n (yj – )2

n yj2 y2

= n

n(n1)

The standard error of 0 is the root of its variance:

se (0) = var (0)

= 1 f s

n

(5)

Sometimes we may want to estimate Y = N, the aggregate or total of the Yi variable in the population. A simple estimator of Y is the N0 and its standard error is estimated by

se (N ) = N 1 f S

(6)

0 n

We can also point out that the expected value of S2 in SRS is

E(s2) = N N1

2 (7)

This is shown as the expected value of the sample estimate of the variance of the mean is

E 1f s2 )= Nn

(8)

n N1 n

x

x

For the difference ( ) of two means, the variance is simply the sum of the two variances if the two samples are independent (Thompson, 1992). But if the two means are not independent, a covariance term must be subtracted from the sum of the

variances: var ( ) = var ( ) + var () 2 cov ( , ).

x x x

For n pairs of values, each pair selected with SRS, the difference has the variance:

var ( ) = 1f ( S2 + S2 + 25yx ) (9)

x n x y

Note also the use of covariance of the two variables. The statistics resembles the variance. But contains cross-product terms instead of the squared terms of the variance

cov (

, ) = (1 f ) Syx, S

= 1 [ n

y x yx ] (10)

0 x 0

n yx

n1

j j j n

Note also that for the pairs of elements

x

x

( ) =

n yj n

n xj

j

j

j

j

n

j

j

= n dj

n

Hence we may treat this as the mean of a sample of n element (yj xj) = dj . The variance can also be computed as

var 1 n d ) = 1f s2 (11)

n j j n d

where

s2 = 1

[ n d 2 – dj ],

d n1

j j n

which numerically equal to (15). The covariance is absent for two independent samples, but present for two overlapping samples. The variance of the difference becomes more complicated if the two samples are neither completely independent nor completely overlapping.

j

j

The subclass mean m = m yj / m from an SRS of n elements can be treated as an SRS of m elements. That is, we consider the variance of the sample conditional on obtaining a sample of m elements:

(m) = 1f n (yj -m)2 = 1f s2 (12)

m(m1) j m m

We can use f = m/M if we know M, the population size of the subclass. If we do not know M, we can use f = n/M, neglecting the difference form m/M. But if we want to estimate Y = MY, the population total subclass, then knowledge of the subclass size

M becomes important. If M is known, then (M m) has the variance M2 var m). If we do not know M and use (mN/n) m to estimate Y, then the element variance s2 is increased to [s2 + (1 ) m2 ].

y m m

The formula with the possible exception of the factor (1 – f ) = (1 – n/N). This factor is usually called the finite population correction, briefly fpc. When sampling without replacement, it appears as a correction factor to the main portion of the variance terms, which is s2 / n for SRS. If we think of a fixed sample size n being applied to larger and larger populations, the sampling fraction g = n/N tends to zero, and the factor (1 f) approaches 1. Multiplication by one has no effect, and the fpc can be omitted when the population is much larger than the sample. For an infinite population the factor disappears from the variance formula; hence, its name. Also, when selecting with replacement, the factor (1 f) becomes 1 and disappears. The effect is similar to selection from an infinite population.

The sampling fraction is usually small, because the population is large. The aims of research generally concern inferences from about large populations or confined to a small population. This often is hopefully considered a sample for making inference about some much larger actual population or theoretical universe. But census aimed specifically as small populations do occur, and sometimes these run into larger fractions of 10 percent and more. In these rare cases the fpc is needed. Note that the variance can be written as var () = S2 / n where n = n/(1 n/N) = nN/(N n). From this we easily note that n = n/(1 + n/N). In the words, the effect of (1 n/N) is to increase the effective sample size from n to n. It might be convenient to write all the variance formulas with this convention.

RELATIVE ERROR

In some situation it is useful to consider some relative measures instead of the absolute measures of the variation. The absolute measures, the standard deviation and the standard error, appear in the units of measurement of the variable, and this causes difficulties in some comparisons. Common relative measures are the coefficients of variation, in which the unit of measurement is cancelled by dividing the mean. The element coefficient of variation is derived from the standard deviation:

y

y

y

y

C = Sy , estimated by c

= Sy

(13)

The coefficient of variation of the mean () is derived similarly from the standard error:

CV() = SE(), estimated by cv() = se()

(14)

The squares of these quantities correspond, respectively, to the variances of the element and of the mean:

y

y

C2 =

S2

y

y

Y2

estimated by c2 =

S2

y

y

2

y

y

is an element relative variance and

CV2() = Var()

2

estimated by cv2 = var()

2

2

is a relative variance of the mean ().

Coefficients of variation are useful for variables that are always or mostly positive; these occur frequently in surveys, especially as count data. Comparison of the variability of these items often becomes more meaningful when expressed in relative terms. For example, in comparing the income spread in two countries, the use of the two standard deviations would be confused by the different monetary units as well as by different standard of living; but coefficients of variation may provide a reasonable comparison in term of average income.

cv(N) = cv() (15)

The general expression holds for different sample designs. Specifically, for SRS samples we can use

cv( ) = cv (N ) = 1 f sy

(16)

0 0

0n

In some situations the coefficients of variation should be used only with caution, or not at all for the following reasons: (1) If the mean of the variable is close to zero, the coefficients of variances are large and unusable and (2) For binomial variable, the element variance is the same P (P 1) for both P and 1 P; but the coefficients of variation differ, depending on the arbitrary decision of which side of the binomial but is regarded as P and which as Q. That is:

C2 = P(1P)

y P

and

cv2 (p) = (1 – f) (1P)

P(n1)

(17)

v

v

The element relative variance C2 = (1 P)/P = 1 for P = 0.05. It increases rapidly for small values of P. LINEAR COMBINATIONS USING STRATIFICATION SAMPLING

Mean of Linear Combinations Using Stratification Sampling

In stratified sampling where the population of N units is first divided into subpopulations of N1, N2, …, NL units respectively. These subpopulations are non overlapping and together they comprise the whole of the population, so that

N1, N2, …, NL = N

h=1

h=1

Then; lst = L

Whh

is a linear function of the h

with fixed weights Wh. (18)

Variances for Linear Combinations Using Stratification Sampling

We obtain variances for some more complicated linear combinations that we shall need later. The sum of H random variables, weighted by the constant factors Whhas the variance:

h

h

Var Whyh = h W2 Var (yh) + 2h< WhWg Cov ( Whyh) (19)

A common example is the sum or difference of two random variables y1 and y2, when W1 = 1 and W2 is either 1 or -1: Var(y1 ± y2 ( = Var(y1) + Var(y2( ± 2Cov(y1 , y2)

The covariance vanishes if y1 and y2 are uncorrelated. Another important special case when all H variate are uncorrelated, because they are based on independent samples from H strata. Then all the covariance vanish and

h

h

Var Whyh = W2 Var (yh) (20)

We can consider the covariance of the sums Whyh and Vhxh of two sets of random variables, again assuming independence between the H sets; for example, these could be pairs of measurements on H independently selected elements:

Cov Whyh , Vhxh = Whyh , Cov (yh , xh) (21) When constants Wh and yh are all unity, we have

Cov yh , xh = Cov (yh , xh ) (22)

The formulas for variances and covariance of linear combinations were developed for population values, written with capital letters as Var and Cov. But they apply also to their sample estimates, which we write with lower case letters as var and cov. Summation of the estimated variances and covariances for sample totals within strata is simple and frequently needed.

Therefore, we employ the brief notation dy2, dx2 and dyhdxh. When the yh and xh represent two variates for selections that are

h h

independent between strata, we have

h

h

Var yh) = var (yh) = dy2

h

h

Var xh) = dx2

Cov yh xh = dyh dxh (23)

RESULTS

Here, an estimation of mean monthly allowance of students in Mathematics and Statistics, Department of Federal University of Technology, Minna is considered. The set of data gathered is stratified into two strata: male (stratum 1) and female (stratum 2). We present the summary of results generated as follows for simple random sampling and stratification sampling in tables 1 and 2 respectively:

Table 1: Estimates for Simple Random Sampling

Estimate / Parameter

X

Y

n

38

38

Mean

19,750.00

18, 170.00

s2

1,540,000,000.00

1,299,998,914.00

Standard Error

6,357.58

5,848.97

of Mean

Table 2: Estimate for Stratification Sampling

Estimate / Parameter

X

Y

n

38

38

Mean

12,263.16

11,282.11

s2

62,469,41700

52,874,114.37

Standard Error of Mean

6,357.16

5,848.59

DISCUSSION OF RESULTS

From tables 1 and 2, it was found that the variance and the corresponding standard error in the case of stratification sampling is less than that of the variance and the corresponding standard error in the case of simple random sampling. For a simple random sampling, the means of x and y are 19,750.00 and 18,170.00 with standard errors of 6,357.58 and 5,848.97 respectively. That is the average monthly allowance for a student is 19,750.00 while the average monthly expenditure for a student is

18,170.00. Also, for a stratification sampling, the means of x and y are 12,263.16 and 11,282.11 with standard errors of 1,282.16 and 1,179.59 respectively. That is the average monthly allowance for a student is 12,263.16 while the monthly expenditure for a student is 11,282.11.

CONCLUSIONS AND RECOMMENDATIONS

From the findings above, it was observed that the estimation procedure using stratification sampling is better than linear combination of simple random sampling. An approach that is better in sampling technique is always being adopted when there is a need for computation involving such variable of interest. The study shows that the estimation from stratification sampling scheme gives the minimum variance and standard error of mean. Hence, estimation using stratification sampling scheme is recommended.

REFERENCES

  1. Hidiroglou, M. and Rao, J. (1983). On two sample schemes of unequal probability sampling without replacement. Journal of the Indian Statistical Association, 3: 173-180.

  2. Horvitz, D. G. and Thompson, D. J. (1952). A Generalization of sampling without replacement from a finite universe. Journal of American Statistical Association. 47: 663-685.

  3. Kish, L. (1965). Survey Sampling. John Wiley and Sons.

  4. Kuk, A. (1988). Estimation of distribution functions and medians under sampling with unequal probabilities. Biometrika, 75: 97-103.

  5. Okafor, F. C. (2002). Sample Survey Theory with Applications, Afro-Orbis Publications, Nigeria.

  6. Rao, J. (1994). Estimating totals and distributions functions using auxiliary information in the estimation stage. Journal of Official Statistics, 10: 153-166.

  7. Rao, J., Kovar, J. and Mantel, H. (1990). On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika, 77: 365-375.

  8. Sarndal, C. E. and Wright, R. L. (1992): Design-based and model-based inference in survey sampling. Scandinavian Journal of Statistics, 5: 27-52.

  9. Thompson, S. K. (1992). Sampling. John Wiley and Sons, New York.

Leave a Reply

Your email address will not be published. Required fields are marked *