The Sample Mean

When working through a tutorial, don't use the scrollbar. Move around the document by clicking on the hotlinks instead.


Statisticians are often trying to answer questions such as: What percentage of U.S. voters prefer the Democratic candidate for President? How many hours, on the average, will a light bulb produced by a certain manufacturer last? What is the average amount of garbage produced per week by an American household? What is the average number of miles driven by the owner of a Nissan in one year?

All these problems can be modelled as questions about the mean of certain random variables on large populations. THEORETICALLY, the mean of such a random variable can be determined by recording the values of the random variable for each member of the population, adding up the results, and dividing the sum by the number of observations. Presidential elections are an attempt to apply exactly this procedure to settle the first question. However, in most cases it is too impractical and expensive to answer questions of this kind by determining all values of the random variables involved.

Continue
































SAMPLING is a less expensive way of solving problems as the above. In sampling a random variable X, only the values of X for members of the sample need to be recorded. Then the mean for the sample taken is calculated; and it is hoped that this mean will be a good approximation to the mean of the whole population. In developing sampling techniques, statisticians have to answer the following question:

"How good of an approximation to the population mean is the sample mean likely to be?"

It seems intuitively clear that larger samples should yield better approximations than smaller ones. But, for a given sample size, how good is the approximation? And, conversely, if we need a reliable approximation of the population mean within, say 3% of the actual value, how large a sample should be taken?

Continue
































In order to answer these questions, we need to introduce some precise mathematical terminology. There are basically two ways of conducting sampling: sampling with replacement, and sampling without replacement. In SAMPLING WITH REPLACEMENT, the same member of the population can be included in the sample more than once; in SAMPLING WITHOUT REPLACEMENT, each member of the population can be included in the population at most once. To understand the origin of this terminology, consider the following situation: Suppose you have an urn with a large number of balls in it. Each ball is either black or white. You want to use sampling of ten balls to get an idea of the percentage of black balls in the urn. In sampling with replacement, you repeat the following procedure 10 times: You pick randomly a ball from the urn, record its color, and put it back into the urn ("replace it"). In sampling without replacement, you repeat the following procedure 10 times: You pick randomly a ball from the urn, record its color, and do NOT put it back into the urn (you put it onto a pile on the side, for example). If the sample size is very small relative to the population size, it really does not matter whether you use sampling with or without replacement, but if the sample size is of the same order of magnitude as the population size, it does matter which of the two methods is used.

Suppose the population has size 1000. How many different samples of size 5 can be taken if the sampling is done without replacement?































Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























CORRECT!

In sampling without replacement, the samples are combinations of elements of the population.

Continue
































Sampling has a variety of uses, but we are only concerned with sampling as a tool to estimate the mean of a random variable X. The population on which X is defined is often called the PARENT POPULATION. The mean of X on the parent population is called the POPULATION MEAN and is denoted by m, the standard deviation of X on the parent population is called the POPULATION STANDARD DEVIATION, and is denoted by s.

Now let us suppose that samples of size n are taken. Each such sample has a SAMPLE MEAN that usually differs from sample to sample and from the population mean m. Thus the sample mean is a new random variable. It is denoted by X with a bar on top (for typographical reasons, in this tutorial we will use X' instead). Note that the probability space on which X' is defined is not the parent population, but the set of all possible samples of size n. The mean of this new random variable will be denoted by mX' , and its standard deviation is denoted by sX' . The standard deviation of X' is often called the STANDARD ERROR, since it tells us how far off the sample mean typically is from the population mean that it is supposed to approximate.

Continue
































If the sample size is n and the population size is N, then we have the following formulas for the sample mean and the standard error:

mX' = m,     no matter whether we sample with or without replacement;

sX'2 = s2/n,    if the sampling is done with replacement;

Note that these formulas nicely correspond to our intuitions about sampling:

The first tells us that if we average the sample means for all possible samples, then we get exactly the population mean. The denominator n in the formula for the variance of the sample mean implies that the larger the sample size, the smaller this variance will be, that is, the closer a typical sample mean will be to the population mean. If the sample size n is very small relative to the population size N, then the difference between sampling with or without replacement will be negligable.

Continue
































The amount of liquid in a large Coke bottle (in liters) is a random variable with mean 2 and standard deviation 0.05. Out of a consignment of 1000 Coke bottles, 16 are sampled for the amount of liquid they contain. What is the standard deviation of the sample mean?































Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























CORRECT!

In the above example, the mean was known beforehand. In real applications, the mean is not known, and sampling is used to get an approximate idea how large this unknown mean is. For example, sampling might be used for quality control of consignments of Coke bottles: If the sample mean is too low, the consignment should be rejected. But how low is "too low"? After all, a low sample mean could be purely a result of an unlucky choice of the sample, even if the mean for the consignment is exactly 2 liters, as it should be. So we want to be able to answer questions like: "If a sample of size 16 is drawn from a population with unknown mean, what is the probability that the sample mean is more than 0.1 units smaller than the population mean?"

In order to answer such questions, we need to know the probability distribution of the sample mean. The Central Limit Theorem tells us that if samples of size n are drawn from practically any population, and if n is large, then the sample mean is a normally distributed random variable with mean m and standard deviation s dividec by the square root of the sample size n, where m is the population mean and s is the population standard deviation.

Continue
































The amount of liquid in a large Coke bottle (in liters) is a random variable with mean 2 and standard deviation 0.05. Out of a consignment of 1000 Coke bottles, 16 are sampled for the amount of liquid they contain. What is the probability that the sample mean is less than 1.99?































Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























CORRECT!

Here m = 2 and s = 0.05. Let X' denote the sample mean. We want to find P(X' < 1.9). Since X' has an approximately normal distribution with a mean of 2 and a standard deviation of 0.05/4 = 0.0125, the latter can be expressed in terms of the standard normal random variable Z as P(Z < (1.99 - 2)/0.0125}).

Suppose the mean number of miles driven by the owner of a Nissan in one year is 14,000, with a standard deviation of 1,000 miles. If 400 Nissan owners are polled about their milage in 1999, what is the probability that the sample mean will be between 13,900 and 14,100 miles?































Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























Sorry, you chose the wrong answer.

Please try again.





























CORRECT!

Here m = 14,000 and s = 1,000. Let X' denote the sample mean. We want to find P(13,900 < X' < 14,100). Since X' has an approximately normal distribution with a mean of 14,000 and a standard deviation of 1,000/20 = 50, the latter can be expressed in terms of the standard normal random variable Z as P(-2 < Z < 2).

Continue
































Congratulations, you have successfully completed the last tutorial!

Good luck on the final!


© Winfried Just
Last modified November 14, 1999.

This tutorial was compiled by WJ TUTORIALMAKER 0.21, a program written by Winfried Just.