·

n-1 missunderstandings in calculation of empirical variance and standard deviation - bias-corrected variance estimation

In statistics, variance is the expectation of the squared deviation of a random variable from its mean. It measures how far a set of (random) numbers are spread out from their average value. Thus the variance has a central role in statistics, e.g. in descriptive statistics, statistical inference, hypothesis testing, goodness of fit or Monte Carlo sampling. In Conclusion, variance is an important tool in all sciences, where statistical analysis of data is common.

When we talk about empirical variance it has to be considered which convention or definition applies in the corresponding context. Neither the naming of the definitions nor the corresponding notation is consistent used in the literature. This often leads to misunderstandings and communication mistakes!

In general, there are two ways empirical variance can be computed: Out of an entire population (if the entire population is known) by using the denominator "n"  or - in cases where this cannot be done - we can estimate it by examining a random sample (randomly taken from the entire population) and compute it with a bias corrected denominator: "n-1" (degrees of freedom).

The same applies to the square root of the variance, the standard deviation. Here we can also distinguish a population standard deviation (the standard deviation of the entire population) from an estimated sample standard deviation (the standard deviation of a sample). Like above, in the population standard deviation formula, the denominator is n instead of n-1.

Because of knowing it's a rare that measurements can be taken for an entire population, by default, statistical software packages like SPSS etc. always calculate the sample standard deviation. The same applies to journal articles, where we should always assume that a sample standard deviation is reported unless otherwise is explicitly specified.

I have often been asked how fundamental the deviation is from a mathematical perspective?  Does it really make a big difference whether the variance or standard deviation is calculated with n respectively n-1 in the denominator?

To get a better feeling how fundamental the problem is, I tried visualize it with the following graphs:

The figure shows that we underestimate the variance, especially if we want to estimate it for the total population on the basis of small samples. For example, if we have a sample of 5 Participants and calculate the variance without a bias corrected denominator, we underestimate the population effect round about 20% and if n=10 is given round about 10%. That's not a little!

But what we also can see that the deviation gets smaller and smaller by growing sample sizes. As sample size increases, the amount of bias decreases. We obtain more information and the difference between becomes smaller. Latest if our sample is larger than 100 the deviation is <1% and reduces the bias problem to a formal aspect.

The same applies to the standard deviation. Of course, the relative difference is smaller here, since it's the square root of the variance:

In conclusion we can say that the variance and standard deviation of an entire population always should be computed with the denominator n in cases where every member of a population is sampled. In cases where that cannot be done, the variance and standard deviation has to be estimated by examining a random sample taken from the population and computing a statistic of the sample by using the correction n-1.

However, unlike in the case of estimating the population mean, for which the sample mean is a simple estimator with unbiased, efficient, maximum likelihood properties, there is no single estimator for the variance and standard deviation with these properties. So we have to accept that an unbiased estimation of the variance and standard deviation is a technically involved problem.

Almost always calculations are carried out with the denominator n-1. This has become more or less a standard and often this is right. But from case to case other estimators can be better: for example, the denominator n-1.5 mostly eliminates bias in unbiased estimation of standard deviation for normal distributed variables and the denominator n+1 can be used to minimize the mean squared error of a normal distribution.

For post hoc corrections of uncorrected variances and standard deviations I wrote a simple R-script:

n <- 3 # Enter sample size

varvalue <- .672 # Enter population variance or standard deviation

corr <- 1 # Enter correction value (e.g. "1" for n-1)

eq1 <- n/(n-corr)

eq2 <- sqrt(eq1) # Ignore this line for variance computations

result <-varvalue*eq2 # and change "eq2" to "eq1"

result

Also it’s possible to "remove" corrections post hoc by putting the correction value (named „corr“) from the denominator in the numerator:

n <- 3 # Enter sample size

varvalue <- .823 # Enter sample variance or standard deviation

corr <- 1 # Enter correction value (e.g. "1" for n-1)

eq1 <- (n-corr)/n

eq2 <- sqrt(eq1) # Ignore this line for variance computations

result <-varvalue*eq2 # and change "eq2" to "eq1"

result