2022-03-26

Unbiased Variance as Estimator of Population Variance

Population and Sample Variance

In statistics, variance is a measure of how dispersed the data points are within a dataset. It is an important concept when analyzing and comparing different datasets.

Population variance represents the true variance of the entire population. Ideally, when conducting research or analyzing data, we would have access to the entire population. However, in most cases, it is impossible or impractical to collect data from every individual in a population. This is where sample variance comes into play.

Sample variance is calculated using a subset of the population, known as a sample. The sample is used to make inferences about the entire population, including estimating the population variance. It is crucial to select a representative sample to ensure accurate estimations.

Estimating Population Variance

Formula for Population Variance

Population variance, denoted by \sigma^2, is a measure of dispersion in a dataset. It is calculated by taking the mean of squared differences between each data point and the population mean (\mu). The formula for population variance is:

\sigma^2 = \frac{\sum(x - \mu)^2}{N}

where x represents each data point, \mu is the population mean, and N is the population size.

Formula for Sample Variance

Sample variance, denoted by s^2, is calculated using a similar method as population variance. The main difference is that the sample mean (\bar{x}) is used instead of the population mean, and the sample size (n) is used instead of the population size:

s^2 = \frac{\sum(x - \bar{x})^2}{n}

Since sample variance tends to underestimate the population variance, we need to apply a correction factor to obtain an unbiased estimator for the population variance. This correction is known as Bessel's correction, and it involves using n-1 in the denominator instead of n:

s^2 = \frac{\sum(x - \bar{x})^2}{n - 1}

By using this corrected formula for sample variance, we can obtain a more accurate and unbiased estimate of the population variance.

Mathematical Foundation for Using n-1

We explore the mathematical foundation for using n-1 in the denominator of the sample variance formula. This adjustment helps provide an unbiased estimate of population variance by accounting for the loss of one degree of freedom due to the use of the sample mean.

Mathematical Expectation and Unbiased Estimators

In statistics, an estimator is a function that calculates an estimate of a population parameter based on a sample. An unbiased estimator is one that, on average, provides an accurate estimate of the population parameter, meaning its expected value is equal to the true parameter value. Mathematically, for an unbiased estimator \hat{\theta} of a population parameter \theta, we have:

E[\hat{\theta}] = \theta

The sample variance, calculated using n in the denominator, is a biased estimator for the population variance. To see why, consider the following:

Let X_1, X_2, ..., X_n be a random sample from a population with a mean of \mu and variance of \sigma^2. The biased sample variance, S_n^2, can be defined as:

S_n^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n}

where \bar{X} is the sample mean. To show that this is a biased estimator, we need to calculate the expectation of S_n^2:

E[S_n^2] = E\left[\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n}\right]

After some algebraic manipulations (which are beyond the scope of this summary), it can be shown that:

E[S_n^2] = \frac{n-1}{n}\sigma^2

Thus, the expected value of the biased sample variance is smaller than the true population variance, indicating that it is a biased estimator.

Deriving the Unbiased Sample Variance Formula

To derive an unbiased estimator for population variance, we must adjust the sample variance formula by using n-1 in the denominator instead of n. The unbiased sample variance, S_{n-1}^2, is defined as:

S_{n-1}^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}

By calculating the expectation of this new estimator, we can show that it is unbiased:

E[S_{n-1}^2] = E\left[\frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1}\right]

After similar algebraic manipulations as before, we find that:

E[S_{n-1}^2] = \sigma^2

This result confirms that the unbiased sample variance formula, which uses n-1 in the denominator, provides an accurate estimate of the true population variance on average.

Why Sample Variance Tends to Be Smaller Than Population Variance

Sample variance has a tendency to underestimate the population variance. This is due to the fact that the sample mean, \bar{x}, is calculated from the same data points used to compute the sample variance. Consequently, the squared differences between each data point and the sample mean are generally smaller, resulting in a smaller variance value. This systematic underestimation is referred to as bias.

Here's a Python example that demonstrates the bias in sample variance:

python
import numpy as np

population = np.random.normal(50, 10, 10000)  # Simulate a population with mean=50 and std_dev=10
biased_variances = []

for _ in range(1000):
    sample = np.random.choice(population, 30)  # Draw a sample of size 30
    sample_variance = np.var(sample, ddof=0)  # Compute biased sample variance
    biased_variances.append(sample_variance)

mean_biased_variance = np.mean(biased_variances)
population_variance = np.var(population)

print("Mean Biased Variance:", mean_biased_variance)
print("Population Variance:", population_variance)
Mean Biased Variance: 97.12419679509834
Population Variance: 99.90071273636632

This example shows that, on average, the biased sample variance underestimates the true population variance.

Unbiased Variance as an Estimator of Population Variance

To obtain an unbiased estimator for the population variance, Bessel's correction is applied to the sample variance formula. This correction involves using n-1 in the denominator instead of n, which compensates for the bias introduced by using the sample mean instead of the population mean:

s^2 = \frac{\sum(x - \bar{x})^2}{n - 1}

By using this corrected formula for sample variance, we can obtain a more accurate and unbiased estimate of the population variance. This adjustment is particularly important when working with small sample sizes, where the bias can be more pronounced.

To further illustrate the effectiveness of Bessel's correction, let's modify the Python example from previous chapter to compute the unbiased sample variances:

python
import numpy as np

population = np.random.normal(50, 10, 10000)  # Simulate a population with mean=50 and std_dev=10
unbiased_variances = []

for _ in range(1000):
    sample = np.random.choice(population, 30)  # Draw a sample of size 30
    sample_variance = np.var(sample, ddof=1)  # Compute unbiased sample variance
    unbiased_variances.append(sample_variance)

mean_unbiased_variance = np.mean(unbiased_variances)
population_variance = np.var(population)

print("Mean Unbiased Variance:", mean_unbiased_variance)
print("Population Variance:", population_variance)
Mean Unbiased Variance: 101.69170154048732
Population Variance: 101.5789887073244

As demonstrated by this example, using the unbiased sample variance formula with Bessel's correction provides a more accurate estimate of the true population variance. By using n-1 instead of n in the denominator, we can effectively correct for the bias introduced by using the sample mean in our calculations.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!