Population and Sample Variance
In statistics, variance is a measure of how dispersed the data points are within a dataset. It is an important concept when analyzing and comparing different datasets.
Population variance represents the true variance of the entire population. Ideally, when conducting research or analyzing data, we would have access to the entire population. However, in most cases, it is impossible or impractical to collect data from every individual in a population. This is where sample variance comes into play.
Sample variance is calculated using a subset of the population, known as a sample. The sample is used to make inferences about the entire population, including estimating the population variance. It is crucial to select a representative sample to ensure accurate estimations.
Estimating Population Variance
Formula for Population Variance
Population variance, denoted by
where
Formula for Sample Variance
Sample variance, denoted by
Since sample variance tends to underestimate the population variance, we need to apply a correction factor to obtain an unbiased estimator for the population variance. This correction is known as Bessel's correction, and it involves using
By using this corrected formula for sample variance, we can obtain a more accurate and unbiased estimate of the population variance.
Mathematical Foundation for Using n-1
We explore the mathematical foundation for using
Mathematical Expectation and Unbiased Estimators
In statistics, an estimator is a function that calculates an estimate of a population parameter based on a sample. An unbiased estimator is one that, on average, provides an accurate estimate of the population parameter, meaning its expected value is equal to the true parameter value. Mathematically, for an unbiased estimator
The sample variance, calculated using
Let
where
After some algebraic manipulations (which are beyond the scope of this summary), it can be shown that:
Thus, the expected value of the biased sample variance is smaller than the true population variance, indicating that it is a biased estimator.
Deriving the Unbiased Sample Variance Formula
To derive an unbiased estimator for population variance, we must adjust the sample variance formula by using
By calculating the expectation of this new estimator, we can show that it is unbiased:
After similar algebraic manipulations as before, we find that:
This result confirms that the unbiased sample variance formula, which uses
Why Sample Variance Tends to Be Smaller Than Population Variance
Sample variance has a tendency to underestimate the population variance. This is due to the fact that the sample mean,
Here's a Python example that demonstrates the bias in sample variance:
import numpy as np
population = np.random.normal(50, 10, 10000) # Simulate a population with mean=50 and std_dev=10
biased_variances = []
for _ in range(1000):
sample = np.random.choice(population, 30) # Draw a sample of size 30
sample_variance = np.var(sample, ddof=0) # Compute biased sample variance
biased_variances.append(sample_variance)
mean_biased_variance = np.mean(biased_variances)
population_variance = np.var(population)
print("Mean Biased Variance:", mean_biased_variance)
print("Population Variance:", population_variance)
Mean Biased Variance: 97.12419679509834
Population Variance: 99.90071273636632
This example shows that, on average, the biased sample variance underestimates the true population variance.
Unbiased Variance as an Estimator of Population Variance
To obtain an unbiased estimator for the population variance, Bessel's correction is applied to the sample variance formula. This correction involves using
By using this corrected formula for sample variance, we can obtain a more accurate and unbiased estimate of the population variance. This adjustment is particularly important when working with small sample sizes, where the bias can be more pronounced.
To further illustrate the effectiveness of Bessel's correction, let's modify the Python example from previous chapter to compute the unbiased sample variances:
import numpy as np
population = np.random.normal(50, 10, 10000) # Simulate a population with mean=50 and std_dev=10
unbiased_variances = []
for _ in range(1000):
sample = np.random.choice(population, 30) # Draw a sample of size 30
sample_variance = np.var(sample, ddof=1) # Compute unbiased sample variance
unbiased_variances.append(sample_variance)
mean_unbiased_variance = np.mean(unbiased_variances)
population_variance = np.var(population)
print("Mean Unbiased Variance:", mean_unbiased_variance)
print("Population Variance:", population_variance)
Mean Unbiased Variance: 101.69170154048732
Population Variance: 101.5789887073244
As demonstrated by this example, using the unbiased sample variance formula with Bessel's correction provides a more accurate estimate of the true population variance. By using