2022-11-22

Central limit theorem

What is the Central Limit Theorem

The central limit theorem states that the distribution of the sample mean \overline{X}_n of a randomly selected sample from a population with mean \mu and variance \sigma^2 approximately follows a normal distribution with mean \mu and variance \frac{\sigma^2}{n} when the sample size n is sufficiently large.

The remarkable point of this theorem is that it can be approximated by a normal distribution regardless of the distribution of the population. No matter what the original probability distribution is, if n is large, the sample mean will always approach the normal distribution.

It is important to note that it is the distribution of the sample mean that can approximate the normal distribution, not the distribution of the sample itself taken from the population. The distribution of the sample mean is the distribution formed by the mean value when the process of extracting a sample from the population and finding its mean is repeated many times.

Check the Central Limit Theorem in Python

Dice example

Let's experiment with the distribution of the total number of dice thrown N (1, 2, 5, 10, 50, 100) times. The dice follow a uniform distribution since the probability of getting any number from 1 to 6 is equal to \frac{1}{6}. The following is the code.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline

sns.set()
sns.set_context(rc = {'patch.linewidth': 0.2})
sns.set_style('dark')

numIterations = np.asarray([1,2,5,10,50,100]); #number of i.i.d RVs
experiment = 'dice' #valid values: 'dice', 'coins'
maxNumForExperiment = {'dice':6,'coins':2} #max numbers represented on dice or coins
nSamp=100000

k = maxNumForExperiment[experiment]

fig, fig_axes = plt.subplots(ncols=3, nrows=2, constrained_layout=True, figsize=(12,8))

for i,N in enumerate(numIterations):
    y = np.random.randint(low=1,high=k+1,size=(N,nSamp)).sum(axis=0)
    row = i//3;col=i%3;
    bins=np.arange(start=min(y),stop=max(y)+2,step=1)
    fig_axes[row,col].hist(y,bins=bins,density=True)
    fig_axes[row,col].set_title('N={} {}'.format(N,experiment))
plt.show()

dice

As N increases (i.e., the sample size to be extracted increases), the distribution of the total value of the dice (distribution of the sample mean) approaches a normal distribution.

Next, let us experiment with the distribution of the mean value of the dice thrown N (1, 2, 5, 10, 50, 100) times.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline

sns.set()
sns.set_context(rc = {'patch.linewidth': 0.2})
sns.set_style('dark')

numIterations = np.asarray([1,2,5,10,50,100]); #number of i.i.d RVs
experiment = 'coins' #valid values: 'dice', 'coins'
maxNumForExperiment = {'dice':6,'coins':2} #max numbers represented on dice or coins
nSamp=100000

k = maxNumForExperiment[experiment]

for i,N in enumerate(numIterations):
    y = np.random.randint(low=1,high=k +1,size=(N,nSamp)).sum(axis=0)/N
    row = i//3;col=i%3;
    bins=np.arange(start=1,stop=7,step=0.1)
    fig_axes[row,col].hist(y,bins=bins,density=True)
    fig_axes[row,col].set_title('N={} {}'.format(N,experiment))
plt.show()

dice mean

We can see that as N increases, the distribution of the mean of the dice rolls approaches a normal distribution. Also, since the variance of the sample mean is \frac{\sigma^2}{n}, we see that the variance decreases as N increases.

Coin example

Let us experiment with the distribution of the total value of a coin tossed N (1, 2, 5, 10, 50, 100) times, with 1 for a heads-up and 2 for a tails-up. The probability of a coin being heads or tails is equal to \frac{1}{2}, so we follow the Bernoulli distribution. The following is the code.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline

sns.set()
sns.set_context(rc = {'patch.linewidth': 0.2})
sns.set_style('dark')

numIterations = np.asarray([1,2,5,10,50,100]); #number of i.i.d RVs
experiment = 'coins' #valid values: 'dice', 'coins'
maxNumForExperiment = {'dice':6,'coins':2} #max numbers represented on dice or coins
nSamp=100000

k = maxNumForExperiment[experiment]

fig, fig_axes = plt.subplots(ncols=3, nrows=2, constrained_layout=True, figsize=(12,8))

for i,N in enumerate(numIterations):
    y = np.random.randint(low=1,high=k+1,size=(N,nSamp)).sum(axis=0)
    row = i//3;col=i%3;
    bins=np.arange(start=min(y),stop=max(y)+2,step=1)
    fig_axes[row,col].hist(y,bins=bins,density=True)
    fig_axes[row,col].set_title('N={} {}'.format(N,experiment))
plt.show()

coins

As N increases (i.e., the sample size to be extracted increases), the distribution of the sum of the coin values (distribution of the sample mean) approaches a normal distribution.

Next, let us experiment with the distribution of the average value of the eyes that appear when a coin is tossed N (1, 2, 5, 10, 50, 100) times.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline

sns.set()
sns.set_context(rc = {'patch.linewidth': 0.2})
sns.set_style('dark')

numIterations = np.asarray([1,2,5,10,50,100]); #number of i.i.d RVs
experiment = 'coins' #valid values: 'dice', 'coins'
maxNumForExperiment = {'dice':6,'coins':2} #max numbers represented on dice or coins
nSamp=100000

k = maxNumForExperiment[experiment]

for i,N in enumerate(numIterations):
    y = np.random.randint(low=1,high=k +1,size=(N,nSamp)).sum(axis=0)/N
    row = i//3;col=i%3;
    bins=np.arange(start=1,stop=3,step=0.1)
    fig_axes[row,col].hist(y,bins=bins,density=True)
    fig_axes[row,col].set_title('N={} {}'.format(N,experiment))
plt.show()

coins mean

As N increases, we see that the distribution of the mean of the coin values approaches a normal distribution. Also, since the variance of the sample mean is \frac{\sigma^2}{n}, we see that the variance decreases as N increases.

References

https://www.gaussianwaves.com/2010/01/central-limit-theorem-2/

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!