2022-12-23

Statistical model

What is a statistical model

A statistical model is a mathematical model that explains patterns or laws about observed data. It selects a probability distribution according to the characteristics of the data and estimates the parameters of the probability distribution that can successfully explain the observed data.

Parameter estimation

I will perform parameter estimation in Python for the following sample data with a sample size of 50.

data = [2,2,4,6,4,5,2,3,1,2,0,4,3,3,3,3,4,2,7,2,4,3,3,3,4,
        3,7,5,3,1,7,6,4,6,5,2,4,7,2,2,6,2,4,5,4,5,1,3,2,3]

Data investigation

First, let us look at a histogram of the data.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.style.use('ggplot')
plt.figure(figsize=(10,5))

data = [2,2,4,6,4,5,2,3,1,2,0,4,3,3,3,3,4,2,7,2,4,3,3,3,4,
        3,7,5,3,1,7,6,4,6,5,2,4,7,2,2,6,2,4,5,4,5,1,3,2,3]

plt.hist(data, bins = [-0.5 + v for v in range(max(data) + 2)], alpha=0.5, rwidth=0.95)
plt.xlabel('data')
plt.ylabel('Frequency')

Histogram

Next let us look at a summary of the data.

import pandas as pd

df = pd.DataFrame(data, columns=['x'])
df.describe()

	x
count	50.00000
mean	3.56000
std	1.72804
min	0.00000
25%	2.00000
50%	3.00000
75%	4.75000
max	7.00000

Let us look at the sample variance.

# sample variabce
df.var()

2.986122

The survey confirmed the following:

Mountain-like distribution
Data are count data (non-negative integers)
Sample mean is 3.56
Sample variance is 2.99

Assumptions of probability distribution

From the sample data, we assume a probability distribution that the population follows. We know the following about the sample data.

The data is count data
The lower bound is 0 but the upper bound is unknown
The sample mean is 3.56 and the sample variance is 2.99, which are relatively close

From the above characteristics, we assume a Poisson distribution with $\lambda=3.56$ .

Poisson distribution vs. sample distribution

Visualize the Poisson distribution of $\lambda=3.56$ and the sample data distribution superimposed.

fig, ax1 = plt.subplots()
plt.hist(data, bins = [-0.5 + v for v in range(max(data) + 2)], alpha=0.5, rwidth=0.95)
poisson_values = poisson.rvs(3.56, size=10000)
prod = (pd.value_counts(poisson_values) / 10000).sort_index()
ax2 = ax1.twinx()
ax2.plot(prod)
plt.show()

Poisson distribution with histogram

It appears that the distribution of the sample data can be represented by a Poisson distribution with $\lambda=3.56$ .

Parameter estimation of the assumed probability distribution

The parameters of the probability distribution are estimated based on the sample data. Here we use a parameter estimation method called maximum likelihood estimation.

Maximum likelihood estimation is a parameter estimation method that searches for the parameter that maximizes the likelihood, a statistic that expresses how well it fits the observed data. In this case, the parameter is $\lambda$ . In other words, we are estimating the value of $\lambda$ that maximizes the fit to the data.

The likelihood is the product of the probabilities for all the sample data. The likelihood is a function of the parameter, denoted by $L$ . Since the parameter in this case is $\lambda$ , the expression for the likelihood is as follows

L(\lambda) = \prod_i p(y_i | \lambda) = \prod_i \frac{e^{-\lambda} \lambda^y_i}{y_i!}

where $y_i$ is the observed value. In this example, $y_1=2, y_2=2, y_3=4, \cdots, y_{50}=3$ .

We examine $\lambda$ that maximizes the likelihood $L(\lambda)$ . However, since the likelihood as it is is multiplicative and difficult to handle, the log-transformed log-likelihood is used for the maximum likelihood estimation of the parameters.

\log L(\lambda) = \sum_i (y_i \log \lambda - \lambda - \sum^{y_i}_k \log k)

The following figure shows the shape of the distribution and the value of $log L(\lambda)$ when the parameter $\lambda$ is varied from 3.2 to 4.4.

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sympy
from scipy.stats import poisson
%matplotlib inline

plt.style.use('ggplot')
plt.figure(figsize=(10,5))

lams = np.arange(3.2,4.5,0.4)

for i, lam in enumerate(lams):
    log_l = 0
    for y in data:
        f = (lam**y)*sympy.exp(-lam)/sympy.factorial(y)
        logf=sympy.log(f)
        log_l = log_l + logf

    fig, ax = plt.subplots()
    ax.hist(data, bins = [-0.5 + v for v in range(max(data) + 2)], alpha=0.5, rwidth=0.95)
    poisson_values = poisson.rvs(lam, size=10000)
    prod = (pd.value_counts(poisson_values) / 10000).sort_index()
    ax2 = ax.twinx()
    ax2.plot(prod, label=f'lambda={round(lam, 1)}, log L={round(log_l, 1)}')
    plt.legend()
    plt.show()

lambda and log maximum likelihood

We can see that the log-likelihood is likely to be maximized around $\lambda=3.6$ . Since $log L$ is a monotonically increasing function of the likelihood $L$ , the log likelihood is also maximum at $\lambda$ where the log likelihood is maximum.

Type of data	Range of data	Relationship between sample variance $V(X)$ and sample mean $E(X)$	Probability distribution
Discrete	$\{0, 1, 2, \cdots ,\infty \}$	$E(X) \approx V(X)$	Poisson distribution
Discrete	$\{0, 1, 2, \cdots ,\infty \}$	$E(X) < V(X)$	Negative binomial distribution
Discrete	$\{0, 1, 2, \cdots ,N \}$		Binomial distribution
Continuous	$[−\infty, \infty]$		Normal distribution
Continuous	$[0, \infty]$		Log-normal distribution
Continuous	$[0, \infty]$		Gamma distribution
Continuous	$[0, 1]$		Beta distribution

Statistical model

What is a statistical model

Parameter estimation

Data investigation

Assumptions of probability distribution

Poisson distribution vs. sample distribution

Parameter estimation of the assumed probability distribution

Which probability distribution should be used for statistical models

P-Value Hacking

Generalized linear model

Ryusei Kakujo