2022-11-21

Chi-Square Test

What is Chi-Square Test

The Chi-square test is a tool in the field of hypothesis testing, primarily used to determine if there is a significant association between two categorical variables.

Unlike some other tests, the chi-square test does not require assumptions about the distribution of the population, hence it's termed "non-parametric." It's specifically designed to analyze categorical or nominal data, such as gender, color, brand preference, and so on, rather than numerical data.

The chi-square test compares observed frequencies in each category of a contingency table with the expected frequencies, which are the frequencies we would anticipate if there was no association between the variables. By comparing these sets of frequencies, the test helps researchers decide whether to accept or reject their null hypothesis about the relationship between the variables under study.

Concept of Chi-Square Test

The Chi-square test is a statistical procedure used to determine whether there's a significant difference between an expected distribution and an actual distribution. It's named after the Chi-square distribution which is a family of distributions that take only positive values and are skewed to the right, a property which the test statistic follows.

The Chi-square test compares the difference between the observed and the expected frequencies in the data. The "observed frequencies" are the actual data you have collected. The "expected frequencies" are the frequencies that you would expect in each cell of a contingency table if there were no relationship between the variables.

Observed frequencies:

	●	▲	Total
✖︎	a	c	i
■	b	d	j
Total	x	y	N

Expected frequencies:

	●	▲
✖︎	$x \times i / N$	$y \times i / N$
■	$x \times j / N$	$y \times j / N$

The chi-square statistic is calculated using the following formula:

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

where:

$\chi^2$ = Chi-square statistic
$O_i$ = Observed frequency
$E_i$ = Expected frequency
$\sum$ indicates that we sum this calculation over all cells of the contingency table.

The result of this calculation, $\chi^2$ , is a single number that represents the total divergence of the observed frequencies from the expected frequencies. If the observed frequencies closely match the expected frequencies (implying that our variables are likely independent), then the $\chi^2$ statistic will be small. If the observed frequencies diverge considerably from the expected frequencies (implying that our variables are likely not independent), then the $\chi^2$ statistic will be large.

Hypotheses of Chi-Square Test

Like all hypothesis tests, the chi-square test uses a null hypothesis and an alternative hypothesis.

The null hypothesis ( $H_0$ ) for a Chi-square test is that there is no association between the categorical variables. In other words, the variables are independent.

The alternative hypothesis ( $H_1$ ) is that there is an association between the variables – they are not independent.

Types of Chi-Square Tests

Chi-square tests are not a single test but a group of statistical tests that follow the chi-square distribution under the null hypothesis. Although there are several different types of chi-square tests, three of them are particularly common: the chi-square test for independence, the chi-square test for goodness of fit, and the chi-square test for homogeneity.

Chi-Square Test for Independence

The chi-square test for independence, also known as the chi-square test for association, is used to determine if there is a significant association between two categorical variables. In other words, it tests whether the variables are independent or related.

The null hypothesis for this test states that the variables are independent, while the alternative hypothesis states that the variables are not independent, i.e., there is an association or relationship between them.

Chi-Square Test for Goodness of Fit

The chi-square test for goodness of fit is used to determine if a set of observed categorical data matches an expected distribution. This test is commonly used to test the hypothesis that an observed frequency distribution fits a particular theoretical distribution.

The null hypothesis for the chi-square goodness of fit test is that the observed data fit the expected distribution, while the alternative hypothesis is that the observed data do not fit the expected distribution.

Chi-Square Test for Homogeneity

The chi-square test for homogeneity is used to determine whether different samples (populations) have the same distribution of a single categorical variable. For example, a chi-square test for homogeneity could be used to test whether the distribution of political party preference is the same for three different age groups.

The null hypothesis for the test for homogeneity is that the populations have the same distribution - they are homogeneous. The alternative hypothesis is that the populations do not have the same distribution - they are not homogeneous.

Steps in Conducting Chi-Square Test

Conducting a chi-square test involves a series of steps. The following is a general process that is commonly followed.

Defining Hypotheses

The first step is to define the null and alternative hypotheses. For a chi-square test, the null hypothesis typically states that the variables are independent, while the alternative hypothesis states that the variables are not independent.

Calculating Test Statistic

Next, construct a contingency table of observed frequencies. From this table, calculate the expected frequencies. The expected frequency for each cell in the table is calculated as (row total × column total) / grand total.

The chi-square statistic is then calculated using the formula:

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

where:

$\chi^2$ = Chi-square statistic
$O_i$ = Observed frequency
$E_i$ = Expected frequency
$\sum$ indicates that we sum this calculation over all cells of the contingency table.

Determining Critical Value

The next step is to determine the critical value from the chi-square distribution. This requires knowing the degrees of freedom for the test. For a chi-square test for independence, the degrees of freedom are calculated as (number of rows - 1) × (number of columns - 1).

Once you have the degrees of freedom, you can find the critical value from a chi-square distribution table. The critical value is the number that the test statistic must exceed to reject the null hypothesis.

Interpreting Result

Finally, compare the chi-square test statistic to the critical value to decide whether to accept or reject the null hypothesis.

If the chi-square test statistic is greater than the critical value, we reject the null hypothesis. This indicates that the observed frequencies significantly differ from the expected frequencies, suggesting an association between the variables.
If the chi-square test statistic is less than or equal to the critical value, we fail to reject the null hypothesis. This suggests that the observed frequencies do not significantly differ from the expected frequencies, providing no evidence of an association between the variables.

Python Implementation of Chi-Square Test

Implementing the Chi-Square test in Python can be quite straightforward. Here is an example of how to do it.

Firstly, let's import necessary libraries.

python

import numpy as np
import scipy.stats as stats

Let's assume we have observed frequencies in a contingency table:

python

# Observed data in each category
observed = np.array([[10, 15, 20], [20, 15, 15]])

We can perform the chi-square test using the chi2_contingency function from scipy.stats. This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed.

python

chi2, p, dof, expected = stats.chi2_contingency(observed)

The function returns four values:

chi2: The test statistic.
p: The p-value of the test
dof: Degrees of freedom
expected: The expected frequencies, based on the marginal sums of the table.

Finally, we can print out these values:

python

print("Chi-square statistic = ", chi2)
print("p-value = ", p)
print("Degrees of freedom = ", dof)
print("Expected contingency table: \n", expected)

Chi-square statistic =  3.7949735449735464
p-value =  0.14994499194861846
Degrees of freedom =  2
Expected contingency table:
 [[14.21052632 14.21052632 16.57894737]
 [15.78947368 15.78947368 18.42105263]]

Based on the output of Chi-square test, we can interpret the results as follows:

Chi-square Statistic
The chi-square statistic of 3.79 measures the difference between your observed data and the values you would expect to obtain if the null hypothesis were true (i.e., if the variables were independent).
p-value
The p-value of 0.1499 is the probability of obtaining the observed data (or data more extreme) if the null hypothesis were true. In most cases, a threshold ( $\alpha$ ) of 0.05 is used. If the p-value is less than $\alpha$ , you would reject the null hypothesis. In this case, the p-value is greater than 0.05, so you would fail to reject the null hypothesis. This means that, based on your data, there is not enough evidence to conclude that there is a significant association between the two variables.
Degrees of Freedom
The degrees of freedom for the test are 2. This is calculated as (number of rows - 1) * (number of columns - 1) in the contingency table.
Expected Contingency Table
This is the contingency table you would expect if the null hypothesis were true. These values are calculated based on the marginal totals of your observed data. You can compare these values to your observed data to see where the biggest discrepancies lie.

The Chi-square test results suggest that there is no significant association between the variables in question, as the p-value is greater than 0.05.

Z-Test

T-Test

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS