What is Chi-Square Test
The Chi-square test is a tool in the field of hypothesis testing, primarily used to determine if there is a significant association between two categorical variables.
Unlike some other tests, the chi-square test does not require assumptions about the distribution of the population, hence it's termed "non-parametric." It's specifically designed to analyze categorical or nominal data, such as gender, color, brand preference, and so on, rather than numerical data.
The chi-square test compares observed frequencies in each category of a contingency table with the expected frequencies, which are the frequencies we would anticipate if there was no association between the variables. By comparing these sets of frequencies, the test helps researchers decide whether to accept or reject their null hypothesis about the relationship between the variables under study.
Concept of Chi-Square Test
The Chi-square test is a statistical procedure used to determine whether there's a significant difference between an expected distribution and an actual distribution. It's named after the Chi-square distribution which is a family of distributions that take only positive values and are skewed to the right, a property which the test statistic follows.
The Chi-square test compares the difference between the observed and the expected frequencies in the data. The "observed frequencies" are the actual data you have collected. The "expected frequencies" are the frequencies that you would expect in each cell of a contingency table if there were no relationship between the variables.
Observed frequencies:
● | ▲ | Total | |
---|---|---|---|
✖︎ | a | c | i |
■ | b | d | j |
Total | x | y | N |
Expected frequencies:
● | ▲ | |
---|---|---|
✖︎ | ||
■ |
The chi-square statistic is calculated using the following formula:
where:
= Chi-square statistic\chi^2 = Observed frequencyO_i = Expected frequencyE_i indicates that we sum this calculation over all cells of the contingency table.\sum
The result of this calculation,
Hypotheses of Chi-Square Test
Like all hypothesis tests, the chi-square test uses a null hypothesis and an alternative hypothesis.
The null hypothesis (
The alternative hypothesis (
Types of Chi-Square Tests
Chi-square tests are not a single test but a group of statistical tests that follow the chi-square distribution under the null hypothesis. Although there are several different types of chi-square tests, three of them are particularly common: the chi-square test for independence, the chi-square test for goodness of fit, and the chi-square test for homogeneity.
Chi-Square Test for Independence
The chi-square test for independence, also known as the chi-square test for association, is used to determine if there is a significant association between two categorical variables. In other words, it tests whether the variables are independent or related.
The null hypothesis for this test states that the variables are independent, while the alternative hypothesis states that the variables are not independent, i.e., there is an association or relationship between them.
Chi-Square Test for Goodness of Fit
The chi-square test for goodness of fit is used to determine if a set of observed categorical data matches an expected distribution. This test is commonly used to test the hypothesis that an observed frequency distribution fits a particular theoretical distribution.
The null hypothesis for the chi-square goodness of fit test is that the observed data fit the expected distribution, while the alternative hypothesis is that the observed data do not fit the expected distribution.
Chi-Square Test for Homogeneity
The chi-square test for homogeneity is used to determine whether different samples (populations) have the same distribution of a single categorical variable. For example, a chi-square test for homogeneity could be used to test whether the distribution of political party preference is the same for three different age groups.
The null hypothesis for the test for homogeneity is that the populations have the same distribution - they are homogeneous. The alternative hypothesis is that the populations do not have the same distribution - they are not homogeneous.
Steps in Conducting Chi-Square Test
Conducting a chi-square test involves a series of steps. The following is a general process that is commonly followed.
Defining Hypotheses
The first step is to define the null and alternative hypotheses. For a chi-square test, the null hypothesis typically states that the variables are independent, while the alternative hypothesis states that the variables are not independent.
Calculating Test Statistic
Next, construct a contingency table of observed frequencies. From this table, calculate the expected frequencies. The expected frequency for each cell in the table is calculated as (row total × column total) / grand total
.
The chi-square statistic is then calculated using the formula:
where:
= Chi-square statistic\chi^2 = Observed frequencyO_i = Expected frequencyE_i indicates that we sum this calculation over all cells of the contingency table.\sum
Determining Critical Value
The next step is to determine the critical value from the chi-square distribution. This requires knowing the degrees of freedom for the test. For a chi-square test for independence, the degrees of freedom are calculated as (number of rows - 1) × (number of columns - 1)
.
Once you have the degrees of freedom, you can find the critical value from a chi-square distribution table. The critical value is the number that the test statistic must exceed to reject the null hypothesis.
Interpreting Result
Finally, compare the chi-square test statistic to the critical value to decide whether to accept or reject the null hypothesis.
- If the chi-square test statistic is greater than the critical value, we reject the null hypothesis. This indicates that the observed frequencies significantly differ from the expected frequencies, suggesting an association between the variables.
- If the chi-square test statistic is less than or equal to the critical value, we fail to reject the null hypothesis. This suggests that the observed frequencies do not significantly differ from the expected frequencies, providing no evidence of an association between the variables.
Python Implementation of Chi-Square Test
Implementing the Chi-Square test in Python can be quite straightforward. Here is an example of how to do it.
Firstly, let's import necessary libraries.
import numpy as np
import scipy.stats as stats
Let's assume we have observed frequencies in a contingency table:
# Observed data in each category
observed = np.array([[10, 15, 20], [20, 15, 15]])
We can perform the chi-square test using the chi2_contingency
function from scipy.stats
. This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed.
chi2, p, dof, expected = stats.chi2_contingency(observed)
The function returns four values:
chi2
: The test statistic.p
: The p-value of the testdof
: Degrees of freedomexpected
: The expected frequencies, based on the marginal sums of the table.
Finally, we can print out these values:
print("Chi-square statistic = ", chi2)
print("p-value = ", p)
print("Degrees of freedom = ", dof)
print("Expected contingency table: \n", expected)
Chi-square statistic = 3.7949735449735464
p-value = 0.14994499194861846
Degrees of freedom = 2
Expected contingency table:
[[14.21052632 14.21052632 16.57894737]
[15.78947368 15.78947368 18.42105263]]
Based on the output of Chi-square test, we can interpret the results as follows:
-
Chi-square Statistic
The chi-square statistic of 3.79 measures the difference between your observed data and the values you would expect to obtain if the null hypothesis were true (i.e., if the variables were independent). -
p-value
The p-value of 0.1499 is the probability of obtaining the observed data (or data more extreme) if the null hypothesis were true. In most cases, a threshold ( ) of 0.05 is used. If the p-value is less than\alpha , you would reject the null hypothesis. In this case, the p-value is greater than 0.05, so you would fail to reject the null hypothesis. This means that, based on your data, there is not enough evidence to conclude that there is a significant association between the two variables.\alpha -
Degrees of Freedom
The degrees of freedom for the test are 2. This is calculated as (number of rows - 1) * (number of columns - 1) in the contingency table. -
Expected Contingency Table
This is the contingency table you would expect if the null hypothesis were true. These values are calculated based on the marginal totals of your observed data. You can compare these values to your observed data to see where the biggest discrepancies lie.
The Chi-square test results suggest that there is no significant association between the variables in question, as the p-value is greater than 0.05.