2022-11-20

Z-Test

What is Z-Test

The Z-test is one of the most commonly used statistical procedures, a fundamental tool for statisticians, researchers, and data analysts alike. It belongs to the family of hypothesis tests that provide a method for making inferences or conclusions about the characteristics of populations based on the analysis of sample data.

The term 'Z-test' originates from the standard normal distribution, also known as the 'Z-distribution', which is a special case of the normal distribution where the mean is 0 and the standard deviation is 1. The 'Z' in Z-test is derived from Z-score, which is a way of standardizing random variables in statistics.

The basic idea behind the Z-test is to see how far off the observed sample mean is from the hypothesized population mean, in terms of standard deviations. If the sample mean is sufficiently far from the population mean, we reject the null hypothesis in favor of the alternative hypothesis. In other words, we conclude that there is a statistically significant difference between the sample mean and the population mean.

For example, imagine a shoe manufacturer claims that their shoes last an average of 12 months. You suspect this might not be accurate, so you collect a sample of shoes and test their lifespan. The Z-test can help you determine whether the average lifespan in your sample is statistically different from the claimed 12 months.

Assumptions of the Z-Test

The Z-test, like many statistical procedures, is based on certain assumptions. If these assumptions are not met, the results obtained from the test may not be valid or reliable. Therefore, before conducting a Z-test, it's essential to confirm that your data meets the following assumptions:

Normally Distributed Population
The population from which the samples are taken should be normally distributed for the Z-test to be valid. However, thanks to the Central Limit Theorem, if you have a large enough sample size (usually considered to be more than 30), it doesn’t matter if the population is not normally distributed. The distribution of the sample means will still approximate a normal distribution.
Known Population Standard Deviation
This is perhaps the most stringent assumption of the Z-test. It assumes that the standard deviation of the population is known. In reality, this is often not the case, and the sample standard deviation is used as an estimate of the population standard deviation. When this happens, it’s more appropriate to use a T-test instead of a Z-test.
Interval or Ratio Level of Measurement
The Z-test requires that the variables being tested are measured on either an interval or ratio scale. This means that the data should be continuous or quantitative, with consistent intervals between measurements.

Types of Z-Test

Depending on the nature of the data and the question we are trying to answer, there are different types of Z-tests that can be used. Here are the three most common types:

One-Sample Z-Test
The one-sample Z-test is used when you want to know whether your sample comes from a particular population. The population mean and standard deviation are known from previous studies. You then take a sample from this population, find the sample's mean, and then compare this with the population mean. If the sample mean is sufficiently different from the population mean, you can conclude that your sample probably didn't come from that population.
Two-Sample Z-Test
The two-sample Z-test is used when you are interested in comparing two independent groups to see if they come from the same population. For example, you might want to know whether the average height of men differs from the average height of women. You would randomly sample men and women, and compare the mean height of the men with the mean height of the women. If the difference is statistically significant, you would conclude that men and women probably come from different populations with regard to height.
Paired Z-Test
The paired Z-test is used when you are interested in the difference between two variables for the same group of subjects. For example, you might be interested in whether a training program improves performance. You measure performance before and after the training program, and compare the two sets of scores. If the mean difference is statistically significant, you would conclude that the training program probably had an effect.

Steps in Performing a Z-Test

Performing a Z-test involves a series of steps. Here, we will outline the general process for conducting a one-sample Z-test. Remember, this is a simplified example and the specific steps might vary based on the nature of your data and the type of Z-test.

State the Null and Alternative Hypotheses

The null hypothesis ( $H_0$ ) is a statement that the difference between the sample mean and the population mean is equal to zero, i.e., there is no effect or no difference. The alternative hypothesis ( $H_1$ ) is the statement that there is an effect or a difference.

For example:

$H_0$ : $\mu$ = $\mu_0$
$H_1$ : $\mu$ ≠ $\mu_0$

where $\mu$ is the population mean and $\mu_0$ is the hypothesized population mean.

Choose the Significance Level

The significance level (denoted as $\alpha$ ) is the probability of rejecting the null hypothesis when it is true. Typically, a significance level of 0.05 is used, which means that we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis.

Calculate the Z-Score

The Z-score is calculated using the following formula:

Z = \frac{(\bar{X} - \mu_0)}{\frac{\sigma}{\sqrt{n}}}

where:

$\bar{X}$ is the sample mean
$\mu_0$ is the hypothesized population mean
$\sigma$ is the population standard deviation
$n$ is the sample size

Determine the Critical Z-Score

The critical Z-score is the value that divides the area under the normal curve into segments associated with significance level. For a two-tailed test at a 0.05 significance level, Z-score is typically ±1.96.

Make a Decision

If Z > Z-score, we reject the null hypothesis.
If Z ≤ Z-score, we fail to reject the null hypothesis.

Remember, failing to reject the null hypothesis is not the same as accepting it. It merely means that there is not enough evidence to support the alternative hypothesis.

Report the Results

Finally, you should report your results. This includes the Z-value, the p-value (probability of observing the data given that the null hypothesis is true), and whether you rejected or failed to reject the null hypothesis.

Python Implementation of Z-Test

In this chapter, I will implement a one-sample Z-test using Python.

First, you'll need to import the necessary libraries.

python

import numpy as np
from scipy import stats

Now, suppose you have a sample data set and you want to test whether its mean differs from the hypothesized population mean. Let's assume the population standard deviation is known.

python

# Sample data
sample_data = np.array([2, 3, 5, 6, 9, 11, 12, 15, 18, 20, 22, 24, 26, 27, 30])

# Known population standard deviation
population_std_dev = 5.0

# Hypothesized population mean
population_mean = 20.0

Now, we can calculate the sample mean and size:

python

# Calculate sample mean and size
sample_mean = np.mean(sample_data)
sample_size = len(sample_data)

Next, we calculate the Z-score:

python

# Calculate Z-score
Z = (sample_mean - population_mean) / (population_std_dev / np.sqrt(sample_size))

We can find the critical Z-score using the scipy.stats library. For a two-tailed test at a significance level of 0.05, the critical Z-score is approximately ±1.96.

python

# Two-tailed test for 95% confidence level
Z_critical = stats.norm.ppf(0.975)  # 0.975 = 1 - (0.05/2)

Finally, we make a decision based on the comparison between the calculated Z-score and the critical Z-score:

python

# Make a decision
if np.abs(Z) > Z_critical:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

This will print "Reject the null hypothesis." or "Fail to reject the null hypothesis." depending on whether the absolute value of the calculated Z-score is greater than the critical Z-score.

p-Value

Chi-Square Test

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS