What is Z-Test
The Z-test is one of the most commonly used statistical procedures, a fundamental tool for statisticians, researchers, and data analysts alike. It belongs to the family of hypothesis tests that provide a method for making inferences or conclusions about the characteristics of populations based on the analysis of sample data.
The term 'Z-test' originates from the standard normal distribution, also known as the 'Z-distribution', which is a special case of the normal distribution where the mean is 0 and the standard deviation is 1. The 'Z' in Z-test is derived from Z-score, which is a way of standardizing random variables in statistics.
The basic idea behind the Z-test is to see how far off the observed sample mean is from the hypothesized population mean, in terms of standard deviations. If the sample mean is sufficiently far from the population mean, we reject the null hypothesis in favor of the alternative hypothesis. In other words, we conclude that there is a statistically significant difference between the sample mean and the population mean.
For example, imagine a shoe manufacturer claims that their shoes last an average of 12 months. You suspect this might not be accurate, so you collect a sample of shoes and test their lifespan. The Z-test can help you determine whether the average lifespan in your sample is statistically different from the claimed 12 months.
Assumptions of the Z-Test
The Z-test, like many statistical procedures, is based on certain assumptions. If these assumptions are not met, the results obtained from the test may not be valid or reliable. Therefore, before conducting a Z-test, it's essential to confirm that your data meets the following assumptions:
-
Normally Distributed Population
The population from which the samples are taken should be normally distributed for the Z-test to be valid. However, thanks to the Central Limit Theorem, if you have a large enough sample size (usually considered to be more than 30), it doesn’t matter if the population is not normally distributed. The distribution of the sample means will still approximate a normal distribution. -
Known Population Standard Deviation
This is perhaps the most stringent assumption of the Z-test. It assumes that the standard deviation of the population is known. In reality, this is often not the case, and the sample standard deviation is used as an estimate of the population standard deviation. When this happens, it’s more appropriate to use a T-test instead of a Z-test. -
Interval or Ratio Level of Measurement
The Z-test requires that the variables being tested are measured on either an interval or ratio scale. This means that the data should be continuous or quantitative, with consistent intervals between measurements.
Types of Z-Test
Depending on the nature of the data and the question we are trying to answer, there are different types of Z-tests that can be used. Here are the three most common types:
-
One-Sample Z-Test
The one-sample Z-test is used when you want to know whether your sample comes from a particular population. The population mean and standard deviation are known from previous studies. You then take a sample from this population, find the sample's mean, and then compare this with the population mean. If the sample mean is sufficiently different from the population mean, you can conclude that your sample probably didn't come from that population. -
Two-Sample Z-Test
The two-sample Z-test is used when you are interested in comparing two independent groups to see if they come from the same population. For example, you might want to know whether the average height of men differs from the average height of women. You would randomly sample men and women, and compare the mean height of the men with the mean height of the women. If the difference is statistically significant, you would conclude that men and women probably come from different populations with regard to height. -
Paired Z-Test
The paired Z-test is used when you are interested in the difference between two variables for the same group of subjects. For example, you might be interested in whether a training program improves performance. You measure performance before and after the training program, and compare the two sets of scores. If the mean difference is statistically significant, you would conclude that the training program probably had an effect.
Steps in Performing a Z-Test
Performing a Z-test involves a series of steps. Here, we will outline the general process for conducting a one-sample Z-test. Remember, this is a simplified example and the specific steps might vary based on the nature of your data and the type of Z-test.
State the Null and Alternative Hypotheses
The null hypothesis (
For example:
-
:H_0 =\mu \mu_0 -
:H_1 ≠\mu \mu_0
where
Choose the Significance Level
The significance level (denoted as
Calculate the Z-Score
The Z-score is calculated using the following formula:
where:
is the sample mean\bar{X} is the hypothesized population mean\mu_0 is the population standard deviation\sigma is the sample sizen
Determine the Critical Z-Score
The critical Z-score is the value that divides the area under the normal curve into segments associated with significance level. For a two-tailed test at a 0.05 significance level, Z-score is typically ±1.96.
Make a Decision
- If Z > Z-score, we reject the null hypothesis.
- If Z ≤ Z-score, we fail to reject the null hypothesis.
Remember, failing to reject the null hypothesis is not the same as accepting it. It merely means that there is not enough evidence to support the alternative hypothesis.
Report the Results
Finally, you should report your results. This includes the Z-value, the p-value (probability of observing the data given that the null hypothesis is true), and whether you rejected or failed to reject the null hypothesis.
Python Implementation of Z-Test
In this chapter, I will implement a one-sample Z-test using Python.
First, you'll need to import the necessary libraries.
import numpy as np
from scipy import stats
Now, suppose you have a sample data set and you want to test whether its mean differs from the hypothesized population mean. Let's assume the population standard deviation is known.
# Sample data
sample_data = np.array([2, 3, 5, 6, 9, 11, 12, 15, 18, 20, 22, 24, 26, 27, 30])
# Known population standard deviation
population_std_dev = 5.0
# Hypothesized population mean
population_mean = 20.0
Now, we can calculate the sample mean and size:
# Calculate sample mean and size
sample_mean = np.mean(sample_data)
sample_size = len(sample_data)
Next, we calculate the Z-score:
# Calculate Z-score
Z = (sample_mean - population_mean) / (population_std_dev / np.sqrt(sample_size))
We can find the critical Z-score using the scipy.stats
library. For a two-tailed test at a significance level of 0.05, the critical Z-score is approximately ±1.96.
# Two-tailed test for 95% confidence level
Z_critical = stats.norm.ppf(0.975) # 0.975 = 1 - (0.05/2)
Finally, we make a decision based on the comparison between the calculated Z-score and the critical Z-score:
# Make a decision
if np.abs(Z) > Z_critical:
print("Reject the null hypothesis.")
else:
print("Fail to reject the null hypothesis.")
This will print "Reject the null hypothesis." or "Fail to reject the null hypothesis." depending on whether the absolute value of the calculated Z-score is greater than the critical Z-score.