What is Linear Regression
Linear regression is a machine learning algorithm used to model the relationship between a dependent variable (also known as the response, outcome and objective variable) and one or more independent variables (also known as predictors, features, or input variables). The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables. This is achieved by fitting a linear equation to the observed data points, which can be represented as a straight line in the case of simple linear regression or as a hyperplane for multiple linear regression.
The underlying principle of linear regression is to minimize the difference between the predicted values and the actual values. This difference, known as the residual, is the vertical distance between the data points and the fitted line or hyperplane. By minimizing the sum of the squared residuals, we obtain the best-fitting line or hyperplane that can be used to make predictions about the dependent variable.
Assumptions of Linear Regression
To ensure that the linear regression model provides accurate and reliable predictions, several assumptions must be met:
-
Linearity
There should be a linear relationship between the dependent variable and the independent variables. This can be checked using scatter plots or correlation coefficients. -
Independence
The independent variables should not be highly correlated with each other. Multicollinearity can lead to unstable estimates and can be addressed by removing redundant variables or using regularization techniques. -
Homoscedasticity
The variance of the residuals should be constant across all levels of the independent variables. Heteroscedasticity can be detected using scatter plots or diagnostic tests and can be addressed by using weighted least squares or transforming the dependent variable. -
Normality
The residuals should be normally distributed. This can be checked using histograms, Q-Q plots, or statistical tests such as the Shapiro-Wilk test. Non-normality can be addressed by transforming the dependent variable or using robust regression techniques. -
Independence of Errors
The residuals should be independent of each other. This can be checked using the Durbin-Watson test or by plotting the residuals against time or the predicted values. Autocorrelation can be addressed by using time series models or incorporating lagged variables.
Simple Linear Regression
In simple linear regression, we aim to establish a linear relationship between a single independent variable (
where:
is the dependent variableY is the independent variableX is the intercept of the line (the value of\beta_0 whenY is 0)X is the slope of the line (the change in\beta_1 for a unit change inY )X is the error term (the difference between the actual and predicted values of\epsilon )Y
The best-fit line is the one that minimizes the sum of squared residuals (the squared differences between the actual and predicted values of
Least Squares Method
The least squares method is a mathematical approach to finding the best-fit line by minimizing the sum of squared residuals. The estimates for the intercept and slope of the best-fit line can be calculated using the following formulas:
where:
is the number of data pointsn andX_i are the individual data pointsY_i and\bar{X} are the means of the independent and dependent variables, respectively\bar{Y}
Evaluating Model Performance
Once the best-fit line is obtained, we need to evaluate the model's performance to ensure that it is a good fit for the data. Some common metrics used to assess the performance of a simple linear regression model are:
Coefficient of Determination (R^2)
This metric measures the proportion of the variance in the dependent variable that can be explained by the independent variable. An
where:
is the predicted value of\hat{Y}_i for theY data pointi^{th}
Mean Squared Error (MSE)
This metric calculates the average squared difference between the actual and predicted values of the dependent variable.
A lower MSE indicates a better fit of the model to the data.
Multiple Linear Regression
Multiple linear regression extends the concept of simple linear regression to include multiple independent variables. The equation for the relationship between the dependent variable (
where:
is the dependent variableY are the independent variablesX_1, X_2, ..., X_p is the intercept of the hyperplane\beta_0 are the coefficients for the independent variables\beta_1, \beta_2, ..., \beta_p is the error term (the difference between the actual and predicted values of\epsilon )Y
Matrix Approach
In multiple linear regression, we use the matrix approach to estimate the coefficients of the independent variables. The least squares estimates can be obtained by solving the following matrix equation:
where:
is a vector containing the coefficients (\boldsymbol{\beta} )\beta_0, \beta_1, ..., \beta_p is the matrix of independent variables, with each row representing a data point and each column representing an independent variable\mathbf{X} is the vector of dependent variable values\mathbf{Y}
Handling Multicollinearity
Multicollinearity arises when two or more independent variables are highly correlated. It can lead to unstable estimates and make it difficult to interpret the coefficients of the independent variables. To detect multicollinearity, we can calculate the variance inflation factor (VIF) for each independent variable:
where
To address multicollinearity, we can:
- Remove one of the correlated variables
- Combine correlated variables into a single variable (e.g., by taking their average)
- Apply regularization techniques, such as ridge or lasso regression
Feature Selection and Scaling
In multiple linear regression, it is essential to select the most relevant independent variables to avoid overfitting and improve model interpretability. Feature selection techniques, such as stepwise regression, recursive feature elimination, and LASSO, can be used to identify the most important variables.
Additionally, when the independent variables have different scales, it can be challenging to compare their coefficients. In such cases, feature scaling methods, such as normalization or standardization, can be applied to bring all variables to a similar scale.
Implementing Linear Regression in Python
In this chapter, I will implement both simple and multiple linear regression using Python and the California Housing dataset. This dataset is a popular choice for regression tasks and is available in the scikit-learn library.
First, let's import the necessary libraries and load the California Housing dataset from the scikit-learn library.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the California Housing dataset
dataset = fetch_california_housing()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = dataset.target
# Display the first few rows of the dataset
print(X.head())
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
Simple Linear Regression
We will start by implementing simple linear regression using the MedInc
feature, which represents the median income in a given area, to predict the median house price.
Split the data into training and testing sets.
X_simple = X[["MedInc"]]
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.2, random_state=42)
Create a simple linear regression model and fit it to the training data.
simple_lr = LinearRegression()
simple_lr.fit(X_train_simple, y_train)
Evaluate the model's performance using MSE and
y_pred_simple = simple_lr.predict(X_test_simple)
mse_simple = mean_squared_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)
print("Simple Linear Regression - MSE:", mse_simple)
print("Simple Linear Regression - R² Score:", r2_simple)
Simple Linear Regression - MSE: 0.7091157771765549
Simple Linear Regression - R² Score: 0.45885918903846656
Plot the best-fit line using matplotlib and seaborn.
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_test_simple["MedInc"], y=y_test, alpha=0.6, label="Actual")
sns.lineplot(x=X_test_simple["MedInc"], y=y_pred_simple, color="red", label="Prediction")
plt.xlabel("Median Income")
plt.ylabel("Median House Price")
plt.title("Simple Linear Regression: Median Income vs. Median House Price")
plt.legend()
plt.show()
Multiple Linear Regression
Now let's implement multiple linear regression using all the features in the dataset.
Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create a multiple linear regression model and fit it to the training data.
multiple_lr = LinearRegression()
multiple_lr.fit(X_train, y_train)
Evaluate the model's performance using mean squared error (MSE) and
y_pred_multiple = multiple_lr.predict(X_test)
mse_multiple = mean_squared_error(y_test, y_pred_multiple)
r2_multiple = r2_score(y_test, y_pred_multiple)
print("Multiple Linear Regression - MSE:", mse_multiple)
print("Multiple Linear Regression - R² Score:", r2_multiple)
Multiple Linear Regression - MSE: 0.5558915986952444
Multiple Linear Regression - R² Score: 0.5757877060324508
Now let's compare the performance of the simple and multiple linear regression models using their MSE and
print("Simple Linear Regression - MSE:", mse_simple)
print("Simple Linear Regression - R² Score:", r2_simple)
print("Multiple Linear Regression - MSE:", mse_multiple)
print("Multiple Linear Regression - R² Score:", r2_multiple)
Simple Linear Regression - MSE: 0.709
Simple Linear Regression - R² Score: 0.459
Multiple Linear Regression - MSE: 0.556
Multiple Linear Regression - R² Score: 0.576
Based on the results, the multiple linear regression model should generally have a lower MSE and a higher
References