2022-08-02

Random Forests with the Titanic Dataset

Machine Learning

Decision Tree

sklearn

Python

Introduction

In this article, I will demonstrate how to implement a random forest classifier using the Titanic dataset from the seaborn library.

Preparing the dataset

First, let's import the necessary libraries and load the Titanic dataset.

python

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Load the dataset
data = sns.load_dataset('titanic')

# Drop unnecessary columns
data = data.drop(['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class'], axis=1)

# Handle missing values
data['age'] = data['age'].fillna(data['age'].median())
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])

# Encode categorical variables
encoder = LabelEncoder()
data['sex'] = encoder.fit_transform(data['sex'])
data['embarked'] = encoder.fit_transform(data['embarked'])

# Split the dataset into training and testing sets
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Building the model

Next, we will create a random forest classifier using scikit-learn.

python

# Create a random forest classifier with additional hyperparameters
rf_clf = RandomForestClassifier(
    n_estimators=100,      # Number of trees in the forest
    criterion='gini',      # Function to measure the quality of a split ('gini' or 'entropy')
    max_depth=None,        # Maximum depth of the tree (None means nodes are expanded until all leaves are pure)
    min_samples_split=2,   # Minimum number of samples required to split an internal node
    min_samples_leaf=1,    # Minimum number of samples required to be at a leaf node
    max_features='auto',   # Number of features to consider when looking for the best split ('auto', 'sqrt', 'log2', or an integer)
    bootstrap=True,        # Whether bootstrap samples are used when building trees
    oob_score=False,       # Whether to use out-of-bag samples to estimate the generalization accuracy
    n_jobs=None,           # Number of jobs to run in parallel for both fit and predict (-1 means using all processors)
    random_state=42,       # Controls both the randomness of the bootstrapping and feature sampling
    verbose=0,             # Controls the verbosity when fitting and predicting
    warm_start=False,      # Reuse the solution of the previous call to fit and add more estimators to the ensemble
    class_weight=None      # Weights associated with classes (None or 'balanced')
)

Here is a brief explanation of the additional hyperparameters:

n_estimator
It represents the number of decision trees in the random forest ensemble. In other words, it controls the size of the forest by specifying how many individual trees should be built and combined. By default, the value is set to 100, which means the random forest will consist of 100 decision trees.
criterion
The function used to measure the quality of a split. Supported criteria are "gini" for Gini impurity and "entropy" for information gain. By default, it's set to "gini".
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. A higher value may lead to overfitting, while a lower value may result in underfitting.
min_samples_split
The minimum number of samples required to split an internal node. Increasing this value may reduce overfitting but could result in a less accurate model.
min_samples_leaf
The minimum number of samples required to be at a leaf node. Increasing this value may reduce overfitting but could result in a less accurate model.
max_features
The number of features to consider when looking for the best split. It can be set to 'auto', 'sqrt', 'log2', or an integer. If 'auto', then max_features=sqrt(n_features) is used. If 'sqrt', then max_features=sqrt(n_features) is used. If 'log2', then max_features=log2(n_features) is used. If an integer, then the number of features is considered at each split.
bootstrap
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_score
Whether to use out-of-bag samples to estimate the generalization accuracy. Out-of-bag samples are those not used in the bootstrap sample for a particular tree.
n_jobs
The number of jobs to run in parallel for both fit and predict. -1 means using all processors.
verbose
Controls the verbosity when fitting and predicting. A higher value will output more information during the process.
warm_start
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble. This can save time when tuning hyperparameters iteratively, as it reuses the previously trained trees and adds new ones, rather than training all trees from scratch.
class_weight
Weights associated with classes. If None, all classes are supposed to have equal weight. If 'balanced', the class weights are adjusted based on the number of samples in each class. This can be useful when dealing with imbalanced datasets.

Training and evaluation

Now, let's train the random forest classifier on the training data and evaluate its performance on the testing data.

python

# Train the model
rf_clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.78
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.83      0.81       157
           1       0.74      0.70      0.72       111

    accuracy                           0.78       268
   macro avg       0.77      0.77      0.77       268
weighted avg       0.77      0.78      0.78       268

Confusion Matrix:
[[130  27]
 [ 33  78]]

Visualizing feature importance

Finally, we will visualize the feature importance of the random forest model.

python

# Calculate feature importances
importances = rf_clf.feature_importances_

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [X.columns[i] for i in indices]

# Create a bar plot
plt.figure(figsize=(10, 5))
plt.title("Feature Importance")
plt.bar(range(X.shape[1]), importances[indices])

# Add feature names as x-axis labels
plt.xticks(range(X.shape[1]), names, rotation=90)

# Show the plot
plt.show()

Feature Importance

Plotting in this way, we can see at a glance which features are important. We can see that sex, age and fare are highly important. This is a convincing result as an important factor that made the difference between life and death on the Titanic.

Random Forests with the Titanic Dataset

Introduction

Preparing the dataset

Building the model

Training and evaluation

Visualizing feature importance

Random Forest

Gradient Boosting Decision Trees (GBDT)

Ryusei Kakujo