2022-08-02

Decision Tree Algorithms

Introduction

Decision trees employ various algorithms to determine the optimal division of a node into two or more sub-nodes. The formation of these sub-nodes enhances their homogeneity. Essentially, this means that the purity of a node in relation to the target variable increases. The decision tree examines all possible variables for splitting and chooses the one that produces the most homogeneous sub-nodes.

In this article, I will introduces five key decision tree algorithms: ID3, C4.5, CART, CHAID, and MARS.

ID3 Algorithm

The Iterative Dichotomiser 3 (ID3) algorithm is one of the earliest and most well-known decision tree algorithms. It is primarily used for classification tasks with categorical features. The ID3 algorithm operates by selecting the best attribute for splitting the dataset based on the Information Gain criterion, which measures the reduction in entropy achieved by partitioning the data. The algorithm is recursive, continually splitting the dataset until all instances belong to a single class or a stopping criterion is met. The Information Gain (IG) is calculated as follows:

$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$

where $S$ is the dataset, $A$ is the attribute, $S_v$ represents the subset of $S$ with attribute value $v$ , and $H$ is the entropy function:

$H(S) = - \sum_{c \in Classes} p(c) \log_2 p(c)$

C4.5 Algorithm

C4.5 is an extension of the ID3 algorithm. It addresses some of the limitations of ID3, such as its inability to handle continuous attributes and missing data. The C4.5 algorithm uses the Gain Ratio criterion, a modification of Information Gain, to choose the best attribute for splitting. The Gain Ratio (GR) is defined as:

$GR(S, A) = \frac{IG(S, A)}{IV(S, A)}$

where $IV(S, A)$ is the Intrinsic Value of attribute $A$ :

$IV(S, A) = - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$

CART Algorithm

Classification and Regression Trees (CART) is another popular decision tree algorithm that can handle both classification and regression tasks. Unlike ID3 and C4.5, CART produces binary trees, with each internal node having exactly two child nodes. The CART algorithm uses the Gini Impurity criterion for classification tasks and Mean Squared Error (MSE) for regression tasks to select the best attribute for splitting. The Gini Impurity (GI) is calculated as follows:

$GI(S) = 1 - \sum_{c \in Classes} p(c)^2$

For regression tasks, the MSE is defined as:

$MSE(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y})^2$

where $y_i$ is the target value for instance $i$ , and $\bar{y}$ is the mean target value of the dataset.

CHAID Algorithm

The Chi-squared Automatic Interaction Detector (CHAID) algorithm is a decision tree algorithm that relies on the Chi-squared test for independence to determine the best attribute for splitting. CHAID is particularly useful for exploring the interactions between categorical variables and can create multiway trees, with each internal node having more than two child nodes. The algorithm is also capable of handling missing data through surrogate splits, which are alternative splitting rules based on the remaining attributes.

The Chi-squared statistic is computed as:

$\chi^2 = \sum_{i=1}^n \sum_{j=1}^m \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$

where $O_{ij}$ is the observed frequency of instances in category $i$ of the attribute and class $j$ , $E_{ij}$ is the expected frequency under the assumption of independence, $n$ is the number of categories for the attribute, and $m$ is the number of classes. The CHAID algorithm chooses the attribute with the highest Chi-squared value for splitting, provided it meets a predefined significance level.

MARS Algorithm

Multivariate Adaptive Regression Splines (MARS) is a decision tree-based algorithm developed by Jerome H. Friedman. MARS is primarily used for regression tasks and is designed to handle high-dimensional and noisy data. It builds a model by fitting piecewise linear regression functions called basis functions to the data. MARS incorporates techniques like forward selection, backward elimination, and knot selection to create an interpretable, flexible, and accurate model. While MARS is not a traditional decision tree algorithm, it shares some similarities in its structure and approach to modeling complex relationships.

The MARS model can be represented as:

$y(x) = \beta_0 + \sum_{i=1}^M \beta_i B_i(x)$

where $y(x)$ is the predicted target value for input $x$ , $\beta_0$ is the constant term, $M$ is the number of basis functions $B_i(x)$ in the model, and $\beta_i$ are the corresponding coefficients. The basis functions are piecewise linear functions of the form: