Random Forests

Posted on September 20, 2022

Neville Dubash headshot

A decision tree is a simple classification algorithm. For a given data point, it starts by looking at one feature and comparing it to a threshold and following the corresponding path down the tree until a terminal node (a “leaf”) is reached, which provides the classification value.

Often, however, a single decision tree is not sufficient to obtain good results. A single decision tree might perform well on the training data but generalize poorly to new data (i.e., become overfit). A better approach is to use an ensemble of decision trees. One such ensemble model is the Random Forest model. Random forests are considered a relatively simple, accurate, and versatile model that can be used for both classification and regression tasks.

The idea behind random forests is referred to as bootstrap aggregation or bagging. Instead of training a single tree on the whole training data, many trees are trained but each on a random subset of the training data. Furthermore, the features that are considered when searching for the best way to make a decision/split are chosen from a random subset of all the features. To make a prediction with the ensemble, the predictions of all the individual trees are calculated and an average is returned. The inclusion of randomness when generating the trees results in a model that has less variance and is less prone to overfitting than a single tree. Consequently, the ensemble usually generalizes better.

The one trade-off from using an ensemble is interpretability. A single decision tree is easily interpretable – by simply following the path the decision tree takes to arrive at the decision. A random forests model can have hundreds of trees being averaged together, which makes interpretation very difficult. Random forests do however provide a means to estimate the relative importance of each of the input features (and are sometimes even used to perform feature selection for more complex models).

Random forests are just one of the models we consider when developing predictive models.