Seeing the Random Forest in Decision Trees

A Random Forest classifier is the mean of the predictions of many Decision Tree classifiers. To understand Random Forest models, an explanation of a Decision Tree classifier is a good starting point.

A Decision Tree classifier guides each record through a treelike structure consisting of nodes where decisions are made that determine if it proceeds toward the left or the right tree branch. Decisions at each node are based on benchmark values for a specific explanatory variable. The respective variables and the benchmarks for the decision criteria are chosen by an optimization procedure. All records from the training dataset are guided through the tree and end up in one of the bins on the bottom of the tree.

Decision trees, although intuitive, are called weak predictors because they respond sensitively to small changes in the data or parameters. This problem can be mitigated by combining many different decision trees—ones that are different at each node in terms of benchmarks and chosen variables.

The idea that a combination of weak predictors can lead to a strong prediction can be compared to a competition sometimes held at county fairs. Visitors to the fair, who likely have limited agricultural knowledge, try to estimate the weight of a pig. Although most predictions will be off, the mean of all predictions (surprisingly) will be very close to the real weight of the pig. This is the basic concept of Random Forest.