by Jonathan Widarsa

Trees Can Predict?

·

Most statistical models begin with some assumption about the world. In linear regression, we assume the relationship between inputs and output is a weighted sum, while in logistic regression, we assume the log-odds of a class is linear in the features. However, while they do give us interpretable coefficients and elegant closed-form solutions, they’re still assumptions. And reality often doesn’t cooperate.

Decision trees take a different philosophical stance. Rather than imposing a functional form on the data, they ask a simpler question at every step: what is the best way to split this data into two groups such that each group becomes more homogeneous? As we repeat this question recursively, we end up with a tree. In other words, decision trees let data determine the structure, rather than constraining the structure upfront (i.e., non-parametric).

***

The core mechanic of a decision tree is the recursive binary splitting. Basically, starting from the full dataset, the algorithm selects a feature xjx_j and a threshold ss, splitting the input space into two half-planes:

R1(j,s)=x|xj<sandR2(j,s)=x|xjs.R_1(j, s) = { x \mid x_j < s } \qquad \text{and} \qquad R_2(j, s) = { x \mid x_j \geq s }.

Within each region, we simply make a single prediction (i.e., a constant), and the split is chosen by minimizing some loss function \mathcal{L} that measures how well the predictions within each region describe the data. Formally, at each node, we solve

minj,,s[(R1(j,s))+(R2(j,s))].\min_{j,, s} \left[ \mathcal{L}\!\left(R_1(j, s)\right) + \mathcal{L}\!\left(R_2(j, s)\right) \right].

Next, once a region is split, each child region becomes the new starting point and the same procedure applies, thereby making it recursive. The algorithm continues until some stopping criterion is met (e.g., minimum node size, maximum depth, improvement threshold), and the result is a partition of the feature space into rectangular regions, with each region assigned a single predicted value.

Intuitively, this is all the tree does: split, predict, repeat. There are no parameters being estimated globally, no matrix inversions, no distributional assumptions; everything is local!

Regression v.s. Classification

Decision trees see much use in both regression and classification tasks. Fortunately, the tree structure (i.e., how splits are formed and how predictions are made) is identical for both. What distinguishes the two settings is the objective function \mathcal{L} used to evaluate a split.

Regression Trees: Variance Reduction

In a regression setting, we have a continuous response yy \in \mathbb{R} and nmn_m observations in region RmR_m. This way, the natural prediction within a region is the mean of the observed responses, i.e.,

y^m=1nmiRmyi.\hat{y}_m = \frac{1}{n_m} \sum_{i \in R_m} y_i.

But why the mean? Because it minimizes the residual sum of squares (RSS) within the region:

(Rm)=iRm(yiy^m)2.\mathcal{L}(R_m) = \sum_{i \in R_m} (y_i – \hat{y}_m)^2.

Note that this is precisely the within-region variance (up to a constant factor of nmn_m), meaning choosing the best split is equivalent to maximizing variance reduction: we want the observations in each child node to be as tightly clustered around their respective means as possible.

Therefore, the total loss we minimize across both children is simply

minj,,s[iR1(yiy1)2+iR2(yiy2)2],\min_{j,, s} \left[ \sum_{i \in R_1} (y_i – \bar{y}_1)^2 + \sum_{i \in R_2} (y_i – \bar{y}_2)^2 \right],

where y1\bar{y}_1 and y2\bar{y}_2 are the means in each child region. Interpretability-wise, the reduction from the parent’s RSS to the sum of children’s RSS tells us how much the split “helped.”

Classification Trees: Impurity and Entropy

Now, in classification, the response y1,2,,Ky \in {1, 2, \dots, K} is categorical. First, within a region RmR_m, we define the class proportion for class kk as

p^mk=1nmiRm𝟏(yi=k),\hat{p}_{mk} = \frac{1}{n_m} \sum_{i \in R_m} \mathbf{1}(y_i = k),

where 𝟏(yi=k)\mathbf{1}(y_i = k) is just an indicator that equals 1 if observation ii belongs to class kk, and 0 otherwise. The predicted class is simply

k^m=argmaxkp^mk,\hat{k}_m = \arg\max_k \hat{p}_{mk},

i.e., the majority class. Here, the analogue of variance is impurity, which is a measure of how mixed the class labels are within a node. The most intuitive impurity measure is called the misclassification error, defined as

err(Rm)=1maxkp^mk.\mathcal{L}_{\text{err}}(R_m) = 1 – \max_k \hat{p}_{mk}.

As you can tell, it’s just the fraction of observations that would be misclassified if we predicted the majority class. While natural, it’s actually a poor criterion for growing trees because it isn’t sensitive to changes in class proportions that don’t change the majority class.

Instead, we measure Gini impurity:

Gini(Rm)=k=1Kp^mk(1p^mk)=1k=1Kp^mk2,\mathcal{L}_{\text{Gini}}(R_m) = \sum_{k=1}^{K} \hat{p}_{mk}(1 – \hat{p}_{mk}) = 1 – \sum_{k=1}^{K} \hat{p}_{mk}^2,

which measures the probability that two randomly drawn observations from the node belong to different classes. It’s observable that a pure node (i.e., all one class) gives Gini=0\mathcal{L}_{\text{Gini}} = 0, while a maximally mixed node gives Gini=11/K\mathcal{L}_{\text{Gini}} = 1 – 1/K. Unlike misclassification error, it’s sensitive to the full distribution of class probabilities.

Another common choice also preferred to misclassification error is cross-entropy (also called deviance or log-loss):

CE(Rm)=k=1Kp^mklogp^mk,\mathcal{L}_{\text{CE}}(R_m) = -\sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk},

which is more rooted in information theory. The idea is that it measures the expected number of bits needed to encode a class label drawn from the node’s distribution, whereby a pure node has entropy zero; a maximally mixed binary node has entropy one (bit).

The reduction in entropy between the parent node and the weighted sum of children’s entropies after a split is called the information gain — which is precisely the mutual information between the splitting feature and the class label, given the current node. This connection gives cross-entropy a principled probabilistic interpretation beyond the purely algorithmic.

Growing the Tree

Now that we understand the splitting criterion, how does the tree actually grow? Greedily. Basically, at each node, the algorithm exhaustively searches all features xjx_j and all candidate thresholds ss, selecting the pair that minimizes the combined loss of the two child nodes. It then recurses on each child independently.

This is essentially a greedy algorithm: it makes the locally optimal decision at each node without considering how that decision affects future splits. It doesn’t backtrack and globally optimize over all possible trees.

Consequently, the tree structure is highly sensitive to the data, where small changes in the training set can produce a completely different sequence of splits — a property that reflects both its flexibility and instability.

Overfitting and Pruning

Unsurprisingly, as a consequence of the classic bias-variance tradeoff, a fully grown tree (i.e., one that continues splitting until every leaf contains a single observation) will fit the training data perfectly but generalize poorly. A deeper tree has lower bias but higher variance.

A common solution is cost-complexity pruning (weakest-link pruning). Rather than stopping growth early by some ad hoc rule, we first grow a large tree T0T_0 and then consider all subtrees obtained by collapsing internal nodes. For a subtree TT with |T||T| leaves, we define the cost-complexity criterion as

α(T)=m=1|T|(Rm)+α|T|,\mathcal{L}_\alpha(T) = \sum_{m=1}^{|T|} \mathcal{L}(R_m) + \alpha |T|,

where α0\alpha \geq 0 is a regularization parameter that penalizes tree size. When α=0\alpha = 0, we recover the original tree, and as α\alpha increases, the penalty for complexity grows and the optimal subtree becomes progressively smaller. Typically, the value of α\alpha is chosen by cross-validation, selecting the model that balances training fit against complexity.

This framing is deliberate: pruning is not a heuristic but a principled penalty method, analogous to Ridge or LASSO regularization in linear models.

Strengths and Limitations

Below we take a look at several compelling reasons why we’d want to deploy a decision tree for modeling:

  • They are highly interpretable, in that a tree of moderate depth can be drawn and read by a non-statistician.
  • They handle both continuous and categorical features naturally, require no feature scaling, and are robust to monotone transformations of inputs.
  • They implicitly perform feature selection and can model complex interactions and nonlinear boundaries without explicit specification.

At the same time, as we’ve hinted throughout the text, it suffers from several notable limitations as well:

  • A single tree has high variance, meaning small changes in training data can radically alter the fitted tree, making individual trees unreliable.
  • Even with pruning, they remain prone to overfitting in high-dimensional settings.

However, as I’ve always mentioned in my articles, these limitations are good news. They simply serve as motivation for ensemble methods such as bagging, random forests, and gradient boosting, which address the variance problem by aggregating many trees. Of course, these are topics for another post.

***

I hope that you can see how much potential this classical machine learning model has in both regression and classification tasks. Even today, they actually remain one of the most widely used methods, and they still work really well for many use cases. Granted, maybe not a single decision tree — but just wait until we go over ensemble methods, where we combine multiple models (typically trees) to engineer a superior predictor.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


More Posts