Using Tree Leaf Indexes as Neural Network Embeddings

Exploring how tree-leaf IDs can be embedded to inject tree-learned structure into neural networks.

11/30/20253 min read

Using Tree Leaf Indexes as Neural Network Embeddings

Tree-based models are incredibly good at uncovering non-linear interactions, but neural networks don’t get those insights for free. What if we could hand those interactions directly to a neural network instead of asking it to rediscover them from scratch?

When a tree ensemble makes a prediction, each sample is funneled into exactly one leaf per tree - effectively assigning it a categorical leaf ID for each tree. In a depth-5 CatBoost tree that’s up to 2^5 = 32 possible leaves, and with 100 trees you end up with 100 separate leaf-index features, each an integer between 0 and 31.

Treat those leaf IDs as categorical features, embed them, and suddenly you have a compact, high-level representation of everything the trees have already learned - ready to feed into a neural network. It’s a simple idea, but potentially a powerful way to inject rich structure into an NN from the very beginning.

Why Leaf Indexes Are Useful

Leaf IDs capture the implicit feature interactions that tree models naturally construct. A path from the root to a leaf might represent:

“feature_3 < 12, feature_7 > 0.8, feature_2 in [A, C]”
or some other multi-step non-linear split.

The beauty is that we don't need to decode the logic behind the leaf - the leaf ID itself stands in for that entire interaction pattern. By embedding these IDs, the neural network gets a dense vector representing these interactions without having to engineer anything manually.

Generating Leaf Features

There are several ways to generate these leaf features:

Training leaf indexes - Fast and uses the full training set, but the leaf IDs are in-sample, which can lead to overfitting.
Out-of-fold (OOF) leaf indexes - Split the data into A and B, train a tree-based model on A, and compute leaf IDs only for B. Since the model never saw B’s labels, these features are leakage-free, though you get fewer samples for training the neural network.
Hybrid (training + OOF) leaf indexes - Split into A and B, train on A, and compute leaf IDs for both A and B using the same model. Because all leaf IDs come from the same trees, they align perfectly and can be concatenated, giving the neural network more data while still mixing in some leakage-free signals.

Extending the Idea: Multiple Tree Models

Once the pipeline is set up, nothing stops you from expanding the feature space:

Train multiple CatBoost models with different depths
Try different random seeds
Use different subsets of features
Mix in LightGBM or XGBoost models
Combine leaf indexes from all of them

Each tree model contributes its own “block” of categorical leaf features. Concatenate all blocks, embed them, and feed everything into the neural network.

This approach gives the NN a diverse set of high-level, learned representations - almost like having multiple specialists extract different kinds of structure before the neural network sees anything.

Why Bother?

Neural networks often need a lot of data - or increasingly elaborate architectures - to learn strong feature interactions. Tree models, on the other hand, discover these interactions naturally. Turning leaf IDs into embeddings is a lightweight way to let a neural network start from the structure the trees have already uncovered.

If the idea works as hoped, it might offer:

Slightly faster convergence
More informative initial representations
Small gains on certain tabular problems
A simple path toward hybrid architectures

It probably won’t deliver dramatic improvements - and a neural network trained only on leaf-index features is very unlikely to beat a well-tuned tree model. But combining leaf indexes from multiple ensembles with the original features, and letting a neural network learn from the entire mix, could provide modest but meaningful benefits in the right situations.

Closing Thoughts

Any improvements from this technique will likely be modest, but it does offer a simple way to let models benefit from the structure that trees naturally learn. And it doesn’t have to be a neural network at all: you can feed the leaf-index features into another tree model and still get some of the same benefits.

I put together a small Kaggle example demonstrating that idea here: https://www.kaggle.com/code/ern711/tree-leaf-indexes-as-features

In a follow-up article, I’ll take this idea further and explore it inside a neural network architecture that actually resembles a boosting process - a concept I call Boosted LeafNet. It’ll be a chance to experiment with how these leaf-based features behave when integrated step-by-step into a boosting-style neural model.