Grandmaster Pro Tip: Winning First Place in a Kaggle Competition with Stacking Using cuML

What does it take to win a Kaggle competition in 2025? In the April Playground challenge, the goal was to predict how long users would listen to a podcast—and the top solution wasn’t just accurate, it was fast. In this post, Kaggle Grandmaster Chris Deotte will break down the exact stacking strategy that powered his first-place finish using GPU-accelerated modeling with cuML. You’ll learn a powerful pro tip: how to explore hundreds of diverse models quickly and combine them into a layered ensemble that outperforms the rest.

Overview of stacking

Stacking is an advanced tabular data modeling technique that achieves high performance by combining the predictions of multiple diverse models. Leveraging the computational speed of GPUs, a large number of models can be efficiently trained. These include gradient boosted decision trees (GBDT), deep learning neural networks (NN), and other machine learning (ML) models such as support vector regression (SVR) and k-nearest neighbors (KNN). These individual models are referred to as Level 1 models.

Level 2 models are then trained, which use the output of Level 1 models as their input. The Level 2 models learn to use different combinations of Level 1 models to predict the target in different scenarios. Finally, simple Level 3 models are trained, which average the Level 2 models’ outputs. The result of this process is a three-level stack.

Figure 1 depicts 12 Level 1 models. The final solution uses 75 Level 1 models that were chosen from 500 experimental models. The secret to building a strong stack is to explore many Level 1 models. It is therefore crucial to train and evaluate models as fast as possible using GBDT with GPU, NN with GPU, and ML with GPU. This is done using cuML, which accelerates ML models on GPU.

Diverse approaches to predict target

Diverse models are built by solving a problem in different ways. Different types of models can be used as well as different architectures and hyperparameters. Different preprocessing methods and feature engineering can also be used.

In the April 2025 Kaggle Playground competition there were at least four different ways to predict the target of Podcast Listening Time. The target Listening_Time_minutes is approximately equal to the linear relationship 0.72 x Episode_Length_minutes. For more details, see the discussion posts, Strong Feature Interaction Exists, Direct versus Indirect – Relationship with Target, and Strong Correlation Between Features and Target. The other nine features modulate this linear relationship.

Based on the insight depicted in Figure 2, you can predict the target in at least the following four ways:

Predict TARGET as is
Predict RATIO of target divided by episode length minutes
Predict RESIDUAL from linear relationship
Predict MISSING episode_length_minutes

Each of these four approaches has two cases:

88% of rows have episode length minutes
12% of rows are missing episode length minutes

Predict target

The common way to build a model is to use the given target. From the 10 given feature columns, you can engineer additional feature columns, then train your model using all these features to predict the given target:

model = Regressor().fit(train[FEATURES],train['Listening_Time_minutes'])
PREDICTION = model.predict(train[FEATURES])

Using this approach alone won first place in the February 2025 Kaggle Playground competition. For more details, see Grandmaster Pro Tip: Winning First Place in Kaggle Competition with Feature Engineering Using cuDF pandas.

Predict ratio

An alternative to predicting the given target is to predict the ratio between target and feature Episode_Length_minutes:

train['new_target'] = train.Listening_Time_minutes / train.Episode_Length_minutes

You can train models to predict this new target. Then multiply this prediction by Episode_Length_minutes or an imputed value of Episode_Length_minutes.

Predict residual

Another alternative to predicting the given target is to predict the residual between the target and a linear regression model. First, train and infer a linear regression model. Then create a new target by subtracting the existing target from the predicted linear regression model:

model = LinearRegressor().fit(train[FEATURES],train['Listening_Time_minutes'])
LINEAR_REGRESSION_PREDS = model.predict(train[FEATURES])
train['new_target'] = train.Listening_Time_minutes - LINEAR_REGRESSION_PREDS

You can train models to predict this new target. Then add this prediction to your linear regression prediction.

Predict missing features

The feature Episode_Length_minutes is the most important feature, but it is missing in 12% of training rows. You can build a separate model to predict this feature. Furthermore, you can use all rows of both train data and test data where Episode_Length_minutes is present to train a model:

combined = cudf.concat([train,test],axis=0)
model = Regressor().fit(combined[FEATURES], combined['Episode_Length_minutes'])
Episode_Length_minutes_IMPUTED = model.predict(train[FEATURES])

After predicting Episode_Length_minutes, you can use it to predict the actual target in at least three ways, including:

Impute missing values with Episode_Length_minutes, then train a model
Replace entire column of Episode_Length_minutes, then train a model
Multiply predicted Episode_Length_minutes with predicted ratio

Predict pseudo labels

The preceding sections show one way to use feature columns of test data to help impute the missing values in train data. This is a powerful technique that extracts information from test columns. It is useful in the real world when dealing with additional unlabeled data. Another way to use test columns is by pseudo labeling.

Pseudo labeling is a three-step process:

Train a model as usual
Infer labels for test data
Concatenate pseudo labeled test data with our train data and then train another model

The following code is a simple example. Note that if you use K-Fold or Nested K-Fold, you can make this generalize better and avoid leakage.

model1 = Regressor().fit(train[FEATURES],train['Listening_Time_minutes'])
test[‘Listening_Time_minutes’] = model1.predict(test[FEATURES])
combine = cudf.concat([train,test],axis=0)
model2 = Regressor().fit(combine[FEATURES],combine['Listening_Time_minutes'])

Building the stack

After building hundreds of diverse models using GBDT, NN, ML and using the variety of previously mentioned techniques, the next step is to build the stack. Using forward feature selection, the out-of-fold (OOF) predictions from the Level 1 models are added as features for the Level 2 models. Additional features from the original dataset may also be included, along with engineered features derived from OOF predictions, such as model confidence or average prediction:

df['confidence'] = df[OOF].std(axis=1) 
df['concensus'] = df[OOF].mean(axis=1)

For variety, it is best to train multiple Level 2 models. A good choice is a GBDT Level 2 model and an NN Level 2 model. For Level 3, perform a weighted average of the Level 2 predictions. The result is the final prediction.

Using the advanced technique of stacking achieved CV validation metric RMSE = 11.54 and private leaderboard RMSE = 11.44. This won first place in the April 2025 Kaggle Playground competition predicting podcast listening times.

Summary

cuML now enables you to create all statistical ML models with GPU speed. You can build GBDT with GPU, NN with GPU, and now cuML with GPU. By creating many diverse models quickly, you can build advanced tabular data solutions like stacks.

Fast experimentation enables discovery of the most accurate solutions. This was demonstrated by the cuML stack achieving first place in the April 2025 Kaggle Playground competition predicting podcast listening times. For more details about the competition entry, see First Place – RAPIDS cuML Stack – 3 Levels.

To learn more tips on using cuML, check out the NVIDIA GTC 2025 workshop Feature Engineering or Bring Accelerated Computing to Data Science in Python, or enroll in the NVIDIA DLI Learning Path courses for data science.