Data Science

Training XGBoost Models with GPU-Accelerated Polars DataFrames

One of the many strengths of the PyData ecosystem is interoperability, which enables seamlessly moving data between libraries that specialize in exploratory analysis, training, and inference. The latest release of XGBoost introduces exciting new capabilities, including a category re-coder and integration with Polars DataFrames. This provides a streamlined approach to data handling. 

This post guides you through how to leverage the Polars GPU engine with the XGBoost machine learning library. It highlights the seamless integration of categorical features, including the new category re-coder within XGBoost.

Using XGBoost with the Polars GPU Engine

Polars is a high-performance DataFrame library written in Rust, offering a lazy evaluation model and GPU acceleration that can significantly optimize data processing workflows.

One of the key aspects of using Polars with XGBoost in a GPU-accelerated pipeline is understanding lazy evaluation. Polars operations are often lazy, meaning that they build a query plan but don’t execute it unless explicitly directed to do so. To customize the execution of the query plan using a GPU, call the collect method of the LazyFrame and specify the engine=”gpu”parameter.

This tutorial uses a small subset of the Microsoft Malware Prediction dataset for illustration. The dataset, available through Kaggle, has a moderate size with both numerical and categorical features. In the following snippets, only a few columns are selected to demonstrate the use of categorical features with XGBoost.

Setting up the environment 

Before diving into the code, ensure you have the following libraries installed: xgboost, polars[gpu], and pyarrow. The [gpu] dependency specifier downloads the GPU-enabled version of Polars:

pip install xgboost polars[gpu] pyarrow

When consuming Polars inputs, XGBoost uses the zero-copy to_arrow method from Polars DataFrames. As a result, PyArrow is required to pass data between Polars and XGBoost. As this example shows, it’s also used as a data exchange format for exporting categories from XGBoost models.

Data preparation and model training

First, import the necessary libraries:

import polars as pl
import xgboost as xgb

Three features will be used from the dataset, two of which are categorical. The HasDetections column is the prediction target with binary values. To utilize the Polars execution engine for optimal performance, create a LazyFrame object with scan_csv:

columns = [
    "ProductName", # Categorical
    "IsBeta",  # Boolean
    "Census_OSArchitecture", # Categorical
    "HasDetections",  # Binary target
]

# ignore_errors is set to True to let polars infer the schema
df_lazy = pl.scan_csv(
    "./microsoft-malware-prediction/train.csv",
    ignore_errors=True,
).select(columns)
# Cast the categorical features to the Polars Enum type
df_lazy = df_lazy.with_columns(
    [
        pl.col("ProductName").cast(
            pl.Enum(
                ["fep", "mseprerelease", "win8defender", "scep", "mse", "windowsintune"]
            )
        ),
        pl.col("Census_OSArchitecture").cast(pl.Enum(["amd64", "x86", "arm64"])),
    ]
)

After reading the data, train an XGBoost binary classification model by feeding the DataFrame into an XGBClassifier:

X = df_lazy.drop("HasDetections")
y = df_lazy.select("HasDetections")

# Use GPU to fit the classification model. We let XGBoost handle the categorical features by setting the `enable_categorical` parameter.
clf = xgb.XGBClassifier(device="cuda", enable_categorical=True)
# Validation data is not created for simplicity
clf.fit(X, y)

This snippet uses the GPU only for model training. The DataFrame loading and processing are still being performed using the CPU. When calling the fit method, XGBoost will issue a warning that says it’s recommended to convert the lazy frame into a concrete DataFrame for optimal performance. To achieve this and to customize the query plan execution on a per-frame basis, use the following snippet:

# Convert the lazy frame into a concrete dataframe using a GPU
df = df_lazy.collect(engine="gpu")

X = df.drop("HasDetections")
y = df.select("HasDetections")

clf.fit(X, y)

Alternatively, to enable global GPU acceleration for Polars in addition to model training, set the engine affinity to GPU:

import polars as pl
import xgboost as xgb

# set the engine before using polars
pl.Config.set_engine_affinity("gpu")

Automatically re-code categorical data with XGBoost

The latest XGBoost release significantly enhances its handling of categorical features with the introduction of the re-coder. Polars encodes categorical and enum data types into integers based on the ordering of input values. For example, given three categories [“aa”, “bb”, “cc”], Polars might store the data as follows:

ValuesEncodingCategories
“cc”2“aa”
“cc”2“bb”
“bb”1“cc”
“aa”0
Table 1. Example encoding of categories

The scheme is shared among DataFrame implementations, including pandas and cuDF. For an in-depth explanation of the categorical type and the enum type in Polars, refer to the Polars documentation.

In prior XGBoost versions, users had to manually re-code categorical features for XGBoost prior to inference. Aside from being error-prone, this can be challenging and inefficient.

For example, given a feature with three categories in the training dataset, ["aa", "bb", "cc"], Polars would encode them into numerical values [0, 1, 2] with a mapping {”aa”: 0, “bb”: 1, “cc”: 2}. However, during inference, the test dataset might contain only a subset of categories: [“bb”, “cc”], which would be encoded as [0, 1] with a mapping {“bb”: 0, “cc”: 1}, resulting in an invalid test-time encoding. 

With the latest XGBoost, the booster object can remember the encoding from the training dataset, and use it in the predict method to re-code the categories automatically. The following example demonstrates how to use this with a synthetic dataset containing categorical features:

import numpy as np
import polars as pl
import xgboost as xgb

# Create a dataframe with a categorical feature (f1)
f0 = [1, 3, 2, 4, 4]
cats = ["aa", "cc", "bb", "ee", "ee"]

df = pl.DataFrame(
    {"f0": f0, "f1": cats},
    schema=[("f0", pl.Int64()), ("f1", pl.Categorical(ordering="lexical"))],
)
rng = np.random.default_rng(2025)
y = rng.normal(size=(df.shape[0]))

# Train a regression model
reg = xgb.XGBRegressor(enable_categorical=True, device="cuda")
reg.fit(df, y)
predt_0 = reg.predict(df)

# Use a subset of rows to create a different encoding, "aa" and "ee" are removed
df_new = pl.DataFrame(
    {"f0": f0[1:3], "f1": cats[1:3]},
    schema=[("f0", pl.Int64()), ("f1", pl.Categorical(ordering="lexical"))],
)
predt_1 = reg.predict(df_new)

# Check the resulting predictions are the same with the original encoding
np.testing.assert_allclose(predt_0[1:3], predt_1)

In this snippet, a test DataFrame is created with a subset of categories from the training DataFrame. It also verifies that the output predictions remain the same despite different encoding schemes. The feature prevents the need for a separate transformation pipeline.

In addition, the re-coder inside XGBoost is more efficient than re-coding with the DataFrame directly when dealing with a large number of features. Internally, the re-coding is performed in-place and on-the-fly. XGBoost can handle all categorical columns simultaneously using a GPU without copying the DataFrame.

Exporting the categories (experimental)

As previously mentioned, the XGBoost model can now remember the categories. Advanced users can export the saved categories to a list of arrow arrays by accessing the underlying booster object from the high-level model. This can be useful for verifying whether the model is trained with the expected categories. Continuing with the example:

# Get the underlying booster object
booster = reg.get_booster()
# Export the categories from the booster
categories = booster.get_categories(export_to_arrow=True)
# Export the categories into a list of arrow arrays
print(categories.to_arrow())

The export_to_arrow option is required to exchange data with PyArrow:

[('f0', None), ('f1', <pyarrow.lib.StringArray object at 0x735ba5407be0>
[
  "aa",
  "cc",
  "bb",
  "ee"
])]

The complete list of categories in f1 is stored in the booster. The interface for exporting categories is experimental as of XGBoost 3.1. Note that the examples in this post use the scikit-learn interface, as it handles most configurations automatically. Using the native interface (the booster) can be more involved, especially when working with training continuation. For more details, see the XGBoost documentation.

The GPU acceleration for Polars is currently limited to its execution plan. The resulting DataFrame is still stored in CPU memory. As a result, XGBoost needs to copy it back to the GPU during inference, and users will see a one-time warning about the performance impact.

Get started with Polars and XGBoost

You can build highly efficient and robust GPU-accelerated pipelines by understanding how to materialize lazy Polars DataFrames and effectively utilize the new XGBoost categorical feature handling, including its re-coder. This approach not only streamlines your workflow, but also unlocks new performance levels for your machine learning models. 

For more information, see GPU Acceleration with Polars and NVIDIA RAPIDS. You can also provide feedback or ask questions about training XGBoost on the dmlc/xgboost GitHub repo. 

Discuss (0)

Tags