Categorical Features in XGBoost Without Manual Encoding

XGBoost is a decision-tree–based, ensemble machine learning algorithm based on gradient boosting. However, until recently, it didn’t natively support categorical data. Categorical features had to be manually encoded before they could be used for training or inference.

In the case of ordinal categories, for example, school grades, this is often done using label encoding where each category is assigned an integer that corresponds with the position of that category. The grades A, B, and C could be assigned the integers 1, 2, and 3 respectively.

In the case of cardinal categories where there’s no ordinal relationship between the categories, for example with colors, this is often done using one-hot encoding. This is where a new binary feature is created for every category a categorical feature contains. A single categorical feature with the categories red, green, and blue would be one-hot encoded to three binary features, one to represent each of the colors.

>>> import pandas as pd
>>> df = pd.DataFrame({"id":[1,2,3,4,5],"color":["red","green","blue","green","blue"]})
>>> print(df)
  id  color
0   1    red
1   2  green
2   3   blue
3   4  green
4   5   blue

>>> print(pd.get_dummies(df))
  id  color_blue  color_green  color_red
0   1           0            0          1
1   2           0            1          0
2   3           1            0          0
3   4           0            1          0
4   5           1            0          0

This means that categorical features with a large number of categories can result in dozens or even hundreds of extra features. As a result, it’s common to run into both memory pool and maximum DataFrame size limitations.

It’s also an especially bad method for tree learners like XGBoost. Decision trees train by finding the splitting point over all features and their possible values that will result in the greatest increase in purity.

As one-hot encoded categorical features with many categories tend to be sparse, the splitting algorithm often ignores the one-hot features in favor of less sparse features that can contribute a larger gain in purity.

Now, XGBoost 1.7 includes an experimental feature that enables you to train and run models directly on categorical data without having to manually encode. This includes the option for either letting XGBoost automatically label encode or one-hot encode the data as well as an optimal partitioning algorithm for efficiently performing splits on categorical data while avoiding the pitfalls of one-hot encoding. Version 1.7 also includes support for missing values and a maximum category threshold to avoid overfitting.

This post provides a quick overview of how to use the new feature in practice on an example dataset that includes several categorical features.

Using XGBoost’s categorical support to predict star types

To use the new feature, you must first load some data. For this example, I used the Kaggle star type prediction dataset.

>>> import pandas as pd
>>> import xgboost as xgb
>>> from sklearn.model_selection import train_test_split
>>> data = pd.read_csv("6 class csv.csv")
>>> print(data.head())

  Temperature (K)  Luminosity(L/Lo)  Radius(R/Ro)  Absolute magnitude(Mv)  \
0             3068          0.002400        0.1700                   16.12  
1             3042          0.000500        0.1542                   16.60  
2             2600          0.000300        0.1020                   18.70  
3             2800          0.000200        0.1600                   16.65  
4             1939          0.000138        0.1030                   20.06  

   Star type Star color Spectral Class 
0          0        Red              M 
1          0        Red              M 
2          0        Red              M 
3          0        Red              M 
4          0        Red              M

Then, extract the target column (Star type) into its own series and split the dataset into training and test datasets.

>>> X = data.drop("Star type", axis=1)
>>> y = data["Star type"]
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)

Next, designate the categorical features as the category dtype.

>>> y_train = y_train.astype("category")
>>> X_train["Star color"] = X_train["Star color"].astype("category")
>>> X_train["Spectral Class"] = X_train["Spectral Class"].astype("category")

Now, to use the new feature, you must set the enable_categorical parameter to True when creating the XGBClassifier object. After that, continue as you normally would when training an XGBoost model. This works with both CPU and GPU tree_methods.

>>> clf = xgb.XGBClassifier(
    tree_method="gpu_hist", enable_categorical=True, max_cat_to_onehot=1
)
>>> clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=True,
              eval_metric=None, gamma=0, gpu_id=0, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, 
              objective='multi:softprob', predictor='auto', random_state=0, 
              reg_alpha=0, ...)

Finally, you can use your model to generate predictions, all without ever needing to one-hot encode or otherwise encode the categorical features.

>>> X_test["Star color"] = X_test["Star color"]
    .astype("category")
    .cat.set_categories(X_train["Star color"].cat.categories)
>>> X_test["Spectral Class"] = X_test["Spectral Class"]
    .astype("category")
    .cat.set_categories(X_train["Spectral Class"].cat.categories)
>>> print(clf.predict(X_test))
[1 0 3 3 2 5 1 1 2 1 4 3 4 0 0 4 1 5 2 4 4 1 4 5 5 3 1 4 5 2 0 2 5 5 4 2 5
 0 3 3 0 2 3 3 1 0 4 2 0 4 5 2 0 0 3 2 3 4 4 4]

Summary

We demonstrated how you can use XGBoost’s experimental support for categorical features to improve the training and inference experience of XGBoost on categorical data. For more examples, see the following resources:

We are excited to hear about how this new feature makes your life easier. We will have more work in the coming months that will help you better understand the true nature of the underlying tree-based algorithm, and this will pay big dividends in your future work.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU acceleration to your project, reach out on GitHub or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would affect your work.