Accelerating Time Series Forecasting with RAPIDS cuML

Time series forecasting is a powerful data science technique used to predict future values based on data points from the past

Open source Python libraries like skforecast make it easy to run time series forecasts on your data. They allow you to “bring your own” regressor that is compatible with the scikit-learn API, giving you the flexibility to work seamlessly with the model of your choice.

With growing datasets and techniques like direct multi-step forecasting that require you to run several models at once, forecasts can quickly become computationally expensive when running on CPU-based infrastructure.

RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuML is a GPU-accelerated machine learning library for Python with a scikit-learn compatible API.

In this blog post, we show how RAPIDS cuML can be used with skforecast to accelerate time series forecasting, allowing you to work with larger datasets and forecast windows.

Why time series forecasting?

In today’s data-driven world, enterprises rely on time series forecasting to make informed decisions, optimize processes, and mitigate risks. Whether it’s predicting stock market trends, sudden changes in supply or demand, or the spread of diseases, accurate forecasting is essential for planning and strategy.

Historically, monthly or weekly forecasting may have been adequate to support decision making. But with the exponential growth of data and rise in global uncertainty, organizations now need to be able to run forecasts in near real-time to make proactive decisions about their business.

Multistep forecasting

One popular technique used in time series forecasting is recursive multi-step forecasting, in which you train a single model and apply it recursively to predict the next n values in the series.

In contrast, direct multi-step forecasting uses a separate model to predict each future value in your forecast horizon. In other words, you are “directly” trying to forecast n steps ahead, rather than getting there via recursion. This can produce much better results in certain situations, but is also more computationally expensive since it requires training multiple models.

Bringing accelerated computing to direct multistep forecasting

RAPIDS cuML can be dropped into existing skforecast workflows. In the example below, we create a synthetic time series dataset with hourly seasonality and positive drift. We then use skforecast’s ForecasterDirect class for direct multi-step forecasting and substitute the scikit-learn regressor for cuML’s RandomForestRegressor:

import numpy as np
import pandas as pd
from skforecast.direct import ForecasterDirect

from sklearn.ensemble import RandomForestRegressor
import cuml

USE_GPU = False

# Parameters
n_records = 100000
drift_rate = 0.001
seasonality_period = 24
start_date = '2010-01-01'

# Create synthetic dataset with positive drift
date_rng = pd.date_range(start=start_date, periods=n_records, freq='h')
np.random.seed(42)
noise = np.random.randn(n_records)
drift = np.cumsum(np.ones(n_records) * drift_rate)
seasonality = np.sin(np.linspace(0, 2 * np.pi, n_records) * (n_records / seasonality_period))

data = noise + drift + seasonality
df = pd.DataFrame(data, index=date_rng, columns=['y'])

if USE_GPU:
    forecaster = ForecasterDirect(
        regressor=cuml.ensemble.RandomForestRegressor(
            n_estimators=200,
            max_depth=13,
        ),
        steps=100,
        lags=100,
        n_jobs=1,
    )
else:
    forecaster = ForecasterDirect(
        regressor=RandomForestRegressor(
            n_estimators=200,
            max_depth=13,
            n_jobs=-1  # parallelize Random Forest to use all CPU cores
        ),
        steps=100,
        lags=100,
        n_jobs=1,
    )

forecaster.fit(y=df['y'])
predictions = forecaster.predict()

With large datasets containing hundreds of thousands of records, CPU-based regressors can take a long time to churn through each forecast – recall that with direct multi-step forecasting we are training a separate model for every step in the forecast. Running this forecast on the CPU took over 43 minutes.

Switching to cuML’s GPU-accelerated regressor allows the entire forecast to finish in just 103 seconds, a 25x speedup with minimal code changes.

Because the forecast runs faster, we can iterate much more quickly and perform hyperparameter optimization to find the best fit, or try out other regressors supported by cuML.

Conclusion

Time series forecasting has been around for decades but remains incredibly important today. Techniques like direct multi-step forecasting can be useful for optimizing forecasts, but are much more computationally expensive as the size of your data and forecast grows.

Using accelerated computing libraries like RAPIDS cuML with skforecast makes it easy to accelerate your forecasting jobs with minimal code changes required.

To learn more about accelerated machine learning, visit the cuML documentation, or take the Fundamentals of Accelerated Data Science course from NVIDIA Deep Learning Institute.