Accelerated Data Analytics: Faster Time Series Analysis with RAPIDS cuDF

This post is part of a series on accelerated data analytics.

Because it is generally constrained to a single core, a standard exploratory data analysis (EDA) workflow benefits from accelerated computing with RAPIDS cuDF, an accelerated data analytics library with a pandas-like interface. Time series data notoriously requires additional data processing that adds time and complexity to your workflow, making it another great use case for leveraging RAPIDS.

With RAPIDS cuDF, you can speed up your time series processing for “Goldilocks” datasets that are not too big and not too small. These datasets are onerous on pandas but don’t need full distributed computing tools like Apache Spark or Dask.

What is time series data?

This section covers machine learning (ML) use cases that rely on time series data and when to consider accelerated data processing.

Time series data is ubiquitous. Timestamps are a variable in many types of data sources, from weather measurements and asset pricing to product purchase information and more.

Timestamps come in all levels of granularity, such as millisecond readings or monthly readings. When timestamp data is eventually leveraged in complex modeling, it becomes time series data, indexing the other variables to make patterns observable.

The following prevalent ML use cases rely heavily on time series data, as do many more:

Anomaly detection for fraud in the financial services industry
Predictive analytics in the retail industry
Sensor readings for weather forecasting
Recommender systems for content suggestions

Complex modeling use cases often require the processing of large datasets with high-resolution historical data that can span years to decades, as well as processing real-time streaming data. Time series data goes through transformations, such as resampling data up and down to make the period of time consistent among datasets and being smoothed into rolling windows that denoise patterns.

pandas provides simple, expressive functions to manage these operations, but you may have observed in your own work that the single-threaded design can quickly get overwhelmed by the amount of processing required. This is especially applicable for larger datasets or use cases that need fast data processing turnaround, which time series–analysis use cases often do. The subsequent wait time for pandas to process the data can be frustrating and can lead to delayed insights.

As a result, these scenarios make cuDF a uniquely good fit for time series data analysis. Using a pandas-like API, you can process tens of gigabytes of data with up to 40x speedups, saving the most valuable asset in any data project: your time.

Time series with RAPIDS cuDF

To showcase the benefit of acceleration for exploring data with RAPIDS cuDF and how it can be readily adopted, this post walks through a subset of time series–processing operations from Time Series Data Analysis. This is a robust notebook analysis of a publicly available dataset of real weather readings, available in the RAPIDS GitHub repository.

In the complete analysis, RAPIDS cuDF executed with a 13x speedup (see the benchmarking section later in this post for exact numbers). Speedups typically increase as the full workflow becomes more complex.

Extrapolating to the real world, this gain has real impact. When an hour-long workload can be completed in under 5 minutes, you can meaningfully add time back to your day.

Dataset

Meteonet is a realistic weather dataset that aggregates readings from weather stations located throughout Paris from 2016-2018, with missing and invalid data. It is about 12.5 GB in size.

Analysis approach

For this post, imagine you are a data scientist who has received this aggregate data for the first time and must prepare this for a meteorological use case. The specific use case is open-ended: it could be a forecast, report, or input in a climate model.

As you review this post, most of the functions should be familiar as they are designed to resemble operations in pandas. This analysis is structured to perform the following tasks:

Format the DataFrame.
Resample the time series.
Run a rolling-window analysis.

This post disregards several data inconsistencies that are addressed in the end-to-end workflow demonstrated in the notebook, Time Series Data Analysis Using cuDF.

Step 1. Format the DataFrame

First, import the packages used in this analysis with the following command:

# Import the necessary packages
import cudf
import cupy as cp
import pandas as pd

Next, read in the CSV data.

## Read in data
gdf = cudf.read_csv('./SE_data.csv')

Begin by focusing on the meteorological parameters of interest: wind speed, temperature, and humidity.

gdf = gdf.drop(columns=['dd','precip','td','psl'])

After the parameters of interest are isolated, perform a series of quick checks. Start the first transformation by converting the date column to the datetime data type. Then, print out the first five rows to visualize what you are working with and assess the size of the tabular dataset.

# Change the date column to the datetime data type. Look at the DataFrame info
gdf['date'] = cudf.to_datetime(gdf['date'])
gdf.head()
Gdf.shape

	number_sta	lat	lon	height_sta	date	ff	hu	t
0	1027003	45.83	5.11	196.0	2016-01-01	<NA>	98.0	279.05
1	1033002	46.09	5.81	350.0	2016-01-01	0.0	99.0	278.35
2	1034004	45.77	5.69	330.0	2016-01-01	0.0	100.0	279.15
3	1072001	46.20	5.29	260.0	2016-01-01	<NA>	<NA>	276.55
4	1089001	45.98	5.33	252.0	2016-01-01	0.0	95.0	279.55

Table 1. Output results showing the first five rows of the dataset

Output

The DataFrame shape (127515796, 8) shows 127,515,796 rows by eight columns. Now that the size and shape of the dataset are known, you can start investigating a bit deeper to see how frequently the data is sampled.

## Investigate the sampling frequency with the diff() function to calculate the time diff
## dt.seconds, which is used to find the seconds value in the datetime frame. Then apply the 
## max() function to calculate the maximum date value of the series.
delta_mins = gdf['date'].diff().dt.seconds.max()/60
print(f"The dataset collection covers from {gdf['date'].min()} to {gdf['date'].max()} with {delta_mins} minute sampling interval")

The dataset covers sensor readings from 2016-01-01T00:00:00.000000000 to 2018-12-31T23:54:00.000000000, at a 6-minute sampling interval. Confirm that the dates and times expected are represented in the dataset.

After a basic review of the dataset is complete, get started with time series-specific formatting. Begin by separating the time increments into separate columns.

gdf['year'] = gdf['date'].dt.year
gdf['month'] = gdf['date'].dt.month
gdf['day'] = gdf['date'].dt.day
gdf['hour'] = gdf['date'].dt.hour
gdf['mins'] = gdf['date'].dt.minute
gdf.tail

The data is now separated into columns at the end for year, month, and day. This makes slicing the data by different increments much simpler.

	number_sta	lat	lon	height_sta	date	ff	hu	t	year	month	day	hour	mins
127515791	84086001	43.811	5.146	672.0	2018-12-31 23:54:00	3.7	85.0	276.95	2018	12	31	23	54
127515792	84087001	44.145	4.861	55.0	2018-12-31 23:54:00	11.4	80.0	281.05	2018	12	31	23	54
127515793	84094001	44.289	5.131	392.0	2018-12-31 23:54:00	3.6	68.0	280.05	2018	12	31	23	54
127515794	84107002	44.041	5.493	836.0	2018-12-31 23:54:00	0.6	91.0	270.85	2018	12	31	23	54
127515795	84150001	44.337	4.905	141.0	2018-12-31 23:54:00	6.7	84.0	280.45	2018	12	31	23	54

Table 2. Output results with time increments separated into columns

Experiment with the updated DataFrame by selecting a specific time range and station to analyze.

# Use the cupy.logical_and(...) function to select the data from a specific time range.
import pandas as pd
start_time = pd.Timestamp('2017-02-01T00')
end_time = pd.Timestamp('2018-11-01T00')
station_id = 84086001
gdf_period = gdf.loc[cp.logical_and(cp.logical_and(gdf['date']>start_time,gdf['date']<end_time),gdf['number_sta']==station_id)]
gdf_period.shape
(146039, 13)

The DataFrame has been successfully prepared, with 13 variables and 146,039 rows.

Step 2. Resample the time series

Now that the DataFrame has been set up, run a simple resampling operation. Although the data updates every 6 minutes, in this case, the data must be reshaped to fall into a daily cadence.

First set the date as the index, so that the rest of the variables adjust as time does. Downsample the data from samples every 6 minutes to one record per day, to yield one record for each variable on each day. Retain the maximum value of each variable for each day as the record for that day.

## Set "date" as the index. See what that does?
gdf_period.set_index("date", inplace=True)
## Now, resample by daylong intervals and check the max data during the resampled period. 
## Use .reset_index() to reset the index instead of date.
gdf_day_max = gdf_period.resample('D').max().bfill().reset_index()
gdf_day_max.head()

The data is now available in daily increments. Refer to the table as a check that the operation yielded the desired result.

	date	number_sta	lat	lon	height_sta	ff	hu	t	year	month	day	hour	mins
0	2017-02-01	84086001	43.81	5.15	672.0	8.1	98.0	283.05	2017	2	1	23	54
1	2017-02-02	84086001	43.81	5.15	672.0	14.1	98.0	283.85	2017	2	2	23	54
2	2017-02-03	84086001	43.81	5.15	672.0	10.1	99.0	281.45	2017	2	3	23	54
3	2017-02-04	84086001	43.81	5.15	672.0	12.5	99.0	284.35	2017	2	4	23	54
4	2017-02-05	84086001	43.81	5.15	672.0	7.3	99.0	280.75	2017	2	5	23	54

Table 3. Output of first five rows of the downsampled dataset

Step 3. Run a rolling-window analysis

In the previous resampling example, points are sampled based on time. However, you can also smooth data based on frequency, using a rolling window.

In the following example, take a rolling window of length three over the data. Again, retain the maximum value for each variable.

# Specify the rolling window.
gdf_3d_max = gdf_day_max.rolling('3d',min_periods=1).max()
gdf_3d_max.reset_index(inplace=True)
gdf_3d_max.head()

Rolling windows can be used to denoise data and assess data stability over time.

	date	number_sta	lat	lon	height_sta	ff	hu	t	year	month	day	hour	mins
0	2017-02-01	84086001	43.81	5.15	672.0	8.1	98.0	283.05	2017	2	1	23	54
1	2017-02-02	84086001	43.81	5.15	672.0	14.1	98.0	283.85	2017	2	2	23	54
2	2017-02-03	84086001	43.81	5.15	672.0	14.1	99.0	283.85	2017	2	3	23	54
3	2017-02-04	84086001	43.81	5.15	672.0	14.1	99.0	283.35	2017	2	4	23	54
4	2017-02-05	84086001	43.81	5.15	672.0	12.5	99.0	283.35	2017	2	5	23	54

Table 4. Output results showing the first five results of the dataset adjusted to a rolling window timeframe

This post walked you through the common steps of time series data processing. While a weather dataset was used for the example, the steps are applicable to all forms of time series data. The process is similar to your time series analysis code today.

Performance speedup

When running through the complete notebook with the Meteonet weather dataset, we achieved a 13x speedup on an NVIDIA RTX A6000 GPU using RAPIDS 23.02 (Figure 1).

Bar chart showing speedup results for data analysis performed on pandas and RAPIDS cuDF. — *Figure 1. Benchmark results for time series data analysis comparing pandas to RAPIDS cuDF*

Pandas on CPU (Intel Core i7-7800X CPU)	User: 2 min 32 sec Sys: 27.3 sec Total: 3 min
RAPIDS cuDF on NVIDIA A6000 GPUs	User: 5.33 sec Sys: 8.67 sec Total: 14 sec

Table 5. Performance comparison showing 12.8x speedup results from the full notebook

Key takeaways

Time series analysis is a core part of analytics that requires additional processing compared to other variables. With RAPIDs cuDF, you can manage the processing step faster and reduce time to insight with the Pandas functions you are accustomed to.

To further investigate cuDF in time series analysis, see rapidsai-community/notebooks-contrib on GitHub. To revisit cuDF in EDA applications, see Accelerated Data Analytics: Speed Up Data Exploration with RAPIDS cuDF.

Acknowledgments

Meiran Peng, David Taubenheim, Sheng Luo, and Jay Rodge contributed to this post.

Accelerated Data Analytics: Faster Time Series Analysis with RAPIDS cuDF

What is time series data?

Time series with RAPIDS cuDF

Dataset

Analysis approach

Step 1. Format the DataFrame

Output

Step 2. Resample the time series

Step 3. Run a rolling-window analysis

Performance speedup

Key takeaways

Acknowledgments

Related resources

Tags

About the Authors

Accelerated Data Analytics: Faster Time Series Analysis with RAPIDS cuDF

What is time series data?

Time series with RAPIDS cuDF

Dataset

Analysis approach

Step 1. Format the DataFrame

Output

Step 2. Resample the time series

Step 3. Run a rolling-window analysis

Performance speedup

Key takeaways

Acknowledgments

Related resources

Tags

About the Authors

Comments

Related posts

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

Accelerated Data Analytics: Speed Up Data Exploration with RAPIDS cuDF

Aligning Time Series at the Speed of Light

Dask Tutorial - Beginner's Guide to Distributed Computing with GPUs in Python

Python Pandas Tutorial: A Beginner's Guide to GPU Accelerated DataFrames for Pandas Users

Related posts

Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer

Analyzing the Security of Machine Learning Research Code

Comparing Solutions for Boosting Data Center Redundancy

Validating NVIDIA DRIVE Sim Radar Models

New Video Series: CUDA Developer Tools Tutorials