The International Society of Automation (ISA) reports that 5% of plant production is lost annually due to downtime. Putting that into a different context, roughly $647B is surrendered on a global basis by manufacturers across all industry segments, the corresponding portion of nearly $13T in production. The challenge at hand is predicting the maintenance needs of these machines to minimize downtime, reduce operational costs, and optimize maintenance schedules.
This problem is especially prevalent in companies that offer Desktop as a Service (DaaS) services, which lease computing devices for commercial use and have tight SLAs to meet. The industry of DaaS is valued at US$3B and is expected to grow 12%.
In this post, we discuss a case where we built a predictive model to estimate the remaining useful life (RUL) of computing assets based on various operational parameters, sensor data, and historical maintenance records.
LatentView Analytics
LatentView Analytics supports multiple DaaS clients and offers operationally complex services with advanced data analytics consulting in areas like business intelligence, data analytics and science, data engineering, machine learning, and AI.
We’ve found that organizations can save valuable downtime by detecting equipment failure before it happens with predictive maintenance algorithms. Advances in data science have made prediction and forecasting widespread within enterprises. Compared to standard operating procedures like routine– or time-based preventative maintenance, predictive maintenance gets ahead of the problem.
LatentView built a solution called PULSE, which is an advanced predictive maintenance solution designed to redefine manufacturing efficiency. By connecting IoT-enabled assets, PULSE uses cutting-edge analytics to provide real-time insights, enabling your team to take proactive measures.
PULSE helps to reduce and remove unplanned downtime, excessive maintenance costs, and operational inefficiencies. You can predict machine failure with precision, eliminate downtime hassles, and enhance manufacturing efficiency.
Remaining useful life use case
A leading computing device manufacturing client wanted to implement effective preventive maintenance. Part failure on millions of leased computing devices resulted in customer churn and dissatisfaction. The ability to identify failure earlier and make recommendations on repair and replacement would reduce that churn, increasing customer loyalty as well as profitability.
To address the customer’s pain point, we decided to use a predictive maintenance model to forecast the RUL of each machine. The model would help identify how long each machine would run before repair or replacement was required, eliminating part failure when machines went to customers.
To build this predictive maintenance model for computing devices, we first needed to aggregate data from the key thermal, battery, fan, disk, and CPU sensors that measured the temperature, cycles, and multiple aspects of the machine. That data was aggregated and applied to a forecasting model.
The following sections describe our initial attempt, learning, and how GPU-accelerated data science helped us speed up implementation to deliver a successful project to our client.
Challenges faced
In our first attempt to build the proof-of-concept for the client, we faced many challenges that centered on computational bottlenecks and extended cycle processing times with our predictive maintenance platform offering, PULSE. This was primarily due to the high volume and constant stream of data required to make effective predictions, which in turn attracted more and larger nodes and images to meet the computational requirements.
While these challenges are holistic to the problem, we primarily wanted a tool and library integration with a solution that can scale well to dynamic operational conditions. The solution should minimize the time it takes to see the results and optimize the TCO, including infrastructure cost.
Some of the problems we faced included the following:
- Large, real-time datasets
- Sparse and noisy sensor data
- Multivariate relationships
- Protracted timeline
- Cost aspect
- Inferencing challenges
Large, real-time datasets
Over 1 TB of data is collected daily due to the millions of deployed machines across multiple locations, with multiple sensors in each machine and data collection intervals every 5 minutes. This made data processing and cleansing the most time-consuming and tedious task, as we spent almost 60% of our time preparing the data.
Continuous iteration of training the model with the latest training data, data cleaning, adding new features, and experimenting with multiple models to finalize the production model, also increase the total effort, time, and compute power being used.
Sparse and noisy sensor data
In a manufacturing or DaaS environment, sensor data from each machine was often sparse (most of the values being zero or empty), collected at irregular intervals, and prone to noise.
Multivariate relationships
The number of sensor types that had to be considered in a single model for the use case created a complex multivariate situation that increased computational needs.
Protracted timeline
Creating accurate predictions requires a large dataset with a lot of examples to train the model.
As big data use cases continue to grow, CPU performance becomes a major bottleneck. These limitations increase cycle time and costs and become visible in our PoC outcome.
Cost aspect
Infrastructure must be scaled to reduce cycle time. Large-scale CPU infrastructure incurs significant costs, reducing the return on investment for data-driven enterprises.
Inferencing challenges
Deploying a large-scale prediction process is arduous. It usually requires significant software refactoring or sometimes rewriting the codes that are optimized to the use case and hand-offs between teams. In that case, insight generation can be delayed substantially.
An accelerated predictive maintenance solution with RAPIDS
PULSE was built to run on CPU infrastructure using the PyData ecosystem. With the introduction to RAPIDS, we wanted to offer an accelerated PULSE platform to our customers with RAPIDS as a drop-in replacement to all of the PyData libraries.
We saw the following high-level benefits of using NVIDIA RAPIDS for the PULSE system:
- Creating faster data pipelines
- Working in a known platform
- Navigating dynamic operational conditions
- Addressing sparse and noisy sensor data
- Benefiting from faster data loading and preprocessing, model training
Creating faster data pipelines
With the power of the GPU, workloads are parallelized, making the processing of large volumes of near-real-time data less of a burden on CPU infrastructure. With improved performance, we anticipated cost savings as the infrastructure performs more efficiently with fewer resources.
Working in a known platform
Data scientists in our team use Python packages like pandas
and scikit-learn
. RAPIDS offers syntactically similar or identical packages with the RAPIDS cuDF and cuML libraries that run workloads on the GPU, helping to speed up development time without requiring new skill development.
Navigating dynamic operational conditions
With GPU acceleration at our disposal, the model seamlessly adjusted to dynamic conditions with additional training data, ensuring that it remained robust and responsive to the evolving patterns and recent trends.
Addressing sparse and noisy sensor data
RAPIDS proved instrumental in tackling the intricacies of sparse and noisy sensor data. Using the GPU-accelerated cuDF library, we experienced a significant boost in data preprocessing speed. Missing values were imputed with unprecedented efficiency, noise was filtered out, and irregularities in data collection were addressed with precision.
RAPIDS laid the foundation for a clean and reliable dataset, setting the stage for more accurate predictive models.
Faster data loading and preprocessing, model training
RAPIDS’s data loading, preprocessing, and ETL features are built on Apache Arrow for loading, joining, aggregating, filtering, and otherwise manipulating data, all in pandas-like API leading to a >10x speedup. This reduced the model iteration time and enabled us to try multiple models in a short time.
Figure 2. shows a high-level representative description of where NVIDIA GPU-accelerated libraries have been consumed. We have integrated RAPIDS and RAPIDS Accelerator for SPARK in the existing PULSE solution to accelerate the AI/ML layer, as well as data processing and advanced analytics. The integration was seamless and with the feasibility of no-code or low-code changes.
CPU and RAPIDS performance comparison
To evaluate GPU acceleration for the project, we planned a proof-of-concept to benchmark an accelerated performance by comparing our CPU-only model to RAPIDS on GPUs. These were the solutions we compared:
- CPU-only test run of the solution with Pandas
- Test run with RAPIDS cuDF to accelerate Pandas
This pilot was an on-premises setup with the following GPU infrastructure:
- CPU Sockets: 2
- CPU Model: AMD EPYC 7742
- CPU Cores: 128 Cores (256 hyper-threading)
- System Memory: 512GiB
- GPU: NVIDIA A100 PCIe
- GPU VRAM: 80GiB
Dataset
For this pilot project, we could not use the production data due to security and data compliance issues. These benchmarks were done outside of the customer’s secure infrastructure.
We had to fall back to synthetic data with stand-in columns. The current production deployment uses Databricks with Microsoft Azure. For our test, we wanted to keep the synthetic data as close to the actual data as possible. We used the Spark-based dbldatagen utility from Databricks labs for generating data where the size was 12 GB.
The following example code shows the dataset details:
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas
import numpy
import os
pandas.set_option('display.max_columns', None)
import dbldatagen as dg
# Column definitions are stubs only - modify to generate correct data
generation_spec = ( dg.DataGenerator(sparkSession=spark, name='synthetic_data', rows=350000000, random=True, partitions=8 ) .withColumn('Unique_Key', 'string', template=r'\\w') .withColumn('Month', 'string', values=[201908, 202106, 201910, 202102, 202010, 202009, 201907, 202012, 202006, 202207, 201909, 202004, 201906, 202206, 202008, 202104, 201912, 202204, 202001, 202110, 202011, 202108, 202007, 201911, 202208, 202002, 202003, 202203, 202101, 202109, 202103, 202005, 202112, 202210, 202105, 202205, 202202, 202209, 202301, 202201, 202107, 202302, 202111, 202211, 202212])
.withColumn('Part_No', 'bigint', minValue=1000000000, maxValue=10000000000)
……………………………………………..
)
df1 = generation_spec.build()
df1.coalesce(1).write.format("csv").mode("overwrite").save("/rapids/notebooks/host/hostdata1/lv_data/final_data/final_2", header = True)
data = spark.read.csv("/rapids/notebooks/host/hostdata1/lv_data/final_data/final/header.csv", header=True,inferSchema=True)
data.head(1)
[Row(Unique_Key='laboris', Month=201908, Part_No=4821032423, Region='LA', classification='Erratic', order_type='SVC&RPB', WIB=12226712, CPIB=1723240, COIB=122034, Order_Qty=1111, Business_Segment='other', Market_Segment='EMEA_Consumer', Exception_Flag='N', Load_date=202202, Product_Line='5X', Part_Commodity='Service/Support Kit', commoditygroup='nan', Active_flag='Y', Part_Life='Sustaining', FCS_date=41.62, RV='A', ABC='X', XYZ='N', Roll_Flag=52705, Order_Qty_agg=None)]
The code example generates synthetic data of the required size defined by the parameter rows
. The data generation spec defined earlier is in conjunction with the customer production data.
The generated data has various data types that must pass through preprocessing and feature engineering steps. This helps with extracting and transforming variables from the generated data that can be used with further advanced data analytics and supervised or unsupervised learning.
Data processing, transformation, and feature engineering
In the current deployment for the RUL use case, we looked at various parameters from the machines by collecting data from the thermal, battery, fan, disk, and CPU sensors. Our test included the following tasks, as these are the prerequisites to prepare a dataset to train a machine-learning model:
- Reading data
- Feature engineering
- Drop duplicates
- Drop NA
- One-shot encoding.
- Data normalization
- Correlation analysis
- Group-by operations (calculating mean, std, and max for a group)
We ran these tasks with RAPIDS cuDF using the pandas accelerator mode. This new mode brings acceleration with zero code changes to the pandas code by adding the command %load_ext cudf.pandas
in a Jupyter notebook.
# Magic command for using original pandas code with no-code change with RAPIDS
%load_ext cudf.pandas
%%time
#Read/Load data
df = pd.read_csv("/rapids/notebooks/hostdata1/lv_data/q0/t3/part1.csv")
# Feature engineering
# Drop duplicates
df = df.drop_duplicates()
# Drop rows with null values
df = df.dropna()
# Select categorical columns for one-hot encoding categorical_columns = ['cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4']
# Perform one-hot encoding df_encoded = pd.get_dummies(df, columns=categorical_columns)
# Normalizing values between 0 and 1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
columns_to_normalize = ['cont_0', 'cont_1', 'cont_2', 'cont_3', 'cont_4', 'cont_5', 'cont_6', 'cont_7', 'cont_8', 'cont_9', 'cont_10', 'cont_11', 'cont_12', 'cont_13', 'target','new_feature_2', 'new_feature_3', 'new_feature_1'] df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
# Group-by operations
grouped = df.groupby('cat_0').agg({'target': ['mean', 'std'], 'new_feature_1': 'max'})
# Compute and plot pairwise correlation of columns
# Select continuous features for correlation analysis continuous_features = ['cont_0', 'cont_1', 'cont_2', 'cont_3', 'cont_4', 'cont_5', 'cont_6', 'cont_7', 'cont_8', 'cont_9', 'cont_10', 'cont_11', 'cont_12', 'cont_13', 'target', 'new_feature_1', 'new_feature_2', 'new_feature_3']
# Create a cuxfilter DataFrame
cux_df = cuxfilter.DataFrame.from_dataframe(df)
# Standardize the data using cuml StandardScaler
scaler = StandardScaler()
df_std = scaler.fit_transform(cux_df.data[continuous_features])
# Add the standardized data to cux_df
for i, col in enumerate(continuous_features):
cux_df.data[col] = df_std[i]
# Calculate the correlation matrix
correlation_matrix = cux_df.data[continuous_features].corr()
# Convert correlation matrix to cuDF DataFrame correlation_matrix_cudf = cudf.DataFrame(correlation_matrix)
# Use seaborn for plotting
plt.figure(figsize=(12, 8)) sns.heatmap(correlation_matrix_cudf.to_pandas(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()
This workflow saved the code migration effort with the absolutely no-code change feature. It also provided a speedup of ~171x in performance improvement by calculating the average of the four tasks improvement ratio over CPU for end-to-end workflow, including data preparation, visualization, model training, and tuning (Figure 3).
Table 1 provides more details on the exact runtime for each step from Figure 3.
Steps | CPU | GPU | Improvement Ratio | |
Data Prep | Reading Data | 87 | 14.9 | 5.8 |
Feature Engineering | 382 | 0.6 | 637.7 | |
Group-by operation | 703 | 17.7 | 39.8 | |
Correlation | Correlation Analysis | 27.1 | 8.3 | 3.3 |
Table 1 shows that the compute-intensive workload of feature engineering and group-by operation achieved a tremendous 639x and 39.8x speedups over Pandas. The workflow covered 10 GB of data: rows 43780407 and columns 23.
Conclusion
Our client found these simulated results compelling, and it is now deployed as a proof-of-concept in the customer environment. We anticipate it being taken to production to improve the RUL prediction model by Q4 2024. After this success, LatentView is excited to continue to apply RAPIDS acceleration in modeling projects across manufacturing portfolios.
To try RAPIDS cuDF yourself, get started with Google Colab.