Data Science

NVIDIA Hackathon Winners Share Strategies for RAPIDS-Accelerated ML Workflows

Dec 20, 2024

By Jenn Yonemitsu and William Hill

Discuss (0)

AI-Generated Summary

Dislike

The NVIDIA hackathon at ODSC West 2024 challenged 220 teams to build a regression model using 10 GB of synthetic tabular data with over 12 million subjects, with the goal of minimizing root mean squared error (RMSE) while achieving high processing speed.
The top three winning teams, including Shyamal Shah, Feifan Liu with teammates Himalaya Dua and Sara Zare, and Lorenzo Mondragon, leveraged NVIDIA's RAPIDS ecosystem, utilizing tools like cuDF pandas and XGBoost with GPU support to optimize their solutions.
Key strategies employed by the winners included feature engineering techniques such as target mean encoding for high-cardinality categorical variables, careful handling of missing values, and hyperparameter tuning to balance accuracy and processing speed.
The use of GPU acceleration through RAPIDS and XGBoost significantly reduced data preprocessing and model training times, enabling the teams to process large datasets efficiently within the 24-hour time constraint.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Approximately 220 teams gathered at the Open Data Science Conference (ODSC) West this year to compete in the NVIDIA hackathon, a 24-hour machine learning (ML) competition. Data scientists and engineers designed models that were evaluated based on accuracy and processing speed. The top three teams walked away with prize packages that included NVIDIA RTX Ada Generation GPUs, Google Colab credits, and more. To earn these top spots, the winning teams leveraged RAPIDS Python APIs to produce the most accurate and performant solutions.

During his talk at ODSC, Nick Becker, product lead for RAPIDS AI at NVIDIA, highlighted that the computational demands of AI, coupled with the ever-increasing volumes of generated data, are fueling data processing as the next phase of accelerated computing. Today, approximately 403 million terabytes of data are generated per day, putting immense pressure on data centers to process more data efficiently to achieve higher accuracy, privacy, and faster response times.

As businesses operationalize and streamline AI systems end-to-end, they need to address related data processing bottlenecks. Accelerated computing enables more efficient processing for today’s increasingly complex workflows.

The NVIDIA Hackathon competition demonstrated how data scientists could swiftly tackle the growing volumes of data and process it faster by leveraging GPU acceleration through PyData libraries, all while using the syntax they already know—with no code changes required.

Participants were provided with approximately 10 GB of synthetic tabular data, containing information on 12 million subjects, each described by over 100 anonymous features, both categorical and numerical. Their task was to build a regression model to predict the target variable, y, and minimize root mean squared error (RMSE) to achieve both accuracy and speed. They had 24 hours to solve the problem and optimize their solutions.

Participants leveraged RAPIDS cuDF through pandas or Polars, and some used RAPIDS cuML or XGBoost to optimize data processing and model training. Participants were encouraged to apply Exploratory Data Analysis (EDA) and feature engineering, and to ensemble multiple ML algorithms.

This post features insights and strategies from the top three winners: Shyamal Shah, Feifan Liu with teammates Himalaya Dua and Sara Zare, and Lorenzo Mondragon. In their own words, they share how they approached the challenge and some tips and tricks for how they produced the fastest, most accurate solutions.

First place winner: Shyamal Shah

The NVIDIA hackathon challenged me to analyze an extensive tabular dataset using powerful NVIDIA GPUs through Google Colab. My approach prioritized both computational efficiency and predictive accuracy through several key optimizations. First, I leveraged the NVIDIA RAPIDS ecosystem by utilizing the cuDF pandas extension, which automatically accelerated pandas operations on the GPU. Through detailed feature analysis, I discovered that 20 numerical features were effectively duplicates, sharing identical statistical properties when normalized. This insight led me to select just one representative numerical feature, the “magical” column, which had the lowest number of null values.

# Calculate statistics from training data
   base_median = train_df[base_feature].median()
   Q1 = train_df[base_feature].quantile(0.25)
   Q3 = train_df[base_feature].quantile(0.75)
   IQR = Q3 - Q1
   lower_bound = Q1 - 1.5 * IQR
   upper_bound = Q3 + 1.5 * IQR

# Process base feature
       df_processed['magical'] = df['magical'].fillna(base_median).clip(lower_bound, upper_bound)

For the high-cardinality categorical variables, I implemented target mean encoding with smoothing instead of traditional one-hot encoding, which would have significantly increased the feature dimensionality. By narrowing down the original 106 features to just three key predictors, I substantially reduced the computational overhead while maintaining predictive power.

# Calculate robust target encodings for high-cardinality categorical variables
   cat_encodings = {}
   global_mean = train_df['y'].mean()

   for col in ['trickortreat', 'kingofhalloween']:
       # Group by category and calculate stats
       cat_stats = (train_df.groupby(col)['y']
                   .agg(['mean', 'count'])
                   .reset_index())

       # Only keep categories that appear more than once
       frequent_cats = cat_stats[cat_stats['count'] > 1]

       # Strong smoothing factor due to high cardinality
       smoothing = 100

       # Calculate smoothed means with stronger regularization
       frequent_cats['encoded'] = (
           (frequent_cats['count'] * frequent_cats['mean'] + smoothing * global_mean) /
           (frequent_cats['count'] + smoothing)
       )

       # Create dictionary only for frequent categories
       cat_encodings[col] = dict(zip(frequent_cats[col], frequent_cats['encoded']))

# Process categorical features
       for col in ['trickortreat', 'kingofhalloween']:
           # Map categories to encodings, with special handling for rare/unseen categories
           df_processed[f'{col}_encoded'] = (
               df[col].map(cat_encodings[col])
               .fillna(global_mean)  # Use global mean for rare/unseen categories
           )

The implementation used Microsoft’s LightGBM framework, chosen specifically for its GPU optimization and top-level performance boosting capabilities on large datasets.

Through careful parameter tuning and experimental iterations, I optimized the model’s hyperparameters to balance training speed and accuracy. The final solution completed the training and prediction cycle in just 1 minute and 47 seconds while achieving high accuracy. This experience demonstrated how combining GPU-accelerated computing with thoughtful feature engineering and algorithm selection can lead to both efficient and accurate solutions when working with large-scale datasets.

Second place winner: Feifan Liu, PhD, and teammates Himalaya Dua and Sara Zare

From my perspective, I think cuDF pandas is really efficient and easy to use. There is no need to learn new APIs for people who are already familiar with the original pandas. It makes loading and manipulating large volumes of data possible.

One tip is to avoid complex preprocessing, for example, imputation. Directly assigning missing values as -1 (that is, create additional dimension in feature space) is effective for both performance and efficiency.

train_df = df.copy()
# train_df = sample_20_df.copy()
categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = train_df.select_dtypes(include=['number']).columns.tolist()
num_col_only_minus_one = [col for col in numerical_cols if (train_df[col] < 0).sum() 
> 0 and (train_df[col] < 0).sum() == (train_df[col] == -1).sum()]
train_df[categorical_cols] = train_df[categorical_cols].astype('category')
train_df[num_col_only_minus_one]=train_df[num_col_only_minus_one].replace(-1, np.nan)


test_df[categorical_cols] = test_df[categorical_cols].astype('category')
test_df[num_col_only_minus_one]=test_df[num_col_only_minus_one].replace(-1, np.nan)

Another tip is to leverage the CUDA support inside XGBoost for accelerated training.

#baseline parameters
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', eval_metric = 'rmse', 
max_depth= 5, n_estimators=500, random_state=42, device='cuda', enable_categorical=True)

Third place winner: Lorenzo Mondragon

To tackle the challenge, I leveraged RAPIDS to integrate GPU acceleration into both Polars and pandas DataFrames. This enabled efficient preprocessing of the 12 million rows of the tabular data, including handling missing values, encoding categorical features, and sampling data to optimize for model training.

For the regression task, I utilized XGBoost with GPU support (gpu_hist tree method) to train a model with hyperparameters fine-tuned for both accuracy and performance. I focused on:

Filling numeric features with column means and categorical features with "Unknown".
Encoding categorical data into compact UInt32 formats to improve memory efficiency.
Experimenting with lazy loading and sampling through Polars for faster data ingestion and manipulation.

# 1. Handle missing values
numeric_cols = train_data.select(cs.numeric()).columns
categorical_cols = [
   col for col in train_data.columns
   if col not in numeric_cols and col not in ['id', 'y']
]

# Fill missing values
df = train_data.with_columns([
   # Fill numeric columns with mean
   *[
       pl.col(col).fill_null(pl.col(col).mean()).alias(col)
       for col in numeric_cols
   ],
   # Fill categorical columns with 'Unknown'
   *[
       pl.col(col).fill_null("Unknown").alias(col)
       for col in categorical_cols
   ]
])

In the evaluation phase, the combination of Polars for preprocessing and GPU-accelerated XGBoost allowed me to strike a balance between model accuracy and inference speed. While my model ranked ninth in terms of accuracy, the efficiency gains from RAPIDS boosted my solution to third place overall when incorporating performance metrics.

GPU acceleration is a game-changer: Using RAPIDS significantly reduced data preprocessing and model training times, making it feasible to process massive datasets within tight time constraints.
Seamlessly integrate with familiar tools: Adopting RAPIDS required minimal changes to existing pandas and Polars workflows, highlighting the accessibility of GPU-accelerated libraries for data science practitioners.
Optimization requires balance: While accuracy is crucial, optimizing for speed can be equally impactful in real-world scenarios where latency and resource efficiency are critical.
Community and support matter: The resources and expert advice available during the hackathon were invaluable, especially when navigating cutting-edge tools like the Polars GPU engine and RAPIDS.

Learn more

If you’re new to RAPIDS, check out these resources to get started, and test drive these tutorials for cuDF pandas and Polars. You can watch the webinar, Unlock Hackathon Success with NVIDIA: Tools and Q&A with NVIDIA Kaggle Grandmaster Jiwei Liu on how to leverage GPU-acceleration using cuDF pandas or Polars, explore feature engineering techniques, and gain insights from this notebook. Additionally, you can have a look at the sample notebooks created for the hackathon—one for cuDF pandas and one for Polars GPU Engine, also created by NVIDIA Kaggle Grandmasters.

Discuss (0)

About the Authors

About Jenn Yonemitsu
Jenn Yonemitsu is manager of developer marketing for accelerated data science at NVIDIA. She has worked with data throughout her career in various product focused roles including synthetic data generation and analytics for real-time trading systems. Jenn worked at various startups in Silicon Valley as well as at big tech companies; Oracle and Sun Microsystems. As well, her recent graduate studies involved data through her research on LLM security and privacy enhancing techniques at UC Berkeley.

View all posts by Jenn Yonemitsu

About William Hill
William Hill is the product marketing manager at NVIDIA for generative AI inference products. Prior to NVIDIA, William was an ML engineer, data scientist, and full-stack developer at various startups and large enterprises. William started his career as a computer programmer in the Air Force. In the commercial sector, he developed neural networks and AI applications for various financial and scientific applications. William has a bachelor’s degree in Computer Science, a master’s degree in Computer Information Systems, and is currently pursuing his MBA at Duke University.

View all posts by William Hill