Data Science

Federated XGBoost Made Practical and Productive with NVIDIA FLARE

Jun 28, 2024

By Yuan-Ting Hsieh and Yan Cheng

Discuss (1)

AI-Generated Summary

Dislike

NVIDIA FLARE 2.4.1 enables running multiple concurrent XGBoost training experiments with different features or datasets, significantly reducing training time.
The platform provides fault-tolerant XGBoost training by automatically handling message retries during network interruptions, ensuring resilience and maintaining learning continuity.
NVIDIA FLARE integrates with experiment tracking systems like MLflow, Weights & Biases, and TensorBoard, allowing users to monitor training and evaluation metrics closely in federated learning settings.

AI-generated content may summarize information incompletely. Verify important information. Learn more

XGBoost is a highly effective and scalable machine learning algorithm widely employed for regression, classification, and ranking tasks. Building on the principles of gradient boosting, it combines the predictions of multiple weak learners, typically decision trees, to produce a robust overall model.

XGBoost excels with large datasets and complex data structures, thanks to its efficient implementation and advanced features such as regularization, parallel processing, and handling missing values. Its versatility and high performance have made it a popular choice in data science competitions and practical applications across various industries.

The XGBoost 1.7.0 release introduced Federated XGBoost, which enables multiple institutions to jointly train XGBoost models without needing to move data. In the XGBoost 2.0.0 release, this capability was further enhanced to support vertical federated learning. OSS Federated XGBoost provides Python APIs for simulations of XGBoost-based federated training.

Since 2023, NVIDIA Federated Learning Application Runtime Environment (FLARE) has introduced built-in integration with Federated XGBoost features: horizontal histogram-based and tree-based XGBoost, as well as vertical XGBoost. We have also added support for Private Set Intersection (PSI) for sample alignment as pre-step for vertical training.

With these integrations, you can prepare and run Federated XGBoost jobs in production or simulation without writing code. You only provide the dataset location, training parameters, and NVIDIA FLARE configurations for the job and data loading function.

In this post, we highlight the key features of NVIDIA FLARE 2.4.1 for real-world federated XGBoost learning. To conduct real-world federated training productively, you must be able to do the following:

Run multiple concurrent XGBoost training experiments with different experiments, feature combinations, or datasets.
Handle potential experiment failures due to unreliable network conditions or interruptions.
Monitor experiment progress through tracking systems such as MLflow or Weights & Biases.

Running multiple experiments concurrently

Data scientists often have to assess the impact of various hyperparameters or features on models. They experiment with different features or combinations of features using the same model.

NVIDIA FLARE parallel job execution capabilities enable you to conduct these experiments concurrently, significantly reducing the time required for training. NVIDIA FLARE manages communication multiplexing on behalf of the users and does not require opening new ports (typically done by IT support) for each job.

Figure 1 shows the execution of two Federated XGBoost jobs in NVIDIA FLARE.

Fault-tolerant XGBoost training

When dealing with cross-region or cross-border training, the network can be less reliable than on a corporate network, leading to periodic interruptions. These interruptions can cause communication failures, resulting in job interruptions and necessitating a restart from the beginning or from a saved snapshot.

The NVIDIA FLARE reliability features of its communicator for XGBoost automatically handle message retries during network interruptions, ensuring resilience and maintaining learning continuity and data integrity (Figure 2).

NVIDIA FLARE integrates seamlessly to facilitate communication between different XGBoost federated servers and clients, providing a robust and efficient solution for federated learning.

For more information and an end-to-end example, see the /NVIDIA/NVFlare GitHub repo.

Video 1. Federated XGBoost with NVIDIA FLARE

Federated experiment tracking

When you’re conducting machine learning training, especially in distributed settings like federated learning, it’s crucial to monitor training and evaluation metrics closely.

NVIDIA FLARE provides built-in integration with experiment tracking systems—MLflow, Weights & Biases, and TensorBoard—to facilitate comprehensive monitoring of these metrics.

With NVIDIA FLARE, you can choose between decentralized and centralized tracking configurations:

Decentralized tracking: Each client manages its own metrics and experiment tracking server locally, maintaining training metric privacy. However, this setup limits the ability to compare data across different sites.
Centralized tracking: All metrics are streamed to a central FL server, which then pushes the data to a selected tracking system. This setup supports effective cross-site metric comparisons

The NVIDIA FLARE job configuration enables you to choose the tracking scenario or system that best fits your needs. When users need to migrate from one experiment tracking system to another, using NVIDIA FLARE, you can modify the job configuration without re-writing the experiment tracking code.

To add MLflow, Weights & Bias, or TensorBoard logging to efficiently stream metrics to the respective server requires just three lines of code:

from nvflare.client.tracking import MLflowWriter

mlflow = MLflowWriter()

mlflow.log_metric("loss", running_loss / 2000, global_step)

The Nvflare.client.tracking API enables you to flexibly redirect your logging metrics to any destination. The use of MLflow, Weights & Biases, or TensorBoard syntax doesn’t really matter here as you can stream the collected metrics to any supported experiment tracking system. Choosing to use MLflowWriter, WandBWriter, or TBWriter is based on your existing code and requirements.

MLflowWriter uses the MLflow API operation log_metric.
TBWriter uses the TensorBoard SummaryWriter operation add_scalar.
WandBWriter uses the API operation log.

Depending on your existing code or familiarity with these systems, you can choose any writer. After you’ve modified the training code, you can use the NVIDIA FLARE job configuration to configure the system to stream the logs appropriately.

For more information, see the FedAvg with SAG workflow with MLflow tracking tutorial.

Summary

In this post, we described the reliable XGBoost and experiment tracking support features of NVIDIA FLARE 2.4.x in great technical detail. For more information, see the /NVIDIA/NVFlare 2.4 branch on GitHub and the NVIDIA FLARE 2.4 documentation.

Any questions or comments? Reach out to us at federatedlearning@nvidia.com.

Discuss (1)

About the Authors

About Yuan-Ting Hsieh
Yuan-Ting is a senior software engineer at NVIDIA. He is working on NVIDIA Flare, an application runtime environment designed for NVIDIA federated learning initiatives. Before his work on NVIDIA Flare, Yuan-Ting was an integral part of the team that developed the Clara-Train SDK and AIAA (Artificial Intelligence Assisted Annotation), which has since been integrated into MONAI (Medical Open Network for AI). Yuan-Ting holds a master’s degree in Computer Science from the University of Wisconsin-Madison and a bachelor’s degree in Electrical Engineering from National Taiwan University. He's particularly interested in the intersection of machine learning and distributed systems in his professional pursuits.

View all posts by Yuan-Ting Hsieh

About Yan Cheng
Yan Cheng is the lead of the engineering team that works closely with the DLMED researchers to architect and implement the Clara Train SDK. He has many years of experience building industry-quality software systems. Before joining NVIDIA, he served as a chief architect for AOL, and did IT consulting for the federal government.

View all posts by Yan Cheng