Technical Walkthrough

Federated Learning from Simulation to Production with NVIDIA FLARE

Discuss (0)

NVIDIA FLARE 2.2 includes a host of new features that reduce development time and accelerate deployment for federated learning, helping organizations cut costs for building robust AI. Get the details about what’s new in this release.

An open-source platform and software development kit (SDK) for Federated Learning (FL), NVIDIA FLARE continues to evolve to enable its end users to leverage distributed, multiparty collaboration for more robust AI development from simulation to production.

The release of FLARE 2.2 brings numerous updates that simplify the research and development workflow for researchers and data scientists, streamline deployment for IT practitioners and project leaders, and strengthen security to ensure data privacy in real-world deployments. These include:

Simplifying the researcher and developer workflow

  • FL Simulator for rapid development and debugging
  • Federated statistics
  • Integration with MONAI and XGBoost

Streamlining deployment, operations, and security

  • FLARE Dashboard
  • Unified FLARE CLI
  • Client-side privacy policies
Diagram of the end-to-end NVIDIA FLARE 2.2 workflow showing the progression from application development using FL Simulator, deployment with the FLARE Dashboard, to simplified operations and security with the FLARE console and site security policies.
Figure 1.  The end-to-end NVIDIA FLARE 2.2 workflow 

FL Simulator: Rapid development and debugging

One of the key features to enable research and development workflows is the new FL Simulator. The Simulator allows researchers and developers to run and debug a FLARE application without the overhead of provisioning and deploying a project. The Simulator provides a lightweight environment with a FLARE server and any number of connected clients on which an application can be deployed. Debugging is possible by leveraging the Simulator Runner API, allowing developers to drive an application with simple Python scripts to create breakpoints within the FLARE application code.

The Simulator is designed to accommodate systems with limited resources, such as a researcher’s laptop, by running client processes sequentially in a limited number of threads. The same simulation can be easily run on a larger system with multiple GPUs by allocating a client or multiple clients per GPU. This gives the developer or researcher a flexible environment to test application scalability. Once the application has been developed and debugged, the same application code can be directly deployed on a production, distributed FL system without change.

Screen captures of the FLARE Dashboard showing the Project Admin user management panel and the User panel for self-service downloads of project configuration and client software.
Figure 2. The FLARE Dashboard showing the Project Admin user management panel (left), and the User panel for self-service downloads of project configuration and client software (right)

Federated learning workflows and federated data science

FLARE 2.2 also introduces new integrations and federated workflows designed to simplify application development and enable federated data science and analytics.

Federated statistics

When working with distributed datasets, it is often important to assess the data quality and distribution across the set of client datasets. FLARE 2.2 provides a set of federated statistics operators (controllers and executors) that can be used to generate global statistics based on individual client-side statistics.  

The workflow controller and executor are designed to allow data scientists to quickly implement their own statistical methods (generators) based on the specifics of their datasets of interest. Commonly used statistics are provided out-of-the box, including count, sum, mean, standard deviation, and histograms, along with routines to visualize the global and individual statistics. The built-in visualization tools can be used to view statistics across all datasets at all sites as well as global aggregates, for example in a notebook utility as shown in Figure 3.

Example histograms from the federated statistics image example using the built-in FLARE statistics visualization class.
Figure 3. Example histograms from the federated statistics image example using the built-in FLARE statistics visualization class

In addition to these new workflows, the existing set of FLARE examples have been updated to integrate with the FL Simulator and leverage new privacy and security features. These example applications leverage common Python toolkits like NumPy, PyTorch, and Tensorflow, and highlight workflows in training, cross validation, and federated analysis.

Integration of FLARE and MONAI

MONAI, the Medical Open Network for AI, recently released an abstraction that allows MONAI models packaged in the MONAI Bundle (MB) format to be easily extended for federated training on any platform that implements client training algorithms in these new APIs. FLARE 2.2 includes a new client executor that makes this integration, allowing MONAI model developers to easily develop and share models using the bundle concept, and then seamlessly deploy these models in a federated paradigm using NVIDIA FLARE.

Diagram showing MONAI FL integration with ClientAlgo API
Figure 4. MONAI FL integration with ClientAlgo API

To see an example of using FLARE to train a medical image analysis model using federated averaging (FedAvg) and MONAI Bundle, visit NVFlare on GitHub. The example shows how to download the dataset, download the spleen_ct_segmentation bundle from the MONAI Model Zoo, and how to execute it with FLARE using either the FL simulator or POC mode. 

MONAI also allows computing summary data statistics on the datasets defined in the bundle. These can be shared and visualized in FLARE using the federated statistics operators described above. The use of federated statistics and MONAI is included in the GitHub example above.

Federated spleen segmentation in abdominal CT using MONAI bundle from the Model Zoo. For details, visit NVFlare on GitHub.
Figure 5. Federated spleen segmentation in abdominal CT using MONAI bundle from the Model Zoo. For details, visit NVFlare on GitHub.

XGBoost integration

A common request from the federated learning user community is support for more traditional machine learning frameworks in a federated paradigm. FLARE 2.2 provides examples that illustrate horizontal federated learning using two approaches: histogram-based collaboration and tree-based collaboration.  

The community DMLC XGBoost project recently released an adaptation of the existing distributed XGBoost training algorithm that allows federated clients to act as distinct workers in the distributed algorithm. This distributed algorithm is used in a reference implementation of horizontal federated learning that demonstrates the histogram-based approach.

FLARE 2.2 also provides a reference federated implementation of tree-based boosting using two methods: Cyclic Training and Bagging Aggregation. In the Cyclic Training method, multiple sites execute tree boosting on their own local data, forwarding the resulting tree sequence to the next client in the federation for the subsequent round of boosting. In the method of Bagging Aggregation, all sites start from the same global model and boost a number of trees based on their local data. The resulting trees are then aggregated by the server for the next round’s boosting.

Real-world federated learning

The new suite of tools and workflows available in FLARE 2.2 allow developers and data scientists to quickly build applications and more easily bring them to production in a distributed federated learning deployment. When moving to a real-world distributed deployment, there are many considerations for security and privacy that must be addressed by both the project leader and developers, as well as the individual sites participating in the federated learning deployment.

FLARE Dashboard: Streamlined deployment

New in 2.2 is the FLARE Dashboard, designed to simplify project administration and deployment for lead researchers and IT practitioners supporting real-world FL deployments. The FLARE Dashboard allows a project administrator to deploy a website that can be used to define project details, gather information about participant sites, and distribute the startup kits that are used to connect client sites.

The FLARE Dashboard is backed by the same provisioning system in previous versions of the platform and allows users the flexibility to choose either the web UI or the classic command line provisioning, depending on project requirements. Both the Dashboard and provisioning CLI now support dynamic provisioning, allowing project administrators to add federated and admin clients on-demand. This ability to dynamically allocate new training and admin clients without affecting existing clients dramatically simplifies management of the FL system over the lifecycle of the project.

Unified FLARE CLI

The FLARE command-line interface (CLI) has been completely rewritten to consolidate all commands under a common top-level nvflare CLI and introduce new convenience tools for improved usability. 

$ nvflare -h
 
usage: nvflare [-h] [--version] {poc,preflight_check,provision,simulator,dashboard,authz_preview} ...

Subcommands include all of the pre-existing standalone CLI tools like poc, provision, and authz_preview, as well as new commands for launching the FL Simulator and the FLARE Dashboard.  The nvflare command now also includes a preflight_check that provides administrators and end-users a tool to verify system configuration, connectivity to other FLARE subsystems, proper storage configuration, and perform a dry-run connection of the client or server.

Improved site security

The security framework of NVIDIA FLARE has been redesigned in 2.2 to improve both usability and overall security. The roles that are used to define privileges and system operation policies have been streamlined to include: Project Admin, Org Admin, Lead Researcher, and Member Researcher.  The security framework has been strengthened based on these roles, to allow individual organizations and sites to implement their own policies to protect individual privacy and intellectual property (IP) through a Federated Authorization framework.

Federated Authorization shifts both the definition and enforcement of privacy and security policies to individual organizations and member sites, allowing participants to define their own fine-grained site policy:

  • Each organization defines its policy in its own authorization.json configuration
  • This locally defined policy is loaded by FL clients owned by the organization
  • The policy is also enforced by these FL clients

The site policies can be used to control all aspects of the federated learning workflow, including:

  • Resource management: The configuration of system resources that are solely the decisions of local IT
  • Authorization policy: Local authorization policy that determines what a user can or cannot do on the local site
  • Privacy policy: Local policy that specifies what types of studies are allowed and how to add privacy protection to the learning results produced by the FL client on the local site
  • Logging configuration: Each site can now define its own logging configuration for system generated log messages

These site policies also allow individual sites to enforce their own data privacy by defining custom filters and encryption applied to any information passed between the client site and the central server.

This new security framework provides project and organization administrators, researchers, and site IT the tools required to confidently take a federated learning project from proof-of-concept to a real-world deployment.

Getting started with NVIDIA FLARE 2.2

We’ve highlighted just some of the new features in FLARE 2.2 that allow researchers and developers to quickly adopt the platform to prototype and deploy federated learning workflows. Tools like the FL Simulator and FLARE Dashboard for streamlined development and deployment, along with a growing set of reference workflows, make it easier and faster than ever to get started and save valuable development time.

In addition to the examples detailed in this post, FLARE 2.2 includes many other enhancements that increase power and flexibility of the platform, including:

  • Examples for Docker compose and Helm deployment
  • Preflight checks to help identify and correct connectivity and configuration issues
  • Simplified POC commands to test distributed deployments locally
  • Updated example applications

To learn more about these features and get started with the latest examples, visit the NVIDIA FLARE documentation. As we are actively developing the FLARE platform to meet the needs of researchers, data scientists, and platform developers, we welcome any suggestions and feedback in the NVIDIA FLARE GitHub community.  

Join us for the webinar, Federated Learning with NVIDIA Flare: From Simulation to Real World to see an overview of the platform and some of these new features in action.