Developer Tools & Techniques

How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

Today’s demanding AI developer workloads often need more memory than desktop systems provide or require access to software that laptops or PCs lack. This forces work to be moved to the cloud or data center.

NVIDIA DGX Spark provides an alternative to cloud instances and data-center queues. The Blackwell-powered, compact supercomputer contains 1 petaflop of FP4 AI computer performance, 128 GB of coherent unified system memory, memory bandwidth of 273 GB/second, and the NVIDIA AI software stack preinstalled. With DGX Spark, you can work with large, compute intensive tasks locally, without moving to the cloud or data center.

We’ll walk you through how DGX Spark’s compute performance, large memory, and preinstalled AI software accelerate fine-tuning, image generation, data science, and inference workloads. Keep reading for some benchmarks.

Fine-tuning workloads on DGX Spark

Tuning pre-trained models is a common task for AI developers. To show how DGX Spark performs at this workload, we ran three tuning tasks using different methodologies: full fine-tuning, LoRA, and QLoRA. 

In full fine-tuning of a Llama 3.2B model, we reached a peak of 82,739.2 tokens per second. Tuning a Llama 3.1 8B model using LoRA on DGX Spark reached a peak of 53,657.6 tokens per second. Tuning a Llama 3.3 70B model using QLoRA on DGX Spark reached a peak of 5,079.4 tokens per second. 

Since fine-tuning is so memory intensive, none of these tuning workloads can run on a 32 GB consumer GPU.

Fine-tuning
ModelMethodBackendConfigurationPeak tokens/sec
Llama 3.2 3B

Full fine tuningPyTorchSequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125BF16
82,739.20
Llama 3.1 8BLoRAPyTorchSequence length: 2048
Batch size: 4
Epoch: 1
Steps: 125BF16
53,657.60
Llama 3.3 70BQLoRAPyTorchSequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125FP4
5,079.04
Table 1. Fine-tuning performance

DGX Spark’s image-generation capabilities

Image generation models are always pushing for greater accuracy, higher resolutions, and faster performance. Creating high-resolution images or multiple images per prompt drives the need for more memory, as well as the compute required to generate the images.

DGX Spark’s large GPU memory and strong compute performance lets you work with larger-resolution images and higher-precision models to provide higher image quality. Support for the FP4 data format enables DGX Spark to generate images quickly, even at high resolutions.

Using the Flux.1 12B model at FP4 precision, DGX Spark can generate a 1K image every 2.6 seconds (see Table 2 below). DGX Spark’s large system memory provides the capacity necessary to run a BF16 SDXL 1.0 model and generate seven 1K images per minute.

Image generation
ModelPrecisionBackendConfigurationImages/min
Flux.1 12B SchnellFP4TensorRTResolution: 1024×1024 
Denoising steps: 4 
Batch size: 1
23
SDXL1.0BF16TensorRTResolution: 1024×1024
Denoising steps: 50
Batch size: 2
7
Table 2. Image-generation performance

Using DGX Spark for data science

DGX Spark supports foundational CUDA-X libraries like NVIDIA cuML and cuDF. NVIDIA cuML accelerates machine-learning algorithms in scikit-learn, as well as UMAP and HDBSCAN on GPUs with zero code changes required. 

For computationally intensive ML algorithms like UMAP and HDBSCAN, DGX Spark can process 250 MB datasets in seconds. (See Table 3 below.) NVIDIA cuDF significantly speeds up common pandas data analysis tasks like joins and string methods. cuDF pandas operations on datasets with tens of millions of records run in just seconds on DGX Spark.

Data science
LibraryBenchmarkDataset sizeTime
NVIDIA cuMLUMAP250 MB4 secs
NVIDIA cuMLHDBSCAN250 MB10 secs
NVIDIA cuDF pandasKey data analysis operations (joins, string methods, UDFs)0.5 to 5 GB11 secs
Table 3. Data-science performance

Using DGX Spark for inference

DGX Spark’s Blackwell GPU supports the FP4 data format, specifically the NVFP4 data format that provides near-FP8 accuracy (<1% degradation). This enables use of smaller models without sacrificing accuracy. The smaller data footprint of FP4 also improves performance. Table 4 below provides inference performance data for DGX Spark.

DGX Spark supports a range of 4-bit data formats: NVFP4, MXFP4, as well as many backends such as TRT-LLM, llama.cpp, and vLLM. The system’s 1 petaflop of AI performance enables it to deliver fast prompt processing, as shown in Table 4. The quick prompt processing results in a faster time-to-first response token, which delivers a better experience for users and speeds up end-to-end throughput. 

Inference (ISL|OSL= 2048|128, BS=1)
ModelPrecisionBackendPrompt processing throughput
(tokens/sec)
Token generation throughput
(tokens/sec)
Qwen3 14BNVFP4TRT-LLM5928.9522.71
GPT-OSS-20BMXFP4llama.cpp3670.4282.74
GPT-OSS-120BMXFP4llama.cpp1725.4755.37
Llama 3.1 8BNVFP4TRT-LLM10256.938.65
Qwen2.5-VL-7B-InstructNVFP4TRT-LLM65831.7741.71
Qwen3 235B
(on dual DGX Spark)
NVFP4TRT-LLM23477.0311.73
Table 4. Inference performance

NVFP4: 4-bit floating point format was introduced with the NVIDIA Blackwell GPU architecture. MXFP4: Microscaling FP4 is a 4-bit floating point format created by the Open Compute Project (OCP). ISL (Input Sequence Length): Number of tokens in the input prompt (a.k.a. prefill tokens). And OSL (Output Sequence Length): Number of tokens generated by the model in response (a.k.a. decode tokens).

We also connected two DGX Sparks together via their ConnectX-7 chips to run the Qwen3 235B model. The model uses over 120 GB of memory, including overhead. Such models typically run on large cloud or data-center servers, but the fact that they can run on dual DGX Spark systems shows what’s possible for developer experimentation. As shown in the last row of Table 4, the token generation throughput on dual DGX Sparks was 11.73 tokens per second. 

The new NVFP4 version of the NVIDIA Nemotron Nano 2 model also performs well on DGX Spark. With the NVFP4 version, you can now achieve up to 2x higher throughput with little to no accuracy degradation. Download the model checkpoints from Hugging Face or as an NVIDIA NIM

And get your DGX Spark, join the DGX Spark developer community, and start your AI-building journey today. 

Discuss (0)

Tags