How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks

Today’s demanding AI developer workloads often need more memory than desktop systems provide or require access to software that laptops or PCs lack. This forces work to be moved to the cloud or data center.

NVIDIA DGX Spark provides an alternative to cloud instances and data-center queues. The Blackwell-powered, compact supercomputer contains 1 petaflop of FP4 AI computer performance, 128 GB of coherent unified system memory, memory bandwidth of 273 GB/second, and the NVIDIA AI software stack preinstalled. With DGX Spark, you can work with large, compute intensive tasks locally, without moving to the cloud or data center.

We’ll walk you through how DGX Spark’s compute performance, large memory, and preinstalled AI software accelerate fine-tuning, image generation, data science, and inference workloads. Keep reading for some benchmarks.

Fine-tuning workloads on DGX Spark

Tuning pre-trained models is a common task for AI developers. To show how DGX Spark performs at this workload, we ran three tuning tasks using different methodologies: full fine-tuning, LoRA, and QLoRA.

In full fine-tuning of a Llama 3.2B model, we reached a peak of 82,739.2 tokens per second. Tuning a Llama 3.1 8B model using LoRA on DGX Spark reached a peak of 53,657.6 tokens per second. Tuning a Llama 3.3 70B model using QLoRA on DGX Spark reached a peak of 5,079.4 tokens per second.

Since fine-tuning is so memory intensive, none of these tuning workloads can run on a 32 GB consumer GPU.

Fine-tuning
Model	Method	Backend	Configuration	Peak tokens/sec
Llama 3.2 3B	Full fine tuning	PyTorch	Sequence length: 2048 Batch size: 8 Epoch: 1 Steps: 125BF16	82,739.20
Llama 3.1 8B	LoRA	PyTorch	Sequence length: 2048 Batch size: 4 Epoch: 1 Steps: 125BF16	53,657.60
Llama 3.3 70B	QLoRA	PyTorch	Sequence length: 2048 Batch size: 8 Epoch: 1 Steps: 125FP4	5,079.04

Table 1. Fine-tuning performance

DGX Spark’s image-generation capabilities

Image generation models are always pushing for greater accuracy, higher resolutions, and faster performance. Creating high-resolution images or multiple images per prompt drives the need for more memory, as well as the compute required to generate the images.

DGX Spark’s large GPU memory and strong compute performance lets you work with larger-resolution images and higher-precision models to provide higher image quality. Support for the FP4 data format enables DGX Spark to generate images quickly, even at high resolutions.

Using the Flux.1 12B model at FP4 precision, DGX Spark can generate a 1K image every 2.6 seconds (see Table 2 below). DGX Spark’s large system memory provides the capacity necessary to run a BF16 SDXL 1.0 model and generate seven 1K images per minute.

Image generation
Model	Precision	Backend	Configuration	Images/min
Flux.1 12B Schnell	FP4	TensorRT	Resolution: 1024×1024 Denoising steps: 4 Batch size: 1	23
SDXL1.0	BF16	TensorRT	Resolution: 1024×1024 Denoising steps: 50 Batch size: 2	7

Table 2. Image-generation performance

Using DGX Spark for data science

DGX Spark supports foundational CUDA-X libraries like NVIDIA cuML and cuDF. NVIDIA cuML accelerates machine-learning algorithms in scikit-learn, as well as UMAP and HDBSCAN on GPUs with zero code changes required.

For computationally intensive ML algorithms like UMAP and HDBSCAN, DGX Spark can process 250 MB datasets in seconds. (See Table 3 below.) NVIDIA cuDF significantly speeds up common pandas data analysis tasks like joins and string methods. cuDF pandas operations on datasets with tens of millions of records run in just seconds on DGX Spark.

Data science
Library	Benchmark	Dataset size	Time
NVIDIA cuML	UMAP	250 MB	4 secs
NVIDIA cuML	HDBSCAN	250 MB	10 secs
NVIDIA cuDF pandas	Key data analysis operations (joins, string methods, UDFs)	0.5 to 5 GB	11 secs

Table 3. Data-science performance

Using DGX Spark for inference

DGX Spark’s Blackwell GPU supports the FP4 data format, specifically the NVFP4 data format that provides near-FP8 accuracy (<1% degradation). This enables use of smaller models without sacrificing accuracy. The smaller data footprint of FP4 also improves performance. Table 4 below provides inference performance data for DGX Spark.

DGX Spark supports a range of 4-bit data formats: NVFP4, MXFP4, as well as many backends such as TRT-LLM, llama.cpp, and vLLM. The system’s 1 petaflop of AI performance enables it to deliver fast prompt processing, as shown in Table 4. The quick prompt processing results in a faster time-to-first response token, which delivers a better experience for users and speeds up end-to-end throughput.

Inference (ISL\|OSL= 2048\|128, BS=1)
Model	Precision	Backend	Prompt processing throughput (tokens/sec)	Token generation throughput (tokens/sec)
Qwen3 14B	NVFP4	TRT-LLM	5928.95	22.71
GPT-OSS-20B	MXFP4	llama.cpp	3670.42	82.74
GPT-OSS-120B	MXFP4	llama.cpp	1725.47	55.37
Llama 3.1 8B	NVFP4	TRT-LLM	10256.9	38.65
Qwen2.5-VL-7B-Instruct	NVFP4	TRT-LLM	65831.77	41.71
Qwen3 235B (on dual DGX Spark)	NVFP4	TRT-LLM	23477.03	11.73

Table 4. Inference performance

NVFP4: 4-bit floating point format was introduced with the NVIDIA Blackwell GPU architecture. MXFP4: Microscaling FP4 is a 4-bit floating point format created by the Open Compute Project (OCP). ISL (Input Sequence Length): Number of tokens in the input prompt (a.k.a. prefill tokens). And OSL (Output Sequence Length): Number of tokens generated by the model in response (a.k.a. decode tokens).

We also connected two DGX Sparks together via their ConnectX-7 chips to run the Qwen3 235B model. The model uses over 120 GB of memory, including overhead. Such models typically run on large cloud or data-center servers, but the fact that they can run on dual DGX Spark systems shows what’s possible for developer experimentation. As shown in the last row of Table 4, the token generation throughput on dual DGX Sparks was 11.73 tokens per second.

The new NVFP4 version of the NVIDIA Nemotron Nano 2 model also performs well on DGX Spark. With the NVFP4 version, you can now achieve up to 2x higher throughput with little to no accuracy degradation. Download the model checkpoints from Hugging Face or as an NVIDIA NIM.

And get your DGX Spark, join the DGX Spark developer community, and start your AI-building journey today.