Agentic AI / Generative AI

Open-Source AI Tool Upgrades Speed Up LLM and Diffusion Models on NVIDIA RTX PCs

Decorative image.

AI developer activity on PCs is exploding, driven by the rising quality of small language models (SLMs) and diffusion models, such as FLUX.2, GPT-OSS-20B, and Nemotron 3 Nano. At the same time, AI PC frameworks, including ComfyUI, llama.cpp, Ollama, and Unsloth are making functional advances, doubling in popularity over the past year as the number of developers using PC-class models has grown tenfold. Developers are no longer experimenting with generative AI workflows—they’re building the next-generation software stack on NVIDIA GPUs, from the data center to NVIDIA RTX AI PCs.

At CES 2026, NVIDIA is announcing several new updates for the AI PC developer ecosystem, including:

  • Acceleration for the top open source tools on PC, llama.cpp, and Ollama for SLMs, along with ComfyUI for diffusion models.
  • Optimizations to the top open source models for NVIDIA GPUs, including the new LTX-2 audio-video model. 
  • A suite of tools to accelerate agentic AI workflows on RTX PCs and NVIDIA DGX Spark.

Accelerated inference through open source AI frameworks

NVIDIA collaborated with the open-source community to boost inference performance across the AI PC stack. 

Continued performance improvements on ComfyUI 

On the diffusion front, ComfyUI optimized performance on NVIDIA GPUs through PyTorch-CUDA and enabled support for NVFP4 and FP8 formats. These quantized formats enable memory savings of 60% and 40%, respectively, and accelerate performance. Developers will see an average of 3x performance with NVFP4 and 2x with NVFP8. 

Graph showing performance increases between September  2025 and January 2026.
Figure 1. Performance increase on ComfyUI

Updates to ComfyUI include:

  • NVFP4 support: Linear layers can run using the NVFP4 format with optimized kernels, delivering 3–4x higher throughput compared to FP16 and BF16 linear layers.
  • Fused FP8 quantization kernels: Boost model performance by eliminating memory-bandwidth-bound operations.
  • Fused FP8 de-quantization kernels: Performance for FP8 workloads is further improved on NVIDIA RTX GPUs without fourth-generation Tensor Cores (pre NVIDIA Ada.)
  • Weight streaming: Leveraging concurrent system memory and CPU compute streams, weight streaming hides memory latency and increases throughput, especially on GPUs with limited VRAM.
  • Mixed precision support: Models can combine multiple numerical formats within a single network, enabling fine-grained tuning for optimal accuracy and performance.
  • RMS & RoPE Fusion: Common, memory-bandwidth-limited operators in diffusion transformers are fused to reduce memory usage and latency. This optimization benefits all DiT models across data types.

The sample code for the optimizations is available under the ComfyUI kitchen repository. NVFP4 and FP8 checkpoints are also available in HuggingFace, including the new LTX-2, FLUX.2, FLUX.1-dev, FLUX.1-Kontext, Qwen-Image and Z-Image.

Acceleration on RTX AI PCS for llama.cpp and Ollama

For SLMs, token generation throughput performance on mixture-of-expert (MoE) models has increased by 35% on llama.cpp on NVIDIA GPUs, and 30% on Ollama on RTX PCs.

Bar charts showing LLM performance improvements on llama.cpp via different models.
Figure 2. Shows token generation performance improvements on GPT-OSS-20B, Nemotron Nano V2, and Qwen 3 30B with NVIDIA RTX on llama.cpp

Jan’26 builds are run with the following environment variables and flags: GGML_CUDA_GRAPH_OPT=1, FA=ON, and —backend-sampling

Updates to llama.cpp include:

  • GPU token sampling: Offloads several sampling algorithms (TopK, TopP, Temperature, minK, minP, and multi-sequence sampling) to the GPU, improving quality, consistency, and accuracy of responses, while also increasing performance.
  • Concurrency for QKV projections: Support for running concurrent CUDA streams to speed up model inference. To use this feature, pass in the –CUDA_GRAPH_OPT=1 flag.
  • MMVQ kernel optimizations: Pre-loads data into registers and hides delays by increasing GPU utilization on other tasks, to speed up the kernel.
  • Faster model loading time: Up to 65% model load time improvements on DGX Spark, and 15% on RTX GPUs.
  • Native MXFP4 support on NVIDIA Blackwell GPUs: Up to 25% faster prompt processing on LLMs using the hardware-level NVFP4 fifth-generation of Tensor Cores on the Blackwell GPUs.

Updates to Ollama include:

  • Flash attention by default: Now standard on many models. This technique uses “tiling” to compute attention in smaller blocks, reducing the number of transfers between GPU VRAM and system RAM to boost inference and memory efficiency.
  • Memory management scheme: A new scheme allocates additional memory to the GPU, increasing token generation and processing speeds.
  • LogProbs added to the API: Unlocks additional developer capabilities for use cases like classification, perplexity calculations, and self-evaluation.
  • The latest optimizations from the upstream GGML library.

Check out the llama.cpp repository and the Ollama repository to get started, and test them in apps like LM Studio or the Ollama App.

New advanced audio-video model on RTX AI PC

NVIDIA and Lightricks are releasing LTX-2 model weights—an advanced audio-video model that competes with cloud models that you can run on your RTX AI PC or DGX Spark. This is an open, production-ready audio-video foundation model delivering up to 20 seconds of synchronized AV content at 4K resolution. It can offer frame rates of up to 50 fps and provides multi-modal control for high extensibility for developers, researchers, and studios.

The model weights are available in BF16 and NVFP8. The quantized checkpoint delivers 30% memory reduction, enabling the model to run efficiently on RTX GPUs and DGX Spark. 

In the past weeks, we’ve also seen dozens of new models being released, each pushing the frontier of generative AI.

LTX2 demo showing a person flying through a train station at 50 FPS.
Figure 3. Example 4K50LTX-2 output

An Agentic AI toolkit for local AI

The use cases for private, local agents are endless. But building reliable, repeatable, and high-quality private agents remains a challenge. LLM quality deteriorates when you distill and quantize the model to fit within a limited VRAM budget on PC. The need for accuracy increases as agentic workflows require reliable and repeatable answers when interfacing with other tools or actions.

To address this, developers typically use two tools to increase accuracy: fine-tuning and retrieval-augmented-generation (RAG). NVIDIA released updates to accelerate tools across this workflow for building agentic AI.

Nemotron 3 Nano is a 32B parameter MoE model optimized for agentic AI and fine-tuning. With 3.6B active parameters and a 1M context window, it tops several benchmarks across coding, instruction-following, long-context reasoning, and STEM tasks. The model is optimized for RTX PCs and DGX Spark via Ollama and llama.cpp, and can be fine-tuned using Unsloth.

This model stands out for being the most open, with weights, recipes, and datasets widely available. Open models and datasets make customizing the model easier for developers. They prevent redundant fine-tuning and eliminate data leakage for objective benchmarking for robust and efficient workflows. Get started with LoRA-based fine-tuning for it.

For RAG, NVIDIA partnered with Docling—a package to ingest, analyze, and process documents into a machine-understandable language for RAG pipelines. Docling is optimized for RTX PCs and DGX Spark and delivers 4x performance compared to CPUs. 

There are two ways of using Docling:

  1. Traditional OCR pipeline: This is a pipeline of libraries and models that is accelerated via PyTorch-CUDA on RTX.
  2. VLM-based pipeline: An advanced pipeline for complex multi-modality documents, available for use via vLLM within WSL and Linux environments.

Docling is developed at IBM and contributed to the Linux Foundation. Start now on RTX with this easy-to-use guide.

SDKs for audio and video effects

The NVIDIA Video and Audio Effects SDKs enable developers to apply AI effects on multimedia pipelines that enhance quality using features such as background noise removal, virtual background, or eye contact. 

The latest updates at CES 2026 enhance the video relighting feature to produce more natural and stable results across diverse environments, while improving performance by 3x (reducing the minimum GPU required to run it to an NVIDIA GeForce RTX 3060 or above), and decreasing the model size up to 6x. To see the Video Effects SDK with AI relighting in action, check out the new release of the NVIDIA Broadcast app.

We’re excited to collaborate with the open-source community of AI PC tools to deliver models, optimizations, tools, and workflows for developers. Start developing for RTX PCs and DGX Spark today!

Discuss (0)

Tags