Development & Optimization

Aug 22, 2025
NVIDIA Hardware Innovations and Open Source Contributions Are Shaping AI
Open source AI models such as Cosmos, DeepSeek, Gemma, GPT-OSS, Llama, Nemotron, Phi, Qwen, and many more are the foundation of AI innovation. These models are...
8 MIN READ

Aug 13, 2025
Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants
If you’ve ever installed an NVIDIA GPU-accelerated Python package, you’ve likely encountered a familiar dance: navigating to pytorch.org, jax.dev,...
15 MIN READ

Aug 06, 2025
What’s New and Important in CUDA Toolkit 13.0
The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs. As a major release, it...
19 MIN READ

Aug 04, 2025
Navigating GPU Architecture Support: A Guide for NVIDIA CUDA Developers
If you’ve used the NVIDIA CUDA Compiler (NVCC) for your NVIDIA GPU application recently, you may have encountered a warning message like the following: nvcc...
6 MIN READ

Jul 24, 2025
Double PyTorch Inference Speed for Diffusion Models Using Torch-TensorRT
NVIDIA TensorRT is an AI inference library built to optimize machine learning models for deployment on NVIDIA GPUs. TensorRT targets dedicated hardware in...
8 MIN READ

Jul 18, 2025
Optimizing for Low-Latency Communication in Inference Workloads with JAX and XLA
Running inference with large language models (LLMs) in production requires meeting stringent latency constraints. A critical stage in the process is LLM decode,...
6 MIN READ

Jul 16, 2025
CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design
GEMM optimization on GPUs is a modular problem. Performant implementations need to specify hyperparameters such as tile shapes, math and copy instructions, and...
12 MIN READ

Jul 16, 2025
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels
In the era of generative AI, utilizing GPUs to their maximum potential is essential to training better models and serving users at scale. Often, these models...
12 MIN READ

Jul 15, 2025
NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale
Amazon Web Services (AWS) developers and solution architects can now take advantage of NVIDIA Dynamo on NVIDIA GPU-based Amazon EC2, including Amazon EC2 P6...
4 MIN READ

Jul 09, 2025
Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python
C++ libraries like CUB and Thrust provide high-level building blocks that enable NVIDIA CUDA application and library developers to write speed-of-light code...
5 MIN READ

Jul 09, 2025
Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO
Reinforcement learning (RL) is the backbone of interactive AI. It is fundamental for teaching agents to reason and learn from human preferences, enabling...
5 MIN READ

Jul 07, 2025
Think Smart and Ask an Encyclopedia-Sized Question: Multi-Million Token Real-Time Inference for 32X More Users
Modern AI applications increasingly rely on models that combine huge parameter counts with multi-million-token context windows. Whether it is AI agents...
8 MIN READ

Jul 03, 2025
New Video: Build Self-Improving AI Agents with the NVIDIA Data Flywheel Blueprint
AI agents powered by large language models are transforming enterprise workflows, but high inference costs and latency can limit their scalability and user...
2 MIN READ

Jul 02, 2025
Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX
As accelerated computing continues to drive application performance in all areas of AI and scientific computing, there's a renewed interest in GPU optimization...
11 MIN READ

Jun 25, 2025
How to Streamline Complex LLM Workflows Using NVIDIA NeMo-Skills
A typical recipe for improving LLMs involves multiple stages: synthetic data generation (SDG), model training through supervised fine-tuning (SFT) or...
10 MIN READ

Jun 18, 2025
Improved Performance and Monitoring Capabilities with NVIDIA Collective Communications Library 2.26
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL...
11 MIN READ