NVIDIA Developer Zone

CUDA FAQ

Get answers to your questions about CUDA and GPU Computing.

Please visit again - we will be putting up answers to frequently asked questions periodically.

General Questions

Q: What is CUDA?

A: CUDA™ is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). 

Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed base of over 300 million CUDA-enabled GPUs in notebooks, workstations, compute clusters and supercomputers.  Applications used in astronomy, biology, chemistry, physics, data mining, manufacturing, finance, and other computationally intense fields are increasing using CUDA to deliver the benefits of GPU acceleration.
 

Q: What is NVIDIA Tesla™?

With the world’s first teraflop many-core processor, NVIDIA® Tesla™ computing solutions enable the necessary transition to energy efficient parallel computing power. With 448 CUDA cores per processor (C2070) and a standard C compiler that simplifies application development, Tesla scales to solve the world’s most important computing challenges—quickly and accurately.

Q: What is OpenACC?

OpenACC is an open industry standard for compiler directives or hints which can be inserted in code written in C or Fortran enabling the compiler to generate code which would run in parallel on multi-CPU and GPU accelerated system. OpenACC directives are easy and powerful way to leverage the power of GPU Computing while keeping your code compatible for non accelerated CPU only systems.
Founding members of the OpenACC standards organizations include Cray, Portland Group, CAPS and NVIDIA. Find out more at http://www.openacc-standard.org/ .

Q: What is the relationship between OpenCL and CUDA?

A: CUDA is the name of NVIDIA’s parallel computation architecture. CUDA encompasses the hardware and software which implements compute on NVIDIA GPUs. The name CUDA  also  refers to the programming language extensions to  C/C++. CUDA C/C++ provides a runtime and a driver API level access to the GPU hardware. CUDA also has a large and active ecosystem including a  number of supporting libraries and development tools which have made CUDA C/C++ the solution of choice for most developers. CUDA x86 has been announced as a project by PGI, to allow CUDA C/C++ code to be run on CPU based servers. NVIDIA is the chair of the Khronos group that developed the OpenCL specification and is one of the active members of the group defining this emerging standard. CUDA extensions have many similarities to the OpenCL API design. The OpenCL API is similar to the "driver API" and the CUDA language integration solution  allow users to simply use the higher level abstraction refered to as the "CUDA Runtime API". CUDA  C/C++ is NVIDIA's platform for innovation, to quickly deliver new features and capabilities requested by our customers.

Q: What kind of performance increase can I expect using GPU Computing over CPU-only code?

This depends on how well the problem maps onto the architecture. For data parallel applications, accelerations of more than two orders of mangitude have been seen. You can browse research, developer, applications and partners on our CUDA In Action Page
 

Q: What operating systems does CUDA support?

CUDA supports Windows 7, Windows XP, Windows Vista, Linux and Mac OS (including 32-bit and 64-bit versions). For full list see the latest CUDA Toolkit Download  Release Notes.

Q: What GPUs does CUDA run on?

GPU Computing is a standard feature in all NVIDIA's latest discrete GPU . A full list can be found on our Supported GPUs Page.

Q:What is the "compute capability"?

The compute capability indicates the version of the compute hardware included in the GPU.

Compute capability 1.0 corresponds to the original G80 architecture.

Compute capability 1.1 (introduced in later G8x parts) adds support for atomic operations on global memory. See "What are atomic operations?" in the programming section below.

Compute capability 1.2 (introduced in the GT200 architecture) adds the following new features:

  • Support for atomic functions operating in shared memory and atomic functions operating on 64-bit words in global memory
  • Support for warp vote functions
  • The number of registers per multiprocessor is 16384
  • The maximum number of active warps per multiprocessor is 32
  • The maximum number of active threads per multiprocessor is 1024

Compute capability 1.3 adds support for double precision floating point numbers.

Compute capability 2.0 (introduced in the Fermi architecture) adds many new features including:

  • Support for concurrent kernel execution
  • 64bit Addressing
  • Unified Virtual Addressing (UVA)
  • GPU Direct, Peer to Peer communication

See the latest CUDA Programming Guide for a full list of GPUs and their compute capabilities.

Q: Where can I find a good introduction to parallel programming?

There are several university courses online, technical webinars, article series and also several excellent books on parallel computing, these can be found on our CUDA Education Page.

Q: How do I pronounce CUDA? 

koo-duh.

Hardware and Architecture

Q: Will I have to re-write my CUDA Kernels when the next new GPU architecture is released?

A: No. CUDA C/C++ provides an abstraction, its a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released. It is possible that changes in the number of registers or size of shared memory may open up the opportunity for further optimization but thats optional. So write your code now, and enjoy it running on future GPU's .

Q: Is GPUDirect2.0, and Peer to Peer Transfered Supported on GeForce GPUs?

A: Yes. Some of the early pre-release versions of CUDA 4.0 Toolkit had this restriction, but now CUDA 4.0 enabled Peer to Peer communication between any Fermi (or higher) GPUs .

Q: Does CUDA support multiple graphics cards in one system?
Yes. Applications can distribute work across multiple GPUs. This is not done automatically, however, so the application has complete control. See the "multiGPU" example in the GPU Computing SDK for an example of programming multiple GPUs.

Q: Where can I find more information on NVIDIA GPU architecture? 

Programming Questions

Q: I think I've found a bug in CUDA, how do I report it?
If you are reasonably sure it is a bug, you can either post a message on the forums. If you are an active CUDA developer please sign up as a CUDA registered developer to get additional support and the ability to file your own bugs directly.

Your bug report should include a simple, self-contained piece of code that demonstrates the bug, along with a description of the bug and the expected behavior.
Please include the following information with your bug report:

  • Machine configuration (CPU, Motherboard, memory etc.)
  • Operating system
  • CUDA Toolkit version
  • Display driver version
  • For Linux users, please attach an nvidia-bug-report.log, which is generated by running "nvidia-bug-report.sh".

 

Q: How does CUDA structure computation?
CUDA broadly follows the data-parallel model of computation. Typically each thread executes the same operation on different elements of the data in parallel.
The data is split up into a 1D,2D or 3D grid of blocks. Each block can be 1D, 2D or 3D in shape, and can consist of  over 512 threads on current hardware. Threads within a thread block can coooperate via the shared memory.
Thread blocks are executed as smaller groups of threads known as "warps".

Q: What are the advantages of CUDA vs. graphics-based GPGPU?
CUDA is designed from the ground-up for efficient general purpose computation on GPUs. Developers can compile C for CUDA to avoid the tedious work of remapping their algorithms to graphics concepts.

CUDA exposes several hardware features that are not available via graphics APIs. The most significant of these is shared memory, area of on-chip memory which can be accessed in parallel by blocks of threads. This allows caching of frequently used data and can provide large speedups over using textures to access data. Combined with a thread synchronization primitive, this allows cooperative parallel processing of on-chip data, greatly reducing the expensive off-chip bandwidth requirements of many parallel algorithms. This benefits a number of common applications such as linear algebra, Fast Fourier Transforms, and image processing filters.

Whereas fragment programs in graphics APIs are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered writes - i.e. an unlimited number of stores to any address. This enables many new algorithms that were not possible using graphics APIS to perform efficiently using CUDA.

Graphics APIs force developers to store data in textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA can perform loads from any address.

CUDA also offers highly optimized data transfers to and from the GPU.

Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.

The "cudaThreadSynchronize()" API call should be used when measuring performance to ensure that all device operations have completed before stopping the timer.

CUDA functions that perform memory copies and that control graphics interoperability are synchronous, and implicitly wait for all kernels to complete.


Q: Can I transfer data and run a kernel in parallel (for streaming applications)?
Yes, CUDA supports overlapping GPU computation and data transfers using streams. See the programming guide for more details.

Q: Is it possible to DMA directly into GPU memory from another PCI-E device?

GPUDirect allows you to DMA directly to GPU host memory.  See the GPUDirect technology page for details.

Q:  Is it possible to write the results from a kernel directly to texture (for multi-pass algorithms)
Not currently, but you can copy from global memory back to the array (texture). Device to device memory copies are fast.

Q: Can I write directly to the framebuffer?
No. In OpenGL you have to write to a mapped pixel buffer object (PBO), and then render from this. The copies are in video memory and fast, however. See the "postProcessGL" sample in the SDK for more details.

Q: Can I read directly from textures created in OpenGL/Direct3D?
You cannot read directly from OpenGL textures in CUDA. You can copy the texture data to a pixel buffer object (PBO) and then map this buffer object for reading in CUDA.

In Direct3D it is possible to map D3D resources and read them in CUDA. This may involve an internal copy from the texture format to linear format.

Q: How do I get the best performance when transferring data to and from OpenGL pixel buffer objects (PBOs)?
For optimal performance when copying data to and from PBOs, you should make sure that the format of the source data is compatible with the format of the destination. This will ensure that the driver doesn't have to do any format conversion on the CPU and can do a direct copy in video memory. When copying 8-bit color data from the framebuffer using glReadPixels we recommend using the GL_BGRA format and ensuring that the framebuffer has an alpha channel (e.g. glutInitDisplayMode(GLUT_RGBA_ | GLUT_ALPHA) if you're using GLUT).

Q: What texture features does CUDA support?
CUDA supports 1D, 2D and 3D textures, which can be accessed with normalized (0..1) or integer coordinates. Textures can also be bound to linear memory and accessed with the "tex1Dfetch" function.
Cube maps, texture arrays, compressed textures and mip-maps are not currently supported.
The hardware only supports 1, 2 and 4-component textures, not 3-component textures.
The hardware supports linear interpolation of texture data, but you must use "cudaReadModeNormalizedFloat" as the "ReadMode".

Q: Are graphics operations such as z-buffering and alpha blending supported in CUDA?
No. Access to video memory in CUDA is done via the load/store mechanism, and doesn't go through the normal graphics raster operations like blending. We don't have any plans to expose blending or any other raster ops in CUDA.

Q: What are the peak transfer rates between the CPU and GPU?
The performance of memory transfers depends on many factors, including the size of the transfer and type of system motherboard used.
We recommend NVIDIA nForce motherboards for best transfer performance. On PCI-Express 2.0 systems we have measured up to 6.0 GB/sec transfer rates.
You can measure the bandwidth on your system using the bandwidthTest sample from the SDK.
Transfers from page-locked memory are faster because the GPU can DMA directly from this memory. However allocating too much page-locked memory can significantly affect the overall performance of the system, so allocate it with care.

Q: What is the precision of mathematical operations in CUDA?
All the current range of NVIDIA GPUs and since GT200  have double precision floating point. See the programming guide for more details. All compute-capable NVIDIA GPUs support 32-bit integer and single precision floating point arithmetic. They follow the IEEE-754 standard for single-precision binary floating-point arithmetic, with some minor differences - notably that denormalized numbers are not supported.


Q: Why are the results of my GPU computation slightly different from the CPU results? There are many possible reasons. Floating point computations are not guaranteed to give identical results across any set of processor architectures. The order of operations will often be different when implementing algorithms in a data parallel way on the GPU.

This is a very good reference on floating point arithmetic:
What Every Computer Scientist Should Know About Floating-Point Arithmetic

The GPU also has several deviations from the IEEE-754 standard for binary floating point arithmetic. These are documented in the CUDA Programming Guide, section A.2.
 

Q: Does CUDA support double precision arithmetic?
Yes. GPUs with compute capability 1.3 and higher (those based on the GT200 architecture, such as the Tesla C1060 and those based on Fermi , and later) support double precision floating point in hardware.

Q: How do I get double precision floating point to work in my kernel?
You need to add the switch "-arch sm_13" or "-arch sm_20" to your nvcc command line, otherwise doubles will be silently demoted to floats. See the "Mandelbrot" sample in the CUDA SDK for an example of how to switch between different kernels based on the compute capability of the GPU.

You should also be careful to suffix all floating point literals with "f" (for example, "1.0f") otherwise they will be interpreted as doubles by the compiler.

Q: Can I read double precision floats from texture?
The hardware doesn't support double precision float as a texture format, but it is possible to use int2 and cast it to double as long as you don't need interpolation:

texture<int2,1> my_texture;

static __inline__ __device__ double fetch_double(texture<int2, 1> t, int i)
{
int2 v = tex1Dfetch(t,i);
return __hiloint2double(v.y, v.x);
}

Q: Does CUDA support long integers?
Yes, CUDA supports 64 bit integers (long longs). Operations on these types compile to multiple instruction sequences on some GPU depending on compute capability.
 

Q: When should I use the __mul24 and __umul24 functions?

G8x hardware supports integer multiply with only 24-bit precision natively (add, subtract and logical operations are supported with 32 bit precision natively). 32-bit integer multiplies compile to multiple instruction sequences and take around 16 clock cycles.

You can use the __mul24 and __umul24 built-in functions to perform fast multiplies with 24-bit precision.

Be aware that future hardware may switch to 32-bit native integers, it which case __mul24 and __umul24 may actually be slower. For this reason we recommend using a macro so that the implementation can be switched easily.

Q: Does CUDA support 16-bit (half) floats?
All floating point computation is performed with 32 or 64 bits.

The driver API supports textures that contain 16-bit floats through the CU_AD_FORMAT_HALF array format. The values are automatically promoted to 32-bit during the texture read.

16-bit float textures are planned for a future release of CUDART.

Other support for 16-bit floats, such as enabling kernels to convert between 16- and 32-bit floats (to read/write float16 while processing float32), is also planned for a future release.

Q: Where can I find documentation on the PTX assembly language?
This is included in the CUDA Toolkit documentation.

Q: How can I see the PTX code generated by my program? 
Add "-keep" to the nvcc command line (or custom build setup in Visual Studio) to keep the intermediate compilation files. Then look at the ".ptx" file. The ".cubin" file also includes useful information including the actual number of hardware registers used by the kernel.

Q: How can I find out how many registers / how much shared/constant memory my kernel is using?
Add the option "--ptxas-options=-v" to the nvcc command line. When compiling, this information will be output to the console.

Q: Is it possible to see PTX assembly interleaved with C code?
Yes! Add the option "--opencc-options -LIST:source=on" to the nvcc command line.

Q: What is CUTIL? 
CUTIL is a simple utility library designed used in the CUDA SDK samples. Note that CUTIL is not part of the CUDA Toolkit and is not supported by NVIDIA. It exists only for the convenience of writing concise and platform-independent example code.

It provides functions for:

  • parsing command line arguments
  • read and writing binary files and PPM format images
  • comparing arrays of data (typically used for comparing GPU results with CPU)
  • timers
  • macros for checking error codes

Q: Does CUDA support operations on vector types?
CUDA defines vector types such as float4, but doesn't include any operators on them by default. However, you can define your own operators using standard C++. The CUDA SDK includes a header "cutil_math.h" that defines some common operations on the vector types.
Note that since the GPU hardware uses a scalar architecture there is no inherent performance advantage to using vector types for calculation.

Q: Does CUDA support swizzling? 
CUDA does not support swizzling (e.g. "vector.wzyx", as used in the Cg/HLSL shading languages), but you can access the individual components of vector types.


Q: Is it possible to run multiple CUDA applications and graphics applications at the same time? 
CUDA is a client of the GPU in the same way as the OpenGL and Direct3D drivers are - it shares the GPU via time slicing. It is possible to run multiple graphics and CUDA applications at the same time, although currently CUDA only switches at the boundaries between kernel executions.

The cost of context switching between CUDA and graphics APIs is roughly the same as switching graphics contexts. This isn't something you'd want to do more than a few times each frame, but is certainly fast enough to make it practical for use in real time graphics applications like games.

Q: Can CUDA survive a mode switch?
If the display resolution is increased while a CUDA application is running, the CUDA application is not guaranteed to survive the mode switch. The driver may have to reclaim some of the memory owned by CUDA for the display.

Q: Is it possible to execute multiple kernels at the same time? 
Yes. With CUDA 3.2 and later, GPU based on the Fermi Architecture, support concurrent Kernel execution and launches.

Q:What is the maximum length of a CUDA kernel? 
The maximum kernel size is 2 million PTX instructions. This may change in the future - so please check with  the release notes with the latest release of the CUDA Toolkit.

Q: How can I debug my CUDA code?
There are several powerful debugging tools which allow the creation of break points and traces. Tools exist for all the major operating systems and multi-GPU solutions and clusters. Please visit the CUDA Tools and Ecosystem Page for the latest lists.

Q: How can I optimize my CUDA code?
Here are some basic tips, but please review some of the optimization webinars and tutorials delivered at GTC. You will find more links on our CUDA Education Pages.

  • We recommend using the CUDA Visual Profiler to profile your application.
  • Make sure global memory reads and writes are coalesced where possible (see programming guide section 6.1.2.1).
  • If your memory reads are hard to coalesce, try using 1D texture fetches (tex1Dfetch) instead.
  • Make as much use of shared memory as possible (it is much faster than global memory).
  • Avoid large-scale bank conflicts in shared memory.
  • Use types like float4 to load 128 bits in a single load.
  • Avoid divergent branches within a warp where possible.

Q: How do I choose the optimal number of threads per block?
For maximum utilization of the GPU you should carefully balance the the number of threads per thread block, the amount of shared memory per block, and the number of registers used by the kernel.

You can use the CUDA occupancy calculator tool to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. This is included as part of the latest CUDA Toolkit.
 

Q: What is the maximum kernel execution time? 
On Windows, individual GPU program launches have a maximum run time of around 5 seconds. Exceeding this time limit usually will cause a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases can hang the entire machine, requiring a hard reset.

This is caused by the Windows "watchdog" timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time.

For this reason it is recommended that CUDA is run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.
 

Q: Why do I get the error message: "The application failed to initialize properly"? 
This problem is associated with improper permissions on the DLLs (shared libraries) that are linked with your CUDA executables. All DLLs must be executable. The most likely problem is that you unzipped the CUDA distribution with cygwin's "unzip", which sets all permissions to non-executable. Make sure all DLLs are set to executable, particularly those in the CUDA_BIN directory, by running the cygwin command "chmod +x *.dll" in the CUDA_BIN directory. Alternatively, right-click on each DLL in the CUDA_BIN directory, select Properties, then the Security tab, and make sure "read & execute" is set. For more information see:

http://www.cygwin.co...2/msg00686.html
 

Q: I get the CUDA error "invalid argument" when executing my kernel
You might be exceeding the maximum size of the arguments for the kernel. Parameters to __global__ functions are currently passed to the device via shared memory and the total size is limited to 256 bytes.

As a workaround you can pass arguments using constant memory.

Q: What are atomic operations? 
Atomic operations allow multiple threads to perform concurrent read-modify-write operations in memory without conflicts. The hardware serializes accesses to the same address so that the behaviour is always deterministic. The functions are atomic in the sense that they are guaranteed to be performed without interruption from other threads. Atomic operations must be associative (i.e. order independent).

Atomic operations are useful for sorting, reduction operations and building data structures in parallel.

Devices with compute capability 1.1 support atomic operations on 32-bit integers in global memory. This includes logical operations (and, or, xor), increment and decrement, min and max, exchange and compare and swap (CAS). To compile code using atomics you must add the option "-arch sm_11" to the nvcc command line.

Compute capability 1.2 and later devices also support atomic operations in shared memory.

Floating point Atomic operations are supported by GPU with compute capability of 2.* and higher.

There is no radiation risk from atomic operations :)
 

Q: Does CUDA support function pointers? 
Function pointers are supported by the latest CUDA releases for GPU with compute capabilities of 2.* and higher. Additional  C++ features were introduced as part of the Compute capabilities 2.0 - look at the latest CUDA Toolkit release notes and programming guides.

Q: How do I compute the sum of an array of numbers on the GPU?
This is known as a parallel reduction operation. See the "reduction" sample in the CUDA SDK for more details.

Q: How do I output a variable amount of data from each thread?
This can be achieved using a parallel prefix sum (also known as "scan") operation. The CUDA Data Parallel Primitives library (CUDPP) includes highly optimized scan functions:

http://www.gpgpu.org/developer/cudpp/

The "marchingCubes" sample in the CUDA SDK demonstrates the use of scan for variable output per thread.

Q: How do I sort an array on the GPU? 
The "particles" sample in the CUDA SDK includes a fast parallel radix sort.
To sort an array of values within a block, you can use a parallel bitonic sort. See the "bitonic" sample in the SDK.
The Thrust and CUDPP libraries also includes sort functions. See the latest GPU Computing SDK for details.

Q: What do I need to distribute my CUDA application? 
Applications that use the driver API only need the CUDA driver library ("nvcuda.dll" under Windows), which is included as part of the standard NVIDIA driver install.

Applications that use the runtime API also require the runtime library ("cudart.dll" under Windows), which is included in the CUDA Toolkit. It is permissible to distribute this library with your application under the terms of the End User License Agreement included with the CUDA Toolkit.
 

Q: Why can't I use all of the shared memory?
16 bytes of shared memory are always taken to store the blockIdx, blockDim and gridDim built-in variables (threadIdx is stored in a special register).
Shared memory is also used by the compiler to pass parameters to global functions. To avoid this you can also use constant memory to pass parameters to your kernels, like this:

struct KernelParams {
float a;
int b;
};
__constant__ KernelParams params;

KernelParams hostParams;
cudaMemcpyToSymbol(params, hostParams, sizeof(KernelParams));
 

Q: How can I get information on GPU temperature from my application?
On Microsoft Windows platforms, NVIDIA's NVAPI gives access to GPU temperature and many other low-level GPU functions:
http://developer.nvi...ject/nvapi.html

Under Linux, the "nvidia-smi" utility, which is included with the standard driver install, also displays GPU temperature for all installed devices.

 

Tools, Libraries and Solutions

Q: What is CUFFT?
CUFFT is a Fast Fourier Transform (FFT) library for CUDA. See the CUFFT documentation for more information.
 

Q: What types of transforms does CUFFT support?
The current release supports complex to complex (C2C), real to complex (R2C) and complex to real (C2R).

Q: What is the maximum transform size?
For 1D transforms, the maximum transform size is 16M elements in the 1.0 release.

Q: What is CUBLAS?
CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the CUDA driver. It allows access to the computational resources of NVIDIA GPUs. The library is self contained at the API level, that is, no direct interaction with the CUDA driver is necessary.

Q: Does NVIDIA have a CUDA debugger on Linux and MAC?
Yes CUDA-GDB is CUDA Debugger for Linux distros and MAC OSX platforms.

Q: Does CUDA-GDB support any UIs?
CUDA-GDB is a command line debugger but can be used with GUI frontends like DDD - Data Display Debugger and Emacs and XEmacs. There are also third party solutions, see the list of options on our Tools & Ecosystem Page

Q: Does CUDA-GDB work in Eclipse?
CUDA-GDB doesn't plugin into Eclipse yet but in future it will.

Q: What are the main differences between Parellel Nsight and CUDA-GDB?
Both share the same  features except for the following:
Parallel Nsight runs on Windows and can debug both graphics and CUDA code on the GPU (no CPU code debugging).
CUDA-GDB runs on Linux and Mac OS and can debug both CPU code and CUDA code on the GPU (no graphics debugging on the GPU).

Q: How does one debug OGL+CUDA application with an interactive desktop?
You can ssh or use nxclient or vnc to remotely debug an OGL+CUDA application. This requires users to disable interactive session in X server config file. For details refer to the CUDA-GDB user guide.

Q: Which debugger do I use for Cluster debugging?
NVIDIA works with its partners to provide clusters debugger. There are two cluster debuggers that support CUDA - DDT from Allinea and TotalView debugger from RogeWave software.

Q: Is there a OpenCL debugger?
There is some support for OpenCL in Parallel Nsight

Q: What impact does the -G flag have on code optimizations?
The -G flag turns off most of the compiler optimizations on the CUDA code. Some optimizations cannot be turned off because they are required for the application to keep running properly. For instance: local variables will not be spilled to local memory, and instead are preserved in registers which the debugger tracks live ranges for. It is required to ensure that an application will not run out of memory when compiled in debug mode when it could be launched without incident without the debug flag.

Q: Is there a way to reach the debugger team for additional questions or issues?
Anyone interested can email to cuda-debugger-bugs@nvidia.com

Engaging with NVIDIA

Q: How can I send suggestions for improvements to the CUDA Toolkit and SDK?

Become a registered developer, then you can directly use our bug reporting system to make suggestions and requests , in addition to reporting bugs etc.

Q: I love to be able to ask  the CUDA Team some questions directly?
You can get direct face to face time with our team at GTC which we hold everyyear, find out when the next one is a www.gputechconf.com 

Also attend one of our Live Q&A Webinars when you can ask questions directly to some of our leading CUDA engineers. To attend become a registered developer .

Note: OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.