Q: What is CUDA?
CUDA® is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed base of hundreds of millions of CUDA-enabled GPUs in notebooks, workstations, compute clusters and supercomputers. Applications used in astronomy, biology, chemistry, physics, data mining, manufacturing, finance, and other computationally intense fields are increasing using CUDA to deliver the benefits of GPU acceleration.
Q: What is NVIDIA Tesla™?
With the world’s first teraflop many-core processor, NVIDIA® Tesla™ computing solutions enable the necessary transition to energy efficient parallel computing power. With thousands of CUDA cores per processor , Tesla scales to solve the world’s most important computing challenges—quickly and accurately.
Q: What is OpenACC?
OpenACC is an open industry standard for compiler directives or hints which can be inserted in code written in C or Fortran enabling the compiler to generate code which would run in parallel on multi-CPU and GPU accelerated system. OpenACC directives are easy and powerful way to leverage the power of GPU Computing while keeping your code compatible for non-accelerated CPU only systems. Learn more at /openacc.
Q: What kind of performance increase can I expect using GPU Computing over CPU-only code?
This depends on how well the problem maps onto the architecture. For data parallel applications, accelerations of more than two orders of magnitude have been seen. You can browse research, developer, applications and partners on our CUDA In Action Page
Q: Which GPUs support running CUDA-accelerated applications?
CUDA is a standard feature in all NVIDIA GeForce, Quadro, and Tesla GPUs as well as NVIDIA GRID solutions. A full list can be found on the CUDA GPUs Page.
Q: What is the "compute capability"?
The compute capability of a GPU determines its general specifications and available features. For a details, see the Compute Capabilities section in the CUDA C Programming Guide.
Q: Where can I find a good introduction to parallel programming?
There are several university courses online, technical webinars, article series and also several excellent books on parallel computing. These can be found on our CUDA Education Page.
Q: Will I have to re-write my CUDA Kernels when the next new GPU architecture is released?
No. CUDA C/C++ provides an abstraction; it’s a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At run-time the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released. It is possible that changes in the number of registers or size of shared memory may open up the opportunity for further optimization but that's optional. So write your code now, and enjoy it running on future GPU's
Q: Does CUDA support multiple graphics cards in one system?
Yes. Applications can distribute work across multiple GPUs. This is not done automatically, however, so the application has complete control. See the "multiGPU" example in the GPU Computing SDK for an example of programming multiple GPUs.
Q: Where can I find more information on NVIDIA GPU architecture?
Two good places to start are:
Q: I think I've found a bug in CUDA, how do I report it?
Sign up as a CUDA registered developer, once your application has been approved you can file bugs which will be reviewed by NVIDIA engineering.
Your bug report should include a simple, self-contained piece of code that demonstrates the bug, along with a description of the bug and the expected behavior.
Please include the following information with your bug report:
Q: How does CUDA structure computation?
CUDA broadly follows the data-parallel model of computation. Typically each thread executes the same operation on different elements of the data in parallel.
The data is split up into a 1D,2D or 3D grid of blocks. Each block can be 1D, 2D or 3D in shape, and can consist of over 512 threads on current hardware. Threads within a thread block can cooperate via the shared memory.
Thread blocks are executed as smaller groups of threads known as "warps".
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring performance to ensure that all device operations have completed before stopping the timer.
CUDA functions that perform memory copies and that control graphics interoperability are synchronous, and implicitly wait for all kernels to complete.
Q: Can I transfer data and run a kernel in parallel (for streaming applications)?
Yes, CUDA supports overlapping GPU computation and data transfers using CUDA streams. See the Asynchronous Concurrent Execution section of the CUDA C Programming Guide for more details.
Q: Is it possible to DMA directly into GPU memory from another PCI-E device?
GPUDirect allows you to DMA directly to GPU host memory. See the GPUDirect technology page for details.
Q: What are the peak transfer rates between the CPU and GPU?
The performance of memory transfers depends on many factors, including the size of the transfer and type of system motherboard used.
On PCI-Express 2.0 systems we have measured up to 6.0 GB/sec transfer rates.
You can measure the bandwidth on your system using the bandwidthtest sample from the SDK.
Transfers from page-locked memory are faster because the GPU can DMA directly from this memory. However allocating too much page-locked memory can significantly affect the overall performance of the system, so allocate it with care.
Q: What is the precision of mathematical operations in CUDA?
All the current range of NVIDIA GPUs and since GT200 have double precision floating point. See the programming guide for more details. All compute-capable NVIDIA GPUs support 32-bit integer and single precision floating point arithmetic. They follow the IEEE-754 standard for single-precision binary floating-point arithmetic, with some minor differences.
Q: Why are the results of my GPU computation slightly different from the CPU results? There are many possible reasons. Floating point computations are not guaranteed to give identical results across any set of processor architectures. The order of operations will often be different when implementing algorithms in a data parallel way on the GPU.
This is a very good reference on floating point arithmetic:
Precision & Performance:Floating Point and IEEE 754 Compliance for NVIDIA GPUs
Q: Does CUDA support double precision arithmetic?
Yes. GPUs with compute capability 1.3 and higher support double precision floating point in hardware.
Q: How do I get double precision floating point to work in my kernel?
You need to add the switch "-arch sm_13" (or a higher compute capability) to your nvcc command line, otherwise doubles will be silently demoted to floats. See the "Mandelbrot" sample included in the CUDA Installer for an example of how to switch between different kernels based on the compute capability of the GPU.
Q: Can I read double precision floats from texture?
The hardware doesn't support double precision float as a texture format, but it is possible to use int2 and cast it to double as long as you don't need interpolation:
static __inline__ __device__ double fetch_double(texture<int2, 1> t, int i)
int2 v = tex1Dfetch(t,i);
return __hiloint2double(v.y, v.x);
Q: Does CUDA support long integers?
Yes, CUDA supports 64 bit integers (long longs). Operations on these types compile to multiple instruction sequences on some GPU depending on compute capability.
Q: Where can I find documentation on the PTX assembly language?
This is included in the CUDA Toolkit documentation.
Q: How can I see the PTX code generated by my program?
Add "-keep" to the nvcc command line (or custom build setup in Visual Studio) to keep the intermediate compilation files. Then look at the ".ptx" file.
Q: How can I find out how many registers / how much shared/constant memory my kernel is using?
Add the option "--ptxas-options=-v" to the nvcc command line. When compiling, this information will be output to the console.
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
Q: What is the maximum length of a CUDA kernel?
Since this could be dependent on the compute capability of your GPU - the definitive answer to this can be found in the Features & technical specification section of the CUDA C programming guide.
Q: How can I debug my CUDA code?
There are several powerful debugging tools which allow the creation of break points and traces. Tools exist for all the major operating systems and multi-GPU solutions and clusters. Please visit the CUDA Tools and Ecosystem Page for the latest debugging tools.
Q: How can I optimize my CUDA code?
There are now extensive guides and examples on how to optimize your CUDA code. Find some useful links below:
Q: How do I choose the optimal number of threads per block?
For maximum utilization of the GPU you should carefully balance the number of threads per thread block, the amount of shared memory per block, and the number of registers used by the kernel.
You can use the CUDA Occupancy Calculator tool to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. This is included as part of the latest CUDA Toolkit.
Q: What is the maximum kernel execution time?
On Windows, individual GPU program launches have a maximum run time of around 5 seconds. Exceeding this time limit usually will cause a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases can hang the entire machine, requiring a hard reset.
This is caused by the Windows "watchdog" timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time.
For this reason it is recommended that CUDA is run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.
Q: How do I compute the sum of an array of numbers on the GPU?
This is known as a parallel reduction operation. See the "reduction" sample for more details.
Q: How do I output a variable amount of data from each thread?
This can be achieved using a parallel prefix sum (also known as "scan") operation. The CUDA Data Parallel Primitives library (CUDPP) includes highly optimized scan functions:
The "marchingCubes" sample demonstrates the use of scan for variable output per thread.
Q: How do I sort an array on the GPU?
The provided "particles" sample includes a fast parallel radix sort.
To sort an array of values within a block, you can use a parallel bitonic sort. Also see the "bitonic" sample.
The Thrust libraries also includes sort functions. See more sample info on our online sample documentation.
Q: What do I need to distribute my CUDA application?
Applications that use the driver API only need the CUDA driver library ("nvcuda.dll" under Windows), which is included as part of the standard NVIDIA driver install.
Applications that use the runtime API also require the runtime library ("cudart.dll" under Windows), which is included in the CUDA Toolkit. It is permissible to distribute this library with your application under the terms of the End User License Agreement included with the CUDA Toolkit.
Q: How can I get information on GPU temperature from my application?
On Microsoft Windows platforms, NVIDIA's NVAPI gives access to GPU temperature and many other low-level GPU functions
Under Linux, the "nvidia-smi" utility, which is included with the standard driver install, also displays GPU temperature for all installed devices.
Q: What is CUFFT?
CUFFT is a Fast Fourier Transform (FFT) library for CUDA. See the CUFFT documentation for more information.
Q: What types of transforms does CUFFT support?
The current release supports complex to complex (C2C), real to complex (R2C) and complex to real (C2R).
Q: What is the maximum transform size?
For 1D transforms, the maximum transform size is 16M elements in the 1.0 release.
Q: What is CUBLAS?
CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the CUDA driver. It allows access to the computational resources of NVIDIA GPUs. The library is self contained at the API level, that is, no direct interaction with the CUDA driver is necessary.
Q: Does NVIDIA have a CUDA debugger on Linux and MAC?
Yes CUDA-GDB is CUDA Debugger for Linux distros and MAC OSX platforms.
Q: Does CUDA-GDB support any UIs?
CUDA-GDB is a command line debugger but can be used with GUI frontends like DDD - Data Display Debugger and Emacs and XEmacs. There are also third party solutions, see the list of options on our Tools & Ecosystem Page
Q: What are the main differences between Parellel Nsight and CUDA-GDB?
Both share the same features except for the following:
Parallel Nsight runs on Windows and can debug both graphics and CUDA code on the GPU (no CPU code debugging).
CUDA-GDB runs on Linux and Mac OS and can debug both CPU code and CUDA code on the GPU (no graphics debugging on the GPU).
Q: How does one debug OGL+CUDA application with an interactive desktop?
You can ssh or use nxclient or vnc to remotely debug an OGL+CUDA application. This requires users to disable interactive session in X server config file. For details refer to the CUDA-GDB user guide.
Q: Which debugger do I use for Cluster debugging?
NVIDIA works with its partners to provide clusters debugger. There are two cluster debuggers that support CUDA - DDT from Allinea and TotalView debugger from RogeWave software.
Q: What impact does the -G flag have on code optimizations?
The -G flag turns off most of the compiler optimizations on the CUDA code. Some optimizations cannot be turned off because they are required for the application to keep running properly. For instance: local variables will not be spilled to local memory, and instead are preserved in registers which the debugger tracks live ranges for. It is required to ensure that an application will not run out of memory when compiled in debug mode when it could be launched without incident without the debug flag.
Q: Is there a way to reach the debugger team for additional questions or issues?
Anyone interested can email to email@example.com
Q: How can I send suggestions for improvements to the CUDA Toolkit?
Become a registered developer, then you can directly use our bug reporting system to make suggestions and requests , in addition to reporting bugs etc.
Q: I would like to ask the CUDA Team some questions directly?
You can get direct face to face time with our team at GTC which we hold everyyear, find out when the next one is a www.gputechconf.com
Also attend one of our Live Q&A Webinars when you can ask questions directly to some of our leading CUDA engineers. To attend become a registered developer .