CUDA in Visual Studio 2010

CUDA 4.0 and later versions work fine with Visual Studio 2010. The only noticeable change from Visual Studio 2008 is that Build Rules are replaced by Build Customizations in Visual Studio 2010.

Adding CUDA capability to any Visual Studio 2010 solution is easy:

1. Install the latest version of CUDA Toolkit.
2. Right-click the project name in Solution Explorer and choose Build Customizations.
2. The Visual C++ Build Customization Files dialog should display the one or more CUDA toolkits you have installed. Choose the one you want to use and click OK.
3. To change the CUDA compile options, in the project properties look for the section CUDA C/C++.

You should be able to build CUDA source files and run CUDA programs with these changes.

Tried with: CUDA 4.1 and Visual Studio 2010

FMAD on CUDA

PTX instructions produced with FMAD off (left) and FMAD on (right)

If you are using CUDA to perform any sort of non-graphics floating-point computation, be aware of the FMAD (floating-point multiply-add) instruction. Since CUDA hardware needs to straddle not only the world of computation, but also graphics and gaming, it has lots of FMAD units. So, by default the CUDA compiler will try to replace as much of your floating-point computation code with FMAD instructions.

This is fine if you do not rely on the precision of your results. However, this can lead to hard-to-find bugs if you do rely on the precision. If you need the CUDA computation to mimic the floating-point computation on the CPU, then you are better off without the FMAD instructions.

The CUDA compiler (nvcc) is configured to produce FMAD instructions by default. To request it to stop producing FMAD instructions and use the normal floating-point instructions use the compiler directive --fmad=false

Note that turning off FMAD can hurt performance quite a bit. I found that the time spent on my computations increased by about 20% with FMAD turned off.

Tried with: CUDA 4.1

David Kirk on CUDA and Fermi

There is a good talk by David Kirk up on Youtube about Fermi. I found the talk quite useful, with Kirk explaining various concepts in CUDA and Fermi with better abstractions than what I had in my mind.

A few notes from the talk:

1. The CPU die area is mostly cache, with a bit of execution-control logic. The GPU die is the other way around. This is because the CPU is built for a cache hit, while the GPU is built for a cache miss.

2. L1 or shared memory latency: ~10 cycles
Global memory latency: ~200-400 cycles

3. Even though a Fermi SM has 32K registers in its register file, the maximum number of registers allowed per thread is a mere 63. This constraint is due to the number of bits NVIDIA used for the addressing from a thread to its registers in the register file.

4. Core clock speeds remain around 1GHz, which is significantly lesser than CPU speeds. This is intentional, because power consumption increases super-linearly with clock speed. GPUs are already consuming a lot of power, it would not be good to see this increase further.

5. The minimum size of a global memory read is 128 bytes. This minimum is dictated by DRAM technology or manufacturers and will only get larger every year due to the way the DRAM technology is progressing.

6. The reason the CUDA profiler needs to run the application multiple times is that there are a limited number of counters per SM. (Kirk says 4, but CUDA programming guide says 8 counters.)

Doxygen for CUDA

Doxygen does not generate documentation for CUDA source files by default. The screenshot above shows how to enable Doxywizard to generate for .cu files. In the Expert tab, Input topic, find the FILE_PATTERNS section. Type *.cu and press the + button to add.

(To learn how to use Doxywizard to generate documentation for your code see here.)

Tried with: Doxygen 1.7.6.1

CUDA on Tesla over Remote Desktop

Problem

Typically, CUDA programs cannot be executed over a Remote Desktop session. (For more on this problem, see here.) However, if the CUDA device is a Tesla and if it does not need to be the primary graphics device, then you are in luck! A typical setup for Tesla is to have another Quadro device handle the desktop graphics, while the Tesla device handles the computation. In such setups, CUDA programs can be executed on the Tesla device over Remote Desktop! :-)

Solution

To be able to successfully execute CUDA programs on a Tesla device, you need to use a Tesla Compute Cluster (TCC) driver and set the Tesla device to TCC mode.

1. Install TCC Driver

CUDA programs can execute on Tesla devices with the CUDA developer driver you might have downloaded off the CUDA webpage. However, the developer driver is not the same as the device specific driver. (For more on the difference, see here.) To be able to use the TCC mode of the Tesla device, the TCC driver specific to the Tesla device is needed.

To get the Tesla driver, use the NVIDIA Drivers webpage and choose the driver for your Tesla device. This is its TCC driver. Install it.

2. Switch to TCC Mode

By default, the Tesla device will be working in a Windows Display Driver Model (WDDM) mode. This is the mode used by most CUDA devices, since they might need to support desktop graphics. If the Tesla device is in the WDDM mode, then it will not be able to execute CUDA programs over Remote Desktop.

To switch the Tesla device to TCC mode, we can use the NVIDIA System Management Interface (SMI) tool. On Windows, this is provided as an executable file named nvidia-smi.exe. On my system, this was found in the C:\Program Files\NVIDIA Corporation\NVSMI directory. Open a Windows command prompt with elevated (Administrator) privileges and change to this directory. (One way to open an Administrator command prompt is described here.)

To list the available NVIDIA devices use the command:

nvidia-smi -q

To view the details of a specific device use the command:

nvidia-smi -q -i 0

The -i specifies the ID of the device. Device IDs begin from 0 upwards.

Finally, to flip the driver model of a device from WDDM to TCC mode:

nvidia-smi -dm 1 -i 0

The -dm parameter takes two values: 0 for WDDM and 1 for TCC.

Changing the driver model mode of the Tesla requires a restart of the system. After the restart, you should be able to execute CUDA programs remotely on the Tesla computer! :-)

Tried with: Tesla C2050 and Quadro 4000

CUDA: Exception due to GPU Architecture Mismatch

Symptom

Your CUDA program executes, but the computed result is wrong. You run the program in Debug mode and it spews out a bunch of first-chance exceptions on cudaError_enum and cudaError of this form in the Output window:

First-chance exception at 0x74dcb727 in HelloCUDA.exe: Microsoft C++ exception: cudaError_enum at memory location 0x002af650..
First-chance exception at 0x74dcb727 in HelloCUDA.exe: Microsoft C++ exception: cudaError at memory location 0x002af4ec..
First-chance exception at 0x74dcb727 in HelloCUDA.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..

Diagnosis

One of the reasons for this behaviour is if the device you are using cannot support the compute capability the program was compiled for. Ideally, such a program should be able to detect the device capability, compare and exit with a meaningful error. I have no idea why NVIDIA does not do this.

Anyway, if your device is of compute capability 1.1 and the program was compiled for compute capability 2.0, that is sm_20 GPU architecture, then it can result in such silent failures. Recompile the program for the compute capability of your device and the error should be gone.

Tried with: CUDA 4.0

Thrust: Compact Multiple Vectors Using Predicate

From this earlier post, we can see that the zip_iterator makes it really easy to compact multiple vectors (of same size) based on the duplicate values in one of those vectors.

Another scenario that arises frequently is the need to compact multiple vectors (of same size) based on testing the value in one of those vectors (using a predicate). We may want to remove all the elements for which the predicate is true.

For example, say I have 2 vectors. A previous kernel might have invalidated some of the values in the first vector by setting them to -1. Now, I want to compact these 2 vectors such that the elements corresponding to -1 in the first vector is removed from both vectors.

The code required to the above compaction is very similar to that in the earlier blog post:

// Many vectors
thrust::device_vector< int > vec0;
thrust::device_vector< int > vec1;

// Make zip_iterator easy to use
typedef thrust::device_vector< int >::iterator  IntDIter;
typedef thrust::tuple< IntDIter, IntDIter >     IntDIterTuple2;
typedef thrust::zip_iterator< IntDIterTuple2 >  ZipDIter;

// Remove elements in many vectors if element in vec0 is negative
ZipDIter newEnd = thrust::remove_if(    thrust::make_zip_iterator( thrust::make_tuple( vec0.begin(), vec1.begin() ) ),
                                        thrust::make_zip_iterator( thrust::make_tuple( vec0.end(), vec1.end() ) ),
                                        isTuple2Negative() );

// Erase the removed elements from the vectors
IntDIterTuple2 endTuple = newEnd.get_iterator_tuple();
vec0.erase( thrust::get<0>( endTuple ), vec0.end() );
vec1.erase( thrust::get<1>( endTuple ), vec1.end() );

The only extra work needed is to carefully write a predicate that does what we want:

// Make predicate easy to write
typedef thrust::tuple< int, int > IntTuple2;

// Predicate
struct isTuple2Negative
{
    __host__ __device__ bool operator() ( const IntTuple2& tup )
    {
        const int x = thrust::get<0>( tup );
        return ( x < 0 );
    }
};

That is it, the compaction works like magic! :-)

Tried with: CUDA 4.0 and Thrust 1.4.0

CUDA: Occupancy Calculator

The CUDA Occupancy Calculator is an Excel spreadsheet that ships with the CUDA Toolkit. It can be used to determine if the number of threads per block being used to launch a kernel is optimal. This spreadsheet can be found at %ProgramData%\NVIDIA Corporation\GPU SDK\C\tools

The spreadsheet requires 4 inputs from you specific to the kernel you are analyzing:

  1. The compute capability of the CUDA device
  2. Threads per block you are using for the kernel
  3. Registers per thread for the kernel
  4. Shared memory per block

You already know (1) and (2) since you are the author of the kernel. (3) and (4) can be found by compiling the code with the option --ptxas-options=-v. This information can be found in the Output window of Visual Studio during compilation. Another alternative is to run the CUDA program with the Compute Visual Profiler and this information can be found in the Profiler Output sheet.

Once the above 4 numbers are entered, the 3 charts on the spreadsheet update to show the position of your kernel on them. The 3 charts deal with the parameters threads per block, registers per block and shared memory respectively. Look for the red triangle on the chart whose parameter you have the flexibility to change.

For example, say for a given kernel I have no say in the number of registers and shared memory it uses. However, I have the ability to change the number of threads per block it launches with. Assume that I am currently using 200 threads per block for this kernel. For this case, I look at Chart 1 (Varying Block Size), and check if the red triangle is on any of the global maxima of the curve. If it is not, I look at the threads per block that will put this kernel at the global maxima and try my kernel with that (say 256). In most cases, my CUDA program should execute a bit faster due to this change since the occupancy of the GPU by the threads of this kernel has been improved.

Tried with: CUDA 4.0

CUDA: Launch Bounds

The CUDA compiler decides on the number of registers to use for a kernel based on its complexity. Such a compiled kernel is flexible enough to be launched with any number of threads or blocks. However, if an approximate idea of the number of threads and blocks is known at compile-time, then this can be used to optimize the kernel for such launches. This is done by informing the compiler of this launch configuration, so that it has a better chance of tweaking the number of registers it will use for the kernel during compilation.

Such a launch bound is set for a kernel as follows:

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2

__global__ void
__launch_bounds__( MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP )
fooKernel( int* inArr, int* outArr )
{
    // ... Computation of kernel
}

The MAX_THREADS_PER_BLOCK parameter is mandatory and the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with the thread number more than MAX_THREADS_PER_BLOCK, that launch will fail for obvious reasons.

Tried with
: CUDA 4.0

CUDA: Prefer L1 Cache for Kernel

On CUDA devices of compute capability 1.x, the amount of shared memory and L1 cache for each multiprocessor was fixed. In devices of compute capability 2.0 and later, there is 64 KB of memory for each multiprocessor. This per-multiprocessor on-chip memory is split and used for both shared memory and L1 cache. By default, 48 KB is used as shared memory and 16 KB as L1 cache.

As CUDA kernels get more complex, they start to behave like CPU programs. There is lesser need to share data between kernels and more pressure for L1 caching. The cudaFuncSetCacheConfig runtime function can be used to set any kernel to prefer the usage of the per-multiprocessor memory for either shared memory or L1 cache. The option cudaFuncCachePreferShared prefers shared memory, i.e., 48 KB for shared memory and 16 KB for L1 cache. cudaFuncCachePreferL1 prefers L1, i.e., 16 KB for shared memory and 48 KB for L1 cache. cudaFuncCachePreferNone uses the preference set for the device or thread.

In this example, the preference for the kernel is set to L1:

// Kernel
__global__ void fooKernel( int* inArr, int* outArr )
{
    // ... Computation of kernel
    return;
}

int main()
{
    // Set preference for above kernel to L1
    cudaFuncSetCacheConfig( fooKernel, cudaFuncCachePreferL1 );

    // Call kernel anytime after cache preference is set
    fooKernel<<< T, B >>>( inArr, outArr );

    return 0;
}