GPU-Z: Sensors

The Sensors tab in GPU-Z is useful if you are tracking the GPU load or memory usage. The real-time graphs are updated at a frequency of once per second.

  • The current value of sensor data is useful, but what is more useful is the minimum, maximum and average. There are 2 ways to view these: Click on the dropdown arrow beside each sensor name and make your pick. Or click on the sensor value (number) and it will cycle through these data on each click.
  • Finally, if you want a log file of the sensor values, click on Log to file. You can selectively remove certain sensors from showing up in the log by clicking their dropdown arrow and disabling Log to file.

Tried with: GPU-Z 0.5.1

VLC: Audio-Video Synchronization

Sometimes the audio or video might lag behind the other during playback of a video file. Fixing this so that the audio and video are synchronized is pretty easy in VLC.

Play the video file in VLC. Choose Tools → Track Synchronization. VLC pops up an Adjustments and Effects window with the Synchronization tab open. Increase or decrease the Advance of audio over video value until you feel the audio and video are in sync. It takes a couple of seconds for new settings to take effect.

This is a per-session setting, as it should rightly be. So, it is lost when VLC is closed and reopened and needs to be set again if needed.

 

Tried with: VLC 1.1.7

Using CUDPP with Visual Studio

The CUDPP (CUDA Data Parallel Primitives) library provides the essential primitives necessary for any parallel data processing on the GPU. This includes functions to scan, compact and sort data. Source code of CUDPP can be downloaded from here and compiled using Visual Studio. Beware that the compilation can take quite a while.

A version of CUDPP is shipped by NVIDIA inside its GPU Computing SDK. Using this version of CUDPP with your Visual Studio project is pretty easy:

  1. Include cudpp.h in source code that uses CUDPP.
  2. Add the CUDPP include directory to the Additional Include Directories of your CUDA Build Rule. Typically this directory is $(NVSDKCOMPUTE_ROOT)/C/common/inc/cudpp
  3. Link with cudpp32.lib or cudpp64.lib. This library file can be provided in LinkerInputAdditional Dependencies
  4. Add the CUDPP library directory to LinkerGeneralAdditional Library Directories. Typically this directory is $(NVSDKCOMPUTE_ROOT)/C/common/lib
  5. When an executable linked with CUDPP is executed, it looks for a CUDPP DLL file (cudpp32_32_16.dll for example). This is typically found in the $(NVSDKCOMPUTE_ROOT)/C/common/bin directory. The GPU Computing SDK adds this directory to the %PATH% environment variable when it is installed. For some reason, if it is not found in %PATH%, either add it there or copy the DLL file to the same directory as that of the executable.

Tried with: CUDPP 1.1.1, CUDA 3.2 and Visual Studio 2008

GPU-Z

 

GPU-Z is a must-have Windows utility if you are using a NVIDIA or ATI graphics card. Much like CPU-Z, almost every last interesting detail of your graphics card is displayed. Hover your mouse cursor over any detail and a helpful popup is displayed with information on that detail.

CUDA or OpenCL programmers might find the details in the Sensors tab useful. It displays the GPU load and memory usage in real-time.

 

Tried with: GPU-Z 0.5.1

Windows 7: Start Any Application as Administrator from Keyboard

It is well known that you can start any application as Administrator by typing its name at the start menu, right-clicking its name from the list and choosing Run as Administrator.

Charon, one of the readers of this blog, has contributed a far cooler way to start any application as Administrator:

  1. Type the name of the application in the Start menu.
  2. Use the arrow keys if needed to choose the application from the displayed list.
  3. Instead of pressing Enter, press Ctrl+Shift+Enter and the application is invoked with Administrator privileges.

This is an awesome trick to know if you like to open applications from the keyboard without using the mouse. Thanks Charon! :-)

CUDA: Work Allocation Techniques

Threads in CUDA are the workers who work on data. In the CUDA architecture, threads are grouped into blocks and blocks are grouped into a grid. Given such an architecture, there are 2 common techniques to allocate the data among the threads. Or put another way, there are 2 ways for each thread to pick the data it should work on.

Technique 1: Chunks of Data Per Thread

In this technique, data is broken into many contiguous chunks and each thread works on one (or none) such chunk. Here is sample code for illustration:

__global__ void fooKernel( const int* dataArray, int dataNum )
{
    // Thread info
    const int blocksPerGrid   = gridDim.x;
    const int threadsPerBlock = blockDim.x;
    const int totalThreadNum  = blocksPerGrid * threadsPerBlock;
    const int curThreadIdx    = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;

    // Work allocation
    const int dataPerThread	 = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;
    const int curThreadDataBegin = dataPerThread * curThreadIdx;
    const int curThreadDataEnd	 = curThreadDataBegin + dataPerThread;

    // Iterate data chunk of this thread
    for ( int idx = curThreadDataBegin; idx < curThreadDataEnd; ++idx )
    {
        // Check if data out of bounds
        if ( idx >= dataNum )
            continue;

        // Do something with data
        int val = dataArray[ idx ];
    }

    return;
}

Note that the work allocation calculation is:

const int dataPerThread	 = ( dataNum + ( totalThreadNum - 1 ) ) / totalThreadNum;

It is not a mere division of available work (data) by available labour (threads). Due to integer division, such a simple calculation would lead to trouble if dataNum is not a multiple of totalThreadNum.

So, if dataNum is not a multiple of totalThreadNum there will always be a few threads with no work. This is an unavoidable fact of life for this technique! :-)

This is also why we need to ensure we are always accessing something inside of the input data:

if ( idx >= dataNum )
    continue;

This work allocation technique is advantageous if the work done by the thread benefits by having access to elements lying in the same chunk. If this is not the case, then Technique 2 is far easier to write and understand.

Technique 2: Iterate With a Large Increment

Simply put, in each iteration each thread accesses the data that is a long distance away from its current data. How far away? A distance equal to the total number of threads. This iteration is very simple to write:

__global__ void fooKernel( const int* dataArray, int dataNum )
{
    // Thread info
    const int blocksPerGrid   = gridDim.x;
    const int threadsPerBlock = blockDim.x;
    const int totalThreadNum  = blocksPerGrid * threadsPerBlock;
    const int curThreadIdx    = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;

    // Iterate over data
    for ( int idx = curThreadIdx; idx < dataNum; idx += totalThreadNum )
    {
        // Do something with data
        int val = dataArray[ idx ];
    }

    return;
}

The work allocation calculation and the bounds check of Technique 1 are both not needed. All the magic is in the details of the loop:

for (
    int idx = curThreadIdx; // Use thread index as beginning location
    idx < dataNum;          // Thread never goes outside the data
    idx += totalThreadNum   // Jump a long way
    )

You can easily see that these techniques lie at the ends of a spectrum of possible work allocation techniques. For example, a hybrid technique would handle chunks of size n and then increment by n * totalThreadNum. Look at your application closely and use what works best for you! :-)

CUDA: Thread Information

This is boilerplate code I tend to use in every CUDA kernel to calculate thread information:

__global__ void fooKernel()
{
    // Thread info
    const int blocksPerGrid   = gridDim.x;
    const int threadsPerBlock = blockDim.x;
    const int totalThreadNum  = blocksPerGrid * threadsPerBlock;
    const int curThreadIdx    = ( blockIdx.x * threadsPerBlock ) + threadIdx.x;

    // Rest of kernel
}

This is written for a kernel launched with a 1-dimensional grid and a 1-dimensional block. Adapting it for grid-blocks of different dimensions should be easy.

Visual Studio: Dependency for .cu Files

Problem

Visual Studio understands that if a C++ file includes a header file, then it means that the C++ file is dependent on that header file. That is, when the .h file is changed, Visual Studio recompiles the .cpp file.

Much like in C++, it is common in CUDA to spread the code across .cu and .h files. However, Visual Studio cannot understand the dependency between a .cu file and a .h file that it might include. Thus, when the .h file is changed, the .cu file is not recompiled!

Solution

The dependency between a given .cu file and the .h files it includes needs to be specified explicity to Visual Studio. Consider the Visual Studio solution shown below:

Assume that Kernels/Foo.cu includes Kernels/FooKernel.h. To make the dependency between these files explicit:

  1. Right-click on Kernels/Foo.cu and choose Properties.
  2. In the Properties dialog, choose CUDA Runtime APISource Dependencies.
  3. Add the files that Kernels/Foo.cu depends on. For us, this is Kernels/FooKernel.h. Make sure the path is correct, like I have done here by using Kernels/
  4. If there are multiple dependencies, separate them using semicolons (;). Using space, comma or anything else will be silently ignored with no errors!

Foo.cu should now be recompiled whenever FooKernel.h is changed! :-)

Tried with: CUDA 3.2 and Visual Studio 2008

CUDA: dim3

dim3 is an integer vector type that can be used in CUDA code. Its most common application is to pass the grid and block dimensions in a kernel invocation. It can also be used in any user code for holding values of 3 dimensions.

For example:

dim3 grid( 512 );         // 512 x 1 x 1
dim3 block( 1024, 1024 ); // 1024 x 1024 x 1
fooKernel<<< grid, block >>>();

The only facts to know about dim3 are:

  • dim3 is a simple structure that is defined in %CUDA_INC_PATH%/vector_types.h
  • dim3 has 3 elements x, y and z.
  • In C code, dim3 can be initialized as dim3 grid = { 512, 512, 1 };
  • In C++ code, dim3 can be initialized as dim3 grid( 512, 512, 1 );
  • Not all the 3 elements need to be provided. Any element not provided during initialization is initialized to 1. Please note that they are initialized to 1, not 0!
  • dim3 can be converted to and from uint3, another similar CUDA data type.

dim3 is modeled after similar vector types that are available in shader languages like Cg, GLSL or HLSL. However, unlike them dim3 is disappointingly simple and incapable of anything useful. It cannot be used directly in any arithmetic operations ( grid + block ) or in any sort of vector swizzling ( grid.xyz = block.zyx ). :-(

 

Tried with: CUDA 3.2

CUDPP: LNK1181 Linker Error

Problem

I got an error while building CUDPP 1.1.1 with CUDA 3.2. The linker error occurs on building cudpp_vc90.sln:

LINK : fatal error LNK1181: cannot open input file 'cudart.lib'

Reason

NVIDIA is notorious for frequently changing their environment variables and their paths. I have commented on this before here and here. They have done it again with CUDA 3.2! :-| cudpp_vc90.vcproj looks for cudart.lib in the directory $(CUDA_LIB_PATH)/../lib. CUDA 3.2 has moved these files to $(CUDA_LIB_PATH).

Solution

In the CUDPP solution, Change the Additional library dependencies in the Visual Studio solution from $(CUDA_LIB_PATH)/../lib to $(CUDA_LIB_PATH).

Note

I have reported this error to CUDPP here. A committer has merged the fix. But, given the fact that the last release was a year ago, it might be a while before this fix appears in a stable release.

Tried with: CUDPP 1.1.1, CUDA 3.2, Visual Studio 2008 and Windows 7 64-bit.