CUDA: Memory Usage

CUDA kernel calls or runtime memory allocation calls can sometimes fail due to insufficient memory. The real-time memory usage on the CUDA device can be tracked by using applications like GPU-Z.

If you want to know the exact device memory usage at particular points in your program, the cudaMemGetInfo runtime call can be used. It returns the free and total memory in bytes. Subtract the free from total to get your memory usage:

size_t avail;
size_t total;
cudaMemGetInfo( &avail, &total );
size_t used = total - avail;
cout << "Device memory used: " << used << endl;

Tried with: CUDA 3.2

CUDA-Z

CUDA-Z is a useful utility to keep around while programming in CUDA. It is essentially a simple GUI display of the CUDA device information that you can get by running the deviceQuery and bandwidthTest projects that ship with the GPU Computing SDK. (For information on deviceQuery go here.)

CUDA-Z is pretty limited for usage statistics. It does not display the current GPU and GPU memory usage statistics like GPU-Z can.

Tried with: CUDA-Z 0.5.95

Firefox: Disable the NYTimes Article Popup

Problem

Scroll to the bottom of any article on The New York Times webpage and a popup window slides in at the right bottom. It shows one of many articles NYTimes wants you to read. I find such dynamic behaviour on long-form content webpages highly irritating.

Solution

The solution for this is similar to Disabling the NYTimes Lookup Word Feature. The Javascript that is invoked for this popup is: http://graphics8.nytimes.com/js/app/article/upNext.js

To stop this popup, add this to your Adblock Plus addon: http://*.nytimes.com/js/app/article/upNext.js

Tried with: Adblock Plus 1.3.3 and Firefox 4.0 on Windows 7

Firefox: Missing Subscribe to RSS

Problem

Firefox 4 has removed the Subscribe to RSS icon that was in Firefox 3.x. This icon would appear in the Location Bar when the webpage had RSS/ATOM feeds. It was useful to see if the webpage has feeds and if it did to subscribe to them.

Solution

Right-click in the Navigation Toolbar and choose Customize to get the Customize dialog. A Subscribe button is available in this dialog that provides the same functionality as the older icon did in Firefox 3.x. Pull the button to any location in the Navigation Toolbar and place it there.

Tried with: Firefox 4.0 on Windows 7

Firefox: Disable the NYTimes Lookup Word Feature

Problem

The article webpages of The New York Times have a feature that I find particularly irritating. If I highlight some text on the webpage, it immediately pops up a question-mark (?) window offering to lookup the words for me! This supposed feature also prevents me from right-clicking the highlighted text and performing any operation on them, like copying or Google-searching them.

Solution

By examining the HTML code of a NYTimes article webpage, the following Javascript file is found to be the entry for this Lookup Word feature: http://graphics8.nytimes.com/js/common/screen/altClickToSearch.js

With this information, disabling this feature is easy if you have the Adblock Plus addon installed on Firefox. Just add the following path to Adblock: http://*.nytimes.com/js/common/screen/altClickToSearch.js

Tried with: Adblock Plus 1.3.3 and Firefox 4.0 on Windows 7

Firefox: Disable Window Resize or Move

Some webpages resize or move the browser window. To prevent such behaviour, go to ToolsOptions. In the Content tab, click on the Advanced button to the right of Enable Javascript. In the dialog that comes up, disable the option Move or resize existing windows.

Alternatively, the same can be achieved in about: config by setting dom.disable_window_move_resize to true.

Tried with: Firefox 4.0 on Windows 7

Firefox: Open Popup Windows in New Tab

I hate it when clicking certain links pops open a new window. It is irritating and it also takes away the focus from the current window.

To stop this behaviour and open all popup windows in a new tab, go to about: config and set browser.link.open_newwindow.restriction to 0. More info on this setting can be found here.

Tried with: Firefox 4.0 on Windows 7

CUDA: Device Function in Header File

Template kernels are very useful to write generic kernels that can handle multiple data types. (For more on template kernels go here.) However, they come with one drawback: any device function that is called by a template kernel needs to be defined in a header file. It cannot be hidden away in a different CUDA compilation unit. This is because the CUDA compiler needs to have the flexibility to inline the device function (if necessary) and to do that the device function definition needs to be accessible in the same compilation unit as the template kernel that calls it.

Here is an example. The template kernel below calls a device function and the definition of the device function needs to be in the header file for successful compilation:

__device__ int getCurThreadIdx()
{
	return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );
}

template< typename T >
__global__ void fooKernel( const T* inArr, int num, T* outArr )
{
    const int threadNum = ( gridDim.x * blockDim.x );

    for ( int idx = getCurThreadIdx(); idx < num; idx += threadNum )
        outArr[ idx ] = inArr[ idx ];

    return;
}

One wishes that this were enough, but there seems to be an additional problem with CUDA. If the header file containing these functions is merely included into multiple CUDA compilation units, one faces a multiple definition error on the device function:

FooDevice.cu.obj : error LNK2005: “int __cdecl getCurThreadIdx(void)” (?getCurThreadIdx@@YAHXZ) already defined in Main.cu.obj

The only solution for this seems to be that the device function being called by a template kernel also needs to be explicitly inlined! The explicit inlining can be specified using the __forceinline__ qualifier on the __device__ function:

__forceinline__ __device__ int getCurThreadIdx()
{
	return ( ( blockIdx.x * blockDim.x ) + threadIdx.x );
}

I further investigated by checking the .ptx files when the device function is __forceinline__ and when it is not. As I guessed, the template kernel is inlining the device function anyway in both these cases. This is what I expected since the device function is very simple and should be inlined for optimization.

However, when the device function is not __forceinline__, the CUDA compiler seems to be generating the device function definition anyway, even though it is not being called by the template function! :-)

The problem with this behaviour by CUDA is that it means that only explicitly inlined device functions can be called from template kernels. This places severe restrictions on the functionality that can be achieved in a template kernel. Device functions that are long or computationally intensive cannot work with the template kernel due to function length restrictions. This beats the whole point of the Fermi architecture, which was to allow true function calls, not mere inlining! :-|

Note: More discussion on this topic is at StackOverflow here and NVIDIA Forums here.

Tried with: CUDA 3.2

CUDA: Template Kernels

CUDA kernel functions can be made generic by writing them as template kernel functions. Doing this is almost similar to writing template functions in C++.

Consider this kernel that merely writes the input integer array to the output integer array:

//////////////////////////////////////////
// FooDevice.cu
__global__ void fooKernel( const int* inArr, int* outArr, int num )
{
    const int curThreadIndex = ( blockIdx.x * blockDim.x ) + threadIdx.x;
    const int threadNum      = gridDim.x * blockDim.x;
    for ( int index = curThreadIndex; index < num; index += threadNum )
        outArr[ index ] = inArr[ index ];
    return;
}
//////////////////////////////////////////

//////////////////////////////////////////
// FooDevice.h
__global__ void fooKernel( const int*, int*, int );
//////////////////////////////////////////

Template kernels are a good solution to enable this kernel to deal with arrays of any type. Just like in C++, CUDA template kernels need to be defined in a header file. This is because the compiler generates the code for a function that is specialized for a given type. To be able to do this at compile time inside a compilation unit, the function definition needs to be completely visible.

Rewriting the above kernel as a template kernel in the header file:

//////////////////////////////////////////
// FooDevice.cu
// Nothing here
//////////////////////////////////////////

//////////////////////////////////////////
// FooDevice.h
template< typename T >
__global__ void fooKernel( const T* inArr, T* outArr, int num )
{
    const int curThreadIndex = ( blockIdx.x * blockDim.x ) + threadIdx.x;
    const int threadNum      = gridDim.x * blockDim.x;
    for ( int index = curThreadIndex; index < num; index += threadNum )
        outArr[ index ] = inArr[ index ];
    return;
}
//////////////////////////////////////////

Tried with: CUDA 3.2

CUDPP: Sorting a Structure Array

Problem

The cudppSort function from the CUDPP library can be used to sort an array of keys and an array of values along with it. However, the value types that are supported are only 32-bit. Thus, an array of structures cannot directly be sorted by using cudppSort.

Solution

One way to sort such a structure array is as follows:

  1. Create an index array. Fill it with indices of the array using a simple kernel.
  2. Use this index array as the value array and sort the key and value arrays using cudppSort.
  3. Pass the sorted index array to a simple kernel and use it write the structures of the array to their sorted destinations.

This solution assumes that the structure array has an additional array of keys. More details and code on this method can be seen in my post on Compacting a Structure Array.

Tried with: CUDPP 1.1.1 and CUDA 3.2