CUDA Memory

CUDA Memory Allocation

When I first learned CUDA, I was introduced to cudaMallocManaged.

CPU version

int N = 2<<20;
size_t size = N * sizeof(int);
 
int *a;
a = (int *)malloc(size);
... // Do stuff
free(a);

GPU Version

int N = 2<<20;
size_t size = N * sizeof(int);
 
int *a;
cudaMallocManaged(&a, size);
... // Do stuff
cudaFree(a);

Manual Device Memory Management

However, it’s better to manually manage memory, so you have better control:

  • cudaMalloc will allocate memory directly to the active GPU
  • cudaMallocHost will allocate memory directly to the CPU
  • cudaMemcpy can copy (not transfer) memory, either from host to device or from device to host

Example

int *host_a, *device_a;
cudaMalloc(&device_a, size);  // Device allocation
cudaMallocHost(&host_a, size); // Host allocation
 
cudaMemcpy(device_a, host_a, size, cudaMemcpyHostToDevice);
 
kernel<<<blocks, threads, 0, someStream>>>(device_a, N);
 
cudaMemcpy(host_a, device_a, size, cudaMemcpyDeviceToHost);
 
cudaFree(device_a);
cudaFreeHost(host_a); // Free pinned memory like this

What is the point of cudaMallocHost when we have the new keyword?

In CUDA programming, when you’re optimizing for performance, you’d typically use cudaMallocHost to allocate memory that you plan to transfer between the CPU and GPU frequently. For memory that you don’t intend to transfer, or for a typical C++ application that does not interface with a GPU, you would use new.

The new keyword is also more flexible, since it allows for object construction with constructors, unlike cudaMallocHost.

Can you do CudaMemcpy if the memory was created using the new keyword?

Yes, you can use cudaMemcpy with memory allocated using the new keyword, but the memory is pageable and the transfer will not be as efficient as with pinned memory.