CUDA Stream
In CUDA programming, a stream is a series of commands that execute in order. In CUDA applications, kernel execution, as well as some memory transfers, occur within CUDA streams. Up until this point in time, you have not been interacting explicitly with CUDA streams, but in fact, your CUDA code has been executing its kernels inside of a stream called the default stream.
CUDA programmers can create and utilize non-default CUDA streams in addition to the default stream, and in doing so, perform multiple operations, such as executing multiple kernels, concurrently, in different streams. Using multiple streams can add an additional layer of parallelization to your accelerated applications, and offers many more opportunities for application optimization.
Resources
-
https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
-
The types
CUstream
andcudaStream_t
are identical and may be used interchangeably. -
The types
CUevent
andcudaEvent_t
are identical and may be used interchangeably.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
Rules Governing the Behavior of CUDA Streams
There are a few rules, concerning the behavior of CUDA streams, that should be learned in order to utilize them effectively:
- Operations within a given stream occur in order.
- Operations in different non-default streams are not guaranteed to operate in any specific order relative to each other.
- The default stream is blocking and will both wait for all other streams to complete before running, and, will block other streams from running until it completes.
The Default Stream
So out of the box, if you use the default stream, then everything runs serially for the GPU.
Creating, Utilizing, and Destroying Non-Default CUDA Streams
The following code snippet demonstrates how to create, utilize, and destroy a non-default CUDA stream. You will note, that to launch a CUDA kernel in a non-default CUDA stream, the stream must be passed as the optional 4th argument of the execution configuration. Up until now you have only utilized the first 2 arguments of the execution configuration:
What is the 3rd argument?
This third argument is the number of bytes in shared memory (an advanced topic that will not be covered presently) to be dynamically allocated per block for this kernel launch.
The default number of bytes allocated to shared memory per block is
0
.