CUDA Best Practices
Tips
From https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations
High Priority: Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU.
- Peak theoretical bandwidth between the device memory and the GPU is 898 GB/s
- eak theoretical bandwidth between host memory and device memory (16 GB/s on the PCIe x16 Gen3)