CUDA Nsight Systems (nsys)
nsys
is the Nsight Systems command line tool. It ships with the CUDA toolkit, and is a powerful tool for profiling accelerated applications.
Basic Command
Resources
Exercise: Profile an Application with nsys
nsys profile
will generate a qdrep
report file which can be used in a variety of manners. We use the --stats=true
flag here to indicate we would like summary statistics printed. There is quite a lot of information printed:
- Profile configuration details
- Report file(s) generation details
- CUDA API Statistics
- CUDA Kernel Statistics
- CUDA Memory Operation Statistics (time and size)
- OS Runtime API Statistics
In this lab you will primarily be using the 3 sections in bold above. In the next lab, you will be using the generated report files to give to the Nsight Systems GUI for visual profiling.
GUI
This is pretty cool. There are 2 things to expand, you don’t want to expand threads.
You want to expand the kernels and see that the functions are running in a way that you expect (through multiple CUDA Stream for example).
Nsys Profile
nsys profile
provides output describing UM behavior for the profiled application. In this exercise, you will make several modifications to a simple application, and make use of nsys profile
after each change, to explore how UM data migration behaves.
In order to test your hypotheses, compile and profile your code using the code execution cells below. In the output of nsys profile --stats=true
you should be looking for the following:
- Is there a CUDA Memory Operation Statistics section in the output?
- If there is, that means there is UM migration
- If so, does it indicate host to device (HtoD) or device to host (DtoH) migrations?
- When there are migrations, what does the output say about how many Operations there were? If you see many small memory migration operations, this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
WARNING: The command line includes a target application therefore the CPU context-switch scope has been set to process-tree.
Collecting data...
Processing events...
Saving temporary "/tmp/nsys-report-25a0-9c88-9ba0-2b0d.qdstrm" file to disk...
Creating final output files...
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-25a0-9c88-9ba0-2b0d.qdrep"
Exporting 2990 events: [==================================================100%]
Exported successfully to
/tmp/nsys-report-25a0-9c88-9ba0-2b0d.sqlite
CUDA API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ---------------------
76.2 231079594 1 231079594.0 231079594 231079594 cudaMallocManaged
20.7 62739750 1 62739750.0 62739750 62739750 cudaDeviceSynchronize
3.0 9119394 1 9119394.0 9119394 9119394 cudaFree
0.0 130941 1 130941.0 130941 130941 cudaLaunchKernel
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ---------- -------- -------- -----------------------
100.0 62737327 1 62737327.0 62737327 62737327 deviceKernel(int*, int)
CUDA Memory Operation Statistics (by time):
Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
------- --------------- ---------- ------- ------- ------- ---------------------------------
52.6 23556910 1101 21395.9 2207 170076 [CUDA Unified Memory memcpy HtoD]
47.4 21243678 768 27661.0 1631 165212 [CUDA Unified Memory memcpy DtoH]
CUDA Memory Operation Statistics (by size in KiB):
Total Operations Average Minimum Maximum Operation
---------- ---------- ------- ------- -------- ---------------------------------
131072.000 768 170.667 4.000 1020.000 [CUDA Unified Memory memcpy DtoH]
131072.000 1101 119.048 4.000 1020.000 [CUDA Unified Memory memcpy HtoD]
Operating System Runtime API Statistics:
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- ---------- ------- --------- --------------------------
83.4 1114691473 59 18893075.8 55491 100136881 poll
9.2 123291205 51 2417474.6 13262 20664569 sem_timedwait
6.3 84353996 666 126657.7 1020 17482972 ioctl
0.9 11818059 92 128457.2 1264 8910102 mmap
0.1 1553954 82 18950.7 5125 38141 open64
0.0 203989 4 50997.3 35366 72846 pthread_create
0.0 171572 25 6862.9 1614 28292 fopen
0.0 168785 3 56261.7 53337 61366 fgets
0.0 92995 11 8454.1 4360 17604 write
0.0 44740 5 8948.0 1326 12412 pthread_rwlock_timedwrlock
0.0 38490 5 7698.0 3756 10110 open
0.0 37483 7 5354.7 3298 8327 munmap
0.0 35962 6 5993.7 1318 16299 fgetc
0.0 33454 18 1858.6 1097 5396 fclose
0.0 23634 12 1969.5 1017 3235 read
0.0 18607 2 9303.5 8056 10551 socket
0.0 16896 12 1408.0 1008 4647 fcntl
0.0 12527 1 12527.0 12527 12527 sem_wait
0.0 10224 1 10224.0 10224 10224 connect
0.0 9610 1 9610.0 9610 9610 pipe2
0.0 9417 2 4708.5 4321 5096 fread
0.0 9341 4 2335.3 1775 2909 mprotect
0.0 2756 1 2756.0 2756 2756 bind
0.0 2255 1 2255.0 2255 2255 listen
Report file moved to "/dli/task/report9.qdrep"
Report file moved to "/dli/task/report9.sqlite"
My understanding:
- With CUDA, there is this layer of abstraction that you can just access the same memory if you use
cudaMallocManaged
. However, if you down the layer of abstraction, you will see that there is this concept of Unified Memory. And every time you switch between CPU access and GPU access, you will need to copy the memory stuff, but it gives you the “illusion” that you are actually accessing the same piece of memory.
They show you a lower layer of abstraction so that you can write faster code.
- Everytime you see host to device (HtoD) or device to host (DtoH) migrations, that is the copying that is being done