Profiling¶
When testing a shader compiler and/or a shader under testing a profile of where the GPU tends to spend time can be generated with the umr “–profiler” command.
The command repeatedly issues SQ_CMD halt/resume commands to see where any waves end up halting. This results in a GPU lockup temporarily which allows umr to read the rings and find IBs and shaders. A ring is considered “halted” if the read and write pointers do not move for 500 uSeconds which typically is enough for most pixel and vertex shaders but may not be enough for compute tasks resulting in race conditions trying to read GPU virtual memory.
The command is:
--profiler [pixel= | vertex= | compute=]<nsamples> [ring]
Which will capture ‘nsamples’-many wave samples. Optionally, a ring can be specified to profile shaders stored in different rings. This defaults to the ‘gfx’ ring. Additionally, the type of shader can be selcted for as well to only profile a given type of shader.
The output then contains the sorted list of addresses and opcodes in descending order. For example,
Shader 1@0x100002000 (88 bytes): total hits: 64665
shader[0x100002000 + 0x0000] = 0xbefc0005 s_mov_b32 m0, s5 ( 7 hits, 0.0 %)
shader[0x100002000 + 0x0004] = 0xd4000008 v_interp_p1_f32_e32 v0, v8, attr0.x
shader[0x100002000 + 0x0008] = 0xd4040108 v_interp_p1_f32_e32 v1, v8, attr0.y ( 2 hits, 0.0 %)
shader[0x100002000 + 0x000c] = 0xbe800003 s_mov_b32 s0, s3 ( 1 hits, 0.0 %)
shader[0x100002000 + 0x0010] = 0xbe810080 s_mov_b32 s1, 0
shader[0x100002000 + 0x0014] = 0xd4010009 v_interp_p2_f32_e32 v0, v9, attr0.x ( 2 hits, 0.0 %)
shader[0x100002000 + 0x0018] = 0xd4050109 v_interp_p2_f32_e32 v1, v9, attr0.y
shader[0x100002000 + 0x001c] = 0xc00e0200 s_load_dwordx8 s[8:15], s[0:1], 0x200
shader[0x100002000 + 0x0020] = 0x00000200 ;;
shader[0x100002000 + 0x0024] = 0x7e001100 v_cvt_i32_f32_e32 v0, v0
shader[0x100002000 + 0x0028] = 0x7e021101 v_cvt_i32_f32_e32 v1, v1
shader[0x100002000 + 0x002c] = 0xbf8c007f s_waitcnt lgkmcnt(0) ( 5 hits, 0.0 %)
shader[0x100002000 + 0x0030] = 0xf0001f00 image_load v[0:3], v0, s[8:15] dmask:0xf unorm ( 4 hits, 0.0 %)
shader[0x100002000 + 0x0034] = 0x00020000 ;;
shader[0x100002000 + 0x0038] = 0xbf8c0f70 s_waitcnt vmcnt(0) ( 184 hits, 0.2 %)
shader[0x100002000 + 0x003c] = 0xd2960000 v_cvt_pkrtz_f16_f32 v0, v0, v1
shader[0x100002000 + 0x0040] = 0x00020300 ;;
shader[0x100002000 + 0x0044] = 0xd2960001 v_cvt_pkrtz_f16_f32 v1, v2, v3 ( 2 hits, 0.0 %)
shader[0x100002000 + 0x0048] = 0x00020702 ;;
shader[0x100002000 + 0x004c] = 0xc4001c0f exp mrt0 v0, v0, v1, v1 done compr vm (64450 hits, 99.6 %)
shader[0x100002000 + 0x0050] = 0x00000100 ;;
shader[0x100002000 + 0x0054] = 0xbf810000 s_endpgm ( 8 hits, 0.0 %)
Indicates that the opcode at VMID 1 offset 0x10000204c had waves halted there 64450 times (99.6% of all captured wave data). The other columns indicate the raw opcode data and the last columns are the LLVM disassembly of the opcode.
When testing a known shader this can be used to determine where the bulk of the processing time is spent.