1. tools
nsight
inWIN(vs)
orLinux (eclipse)
nvprof
inlinux cmd line
//in gtx1060 nvprof --metrics ipc,issued_ipc,achieved_occupancy,global_hit_rate,local_hit_rate,l2_tex_read_hit_rate,gld_transactions,gst_transactions,local_load_transactions,local_store_transactions,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transactions,l2_write_transactions,dram_read_transactions,dram_write_transactions,sysmem_read_transactions,sysmem_write_transactions ./wave
2. 度量标准 metrics
2.1 Performance
ipc
- Instructions executed per cycle
issued_ipc
- Instructions issued per cycle
achieved_occupancy
- Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
说明:本文研究点在 Data Cache,那么一下的提到的L1 Cache
都为 Data Cache
2.2 Cache Hit Rate
L1 Cache
Fermi/Kepler (Capability 2.x/3.x)
l1_cache_global_hit_rate
- Hit rate in
L1 cache
for global loads
- Hit rate in
l1_cache_local_hit_rate
- Hit rate in
L1 cache
for local loads and stores
- Hit rate in
nc_cache_global_hit_rate
- only for Kepler
- Hit rate in
non coherent cache
for global loads
Maxwell/Pascal(Capability 5.x/6.x)
global_hit_rate
- Hit rate for global loads
local_hit_rate
- Hit rate for local loads and stores
L2 Cache
Fermi/Kepler (Capability 2.x/3.x)
l2_l1_read_hit_rate
- Hit rate at
L2
cache for all read requests fromL1
cache
- Hit rate at
l2_tex_read_hit_rate
- Hit rate at
L2
cache for all read requests fromtexture
cache
- Hit rate at
Maxwell/Pascal(Capability 5.x/6.x)
l2_tex_read_hit_rate
- Hit rate at
L2
cache for all read requests fromtexture
cache
- Hit rate at
2.3 Transactions
L1 Cache
Global data
gld_transactions
- Number of global memory load transactions
gld_transactions_per_request
- Average number of global memory load transactions performed for each global memory load
gst_transactions
- Number of global memory store transactions
gst_transactions_per_request
- Average number of global memory store transactions performed for each global memory store
Local data
local_load_transactions
- Number of local memory load transactions
local_load_transactions_per_request
- Average number of local memory load transactions performed for each local memory load
local_store_transactions
- Number of local memory store transactions
local_store_transactions_per_request
- Average number of local memory store transactions performed for each local memory store
L2 Cache
Fermi/Kepler (Capability 2.x/3.x)
l2_l1_read_transactions
- Memory read transactions seen at
L2
cache for all read requests fromL1
cache
- Memory read transactions seen at
l2_l1_write_transactions
- Memory write transactions seen at
L2
cache for all write requests fromL1
cache
- Memory write transactions seen at
Maxwell/Pascal(Capability 5.x/6.x)
l2_tex_read_transactions
- Memory read transactions seen at
L2
cache for read requests from thetexture
cache
- Memory read transactions seen at
l2_tex_write_transactions
Both
l2_read_transactions
- Memory read transactions seen at L2 cache for all read requests
l2_write_transactions
- Memory write transactions seen at L2 cache for all write requests
Only in Kepler
nc_l2_read_transactions
- Memory read transactions seen at L2 cache for non coherent global read requests
备注
- 自
Kepler
架构以来,L1 Cache
对global data
的默认策略是bypassing
,只有Fermi
架构L1 Cache
对 global data 是既可读又可写的,但是不能保持cache coherence
。 - 那么为了保证
cache coherence
,nvidia
采取了较为极端的做法,那就是bypassing
L1 Cache
,并且在Maxwell
与Pascal
架构中,与Tex Cache
合并,设置为Read Only
, 但我认为其效果并不佳。最新架构volta又将其架构改为Fermi
中L1 Cache
与Shared memory
可配置的模式。
- 可知,在
Maxwell
与Pascal
架构中,我们就将tex cache
看成L1 Data Cache
GDRAM
dram_read_transactions
- Device memory read transactions
dram_write_transactions
- Device memory write transactions
DRAM
sysmem_read_transactions
- System memory read transactions
sysmem_write_transactions
- System memory write transactions
Influence by L2 Hit Rate
Reference
Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#ixzz4t4vGKod8
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook