GPU – Chuanqiz’s blog

1. tools

nsight in WIN（vs） or Linux (eclipse）

nvprof in linux cmd line


//in gtx1060 
nvprof --metrics ipc,issued_ipc,achieved_occupancy,global_hit_rate,local_hit_rate,l2_tex_read_hit_rate,gld_transactions,gst_transactions,local_load_transactions,local_store_transactions,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transactions,l2_write_transactions,dram_read_transactions,dram_write_transactions,sysmem_read_transactions,sysmem_write_transactions ./wave

2. 度量标准 `metrics`

2.1 Performance

ipc
- Instructions executed per cycle
issued_ipc
- Instructions issued per cycle
achieved_occupancy
- Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

说明：本文研究点在 Data Cache，那么一下的提到的L1 Cache 都为 Data Cache

2.2 Cache Hit Rate

`L1 Cache`

Fermi/Kepler (Capability 2.x/3.x)

l1_cache_global_hit_rate
- Hit rate in L1 cache for global loads
l1_cache_local_hit_rate
- Hit rate in L1 cache for local loads and stores
nc_cache_global_hit_rate
- only for Kepler
- Hit rate in non coherent cache for global loads

Maxwell/Pascal(Capability 5.x/6.x)

global_hit_rate
- Hit rate for global loads
local_hit_rate
- Hit rate for local loads and stores

`L2 Cache`

Fermi/Kepler (Capability 2.x/3.x)

l2_l1_read_hit_rate
- Hit rate at L2 cache for all read requests from L1 cache
l2_tex_read_hit_rate
- Hit rate at L2 cache for all read requests from texture cache

Maxwell/Pascal(Capability 5.x/6.x)

l2_tex_read_hit_rate
- Hit rate at L2 cache for all read requests from texture cache

2.3 Transactions

`L1 Cache`

Global data

gld_transactions
- Number of global memory load transactions
gld_transactions_per_request
- Average number of global memory load transactions performed for each global memory load
gst_transactions
- Number of global memory store transactions
gst_transactions_per_request
- Average number of global memory store transactions performed for each global memory store

Local data

local_load_transactions
- Number of local memory load transactions
local_load_transactions_per_request
- Average number of local memory load transactions performed for each local memory load
local_store_transactions
- Number of local memory store transactions
local_store_transactions_per_request
- Average number of local memory store transactions performed for each local memory store

`L2 Cache`

Fermi/Kepler (Capability 2.x/3.x)

l2_l1_read_transactions
- Memory read transactions seen at L2 cache for all read requests from L1 cache
l2_l1_write_transactions
- Memory write transactions seen at L2 cache for all write requests from L1 cache

Maxwell/Pascal(Capability 5.x/6.x)

l2_tex_read_transactions
- Memory read transactions seen at L2 cache for read requests from the texture cache
l2_tex_write_transactions
Both
l2_read_transactions
- Memory read transactions seen at L2 cache for all read requests
l2_write_transactions
- Memory write transactions seen at L2 cache for all write requests

Only in Kepler

nc_l2_read_transactions
- Memory read transactions seen at L2 cache for non coherent global read requests

备注

自Kepler架构以来，L1 Cache 对 global data 的默认策略是 bypassing ，只有Fermi架构L1 Cache对 global data 是既可读又可写的，但是不能保持cache coherence 。
那么为了保证 cache coherence,nvidia 采取了较为极端的做法，那就是bypassing L1 Cache ，并且在Maxwell 与 Pascal 架构中，与Tex Cache 合并，设置为 Read Only , 但我认为其效果并不佳。最新架构volta又将其架构改为 Fermi 中 L1 Cache 与 Shared memory 可配置的模式。

可知，在Maxwell 与 Pascal 架构中，我们就将 tex cache 看成 L1 Data Cache

`GDRAM`

dram_read_transactions
- Device memory read transactions
dram_write_transactions
- Device memory write transactions

`DRAM`

sysmem_read_transactions
- System memory read transactions
sysmem_write_transactions
- System memory write transactions

Influence by L2 Hit Rate

Reference

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#ixzz4t4vGKod8
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

Chuanqiz’s blog

标签： GPU

1. tools

2. 度量标准 `metrics`

2.1 Performance

2.2 Cache Hit Rate

`L1 Cache`

`L2 Cache`

2.3 Transactions

`L1 Cache`

`L2 Cache`

备注

`GDRAM`

`DRAM`

Reference

Zero copy in TK1 and TX1 and TX2

concurrent kernel and dynamic parallelism

PTX ISA special-registers

GPGPU-Sim ispass2009 编译问题０

CUDA Programming 之 Launch Bounds

GPGPU-Sim Notes —— Cache

PTX ISA 之 Control Flow Instructions

Jetson TX2 安装 JetPack 3.0 小记与申请教育版链接

CUDA-MEMCHECK

1. tools

2. 度量标准 metrics

2.1 Performance

2.2 Cache Hit Rate

L1 Cache

L2 Cache

2.3 Transactions

L1 Cache

L2 Cache

备注

GDRAM

DRAM

Reference

2. 度量标准 `metrics`

`L1 Cache`

`L2 Cache`

`L1 Cache`

`L2 Cache`

`GDRAM`

`DRAM`