Chuanqiz’s blog

标签: GPU

1. tools

  • nsight in WIN(vs) or Linux (eclipse)
  • nvprof in linux cmd line
    
    //in gtx1060 
    nvprof --metrics ipc,issued_ipc,achieved_occupancy,global_hit_rate,local_hit_rate,l2_tex_read_hit_rate,gld_transactions,gst_transactions,local_load_transactions,local_store_transactions,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transactions,l2_write_transactions,dram_read_transactions,dram_write_transactions,sysmem_read_transactions,sysmem_write_transactions ./wave
    

2. 度量标准 metrics

2.1 Performance

  • ipc
    • Instructions executed per cycle
  • issued_ipc
    • Instructions issued per cycle
  • achieved_occupancy
    • Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

说明:本文研究点在 Data Cache,那么一下的提到的L1 Cache 都为 Data Cache

2.2 Cache Hit Rate

L1 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l1_cache_global_hit_rate
    • Hit rate in L1 cache for global loads
  • l1_cache_local_hit_rate
    • Hit rate in L1 cache for local loads and stores
  • nc_cache_global_hit_rate
    • only for Kepler
    • Hit rate in non coherent cache for global loads

Maxwell/Pascal(Capability 5.x/6.x)

  • global_hit_rate
    • Hit rate for global loads
  • local_hit_rate
    • Hit rate for local loads and stores

L2 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l2_l1_read_hit_rate
    • Hit rate at L2 cache for all read requests from L1 cache
  • l2_tex_read_hit_rate
    • Hit rate at L2 cache for all read requests from texture cache

Maxwell/Pascal(Capability 5.x/6.x)

  • l2_tex_read_hit_rate
    • Hit rate at L2 cache for all read requests from texture cache

2.3 Transactions

L1 Cache

Global data

  • gld_transactions
    • Number of global memory load transactions
  • gld_transactions_per_request
    • Average number of global memory load transactions performed for each global memory load
  • gst_transactions
    • Number of global memory store transactions
  • gst_transactions_per_request
    • Average number of global memory store transactions performed for each global memory store

Local data

  • local_load_transactions
    • Number of local memory load transactions
  • local_load_transactions_per_request
    • Average number of local memory load transactions performed for each local memory load
  • local_store_transactions
    • Number of local memory store transactions
  • local_store_transactions_per_request
    • Average number of local memory store transactions performed for each local memory store

L2 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l2_l1_read_transactions
    • Memory read transactions seen at L2 cache for all read requests from L1 cache
  • l2_l1_write_transactions
    • Memory write transactions seen at L2 cache for all write requests from L1 cache

Maxwell/Pascal(Capability 5.x/6.x)

  • l2_tex_read_transactions
    • Memory read transactions seen at L2 cache for read requests from the texture cache
  • l2_tex_write_transactions

    Both

  • l2_read_transactions
    • Memory read transactions seen at L2 cache for all read requests
  • l2_write_transactions
    • Memory write transactions seen at L2 cache for all write requests

Only in Kepler

  • nc_l2_read_transactions
    • Memory read transactions seen at L2 cache for non coherent global read requests

备注

  • Kepler架构以来,L1 Cacheglobal data 的默认策略是 bypassing ,只有Fermi架构L1 Cache对 global data 是既可读又可写的,但是不能保持cache coherence
  • 那么为了保证 cache coherence,nvidia 采取了较为极端的做法,那就是bypassing L1 Cache ,并且在MaxwellPascal 架构中,与Tex Cache 合并,设置为 Read Only , 但我认为其效果并不佳。最新架构volta又将其架构改为 FermiL1 CacheShared memory 可配置的模式。
  • 可知,在MaxwellPascal 架构中,我们就将 tex cache 看成 L1 Data Cache

GDRAM

  • dram_read_transactions
    • Device memory read transactions
  • dram_write_transactions
    • Device memory write transactions

DRAM

  • sysmem_read_transactions
    • System memory read transactions
  • sysmem_write_transactions
    • System memory write transactions

Influence by L2 Hit Rate

Reference

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#ixzz4t4vGKod8
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

Zero copy in TK1 and TX1 and TX2

Zero copy in TK1and TX1 and TX2 t…

concurrent kernel and dynamic parallelism

concurrent kernel and dynamic par…

PTX ISA special-registers

PTX ISA Special Registers 综述 PTX …

GPGPU-Sim ispass2009 编译问题0

GPGPU-Sim ispass2009 编译问题0 最早接触GP…

CUDA Programming 之 Launch Bounds

Launch Bounds 1.概述 As discussed i…

GPGPU-Sim Notes —— Cache

GPGPU-Sim笔记整理 cache gpgpu-cache.c…

PTX ISA 之 Control Flow Instructions

Control Flow Instructions The fol…

Jetson TX2 安装 JetPack 3.0 小记 与 申请教育版链接

前言 本文主要参考YouTube视频,《JetPack 3.0 &…

CUDA-MEMCHECK

CUDA-MEMCHECK 1.Introduction CUDA…

旧的 文章 »
Page 1 of 2