Chuanqiz’s blog

分类: GPGPU

NVCC 分步编译

先上一张大图,nviDIA官方文档的编译流程图 完整的说明了如何从…

NVIDIA Ampere Architecture 分析

官方博客 https://devblogs.nvidia.com/…

关于NV端侧SOC算力的计算公式

Reference https://www.nvidia.com/…

TensorRT 学习

https://mp.weixin.qq.com/s/F_VvLT…

GPGPU-Sim 1 环境搭建

GPGPU-Sim 1 环境搭建 前一篇文章我们简要介绍了GPGP…

GPGPU-Sim 0 纵览

GPGPU-Sim 纵览 Overview 写在前面的话,作为当年…

Jetson tx2 性能模式工具nvpmodel

Jetson tx2 性能模式工具nvpmodel [TOC] J…

1. tools

  • nsight in WIN(vs) or Linux (eclipse)
  • nvprof in linux cmd line
    
    //in gtx1060 
    nvprof --metrics ipc,issued_ipc,achieved_occupancy,global_hit_rate,local_hit_rate,l2_tex_read_hit_rate,gld_transactions,gst_transactions,local_load_transactions,local_store_transactions,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transactions,l2_write_transactions,dram_read_transactions,dram_write_transactions,sysmem_read_transactions,sysmem_write_transactions ./wave
    

2. 度量标准 metrics

2.1 Performance

  • ipc
    • Instructions executed per cycle
  • issued_ipc
    • Instructions issued per cycle
  • achieved_occupancy
    • Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

说明:本文研究点在 Data Cache,那么一下的提到的L1 Cache 都为 Data Cache

2.2 Cache Hit Rate

L1 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l1_cache_global_hit_rate
    • Hit rate in L1 cache for global loads
  • l1_cache_local_hit_rate
    • Hit rate in L1 cache for local loads and stores
  • nc_cache_global_hit_rate
    • only for Kepler
    • Hit rate in non coherent cache for global loads

Maxwell/Pascal(Capability 5.x/6.x)

  • global_hit_rate
    • Hit rate for global loads
  • local_hit_rate
    • Hit rate for local loads and stores

L2 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l2_l1_read_hit_rate
    • Hit rate at L2 cache for all read requests from L1 cache
  • l2_tex_read_hit_rate
    • Hit rate at L2 cache for all read requests from texture cache

Maxwell/Pascal(Capability 5.x/6.x)

  • l2_tex_read_hit_rate
    • Hit rate at L2 cache for all read requests from texture cache

2.3 Transactions

L1 Cache

Global data

  • gld_transactions
    • Number of global memory load transactions
  • gld_transactions_per_request
    • Average number of global memory load transactions performed for each global memory load
  • gst_transactions
    • Number of global memory store transactions
  • gst_transactions_per_request
    • Average number of global memory store transactions performed for each global memory store

Local data

  • local_load_transactions
    • Number of local memory load transactions
  • local_load_transactions_per_request
    • Average number of local memory load transactions performed for each local memory load
  • local_store_transactions
    • Number of local memory store transactions
  • local_store_transactions_per_request
    • Average number of local memory store transactions performed for each local memory store

L2 Cache

Fermi/Kepler (Capability 2.x/3.x)

  • l2_l1_read_transactions
    • Memory read transactions seen at L2 cache for all read requests from L1 cache
  • l2_l1_write_transactions
    • Memory write transactions seen at L2 cache for all write requests from L1 cache

Maxwell/Pascal(Capability 5.x/6.x)

  • l2_tex_read_transactions
    • Memory read transactions seen at L2 cache for read requests from the texture cache
  • l2_tex_write_transactions

    Both

  • l2_read_transactions
    • Memory read transactions seen at L2 cache for all read requests
  • l2_write_transactions
    • Memory write transactions seen at L2 cache for all write requests

Only in Kepler

  • nc_l2_read_transactions
    • Memory read transactions seen at L2 cache for non coherent global read requests

备注

  • Kepler架构以来,L1 Cacheglobal data 的默认策略是 bypassing ,只有Fermi架构L1 Cache对 global data 是既可读又可写的,但是不能保持cache coherence
  • 那么为了保证 cache coherence,nvidia 采取了较为极端的做法,那就是bypassing L1 Cache ,并且在MaxwellPascal 架构中,与Tex Cache 合并,设置为 Read Only , 但我认为其效果并不佳。最新架构volta又将其架构改为 FermiL1 CacheShared memory 可配置的模式。
  • 可知,在MaxwellPascal 架构中,我们就将 tex cache 看成 L1 Data Cache

GDRAM

  • dram_read_transactions
    • Device memory read transactions
  • dram_write_transactions
    • Device memory write transactions

DRAM

  • sysmem_read_transactions
    • System memory read transactions
  • sysmem_write_transactions
    • System memory write transactions

Influence by L2 Hit Rate

Reference

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#ixzz4t4vGKod8
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

Jetson TX1 or TX2 配置源 和 设置远程桌面Ubuntu xfce4

为了使用Ubuntu 的远程桌面,做了很多尝试,只有使用 xfce…

Zero copy in TK1 and TX1 and TX2

Zero copy in TK1and TX1 and TX2 t…

旧的 文章 »
Page 1 of 2