¨ Munchen Technische Universitat ¨
Memory-Bound Performance CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8–19, 2013
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
1
¨ Munchen Technische Universitat ¨
Part I Parallel Architectures – A Memory Perspective
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
2
¨ Munchen Technische Universitat ¨
The Memory Gap Graham et al., 2005: • annual increase of CPU performance (“Flop/s”): 59 % • annual increase of memory bandwidth: ∼25 % • annual improvement of memory latency: ∼5 % • memory is the bottleneck (already, and getting worse); • remedy: introduce cache memory!
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
3
¨ Munchen Technische Universitat ¨
Multicore CPUs – Intel’s Nehalem Architecture (from 2008/2009)
• quad-core CPU with shared and private caches • simultaneous multi-threading: 8 threads on 4 cores • memory architecture: Quick Path Interconnect
(replaced Front Side Bus)
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
4
¨ Munchen Technische Universitat ¨
Multicore CPUs – Intel’s Nehalem Architecture (2) (from 2008/2009)
• NUMA (non-uniform memory access) architecture:
CPUs have “private” main memory, but uniform access to “remote” memory • max. 25 GB/s bandwidth
(source: Intel – Nehalem Whitepaper) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
5
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA ¨ Munchen Technische Universitat ¨ cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread global scheduler distributes thread blocks to SM thread schedulers. (from 2009/2010)
GPGPU – NVIDIA Fermi
Fermi’s 16 SM are (source: positionedNVIDIA around a– common L2 cache. Each SM is a vertical Fermi Whitepaper) rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).
Michael Bader: Memory-Bound Performance
7
Computer Simulations in Science and Engineering, July 8–19, 2013
6
¨ Munchen Technische Universitat ¨
GPGPU – NVIDIA Fermi (3) (from 2009/2010) Memory Subsystem Innovations
General Purpose Graphics Processing NVIDIA Parallel DataCache with ConfigurableUnit: L1 and Unified L2 Cache TM
Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared memory, others require a cache, while others require a combination of both. The optimal memory hierarchy should offer the benefits of both Shared memory and cache, and allow the programmer a choice over its partitioning. The Fermi memory hierarchy adapts to both types of program behavior.
• 512 CUDA cores
• improved double precision
performance
• shared vs. global memory
Adding a true cache (768 hierarchy forKB) load / store • new: L1 und L2 cache operations presented significant challenges. Traditional GPU architectures • trend from GPU towards CPU? support a read-only ‘‘load’’ path for texture
Michael Bader: Memory-Bound
operations and a write-only ‘‘export’’ path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. As one example: spilling a register operand to memory and then reading it back creates a read after write hazard; if the read and write paths are separate, it may be necessary to explicitly flush the entire write / ‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be coherent with respect to the write data. Performance
The Fermi architecture addresses Computer Simulations in Science and Engineering, July 8–19, 2013this challenge by implementing a single unified memory
7
¨ Munchen Technische Universitat ¨
Manycore CPU – Intel Knights Ferry (from 2011)
Intel® MIC Architecture:
FIXED FUNCTION LOGIC
VECTOR IA CORE
VECTOR IA CORE
…
VECTOR IA CORE
VECTOR IA CORE
INTERPROCESSOR NETWORK COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
… …
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
INTERPROCESSOR NETWORK VECTOR IA CORE
VECTOR IA CORE
…
VECTOR IA CORE
VECTOR IA CORE
MEMORY and I/O INTERFACES
An Intel Co-Processor Architecture
Many cores and many, many more threads Standard IA programming and memory model
(source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
8
¨ Munchen Technische Universitat ¨
Manycore CPU – Intel Knights Ferry (2) (from 2011)
Knights Ferry • Software development platform • Growing availability through 2010 • 32 cores, 1.2 GHz • 128 threads at 4 threads / core • 8MB shared coherent cache • 1-2GB GDDR5 • Bundled with Intel HPC tools
Software development platform for Intel® MIC architecture
(source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
9
¨ Munchen Technische Universitat ¨
Heterogeneous CPU – Cell BE (from 2007 - already forgotten??)
(source: Wikipedia.de “Cell (Prozessor)”) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
10
¨ Munchen Technische Universitat ¨
Cell BE and Roadrunner Cell Broadband Engine (IBM, Sony, Toshiba) • used for Playstation (and for supercomputing) • PowerPC core plus 7–8 “Synergistic Processing Elements” • small local store (256 KB) for each SPE • explicit memory transfers required (no real caches)
Roadrunner (Los Alamos Nat. Lab.) • # 1 of the Top 500, years 2008–2009 • first PetaFlop supercomputer: 1.026 PFlop/s (Linpack) • costs of installation: ∼100 Mio$ • hybrid design: 6,562 dual-core Opteron, 12,240 Cell BE • power consumption: 2.35 MW ⇒ 437 MFlop/s per Watt(!) http://www.lanl.gov/roadrunner/rrtechnicalseminars2008.shtml Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
11
¨ Munchen Technische Universitat ¨
Caches and The Memory Gap Complicated Memory Hierarchies: • multiple levels of caches, NUMA, main memory, . . . • sometimes explicit local store (Cell BE, GPU) • regarding performance: “moving memory is evil!”
Caching Ideas and Strategies: • temporal locality: if we access address x, we will
probably access address x again • replacement strategy: least frequently/recently used • spatial locality: if we access address x, we will probably
access addresses x + δ or x − δ • cache lines (with limited associativity) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
12
¨ Munchen Technische Universitat ¨
Latency of Cache Misses Example: Intel Sandy Bridge Level L1 Data L2 (Unified) Third Level (LLC) L2 and L1 DCache in other cores
Latency (cycles) 4 12 26-31 43–60
Bandwidth (per core per cycle) 2x16 bytes 1 x 32 bytes
Further Influences: • prefetching due to stream detection • eviction of cache lines bandwidth-limited R 64 and IA-32 Architectures Optimization Reference Manual source: Intel Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
13
¨ Munchen Technische Universitat ¨
Types of Cache Misses – “Three C’s” Capacity Miss: • cache cannot hold the current memory set of the algorithm • has to evict cache lines that are still needed cache miss
Conflict Miss: • memory line can only be stored in certain cache lines
(“cache associativity”) • evicts cache line of data that is still required cache miss • “false sharing”: two cores write-access the same cache line
Compulsory Miss: • first access of a memory line: “cold miss” • cache miss that cannot be avoided(?) • each required piece of data is read at least once Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
14
¨ Munchen Technische Universitat ¨
Caches - Examples Intel Core i7 • 32 KB L1 data cache, 8-way associative • 32 KB L1 instruction cache • 256 KB L2 cache, 8-way associative • 8 MB shared L3 cache, 16-way associative • 64 Byte cache-line size
AMD Phenom II X4 • 64 KB L1 data cache, 2-way associative • 64 KB L1 instruction cache • 512 KB L2 cache, 16-way associative • 6 MB shared L3 cache, 48-way associative • 64 Byte cache-line size Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
15
¨ Munchen Technische Universitat ¨
Part II Towards Memory & Cache Efficient Algorithms → (Parallel) Memory Models
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
16
¨ Munchen Technische Universitat ¨
I/O Model and Cache Oblivious Model
...
private local caches (M words per cache, L words per line)
I/O Model:
External Shared Memory
CPU
• main memory vs. hard disc; block-wise transfers • algorithm contains explicit block transfers • goal: algorithms with minimal number of transfers
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
17
¨ Munchen Technische Universitat ¨
I/O Model and Cache Oblivious Model
Cache Oblivious Model:
...
private local caches (M words per cache, L words per line)
External Shared Memory
CPU
• cache lines vs. main memory • assume perfect cache-replacement: • goal: cache efficient algorithms that are “oblivious” of exact
cache architecture Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
17
¨ Munchen Technische Universitat ¨
Example: Matrix Multiplication Consider: Cache Oblivious Model • number of memory transfers for the following program?
for i from 1 to n do for k from 1 to n do for j from 1 to n do C[i ,k] = C[i ,k] + A[i , j ]∗B[j ,k] • each j-loop reads 2n elements (from arrays A and B) • thus 2n/L transfers per j-loop (cache line holds L words)
only n/L transfers, if column B [:, k] stays in cache • with n2 iterations of the j-loop: O(n3 /L) transfers • how to better exploit the local memory? Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
18
¨ Munchen Technische Universitat ¨
Block-Oriented Matrix Multiplication Block-oriented formulation: A11 A12 A13 B11 B12 B13 C11 C12 C13 A21 A22 A23 B21 B22 B23 = C21 C22 C23 A31 A32 A33 B31 B32 B33 C31 C32 C33 → each Aij , Bjk , Cik a square matrix block Block operations: • C11 = A11 B11 + A12 B21 + A13 B31
C12 = A11 B12 + A12 B22 + A13 B32 ... • three blocks need to fit in local memory! Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
19
¨ Munchen Technische Universitat ¨
Block-Oriented Matrix Multiplication – Code for i from 1 to n/b do for k from 1 to n/b do ! read/ initialise required block C for j from 1 to n/b do { ! read required blocks of A and B ! and perform block multiplication : for ii from 1 to b do for kk from 1 to b do for jj from 1 to b do C[i∗b+ii ,k∗b+kk] = C[i∗b+ii ,k∗b+kk] + A[i∗b+ii , j ∗b+jj ]∗B[j∗b+jj ,k∗b+kk] }
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
20
¨ Munchen Technische Universitat ¨
Blocked MatrixMult – Memory Transfers Number of Memory Transfers: • local memory can hold M words;
choose b such that 3b2 < M (ideal: 3b2 = M) • read/write local C block: 2b 2 words per i- and k-loop • read all n/b A blocks and n/b B blocks:
2n/b · b2 words per i- and k-loop • for all (n/b)2 i-/k-loop iterations:
(n/b)2 · (2n/b + 2)b2 = 2n3 /b + 2n2 words √ • b ∈ Θ( M); move √ L words per transfer; therefore: Θ(n3 /(L M)) memory transfers
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
21
¨ Munchen Technische Universitat ¨
Parallel External Memory – Memory Scheme
CPU private local caches (M words per cache)
CPU
External Shared Memory
CPU
[Arge, Goodrich, Nelson, Sitchinava, 2008] Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
22
¨ Munchen Technische Universitat ¨
Parallel External Memory – History Extension of the classical I/O model: • large, global memory (main memory, hard disk, etc.) • CPU can only access smaller working memory
(cache, main memory, etc.) of M words each • both organised as cache lines of size B words • algorithmic complexity determined by memory transfers Extension of the PRAM: • multiple CPUs access global shared memory
(but locally distributed) • EREW, CREW, CRCW classification (exclusive/concurrent read/write to external memory) • similar programming model (sychronised execution, e.g.) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
23
¨ Munchen Technische Universitat ¨
“Flops are For Free” Consider a memory-bandwidth intensive algorithm: • you can do a lot more flops than can be read from cache or
memory • computational intensity of a code: number of of flops per byte Memory-Bound Performance: • computational intensity smaller than critical ratio • you could execute additional flops “for free” • speedup only possible by reducing memory accesses
Compute-Bound Performance: • enough computational work to “hide” memory latency • speedup only possible by reducing operations Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
24
¨ Munchen Technische Universitat ¨
The Roofline Model
k
a pe
h idt
without vectorization
b am UMA e r N st w
without instruction−level parallism
e trid it s
ut
o ith
un n−
1/8
1/4
5−pt stencil
SpMV
no
1/2
1
2
matrix mult. (100x100)
dw an
SWE?
GFlop/s − log
peak FP performance
4
8
16
32
64
Operational Intensity [Flops/Byte] − log [Williams, Waterman, Patterson, 2008] Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
25
¨ Munchen Technische Universitat ¨
Part III Structured Grids and Stencil Computations
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
26
¨ Munchen Technische Universitat ¨
Dwarf #5 – Structured Grids
1. dense linear algebra 2. sparse linear algebra 3. spectral methods 4. N-body methods 5. structured grids 6. unstructured grids 7. Monte Carlo
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
27
¨ Munchen Technische Universitat ¨
A Wiremesh Model for the Heat Equation • consider rectangular plate as fine mesh of wires • compute temperature xij at nodes of the mesh
x i,j+1 x i−1,j x i,j x i+1,j x i,j−1
hy hx Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
28
¨ Munchen Technische Universitat ¨
Wiremesh Model Model for the Heat Equation (2) • model assumption: temperatures in equilibrium at every
mesh node • for all temperatures xij :
xij =
1 xi−1,j + xi+1,j + xi,j−1 + xi,j+1 4
• temperature known at (part of) the boundary; for example:
x0,j = Tj • task: solve system of linear equations
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
29
¨ Munchen Technische Universitat ¨
A Finite Volume Model • object: a rectangular metal plate (again) • model as a collection of small connected rectangular cells
hy hx
• examine the heat flow across the cell edges Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
30
¨ Munchen Technische Universitat ¨
Heat Flow Across the Cell Boundaries • Heat flow across a given edge is proportional to
temperature difference (T1 − T0 ) between the adjacent cells • length h of the edge • heat flow across all edges change of heat energy: qij = kx Tij − Ti−1,j hy + kx Tij − Ti+1,j hy + ky Tij − Ti,j−1 hx + ky Tij − Ti,j+1 hx •
• in equilibrium, with source term: qij + Fij = 0
fij hx hy
= −kx hy 2Tij − Ti−1,j − Ti+1,j
−ky hx 2Tij − Ti,j−1 − Ti,j+1
• apply iterative solver Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
31
¨ Munchen Technische Universitat ¨
Towards a Time Dependent Model • idea: set up ODE for each cell • simplification: no external heat sources or drains, i.e. fij = 0 • change of temperature per time is proportional to heat flow
into the cell (no longer 0): κx T˙ ij (t) = 2Tij (t) − Ti−1,j (t) − Ti+1,j (t) hx κy 2Tij (t) − Ti,j−1 (t) − Ti,j+1 (t) + hy • solve system of ODE
→ using Euler time stepping, e.g.: κx (n) (n) (n) (n+1) (n) 2Tij (t) − Ti−1,j (t) − Ti+1,j (t) Tij = Tij + τ hx κy (n) (n) (n) + τ 2Tij (t) − Ti,j−1 (t) − Ti,j+1 (t) hy Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
32
¨ Munchen Technische Universitat ¨
General Pattern: Stencil Computation • update of unknowns, elements, etc., according to a fixed
pattern • pattern usually defined by neighbours in a structured
grid/lattice • task: “update all unknowns/elements” → traversal • multiple traversals for iterative solvers (for systems of
equations) or time stepping (for time-dependent problems) • example: Poisson equation (h2 factors ignored): 1 1 −2 1 1D: 2D: 1 −4 1 1 • our example: shallow water equation on Cartesian grid
(Finite Volume Model) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
33
¨ Munchen Technische Universitat ¨
Structured Grids – Characterisation • construction of points or elements follows regular process • geometric (coordinates) and topological information
(neighbour relations) can be derived (i.e. are not stored) • memory addresses can be easily computed
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
34
¨ Munchen Technische Universitat ¨
Regular Structured Grids • rectangular/cartesian grids:
rectangles (2D) or cuboids (3D) • triangular meshes:
triangles (2D) or tetrahedra (3D) • often: row-major or column-major traversal and storage
hy hx
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
35
¨ Munchen Technische Universitat ¨
Transformed Structured Grids • transformation of the unit square to the computational
domain • regular grid is transformed likewise (0,1)
(1,1)
(0,0)
(1,0)
(ξ(0),η(1)) (ξ(1),η(1))
(ξ(0),η(0))
(ξ(1),η(0))
Variants: • algebraic: interpolation-based • PDE-based: solve system of PDEs to obtain ξ(x, y) and η(x, y) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
36
¨ Munchen Technische Universitat ¨
Composite and Block Structured Grids • subdivide (complicated) domain into subdomains of
simpler form
• and use regular meshes on each subdomain
• at interfaces:
conforming at interface (“glue” required?) • overlapping grids (chimera grids) •
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
37
¨ Munchen Technische Universitat ¨
Block Adaptive Grids • retain regular structure • refinement of entire blocks • similar to block structured grids
• efficient storage and processing • but limited w.r.t. adaptivity Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
38
¨ Munchen Technische Universitat ¨
Part IV (Cache-)Efficient (Parallel) Algorithms for Structured Grids
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
39
¨ Munchen Technische Universitat ¨
Analysis of Cache-Usage for 2D/3D Stencil Computation We will assume: • 2D or 3D Cartesian mesh with N = nd grid points • stencil only accesses nearest neighbours
→ typically cM := 2d or cM := 3d accesses per stencil • cF floating-point operations per stencil, cF ∈ O(cM )
We will examine: • number of memory transfers in the Parallel External
Memory model (equiv. to cache misses) • for different implementations and algorithms • similar for ratio of communication to computation Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
40
¨ Munchen Technische Universitat ¨
Straight-Forward, Loop-Based Implementation Example: for i from 1 to n do for j from 1 to n do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } Number of cache line transfers: • x[ i −1,j], x[ i , j ], and x[ i +1,j ] stored consecutive in
memory
loaded as one cache line (of size L)
• question: Cache size M large enough to hold n floats? • if n > M: cache misses for x[ i , j −1] and x[ i , j +1] • this: 3N/L = 3n2 /L transfers; no impact of cache size M Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
41
¨ Munchen Technische Universitat ¨
Loop-Based Implementation with Blocking Example: for ii from 1 to n by b do for jj from 1 to n by b do for i from ii to ii +b−1 do for j from jj to jj +b−1 do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } Number of cache line transfers: • choose b such that the cache can hold 3 rows of x: M > 3b • then: N/L transfers; still independent of cache size M
(besides condition for b) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
42
¨ Munchen Technische Universitat ¨
Extension to 3D stencils Simple loops: • if cache holds 3 planes of x, M > 3n2 , then N/L transfers • if cache only holds 1 plane, M > n2 , then 3N/L transfers • if cache holds less than 1 row, M < n, then 5N/L transfers
(if cM = 6) or 9N/L transfers (if cM = 33 = 27) With blocking: • cache needs to hold 3 planes of a b 3 block: M > 3b 2 • then: N/L transfers; again independent of cache size M
(besides condition for b)
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
43
¨ Munchen Technische Universitat ¨
Further Increase of Cache Reuse Requires multiple stencil evaluations: for t from 1 to m do for i from 1 to n do for j from 1 to n do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } → for multiple iterations or time steps, e.g. Possible approaches: • blocking in space and time? • what about precedence conditions of stencil updates? Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
44
¨ Munchen Technische Universitat ¨
Region of Influence for Stencil Updates 1D Example: t
x • area of “valid” points narrows by stencil size in each step • leads to trapezoidal update regions • similar, but more complicated, in 2D and 3D Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
45
¨ Munchen Technische Universitat ¨
Option: Time Skewing 1D Example: t
x
• sliding trapezoidal update “window” • question: optimal size of window in x- and t-direction? • can be extended to 2D and 3D Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
46
¨ Munchen Technische Universitat ¨
Divide & Conquer Algorithm: Space Split 1D Example: t
x
• applied, if spatial domain is at least “twice as large” as
number of time steps • note precedence condition for left vs. right subdomain Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
47
¨ Munchen Technische Universitat ¨
Divide & Conquer Algorithm: Time Split 1D Example: t
x
• applied, if spatial domain is less than “twice as large” as
number of time steps • space split likely as the next split for the lower domain Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
48
¨ Munchen Technische Universitat ¨
Cache Oblivious Algorithms for Structured Grids Algorithm by Frigo & Strumpen: • divide & conquer approach using time and space splits
√ d
• O(N/ M) cache misses in “cache oblivious” model (“Parallel External Memory” with only 1 CPU and “ideal cache”)
References/Literature: • Matteo Frigo and Volker Strumpen: Cache Oblivious Stencil Computations, Int. Conf. on Supercomput., ACM, 2005. • Matteo Frigo and Volker Strumpen: The Memory Behavior of Cache Oblivious Stencil Computations, J. Supercomput. 39 (2), 2007
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
49
¨ Munchen Technische Universitat ¨
Part V Workshop – Memory Access in SWE
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
50
¨ Munchen Technische Universitat ¨
Towards a Roofline Model for SWE
dw an
h idt
without vectorization
b m MA ea r t NU s
without instruction−level parallism
e trid it s
t
ou ak pe with
SWE?
un n−
1/8
1/4
5−pt stencil
SpMV
no
1/2
1
2
matrix mult. (100x100)
GFlop/s − log
peak FP performance
4
8
16
32
64
Operational Intensity [Flops/Byte] − log Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
51
¨ Munchen Technische Universitat ¨
Towards a Roofline Model for SWE Components of the Roofline Model: • peak floating point performance and peak bandwidth? • • • •
(on a single core/socket/node) operational complexity of SWE? test performance for non-unit-stride memory access (exchange i-/j-loops, e.g.) test performance with and without vectorization test performance implications on serial and parallel code!
Possible improvements of SWE: • examine memory access pattern of SWE → possibilities
for improvement? (loop fusion? storage of net updates?) • improve cache behaviour of SWE (loop skewing, e.g.??) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
52
¨ Munchen Technische Universitat ¨
Roofline Ingredient #1: Stream Benchmark
Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
53
¨ Munchen Technische Universitat ¨
An Improvement to Memory Usage of SWE Consider time-stepping scheme: n+1 ∆t n n n Qi,j = Qi,j − ∆x A+ ∆Qi−1/2,j + A− ∆Qi+1/2,j ∆t n n − ∆y B + ∆Qi,j−1/2 + B − ∆Qi,j+1/2 . Implementation w.r.t. memory: n 1. compute and store all net updates A+ ∆Qi−1/2,j , etc. n+1 2. update quantities Qi,j
Possible improvement: 1. accumulate
1 + n ∆x A ∆Qi−1/2,j ,
etc., for each quantity
n 2. add these accumulated net updates to Qi,j Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
54
¨ Munchen Technische Universitat ¨
References and Literatur • L. Arge, M. T. Goodrich, M. Nelson, N. Sitchinava: Fundamental parallel algorithms for private-cache chip multiprocessors. Proc. 20th ACM Symp. Parallelism in Algorithms and Architectures (SPAA), 2008. • K. Datta et al.: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors, SIAM Review 51 (1), 2009 • Graham, Snir, Patterson: Getting Up to Speed: The Future of Supercomputing. The National Academies Press, 2005 • S. W. Williams, A. Waterman, D. A. Patterson: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures, UC Berkeley, Tech. Report No. UCB/EECS-2008-134 Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013
55