Memory-Bound Performance

NUMA (non-uniform memory access) architecture: CPUs have “private” main memory, but uniform access to “remote” memory ... Computer Simulations in Scie...

0 downloads 84 Views 2MB Size
¨ Munchen Technische Universitat ¨

Memory-Bound Performance CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8–19, 2013

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

1

¨ Munchen Technische Universitat ¨

Part I Parallel Architectures – A Memory Perspective

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

2

¨ Munchen Technische Universitat ¨

The Memory Gap Graham et al., 2005: • annual increase of CPU performance (“Flop/s”): 59 % • annual increase of memory bandwidth: ∼25 % • annual improvement of memory latency: ∼5 % • memory is the bottleneck (already, and getting worse); • remedy: introduce cache memory!

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

3

¨ Munchen Technische Universitat ¨

Multicore CPUs – Intel’s Nehalem Architecture (from 2008/2009)

• quad-core CPU with shared and private caches • simultaneous multi-threading: 8 threads on 4 cores • memory architecture: Quick Path Interconnect

(replaced Front Side Bus)

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

4

¨ Munchen Technische Universitat ¨

Multicore CPUs – Intel’s Nehalem Architecture (2) (from 2008/2009)

• NUMA (non-uniform memory access) architecture:

CPUs have “private” main memory, but uniform access to “remote” memory • max. 25 GB/s bandwidth

(source: Intel – Nehalem Whitepaper) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

5

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA ¨ Munchen Technische Universitat ¨ cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread global scheduler distributes thread blocks to SM thread schedulers. (from 2009/2010)

GPGPU – NVIDIA Fermi

Fermi’s 16 SM are (source: positionedNVIDIA around a– common L2 cache. Each SM is a vertical Fermi Whitepaper) rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).

Michael Bader: Memory-Bound Performance

7

Computer Simulations in Science and Engineering, July 8–19, 2013

6

¨ Munchen Technische Universitat ¨

GPGPU – NVIDIA Fermi (3) (from 2009/2010) Memory Subsystem Innovations

General Purpose Graphics Processing NVIDIA Parallel DataCache with ConfigurableUnit: L1 and Unified L2 Cache TM

Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared memory, others require a cache, while others require a combination of both. The optimal memory hierarchy should offer the benefits of both Shared memory and cache, and allow the programmer a choice over its partitioning. The Fermi memory hierarchy adapts to both types of program behavior.

• 512 CUDA cores

• improved double precision

performance

• shared vs. global memory

Adding a true cache (768 hierarchy forKB) load / store • new: L1 und L2 cache operations presented significant challenges. Traditional GPU architectures • trend from GPU towards CPU? support a read-only ‘‘load’’ path for texture

Michael Bader: Memory-Bound

operations and a write-only ‘‘export’’ path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. As one example: spilling a register operand to memory and then reading it back creates a read after write hazard; if the read and write paths are separate, it may be necessary to explicitly flush the entire write / ‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be coherent with respect to the write data. Performance

The Fermi architecture addresses Computer Simulations in Science and Engineering, July 8–19, 2013this challenge by implementing a single unified memory

7

¨ Munchen Technische Universitat ¨

Manycore CPU – Intel Knights Ferry (from 2011)

Intel® MIC Architecture:

FIXED FUNCTION LOGIC

VECTOR IA CORE

VECTOR IA CORE



VECTOR IA CORE

VECTOR IA CORE

INTERPROCESSOR NETWORK COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

… …

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

INTERPROCESSOR NETWORK VECTOR IA CORE

VECTOR IA CORE



VECTOR IA CORE

VECTOR IA CORE

MEMORY and I/O INTERFACES

An Intel Co-Processor Architecture

Many cores and many, many more threads Standard IA programming and memory model

(source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

8

¨ Munchen Technische Universitat ¨

Manycore CPU – Intel Knights Ferry (2) (from 2011)

Knights Ferry • Software development platform • Growing availability through 2010 • 32 cores, 1.2 GHz • 128 threads at 4 threads / core • 8MB shared coherent cache • 1-2GB GDDR5 • Bundled with Intel HPC tools

Software development platform for Intel® MIC architecture

(source: Intel/K. Skaugen – SC’10 keynote presentation) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

9

¨ Munchen Technische Universitat ¨

Heterogeneous CPU – Cell BE (from 2007 - already forgotten??)

(source: Wikipedia.de “Cell (Prozessor)”) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

10

¨ Munchen Technische Universitat ¨

Cell BE and Roadrunner Cell Broadband Engine (IBM, Sony, Toshiba) • used for Playstation (and for supercomputing) • PowerPC core plus 7–8 “Synergistic Processing Elements” • small local store (256 KB) for each SPE • explicit memory transfers required (no real caches)

Roadrunner (Los Alamos Nat. Lab.) • # 1 of the Top 500, years 2008–2009 • first PetaFlop supercomputer: 1.026 PFlop/s (Linpack) • costs of installation: ∼100 Mio$ • hybrid design: 6,562 dual-core Opteron, 12,240 Cell BE • power consumption: 2.35 MW ⇒ 437 MFlop/s per Watt(!) http://www.lanl.gov/roadrunner/rrtechnicalseminars2008.shtml Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

11

¨ Munchen Technische Universitat ¨

Caches and The Memory Gap Complicated Memory Hierarchies: • multiple levels of caches, NUMA, main memory, . . . • sometimes explicit local store (Cell BE, GPU) • regarding performance: “moving memory is evil!”

Caching Ideas and Strategies: • temporal locality: if we access address x, we will

probably access address x again • replacement strategy: least frequently/recently used • spatial locality: if we access address x, we will probably

access addresses x + δ or x − δ • cache lines (with limited associativity) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

12

¨ Munchen Technische Universitat ¨

Latency of Cache Misses Example: Intel Sandy Bridge Level L1 Data L2 (Unified) Third Level (LLC) L2 and L1 DCache in other cores

Latency (cycles) 4 12 26-31 43–60

Bandwidth (per core per cycle) 2x16 bytes 1 x 32 bytes

Further Influences: • prefetching due to stream detection • eviction of cache lines bandwidth-limited R 64 and IA-32 Architectures Optimization Reference Manual source: Intel Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

13

¨ Munchen Technische Universitat ¨

Types of Cache Misses – “Three C’s” Capacity Miss: • cache cannot hold the current memory set of the algorithm • has to evict cache lines that are still needed cache miss

Conflict Miss: • memory line can only be stored in certain cache lines

(“cache associativity”) • evicts cache line of data that is still required cache miss • “false sharing”: two cores write-access the same cache line

Compulsory Miss: • first access of a memory line: “cold miss” • cache miss that cannot be avoided(?) • each required piece of data is read at least once Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

14

¨ Munchen Technische Universitat ¨

Caches - Examples Intel Core i7 • 32 KB L1 data cache, 8-way associative • 32 KB L1 instruction cache • 256 KB L2 cache, 8-way associative • 8 MB shared L3 cache, 16-way associative • 64 Byte cache-line size

AMD Phenom II X4 • 64 KB L1 data cache, 2-way associative • 64 KB L1 instruction cache • 512 KB L2 cache, 16-way associative • 6 MB shared L3 cache, 48-way associative • 64 Byte cache-line size Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

15

¨ Munchen Technische Universitat ¨

Part II Towards Memory & Cache Efficient Algorithms → (Parallel) Memory Models

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

16

¨ Munchen Technische Universitat ¨

I/O Model and Cache Oblivious Model

...

private local caches (M words per cache, L words per line)

I/O Model:

External Shared Memory

CPU

• main memory vs. hard disc; block-wise transfers • algorithm contains explicit block transfers • goal: algorithms with minimal number of transfers

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

17

¨ Munchen Technische Universitat ¨

I/O Model and Cache Oblivious Model

Cache Oblivious Model:

...

private local caches (M words per cache, L words per line)

External Shared Memory

CPU

• cache lines vs. main memory • assume perfect cache-replacement: • goal: cache efficient algorithms that are “oblivious” of exact

cache architecture Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

17

¨ Munchen Technische Universitat ¨

Example: Matrix Multiplication Consider: Cache Oblivious Model • number of memory transfers for the following program?

for i from 1 to n do for k from 1 to n do for j from 1 to n do C[i ,k] = C[i ,k] + A[i , j ]∗B[j ,k] • each j-loop reads 2n elements (from arrays A and B) • thus 2n/L transfers per j-loop (cache line holds L words)

only n/L transfers, if column B [:, k] stays in cache • with n2 iterations of the j-loop: O(n3 /L) transfers • how to better exploit the local memory? Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

18

¨ Munchen Technische Universitat ¨

Block-Oriented Matrix Multiplication Block-oriented formulation:      A11 A12 A13 B11 B12 B13 C11 C12 C13  A21 A22 A23   B21 B22 B23  =  C21 C22 C23  A31 A32 A33 B31 B32 B33 C31 C32 C33 → each Aij , Bjk , Cik a square matrix block Block operations: • C11 = A11 B11 + A12 B21 + A13 B31

C12 = A11 B12 + A12 B22 + A13 B32 ... • three blocks need to fit in local memory! Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

19

¨ Munchen Technische Universitat ¨

Block-Oriented Matrix Multiplication – Code for i from 1 to n/b do for k from 1 to n/b do ! read/ initialise required block C for j from 1 to n/b do { ! read required blocks of A and B ! and perform block multiplication : for ii from 1 to b do for kk from 1 to b do for jj from 1 to b do C[i∗b+ii ,k∗b+kk] = C[i∗b+ii ,k∗b+kk] + A[i∗b+ii , j ∗b+jj ]∗B[j∗b+jj ,k∗b+kk] }

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

20

¨ Munchen Technische Universitat ¨

Blocked MatrixMult – Memory Transfers Number of Memory Transfers: • local memory can hold M words;

choose b such that 3b2 < M (ideal: 3b2 = M) • read/write local C block: 2b 2 words per i- and k-loop • read all n/b A blocks and n/b B blocks:

2n/b · b2 words per i- and k-loop • for all (n/b)2 i-/k-loop iterations:

(n/b)2 · (2n/b + 2)b2 = 2n3 /b + 2n2 words √ • b ∈ Θ( M); move √ L words per transfer; therefore: Θ(n3 /(L M)) memory transfers

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

21

¨ Munchen Technische Universitat ¨

Parallel External Memory – Memory Scheme

CPU private local caches (M words per cache)

CPU

External Shared Memory

CPU

[Arge, Goodrich, Nelson, Sitchinava, 2008] Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

22

¨ Munchen Technische Universitat ¨

Parallel External Memory – History Extension of the classical I/O model: • large, global memory (main memory, hard disk, etc.) • CPU can only access smaller working memory

(cache, main memory, etc.) of M words each • both organised as cache lines of size B words • algorithmic complexity determined by memory transfers Extension of the PRAM: • multiple CPUs access global shared memory

(but locally distributed) • EREW, CREW, CRCW classification (exclusive/concurrent read/write to external memory) • similar programming model (sychronised execution, e.g.) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

23

¨ Munchen Technische Universitat ¨

“Flops are For Free” Consider a memory-bandwidth intensive algorithm: • you can do a lot more flops than can be read from cache or

memory • computational intensity of a code: number of of flops per byte Memory-Bound Performance: • computational intensity smaller than critical ratio • you could execute additional flops “for free” • speedup only possible by reducing memory accesses

Compute-Bound Performance: • enough computational work to “hide” memory latency • speedup only possible by reducing operations Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

24

¨ Munchen Technische Universitat ¨

The Roofline Model

k

a pe

h idt

without vectorization

b am UMA e r N st w

without instruction−level parallism

e trid it s

ut

o ith

un n−

1/8

1/4

5−pt stencil

SpMV

no

1/2

1

2

matrix mult. (100x100)

dw an

SWE?

GFlop/s − log

peak FP performance

4

8

16

32

64

Operational Intensity [Flops/Byte] − log [Williams, Waterman, Patterson, 2008] Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

25

¨ Munchen Technische Universitat ¨

Part III Structured Grids and Stencil Computations

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

26

¨ Munchen Technische Universitat ¨

Dwarf #5 – Structured Grids

1. dense linear algebra 2. sparse linear algebra 3. spectral methods 4. N-body methods 5. structured grids 6. unstructured grids 7. Monte Carlo

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

27

¨ Munchen Technische Universitat ¨

A Wiremesh Model for the Heat Equation • consider rectangular plate as fine mesh of wires • compute temperature xij at nodes of the mesh

x i,j+1 x i−1,j x i,j x i+1,j x i,j−1

hy hx Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

28

¨ Munchen Technische Universitat ¨

Wiremesh Model Model for the Heat Equation (2) • model assumption: temperatures in equilibrium at every

mesh node • for all temperatures xij :

xij =

 1 xi−1,j + xi+1,j + xi,j−1 + xi,j+1 4

• temperature known at (part of) the boundary; for example:

x0,j = Tj • task: solve system of linear equations

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

29

¨ Munchen Technische Universitat ¨

A Finite Volume Model • object: a rectangular metal plate (again) • model as a collection of small connected rectangular cells

hy hx

• examine the heat flow across the cell edges Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

30

¨ Munchen Technische Universitat ¨

Heat Flow Across the Cell Boundaries • Heat flow across a given edge is proportional to

temperature difference (T1 − T0 ) between the adjacent cells • length h of the edge • heat flow across all edges change of heat energy:   qij = kx Tij − Ti−1,j hy + kx Tij − Ti+1,j hy   + ky Tij − Ti,j−1 hx + ky Tij − Ti,j+1 hx •

• in equilibrium, with source term: qij + Fij = 0

fij hx hy

= −kx hy 2Tij − Ti−1,j − Ti+1,j



−ky hx 2Tij − Ti,j−1 − Ti,j+1



• apply iterative solver Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

31

¨ Munchen Technische Universitat ¨

Towards a Time Dependent Model • idea: set up ODE for each cell • simplification: no external heat sources or drains, i.e. fij = 0 • change of temperature per time is proportional to heat flow

into the cell (no longer 0):  κx T˙ ij (t) = 2Tij (t) − Ti−1,j (t) − Ti+1,j (t) hx  κy 2Tij (t) − Ti,j−1 (t) − Ti,j+1 (t) + hy • solve system of ODE

→ using Euler time stepping, e.g.:  κx  (n) (n) (n) (n+1) (n) 2Tij (t) − Ti−1,j (t) − Ti+1,j (t) Tij = Tij + τ hx  κy  (n) (n) (n) + τ 2Tij (t) − Ti,j−1 (t) − Ti,j+1 (t) hy Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

32

¨ Munchen Technische Universitat ¨

General Pattern: Stencil Computation • update of unknowns, elements, etc., according to a fixed

pattern • pattern usually defined by neighbours in a structured

grid/lattice • task: “update all unknowns/elements” → traversal • multiple traversals for iterative solvers (for systems of

equations) or time stepping (for time-dependent problems) • example: Poisson equation (h2 factors ignored):   1   1 −2 1 1D: 2D:  1 −4 1  1 • our example: shallow water equation on Cartesian grid

(Finite Volume Model) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

33

¨ Munchen Technische Universitat ¨

Structured Grids – Characterisation • construction of points or elements follows regular process • geometric (coordinates) and topological information

(neighbour relations) can be derived (i.e. are not stored) • memory addresses can be easily computed

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

34

¨ Munchen Technische Universitat ¨

Regular Structured Grids • rectangular/cartesian grids:

rectangles (2D) or cuboids (3D) • triangular meshes:

triangles (2D) or tetrahedra (3D) • often: row-major or column-major traversal and storage

hy hx

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

35

¨ Munchen Technische Universitat ¨

Transformed Structured Grids • transformation of the unit square to the computational

domain • regular grid is transformed likewise (0,1)

(1,1)

(0,0)

(1,0)

(ξ(0),η(1)) (ξ(1),η(1))

(ξ(0),η(0))

(ξ(1),η(0))

Variants: • algebraic: interpolation-based • PDE-based: solve system of PDEs to obtain ξ(x, y) and η(x, y) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

36

¨ Munchen Technische Universitat ¨

Composite and Block Structured Grids • subdivide (complicated) domain into subdomains of

simpler form

• and use regular meshes on each subdomain

• at interfaces:

conforming at interface (“glue” required?) • overlapping grids (chimera grids) •

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

37

¨ Munchen Technische Universitat ¨

Block Adaptive Grids • retain regular structure • refinement of entire blocks • similar to block structured grids

• efficient storage and processing • but limited w.r.t. adaptivity Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

38

¨ Munchen Technische Universitat ¨

Part IV (Cache-)Efficient (Parallel) Algorithms for Structured Grids

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

39

¨ Munchen Technische Universitat ¨

Analysis of Cache-Usage for 2D/3D Stencil Computation We will assume: • 2D or 3D Cartesian mesh with N = nd grid points • stencil only accesses nearest neighbours

→ typically cM := 2d or cM := 3d accesses per stencil • cF floating-point operations per stencil, cF ∈ O(cM )

We will examine: • number of memory transfers in the Parallel External

Memory model (equiv. to cache misses) • for different implementations and algorithms • similar for ratio of communication to computation Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

40

¨ Munchen Technische Universitat ¨

Straight-Forward, Loop-Based Implementation Example: for i from 1 to n do for j from 1 to n do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } Number of cache line transfers: • x[ i −1,j], x[ i , j ], and x[ i +1,j ] stored consecutive in

memory

loaded as one cache line (of size L)

• question: Cache size M large enough to hold n floats? • if n > M: cache misses for x[ i , j −1] and x[ i , j +1] • this: 3N/L = 3n2 /L transfers; no impact of cache size M Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

41

¨ Munchen Technische Universitat ¨

Loop-Based Implementation with Blocking Example: for ii from 1 to n by b do for jj from 1 to n by b do for i from ii to ii +b−1 do for j from jj to jj +b−1 do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } Number of cache line transfers: • choose b such that the cache can hold 3 rows of x: M > 3b • then: N/L transfers; still independent of cache size M

(besides condition for b) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

42

¨ Munchen Technische Universitat ¨

Extension to 3D stencils Simple loops: • if cache holds 3 planes of x, M > 3n2 , then N/L transfers • if cache only holds 1 plane, M > n2 , then 3N/L transfers • if cache holds less than 1 row, M < n, then 5N/L transfers

(if cM = 6) or 9N/L transfers (if cM = 33 = 27) With blocking: • cache needs to hold 3 planes of a b 3 block: M > 3b 2 • then: N/L transfers; again independent of cache size M

(besides condition for b)

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

43

¨ Munchen Technische Universitat ¨

Further Increase of Cache Reuse Requires multiple stencil evaluations: for t from 1 to m do for i from 1 to n do for j from 1 to n do { x[ i , j ] = 0.25∗(x[ i −1,j]+x[ i +1,j]+x[ i , j −1]+x[i, j +1]) } → for multiple iterations or time steps, e.g. Possible approaches: • blocking in space and time? • what about precedence conditions of stencil updates? Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

44

¨ Munchen Technische Universitat ¨

Region of Influence for Stencil Updates 1D Example: t

x • area of “valid” points narrows by stencil size in each step • leads to trapezoidal update regions • similar, but more complicated, in 2D and 3D Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

45

¨ Munchen Technische Universitat ¨

Option: Time Skewing 1D Example: t

x

• sliding trapezoidal update “window” • question: optimal size of window in x- and t-direction? • can be extended to 2D and 3D Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

46

¨ Munchen Technische Universitat ¨

Divide & Conquer Algorithm: Space Split 1D Example: t

x

• applied, if spatial domain is at least “twice as large” as

number of time steps • note precedence condition for left vs. right subdomain Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

47

¨ Munchen Technische Universitat ¨

Divide & Conquer Algorithm: Time Split 1D Example: t

x

• applied, if spatial domain is less than “twice as large” as

number of time steps • space split likely as the next split for the lower domain Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

48

¨ Munchen Technische Universitat ¨

Cache Oblivious Algorithms for Structured Grids Algorithm by Frigo & Strumpen: • divide & conquer approach using time and space splits

√ d

• O(N/ M) cache misses in “cache oblivious” model (“Parallel External Memory” with only 1 CPU and “ideal cache”)

References/Literature: • Matteo Frigo and Volker Strumpen: Cache Oblivious Stencil Computations, Int. Conf. on Supercomput., ACM, 2005. • Matteo Frigo and Volker Strumpen: The Memory Behavior of Cache Oblivious Stencil Computations, J. Supercomput. 39 (2), 2007

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

49

¨ Munchen Technische Universitat ¨

Part V Workshop – Memory Access in SWE

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

50

¨ Munchen Technische Universitat ¨

Towards a Roofline Model for SWE

dw an

h idt

without vectorization

b m MA ea r t NU s

without instruction−level parallism

e trid it s

t

ou ak pe with

SWE?

un n−

1/8

1/4

5−pt stencil

SpMV

no

1/2

1

2

matrix mult. (100x100)

GFlop/s − log

peak FP performance

4

8

16

32

64

Operational Intensity [Flops/Byte] − log Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

51

¨ Munchen Technische Universitat ¨

Towards a Roofline Model for SWE Components of the Roofline Model: • peak floating point performance and peak bandwidth? • • • •

(on a single core/socket/node) operational complexity of SWE? test performance for non-unit-stride memory access (exchange i-/j-loops, e.g.) test performance with and without vectorization test performance implications on serial and parallel code!

Possible improvements of SWE: • examine memory access pattern of SWE → possibilities

for improvement? (loop fusion? storage of net updates?) • improve cache behaviour of SWE (loop skewing, e.g.??) Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

52

¨ Munchen Technische Universitat ¨

Roofline Ingredient #1: Stream Benchmark

Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

53

¨ Munchen Technische Universitat ¨

An Improvement to Memory Usage of SWE Consider time-stepping scheme:   n+1 ∆t n n n Qi,j = Qi,j − ∆x A+ ∆Qi−1/2,j + A− ∆Qi+1/2,j   ∆t n n − ∆y B + ∆Qi,j−1/2 + B − ∆Qi,j+1/2 . Implementation w.r.t. memory: n 1. compute and store all net updates A+ ∆Qi−1/2,j , etc. n+1 2. update quantities Qi,j

Possible improvement: 1. accumulate

1 + n ∆x A ∆Qi−1/2,j ,

etc., for each quantity

n 2. add these accumulated net updates to Qi,j Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

54

¨ Munchen Technische Universitat ¨

References and Literatur • L. Arge, M. T. Goodrich, M. Nelson, N. Sitchinava: Fundamental parallel algorithms for private-cache chip multiprocessors. Proc. 20th ACM Symp. Parallelism in Algorithms and Architectures (SPAA), 2008. • K. Datta et al.: Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors, SIAM Review 51 (1), 2009 • Graham, Snir, Patterson: Getting Up to Speed: The Future of Supercomputing. The National Academies Press, 2005 • S. W. Williams, A. Waterman, D. A. Patterson: Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures, UC Berkeley, Tech. Report No. UCB/EECS-2008-134 Michael Bader: Memory-Bound Performance Computer Simulations in Science and Engineering, July 8–19, 2013

55