Toward High Performance Matrix Multiplication for Exact

Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French A...

0 downloads 85 Views 1MB Size
Toward High Performance Matrix Multiplication for Exact Computation

Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC

S´eminaire CASYS - LJK, April 2014

Motivations Matrix multiplication plays a central role in computer algebra. algebraic complexity of O(nω ) with ω < 2.3727

[Williams 2011]

Modern processors provide many levels of parallelism. superscalar, SIMD units, multiple cores

High performance matrix multiplication 3 numerical computing = classic algorithm + hardware arithmetic 7 exact computing 6= numerical computing algebraic algorithm is not the most efficient (6= complexity model) arithmetic is not directly in the hardware (e.g. Z, Fq , Z[x], Q[x, y, z]).

Motivation : Superscalar processor with SIMD

Hierarchical memory : L1 cache : 32kB - 4 cycles L2 cache : 256kB - 12 cycles L3 cache : 8MB - 36 cycles RAM : 32GB - 36 cycles + 57ns

Motivations : practical algorithms

High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism

Motivations : practical algorithms

High performance algorithms (rule of thumb) best asymptotic complexity is not alway faster : constants matter better arithmetic count is not always faster : caches matter process multiple data at the same time : vectorization fine/coarse grain task parallelism matter : multicore parallelism

Our goal : try to incorporate these rules into exact matrix multiplications

Outline

1

Matrix multiplication with small integers

2

Matrix multiplication with multi-precision integers

3

Matrix multiplication with polynomials

Outline

1

Matrix multiplication with small integers

2

Matrix multiplication with multi-precision integers

3

Matrix multiplication with polynomials

Matrix multiplication with small integers This corresponds to the case where each integer result holds in one processor register : A, B ∈ Zn×n such that ||AB||∞ < 2s where s is the register size.

Main interests ring isomorphism : → computation over Z/pZ is congruent to Z/2s Z when p(n − 1)2 < 2s . its a building block for matrix mutiplication with larger integers

Matrix multiplication with small integers Two possibilities for hardware support : use floating point mantissa, i.e. s = 253 , use native integer, i.e. s = 264 .

Using floating point historically, the first approach in computer algebra

[Dumas, Gautier, Pernet 2002]

3 out of the box performance from optimized BLAS 7 handle matrix with entries < 226 Using native integers 3 apply same optimizations as BLAS libraries 32

3 handle matrix with entries < 2

[Goto, Van De Geijn 2008]

Matrix multiplication with small integers floating point Nehalem (2008) SSE4 128-bits 1 mul+1 add Sandy Bridge (2011) AVX 256-bits 1 mul+1 add Haswell (2013) AVX2 256-bits 2 FMA # vector operations per cycle (pipelined)

integers 1 mul+2 add 1 mul+2 add

50   45   40  

OpenBlas:    double  -­‐  SSE     OpenBlas:    double  -­‐  AVX     OpenBlas:    double  -­‐  AVX2   Our  code:  int  -­‐  SSE   Our  code:  int  -­‐  AVX2  

35  

GFLOPS  

30   25   20   15   10   5   0   32  

64  

128  

256   512   1024   Matrix  dimensions  

2048  

benchmark on Intel i7-4960HQ @ 2.60GHz

4096  

8192  

Matrix multiplication with small integers Matrix multiplication modulo a small integer Let p such that (p − 1)2 × n < 253 1

perform the multiplication in Z using BLAS

2

reduce the result modulo p 60   OpenBlas  

50  

mod  p  -­‐  Classic   mod  p  -­‐  Winograd  

GFLOPS  

40  

30  

20  

10  

0   32  

64  

128  

256   512   1024   Matrix  dimensions  

2048  

benchmark on Intel i7-4960HQ @ 2.60GHz

4096  

8192  

Outline

1

Matrix multiplication with small integers

2

Matrix multiplication with multi-precision integers

3

Matrix multiplication with polynomials

Matrix multiplication with multi-precision integers Direct approach Let M(k) be the bit complexity of k-bit integers multiplication and A, B ∈ Zn×n such that ||A||∞ , ||B||∞ ∈ O(2k ). Computing AB using direct algorithm costs nω M(k) bit operations. 7 not best possible complexity, i.e. M(k) is super-linear 7 not efficient in practice

Remark: Use evaluation/interpolation technique for better performances ! ! !

Multi-modular matrix multiplication Multi-modular approach ||AB||∞ < M =

k Y

mi ,

with primes mi ∈ O(1)

i=1

then AB can be reconstructed with the CRT from (AB) mod mi . 1

for each mi compute Ai = A mod mi and Bi = B mod mi

2

for each mi compute Ci = Ai Bi mod mi

3

reconstruct C = AB from (C1 , . . . , Ck )

Bit complexity : O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction

Multi-modular matrix multiplication Multi-modular approach ||AB||∞ < M =

k Y

mi ,

with primes mi ∈ O(1)

i=1

then AB can be reconstructed with the CRT from (AB) mod mi . 1

for each mi compute Ai = A mod mi and Bi = B mod mi

2

for each mi compute Ci = Ai Bi mod mi

3

reconstruct C = AB from (C1 , . . . , Ck )

Bit complexity : O(nω k + n2 R(k)) where R(k) is the cost of reduction/reconstruction R(k) = O(M(k) log(k)) using divide and conquer strategy R(k) = O(k 2 ) using naive approach

Multi-modular matrix multiplication

Improving naive approach with linear algebra reduction/reconstruction of n2 data corresponds to matrix multiplication 3 improve the bit complexity from O(n2 k 2 ) to O(n2 k ω−1 ) 3 benefit from optimized matrix multiplication, i.e. SIMD

Remark : A similar approach has been used by non-distributed code.

[Doliskani, Schost 2010]

in a

Multi-modular reductions of an integer matrix Let us assume M =

Qk

i=1

mi < β k with mi < β.

Multi-reduction of a single entry Let a = a0 + a1 β + . . . ak−1 β k−1 be a value to reduce modmi then         |a|m1 1 |β|m1 . . . |β k−1 |m1 a0 m1  ..   ..     . . . .  . .. .. ..  .  = .  ×  ..  − Q ×  ..  |a|mk ak−1 mk 1 |β|mk . . . |β k−1 |mk with ||Q||∞ < kβ 2

Lemma : if kβ 2 ∈ O(1) than the reduction of n2 integers modulo the mi ’s costs O(n2 k ω−1 ) + O(n2 k) bit operations.

Multi-modular reconstruction of an integer matrix Let us assume M =

Qk

i=1

mi < β k with mi < β and Mi = M/mi

k X |a|m1 · Mi |Mi−1 |mi ) mod M CRT formulae : a = ( i=1

Reconstruction of a single entry (i)

(i)

(i)

Let Mi |Mi−1 |mi = α0 + α1 β + . . . αk−1 β k−1 be the CRT constants, then   (1) α0 a0  ..   .  . =  .. (k) ak−1 α0 

... .. .

   (1) αk−1 |a|m1  ..  ×  ..  .   . 

...

αk−1

(k)

|a|mk

with ai < kβ 2 and a = a0 + . . . + ak−1 β k−1 mod M the CRT solution. Lemma : if kβ 2 ∈ O(1) than the reconstruction of n2 integers from their images modulo the mi ’s costs O(n2 k ω−1 ) + O˜(n2 k) bit operations.

Matrix multiplication with multi-precision integers Implementation of multi-modular approach choose β = 216 to optimize β-adic conversions choose mi s.t. nβmi < 253 and use BLAS dgemm use a linear storage for multi-modular matrices

Compare sequential performances with : FLINT library 1 : uses divide and conquer Mathemagix library 2 : uses divide and conquer Doliskani’s code 3 : uses dgemm for reductions only

1. www.flintlib.org 2. www.mathemagix.org 3. courtesy of J. Doliskani

Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 32) 64.00

Flint Mathemagix Doliskani’s code Our code

16.00

Time in seconds

4.00 1.00 0.25 0.06 0.02 0.00

24

26

28

210 Entry bitsize

212

benchmark on Intel Xeon-2620 @ 2.0GHz

214

216

Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 128) 256.00

Flint Mathemagix Doliskani’s code Our code

64.00

Time in seconds

16.00 4.00 1.00 0.25 0.06 0.02 0.00

24

26

28

210 Entry bitsize

212

benchmark on Intel Xeon-2620 @ 2.0GHz

214

216

Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 512) 256.00

Flint Mathemagix Doliskani’s code Our code

128.00

Time in seconds

64.00 32.00 16.00 8.00 4.00 2.00 1.00 0.50 0.25

24

26

28

210 Entry bitsize

212

benchmark on Intel Xeon-2620 @ 2.0GHz

214

216

Matrix multiplication with multi-precision integers Integer matrix multiplication (matrix dim = 1024) 128.00

Flint Mathemagix Doliskani’s code Our code

Time in seconds

64.00

32.00

16.00

8.00

4.00

2.00

24

26

28

210 Entry bitsize

212

benchmark on Intel Xeon-2620 @ 2.0GHz

214

216

Parallel multi-modular matrix multiplication 1

for i = 1 . . . k compute Ai = A mod mi and Bi = B mod mi

3

reconstruct C = AB from (C1 , . . . , Ck )

Parallelization of multi-modular reduction/reconstruction each thread reduces (resp. reconstructs) a chunk of the given matrix thread 0

thread 1

thread 2 A0 = A mod m0 A1 = A mod m1 A2 = A mod m2 A3 = A mod m3 A4 = A mod m4 A5 = A mod m5 A6 = A mod m6

thread 3

Parallel multi-modular matrix multiplication 2

for i = 1 . . . k compute Ci = Ai Bi mod mi

Parallelization of modular multiplication each thread computes a bunch of matrix multiplications modmi thread 0

C0 = A0 B0 mod m0 C1 = A1 B1 mod m1

thread 1

C2 = A2 B2 mod m2 C3 = A3 B3 mod m3

thread 2

C4 = A4 B4 mod m4 C5 = A5 B5 mod m5

thread 3

C6 = A6 B6 mod m6

Parallel Matrix multiplication with multi-precision integers based on OpenMP task CPU affinity (hwloc-bind), allocator (tcmalloc) still under progress for better memory strategy ! ! ! Speedup  of  parallel  integer  matrix  mul5plica5on   12  

speedup  

10   8  

dim  =  1024   bitsize  =  1024   dim=1024   bitsize  =  4096   dim  =  256   bitsize  =  8192  

6   4   2   0   2  

6  

8  

12  

number  of  cores  

benchmark on Intel Xeon-2620 @ 2.0GHz (2 NUMA with 6 cores)

Outline

1

Matrix multiplication with small integers

2

Matrix multiplication with multi-precision integers

3

Matrix multiplication with polynomials

Matrix multiplication over Fp [x] We consider the ”easiest” case : A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t

p is a Fourier prime, i.e. p = 2t q + 1 p is such that n(p − 1)2 < 253

Complexity O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT

Matrix multiplication over Fp [x] We consider the ”easiest” case : A, B ∈ Fp [x]n×n such that deg(AB) < k = 2t

p is a Fourier prime, i.e. p = 2t q + 1 p is such that n(p − 1)2 < 253

Complexity O(nω k + n2 k log(k)) op. in Fp using evaluation/interpolation with FFT

Remark: using Vandermonde matrix on can get a similar approach as for integers, i.e. O(nω k + n2 k ω−1 )

Matrix multiplication over Fp [x]

Evaluation/Interpolation scheme Let θ a primitive kth root of unity in Fp . 1 2 3

for i = 1 . . . k compute Ai = A(θi−1 ) and Bi = B(θi−1 ) for i = 1 . . . k compute Ci = Ai Bi ∈ Fp interpolate C = AB from (C1 , . . . , Ck )

steps 1 and 3 : O(n2 ) call to FFTk over Fp [x] step 2 : k matrix multiplications modulo a small prime p

FFT with SIMD over Fp Butterly operation modulo p i

compute X + Y mod p and (X − Y )θ2 mod p. Barret’s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals

[Harvey 2014]

Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe. Algorithm: Butterfly(X,Y,W,W’,p) 1: 2: 3: 4: 5:

X 0 := X + Y mod 2p T := X − Y + 2p Q := dW 0 T /βe Y 0 := (WT − Qp) mod β return (X 0 , Y 0 )

1 high short product 2 low short products

FFT with SIMD over Fp Butterly operation modulo p i

compute X + Y mod p and (X − Y )θ2 mod p. Barret’s modular multiplication with a constant (NTL) calculate into [0, 2p) to remove two conditionals

[Harvey 2014]

Let X , Y ∈ [0, 2p), W ∈ [0, p), p < β/4 and W 0 = dW β/pe. Algorithm: Butterfly(X,Y,W,W’,p) 1: 2: 3: 4: 5:

X 0 := X + Y mod 2p T := X − Y + 2p Q := dW 0 T /βe Y 0 := (WT − Qp) mod β return (X 0 , Y 0 )

1 high short product 2 low short products

3 SSE/AVX provide 16 or 32-bits low short product 7 no high short product available (use full product)

Matrix multiplication over Fp [x] Implementation radix-4 FFT with 128-bits SSE (29 bits primes) BLAS-based matrix multiplication over Fp

[FFLAS-FFPACK library]

Polynomial  Matrix  Mul9plica9on  Performances     1  024,00   512,00   256,00   128,00   64,00  

Time  in  seconds  

32,00   16,00   8,00   4,00  

FLINT  

2,00  

MMX  

1,00  

our  code  

0,50   0,25   0,13   0,06   0,03   0,02   deg=1024  

deg=2048  

deg=4096  

deg=8192  

deg=512  

deg=256  

deg=512  

deg=64  

deg=32  

n=16  

n=16  

n=16  

n=16  

n=128  

n=256  

n=512  

n=1024  

n=2048  

Matrix  with  dimension  (n)  and  degree  (deg)  

benchmark on Intel Xeon-2620 @ 2.0GHz

Matrix multiplication over Z[x]

A, B ∈ Z[x]n×n such that deg(AB) < d and ||(AB)i ||∞ < k

Complexity O˜(nω d log(d) log(k)) bit op. using Kronecker substitution O(nω d log(k) + n2 d log(d) log(k)) bit op. using CRT+FFT

Remark: if the result’s degree and bitsize are not too large, CRT with Fourier primes might suffice.

Matrix multiplication over Z[x] Implementation use CRT with Fourier primes re-use multi-modular reduction/reconstruction with linear algebra re-use multiplication in Fp [x] FLINT  

4096  

our  code  

2048   1024  

Time  in  seconds  

512   256   128   64   32   16   8   4   2   1   log(k)=64  

log(k)=128  

log(k)=256  

n=16  ;  deg=8192  

log(k)=512  

log(k)=64  

log(k)=128  

log(k)=256  

n=128  ;  deg=512  

benchmark on Intel Xeon-2620 @ 2.0GHz

Parallel Matrix multiplication over Z[x] Very first attempt (work still in progress) parallel CRT with linear algebra (same code as in Z case) perform each multiplication over Fp [x] in parallel some part of the code still sequential

n 64 32 32 128

d 1024 4096 2048 128

log(k) 600 600 1024 1024

6 cores ×3.52 ×3.68 ×3.95 ×3.76

12 cores ×4.88 ×5.02 ×5.73 ×5.55

time seq 61.1s 64.4s 54.5s 53.9s

Polynomial Matrix in LinBox (proposition) Generic handler class for Polynomial Matrix t e m p l a t e class PolynomialMatrix ;

Specialization for different memory strategy // M a t r i x o f p o l y n o m i a l s t e m p l a t e c l a s s P o l y n o m i a l M a t r i x ; // P o l y n o m i a l o f m a t r i c e s t e m p l a t e c l a s s P o l y n o m i a l M a t r i x ; // P o l y n o m i a l o f m a t r i c e s ( p a r t i a l v i e w on m o n o m i a l s ) t e m p l a t e c l a s s P o l y n o m i a l M a t r i x ;

Conclusion

High performance tools for exact linear algebra : matrix multiplication through floating points multi-dimensional CRT FFT for polynomial over wordsize prime fields adaptative matrix representation

We provide in the LinBox library (www.linalg.org) efficient sequential/parallel matrix multiplication over Z efficient sequential matrix multiplication over Fp [x] and Z[x]