On parallel block algorithms for exact triangularizations

On parallel block algorithms for exact triangularizations Jean-Guillaume Dumasa Jean-Louis Rochb a Laboratoire de Mod´elisation et Calcul B. P. 53 – 5...

0 downloads 32 Views 411KB Size
On parallel block algorithms for exact triangularizations Jean-Guillaume Dumas a

Jean-Louis Roch b

a

Laboratoire de Mod´elisation et Calcul B. P. 53 – 51, av. des Math´ematiques, 38041 Grenoble, France.

b

Laboratoire Informatique et Distribution projet APACHE, CNRS–INRIA–INPG–UJF. ZIRST - 51, av. Jean Kuntzmann 38330 Montbonnot Saint-Martin, France.

Abstract We present a new parallel algorithm to compute an exact triangularization of large square or rectangular and dense or sparse matrices in any field. Using fast matrix multiplication, our algorithm has the best known sequential arithmetic complexity. Furthermore, on distributed architectures, it drastically reduces the total volume of communication compared to previously known algorithms. The resulting matrix can be used to compute the rank or to solve a linear system. Over finite fields, for instance, our method has proven useful in the computation of large Gr¨obner bases arising in robotic problems or wavelet image compression. Key words: Parallel Block triangularization, Exact LU factorization of rectangular matrices, Symbolic rank computation, Sparse matrices, Galois Finite fields, BLAS, Fast symbolic matrix multiplication.

1

Introduction

In this article, we study the parallelization of the exact LU factorization of matrices with arbitrary field elements. The matrix can be singular or even rectangular with dimension m × n. Our main purpose is to compute the rank of large matrices. Therefore, we relax the conditions on L in order to obtain a Email addresses: [email protected] (Jean-Guillaume Dumas), [email protected] (Jean-Louis Roch).

Preprint submitted to Elsevier Science

20 June 2002

T U factorization, where U is m × n, upper triangular as usual and T is m × m, block sparse (with some “T” patterns). Exact triangularization arises in various applications and especially in computer algebra. For instance, one of the main tools for solving algebraic systems is the computation of Gr¨obner bases[1]. Actual methods to compute such standard bases use modular triangularization of large sparse rectangular matrices [2]. Among other applications of symbolic LU are combinatorics [3], fast determinant computation, Diophantine analysis, group theory and algebraic topology via the computation of the integer Smith normal form (an integer diagonal canonical form [4]). A first idea is to use a parallel direct method [5, chapter 11] on matrices stored by rows (respectively columns). There, at stage k of the classical Gaussian elimination algorithm, eliminations are executed in parallel on the n − k − 1 remaining rows. This method does not directly enable efficient modular computations which are of low cost. Then, gathering the operations to obtain a larger grain size is necessary. The next idea is then to mimic numerical methods and use sub-matrices. This way, as the computations are local, it is possible to take advantage of the cache effects as in the BLAS numerical libraries [6]. Now, the problem is that usually, for symbolic computation, these blocks are singular. To solve this problem one has mainly two alternatives. One is to perform a dynamic cutting of the matrix and to adjust it so that the blocks are reduced and become invertible. Such a method is shown by Ibarra et al. in [7; 8], and studied in detail in [9, Chapter 2]. Their algorithm (LSP ) groups rows into two regions [10, Problem 2.7c]. A recursive process is then used to compute the rank of the first region. Then, depending on this rank, the cutting is modified and the algorithm pursues with a new region. This way, Ibarra et al. were able to build the first algorithm computing the rank of an m×n matrix with arithmetic complexity O(mω−1 n), where the complexity of matrix multiplication is O(mω ). Unfortunately, their method is not so efficient in parallel: it induces synchronizations and significant communications at each stage in order to compute the block redistribution. We therefore propose another method, called TURBO, using static blocks (which might be singular) in order to avoid these synchronizations and redistributions. Our algorithm also has an optimal sequential arithmetic complexity and is able to avoid as much as a third of the communications. A preliminary version of this paper appeared in [11]. Here we include a complete asymptotic analysis, give sharper bounds on the number of needed arithmetic operations and offer more experimental results. 2

The paper is organized as follows. In Section 2, we detail this new recursive block algorithm. This presentation is followed, in Sections 3 and 4, by asymptotic arithmetic and communication cost analyses. Finally, in Section 5, practical performance is shown on matrices involved in the computation of Gr¨obner bases and Homology groups of simplicial complexes [12; 13].

2

A new block algorithm

2.1

TURBO algorithm

In TURBO, the elementary operation is a block operation and not a scalar one. In addition, the cutting of the matrix in blocks is carried out before the execution of the algorithm and is not modified, in order to limit the volume of communications. We choose to describe the algorithm in a recursive way to simplify its theoretical study. The threshold of recursive cutting is then settled as the initial structure of the matrix is decided. Now take a 2m × 2n matrix A over a field IF. Our method recursively divides the matrix into four regions: 



N W N E A=   SW SE Local T U factorizations are then performed on each block. The method is applied recursively until the region size reaches a given threshold. We show here the algorithm for only one iteration. The factorization is done in place, i.e. the matrix A in input is not copied: the algorithm modifies its elements progressively. The following code computes the upper triangular form U . It can easily be completed in order to also compute the matrix T such that A = T U . To illustrate the algorithm we show figures representing the matrix after each step. In these figures we supposed that the initial cutting of the matrix was in 10 × 10 = 100 blocks. Algorithm TURBO: TU Recursive factorization with Blocks.

Input : – a matrix A ∈ IF2m×2n , of rank r. Output: – A, modified in place as an upper triangular matrix with invertible leading principal minor r × r. Step 1. Recursive T U triangularization in N W Compute L1 ∈ IFm×m lower triangular, U1 ∈ IFq×q upper triangular with 3

Fig. 1. Matrix A after step 1

rank q and G1 such that 



U1 G1  L1 × N W =   0 0

The N E region can now be updated: B1 = L1 ×N E. Simultaneously, zeroes under U1 in SW are computed. Let SW(,1..q) stands for the first q columns of SW ; then, N1 = −SW(,1..q) × U1−1 ∈ IFm×q is computed. Finish by the update of the SW region: I1 = SW(,q+1..n) + N1 × G1 and the first rows of N E are multiplied by N1 in order to update SE: E1 = SE + N1 × B1(1..q,) . Step 2. Recursive T U triangularization in SE Compute L2 ∈ IFm×m lower triangular, V2 ∈ IFp×p upper triangular with rank p and E2 such that 



V2 E2  L2 × E1 =   0 0

4

Then, as in step 1, SW is updated using 



 I2    = L2 × I1

F2

and zeroes over V2 are computed: i.e. N2 = −B1(q+1..m,1..p) × V2−1 Then, H2 = B1(q+1..m,p+1..n) + N2 × E2 and O2 = N W(q+1..m,q+1..n) = N2 × I2 .

Fig. 2. Matrix A after step 2

Step 3. Parallel recursive T U in SW and N E Compute L3 , D3 ∈ IFd×d with rank d, F3 and M3 , C3 ∈ IFc×c with rank c, H3 such that 







D3 F3  C3 H3  L3 × F2 =   and M3 × H2 =   0 0 0 0

Then N W is updated





 O3   = M3 × O2 

X2 T2

5

and zeroes over D3 are computed using N3 = −X2 × D3−1 .

Fig. 3. Matrix A after step 3

This step ends with T3 = T2 + N3 F3 . Of course, it is possible to create zeroes to the left of C3 instead of over D3 in N W . On the one hand, the choice can be made in view of the respective dimensions of the blocks. The smallest block, i.e. the one inducing the maximum number of zeroes and consequently the minimum volume of communication, is chosen. On the other hand, a parallel version could choose the first ready block instead. Step 4. Small recursive T U in N W again Compute L4 , Z4 ∈ IFz×z with rank z and T4 , such that 



Z4 T 4  L4 × T3 =    0 0 Step 5. Virtual row permutations I2 –V2 –E2 can be inserted between U1 –G1 –B2 and O3 –C3 –H3 . Z4 –T4 can be moved under F3 . Step 6. Virtual column permutations 6

Fig. 4. Matrix A after step 4

B2 –V2 –E2 –C3 –H3 can be inserted between U1 and G1 –I2 –O3 –D3 –F3 –Z4 –T4 . Step 7. Rank r = q + p + c + d + z;

Numbers, in Figures 1, 2, 3, 4 and 5 match the last modification step. The whole algorithm is a variant of this one preserving the intermediate matrices Li . Moreover, as steps 5 and 6 are only virtual, a permutation vector is computed in order to enable next recursive phases; Figure 5, shows the matrix if those two steps are performed. Now, Figure 6 shows the data dependency graph between the different computations. This graph can be recursive for the whole algorithm, as for the large multiplication tasks. Besides, for those, fast parallel matrix multiplication is applied. The major interest of this algorithm, apart from enabling fast matrix arithmetic, is the communication savings. Indeed, most of the operations are local 7

Fig. 5. Matrix A after virtual step 6

L1, U1, G1

B1

N1

E1

I1

L2, V2, E2

N2

H2

I2,F2

O2

M3,C3,H3

L3,D3,F3

O3,X2,T2

daVinci V2.1

N3

T3

L4,Z4,T4

Fig. 6. Data dependency graph for the parallel block recursive algorithm

8

and only the updating matrices are sent to the other regions. Further, as the computations are local and not redistributed, it is possible to efficiently take advantage of the cache effects inside the blocks (in a BLAS way, see [14], where an efficient implementation of finite field linear algebra subroutines on top of numerical BLAS is proposed).

2.2

Effective weight of the steps

In practice, Table 1 shows the successive ranks for different matrices arising in Gr¨obner bases computations (robot24 m5, rkat7 m5, f855 m9 and cyclic8 m11 [15]) and, using integer matrix Smith normal form, arising when computing Homology groups of simplicial complexes (mki.bk are matching complexes and chi-j.bk are chess-board complexes [16; 17]). Matrix

2m × 2n

q

p

c

d

z

ch5-5.b2

600x200

100

55

21

0

0

mk9.b2

1260x378

189

102

52

0

0

ch6-6.b2

2400x450

225

105

85

0

0

ch4-4.b2

96x72

24

23

7

3

0

ch5-5.b3

600x600

156

158

70

40

0

mk9.b3

945x1260

286

334

136

119

0

robot24 m5

404x302

54

141

10

57

0

rkat7 m5

694x738

95

239

130

108

39

f855 m9

2456x2511

123

1064

189

164

791

cyclic8 m11

4562x5761

392

1927

521

354

709

Table 1 Successive ranks in the block recursive algorithm

We see that the last block is often all zero (z = 0), and that the other four ranks are quite homogeneously balanced on the average. The next sections study the arithmetic complexities and memory costs of our technique.

3

Arithmetic complexity

In this section, we show that our algorithm has an arithmetic complexity similar to the best known ones. We show also that its parallel arithmetic complexity is analogous to those of the most efficient parallel algorithms. 9

3.1

Sequential cost

ω is the exponent of the complexity of matrix multiplication (i.e. 3 for the classical multiplication or log2 (7) ≈ 2.807355 for Strassen’s [18], the actual record being below 2.375477 [19] and the lower bound being 2). We will therefore denote the cost of the multiplication of two matrices, of respective dimensions 2m × 2n and 2n × 2l, by M (h) = O(hω ) for h = max{2m; 2n; 2l}. Theorem 3.1. Let T1 (h) be the sequential arithmetic complexity of algorithm TURBO for a rectangular matrix of higher dimension h. Then, T1 (h) ≤

29 4(2ω−1

− 1)

M (h) + 2h2 = O(hω ).

Proof. To prove this theorem, the 5 triangularizations have to be gathered into two groups such that each one of these two groups has a total rank less than h2 . An induction process can then be applied to the respective costs in order to achieve the given upper bound: Following the dependency Graph 6 and the algorithm, we see that TURBO requires 4 multiplications, 1 triangular inversion and 2 additions for step 1. On step 2, 4 multiplications, 1 triangular inversion and only 1 addition are required. We end by step 3 where 2 multiplications, 1 triangular inversion and 1 addition are needed. This sums up to 10 multiplications, 3 inversions and 4 additions of matrices of size smaller than h2 . Moreover, if the multiplication cost is M (h) ≤ Khω , triangular inversion is bounded by 32 M (h) [20, theorem 6.2]. Then the cost of our algorithm is bounded as follows (q, p, c, d and z are the respective ranks of U1 , V2 , C3 , D3 and Z4 ): T1 (h) ≤ T1 (p) + T1 (q) + T1 (c) + T1 (d) 29 hω +T1 (z) + K ω + h2 2 2

(1)

Now suppose, that T1 (x) ≤ K 0 xω + 2x2 , ∀x < h, for some K 0 . Then, by induction we have: hω ) 2ω +2(q 2 + c2 + z 2 + p2 + d2 ) + h2 .

T1 (h) ≤ K 0 (q ω + cω + z ω + pω + dω + 2

(2)

Also, as ω ≥ 2 and all the intermediate ranks are non negative, we bound q ω + cω + z ω by (q + c + z)ω and pω + dω by (p + d)ω . Besides, using the ranks 10

locality, we have: h 2 h q+d+z ≤ 2

h 2 h p+c≤ 2

and

q+c+z ≤

p+d≤

and

(3) (4)

Thus, with Relations 3, we get T1 (h) ≤ ( Therefore we can take K 0 = theorem are proven.

4K 0 + 29K ω )h + 2h2 2ω+1

29 K 2ω+1 −4

and, as T1 (2) = 3, the induction and the

The theorem shows that when using classical multiplication, our algorithm requires then the equivalent of only 29 ≈ 2.416667 matrix multiplications. On 12 the other hand, the arithmetic complexity of our algorithm is the best known complexity for this problem in terms of ”big-Oh” [8, Theorem 2.1]. But Ibarra’s algorithm (LSP ) has a different bound: 2ω−13 −2 . We compare these two bounds for different kinds of matrix multiplications in Table 2. ω

TURBO

LSP

3

2.416667

1.5

2.807375

2.899943

1.999935

2.375477 4.546775 5.045945 Table 2 Number of arithmetic operations for TURBO and LSP

The table shows that for most of the practical algorithms, our method requires a few more operations. But, unlike our method, Ibarra’s algorithm groups rows into two regions. Then, depending on the rank of the first region, the matrix structure is modified. Using our block cutting, we instead guarantee that all the accesses are local, thus enabling faster computations.

3.2

Parallel cost

In the parallel case, the situation is different. We obtain only a linear complexity. Considering that parallel triangular matrix multiplication and inversion costs are logarithmic (lower than K∞ log2 2 (h)) we have the following result: Theorem 3.2. Let T∞ (h) be the parallel arithmetic complexity of algorithm TURBO for a rectangular matrix of higher dimension h ≥ 22. Then, T∞ (h) ≤ 3K∞ h = O(h). 11

As in the sequential case, the idea of this theorem is to gather triangularizations in order to have groups of total rank less than h2 . Proof. Here, T∞ (c) and T∞ (d) are parallel. Without loss of generality, we suppose that c ≥ d, and then the difference with the sequential inequality is that T∞ (d) is shadowed by T∞ (c). Then, let h1 = q, h2 = p, h3 = c and h4 = z. As the parallel multiplication cost is bounded by K∞ log2 2 (h), we have: T∞ (h) = K∞ log2 2 (h) + T∞ (q) + T∞ (p) + T∞ (c) + T∞ (z) 2

= K∞ log2 (h) +

4 X

T∞ (hi )

i=0

Therefore, developing one more recursive phase, we get, assuming for now that the four ranks are non zero: µ

T∞ (h) = K∞ log2 2 (h) + log2 2 (q) + log2 2 (c) ¶ 2

2

+ log2 (p) + log2 (z) +

4 X 4 X

(5) T∞ (hi;j )

i=0 j=0

where T∞ (h2;3 ), for instance, is the cost connected the rank of block C of the second recursive phase, inside block V of the first recursive phase. Now, as in the sequential theorem, costs are gathered in order to have groups where the rank sum is less than h2 . Thus, we have q + c ≤ h2 . First, we suppose that q ≥ 1 and c ≥ 1. Then log2 2 (q) + log2 2 (c) is maximal when q = c = h4 , as soon as h ≥ 16. Else, if one of these two ranks is zero, then the cost of the √group is bounded by log2 2 ( h2 ). But log2 2 ( h2 ) ≤ 2 log2 2 ( h4 ) as soon as h ≥ 23+ 2 ≈ 21.3212. Therefore, asymptotically, (log2 2 (q) + log2 2 (c)) ≤ 2 log2 2 ( h4 ) In the same manner we have (log2 2 (p) + log2 2 (z)) ≤ 2 log2 2 ( h4 ). Therefore, Formula 5 becomes 4 ³ h ´ X T∞ (h) ≤ K∞ log2 2 (h) + 4 log2 2 ( ) + T∞ (hi,j ) 4 i,j=0

(6)

We finish the proof by gathering and bounding again recursively in the same manner: log4 (h) X h 80 T∞ (h) ≤ K∞ 4i log2 2 ( i ) ≤ K∞ h ≤ 3K∞ h. 4 27 i=0

Therefore, the theoretical complexity is linear, while rank computation is in N C 2 : using O(n4.5 ) processors, the computation can be performed with parallel time O(log2 2 (n)) [21]. However, the best known parallel algorithm with 12

optimal sequential time also achieves a parallel linear time [8; 10]. But, in practice, our technique is more interesting as it preserves locality. Further, we will see that it reduces the volume of communications on distributed architectures.

4

Scheduling, blocking and communications

We first consider the execution of algorithm TURBO on a PRAM [22] with P processors. For a given matrix, once the computational dependency Graph 6 is known, Brent’s principle [22; 10], ensures that execution can be performed in time TP (h) ≤ T∞ (h) + T1P(h) . However, here, the graph is dynamically generated since the rank of the related sub-matrices is only known at execution time. Thus, Brent’s bound cannot be applied for classical rank algorithms: the overhead σ [23] introduced for the scheduling (allocation of tasks to idle processors [24]) is to be considered and leads to: TP (h) ≤ T∞ (h) +

T1 (h) + σ(h) P

(7)

In order to achieve asymptotically fast execution on P processors, σ must be reduced. A blocking technique is classically used to this end. When a block of size smaller than a given value k is encountered, its rank is computed using the optimal sequential algorithm. Then the number of tasks is clearly bounded by T1 ( hk ). Hence, the scheduling overhead is à !

h σ k

ÃÃ !ω !

=O

Choosing k large enough with respect to execution in time

h k

T1 hω

. ensures asymptotically optimal à !

T1 (h) h TP (h) ≤ T1 (k)T∞ (h) + +σ P k

(8)

Now let us consider a distributed architecture. We show that in this case, due to the blocking technique, algorithm TURBO requires less communications than previously known optimal rank algorithms. Indeed, in order to achieve an optimal number of operations (T1 = O(hω )), those algorithms redesign the block structure of the matrix after each elimination step (k is modified). This redistribution involves 2m.2n communications at each step on a 2m × 2n matrix; then, adding the pivot row communications, we obtain a total of 8mn communications. In particular, we consider the block row algorithm of Ibarra et al. [8]. Its communication volume for a 2m × 2n matrix is denoted by I(2m, 2n). Recall that it groups rows into two regions; now, the whole number 13

of communications performed is then I(2m, 2n) = 2I(m, 2n) + 8mn

(9)

TURBO algorithm avoids such a redistribution. We denote our communication volume by C(2m, 2n). This volume is a function of the five intermediate ranks (q, p, c, d, z, ranks of the matrices: U1 , V2 , C3 , D3 , Z4 , and r = q +p+c+d+z). Furthermore, each matrix Xi is computed with the owner compute rule: i.e. on the processor where it is supposed to be at the end. Now, the analysis of the dependency Graph 6 shows that each one of the following blocks must be communicated once: L1 , U1 , B1(1..q,) , N1 , V2 , L2 , E2 , N2 , I2 , M3 , D3 , and F3 . This leads to the following volume ∗ : C(2m, 2n) = 2C(m, n) + C(m − q, n − p) + C(m − p, n − q)+ C(m − q − c, n − q − d)+ mn + 2qm + 2pn + q 2 + pm + dn − pq − dq

(10)

Thus, at the current step, the number of communications is mn + 2qm + 2pn + q 2 +pm+dn−pq −dq; since p, q and d are smaller than min(m, n), this number is always less than the previous 8mn. For our algorithm, the worst case occurs when the matrix is invertible. Then if the first two pivot blocks are of full rank (p = q = min(m, n) and d = 0), our number of communications is less than 6mn instead of 8mn. In addition, and still in a full rank case, when the ranks are more evenly distributed on the blocks, say p = q = d = c = min(m,n) , then 2 our number of communications is less than 3.75mn. We can therefore expect very good performance on the average. However, solving the previous recurrence equations in the general case is quite complex. Therefore, in the following section, we compare those communications in practice, on specific dense and sparse matrices at a given step.

5

Practical communication performance

In this section, we compare the communication volumes between the row and block strategies. We first consider invertible matrices and then the general case. ∗

Remark that C3 can be chosen instead of D3 at step 3 of the algorithm. In that case, d must be replaced by c in the formula.

14

5.1

Invertible principal minors case

Consider a dense matrix. We study the differences in communication volume on a square 2n × 2n invertible matrix, on P processors. There are then two possible cuttings. Suppose that the rows are cyclicly distributed on the P processors. At each step, the pivot row must be communicated to every other processor. This leads to the following volume of communications: L(2n, 2n, P ) = 2n X

(P − 1)(2n − k) = n(2n − 1)(P − 1)

(11)

k=1

Another way is to consider a cutting of the 2n×2n matrix into Q square blocks Bij of size √2nQ × √2nQ . Blocks are assumed cyclicly distributed on the processors. We also suppose here that there exists at least one invertible block at each step (that is very restrictive in exact computations).√At step k, the block pivot row must be communicated, √ i.e. each one of the Q √ − k + 1 blocks of this row (Bkj for j from k to Q) must be sent to the Q − k other remaining blocks in its column. Then each remaining row performs the multiplication −1 by √ the inverse of bloc Bkk and communicates the product −Bik .Bkk to the Q − k other blocks of its row. Since 1 block over P is local, the volume of communications is then: √

µ

1−

B(2n, 2n, P, Q) =

¶ Qµ q 1 X

P

k=1

q 2n 2n ( Q − k + 1)( Q − k) √ √ Q Q ¶ q 2 2n 2n +( Q − k) √ √ Q Q

which gives µ

q 2 1 B(2n, 2n, P, Q) = (2n)2 Q 1 − 3 P



− O(n2 )

(12)

Let ρ be the difference between the volume of monodimensional communications and the volume bidimensional communications divided by the volume of monodimensional communications. Then ρ measures the gain in communications between the two strategies. It is normalized in order to compare different kind of matrices. Now, by taking P = Q, communications are reduced when using a block cutting. As a matter of fact, the gain between rows and blocks is as follows: ρ=

4 L(2n, 2n, P ) − B(2n, 2n, P, P ) ≤1− √ . L(2n, 2n, P ) 3 P 15

(13)

Such a gain is then our goal for the singular case. We show in the next section that with our new algorithm, we can reach those performances in general.

5.2

General case

We now estimate the gain with our algorithm for a rectangular matrix 2m×2n of rank r ≤ min{2m, 2n}. We then compute the communication volume, in the worst case, for only one phase (no recursion, only the steps previously shown), on 4 processors (one for each region). The cost function obtained with 4 processors is already quite complex: C(2m, 2n, r, 4) = mn + 2qm + 2pn + q + pm + dn − pq − dq 2

(14)

In order to give a more precise idea of the gain of our method, we compare this result to the volume of communications obtained by row. For the non invertible case, Formula 11 has to be modified as follows: L(2n, r, P ) = r+1 (P − 1)(2n − k) = r(2n − )(P − 1) 2 k=1 r X

(15)

where r = q+p+c+d+z is the rank of the matrix. Next, Table 3 shows the gain obtained with the previously introduced matrices. The total effective communicated volumes of both (row and TURBO) methods are compared. These matrices are quite sparse. Unfortunately, the first version of our algorithm is implemented only for dense matrices. Still, we can see that our method is able to avoid some communications as soon as the matrices are not too special. In the table, the first three matrices have special rank conformations as seen in Table 1 (d = z = 0 for instance) and are very unbalanced (very small number of columns compared to the number of rows): in that case a row method can be much more efficient since it can communicate only the smallest dimension. However, in all the other cases we are able to achieve very good performances: for the less rectangular matrices, we have a gain ρ very close to the aimed one (Equation 13 gives 1 − 3√4 4 = 13 ≈ 33%). Now the problem is that the recursive setting is rather delicate. Indeed, to limit the structure overhead, the recursive cutting threshold must be rather high. The induced parallelism is thus not so extensible and the algorithm is interesting on a relatively small number of processors (4, 8, 16, . . . ). Nevertheless, there is the possibility of using this cutting only for the first stages (greediest in communications) and then of switching to the row algorithm, for instance. Moreover, this algorithm can be easily adapted to take advantage of sparse 16

ρ=

L−C L

Matrix

2m × 2n

r

ch5-5.b2

600x200

176

-57.97%

mk9.b2

1260x378

343

-67.36%

ch6-6.b2

2400x450

415

-123.66%

ch4-4.b2

96x72

57

10.40%

ch5-5.b3

600x600

424

32.80%

mk9.b3

945x1260

875

11.80%

robot24 m5

404x302

262

9.08%

rkat7 m5

694x738

611

34.02%

f855 m9

2456x2511

2331

34.68%

cyclic8 m11

4562x5761

3903

21.02%

Table 3 Communication volume gain

matrices. With the only restriction that in this case, it is not possible to completely apply reordering heuristics to the whole matrix without adding communications (it is anyway feasible to apply them locally, in each block, but only with limited effects [4]).

6

Conclusions

To conclude, we developed a new block T U elimination algorithm. Its theoretical sequential and parallel arithmetic complexities are similar to those of the most efficient current elimination algorithms for this problem. Besides, it is particularly adapted to the singular matrices and makes it possible to compute the rank in an exact way. Furthermore, it allows a more flexible management of the scheduling (adaptive grain) and avoids a third of the communications when used with only one level of recursion on 4 processors. In addition, if the increase in locality reduces the number of communications, it also makes it possible to increase the speed by a greater benefit of the cache effects. There remains to implement efficiently the parallel finite field subroutines to test the effectiveness in terms of speed-up. For instance, recent sequential experiments ([14]) show that classical Gaussian elimination can be performed at a speed of about 40 Million of field operations per second (MFop/s) on a pentium III 735 MHz, whereas fast matrix multiplication can achieve a speed close to 600 MFop/s on the same machine. When considering that Gaussian elimination requires 23 n3 operations, it means that our algorithm 17

requires at most seven times that number of operations. The compared speeds shows that even a sequential speed-up can be achieved. Lastly, there remains also to study an effective method to reorder sparse matrices effectively in parallel. Indeed, even though designed for dense matrices, our algorithm can also be used on sparse matrices. But, in order to attain high speeds, we need to gather the non-zero elements. Figure 8 shows the result of an iteration of our algorithm on a sparse matrix (Figure 7) arising in the computation Gr¨obner bases.

Fig. 7. Matrix rkat7 mat5, 694 × 738 of rank 611

We see that a rather significant fill-in occurs during the last phases of this iteration: the final U shape of this matrix has 105516 non zero elements. By way of comparison, the final LU shape of this matrix, computed with a row algorithm, has 64622 non zero elements. Moreover, using a reordering technique, one can obtain a triangular form containing much fewer non zero elements (39477 for instance for this matrix [4, section 5.4.5]). Therefore, designing an efficient block reordering technique seems to be an important open question. 18

Fig. 8. Matrix rkat7 mat5 after one phase of our algorithm

References [1] B. Buchberger, Gr¨obner bases: An algorithmic method in polynomial ideal theory, in: N. K. Bose (Ed.), Recent Trends in Multidimensional Systems Theory, Mathematics and its applications, D. Reidel Publishing Company, Dordrecht, The Netherlands, 1985, Ch. 6, pp. 184–232. [2] J.-C. Faug`ere, Parallelization of Gr¨obner basis, in: H. Hong (Ed.), First International Symposium on Parallel Symbolic Computation, PASCO ’94, Hagenberg/Linz, Austria, Vol. 5 of Lecture notes series in computing, 1994, pp. 124–132. [3] W. D. Wallis, A. P. Street, J. S. Wallis, Combinatorics: Room Squares, Sum-Free Sets, Hadamard Matrices, Vol. 292 of Lecture Notes in Mathematics, Springer-Verlag, Berlin, 1972. [4] J.-G. Dumas, Algorithmes parall`eles efficaces pour le calcul formel : alg`ebre lin´eaire creuse et extensions alg´ebriques, Ph.D. thesis, Institut National Polytechnique de Grenoble, France, ftp://ftp.imag.fr/pub/Mediatheque.IMAG/theses/2000/Dumas.Jean-Guillaume (Dec. 2000). [5] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to parallel computing. Design and analysis of algorithms, The Benjamin/Cummings Publishing Company, Inc., 1994. 19

[6] J. J. Dongarra, J. D. Croz, S. Hammarling, I. Duff, A set of level 3 Basic Linear Algebra Subprograms, Transactions on Mathematical Software 16 (1) (1990) 1–17, www.acm.org/pubs/toc/Abstracts/0098-3500/79170.html. [7] O. H. Ibarra, S. Moran, L. E. Rosier, A note on the parallel complexity of computing the rank of order n matrices, Information Processing Letters 11 (4–5) (1980) 162–162. [8] O. H. Ibarra, S. Moran, R. Hui, A generalization of the fast LUP matrix decomposition algorithm and applications, Journal of Algorithms 3 (1) (1982) 45–56. [9] A. Storjohann, Algorithms for matrix canonical forms, Ph.D. thesis, Department of Computer Science, Swiss Federal Institute of Technology— ETH Zurich (Dec. 2000). [10] D. Bini, V. Pan, Polynomial and Matrix Computations, Volume 1: Fundamental Algorithms., Birkhauser, Boston, 1994. [11] J.-G. Dumas, J.-L. Roch, A fast parallel block algorithm for exact triangularization of rectangular matrices, in: SPAA’01. Proceedings of the Thirteenth ACM Symposium on Parallel Algorithms and Architectures, Kreta, Greece., 2001, pp. 324–325. [12] A. Bj¨orner, V. Welker, Complexes of directed graphs, SIAM Journal on Discrete Mathematics 12 (4) (1999) 413–424. URL http://epubs.siam.org/sam-bin/dbq/article/33872 [13] V. Reiner, J. Roberts, Minimal resolutions and the homology of matching and chessboard complexes, Journal of Algebraic Combinatorics 11 (2) (2000) 135–154. [14] J.-G. Dumas, T. Gautier, C. Pernet, Finite fields linear algebra subroutines, in: T. Mora (Ed.), Proceedings of the 2002 International Symposium on Symbolic and Algebraic Computation, Lille, France, ACM Press, New York, 2002. [15] J.-C. Faug`ere, A new efficient algorithm for computing Gr¨obner bases (F4 ), Tech. rep., Laboratoire d’Informatique de Paris 6, http://wwwcalfor.lip6.fr/∼ jcf (Jan. 1999). ˇ [16] A. Bj¨orner, L. Lov´asz, S. T. Vre´cica, R. T. Zivaljevi´ c, Chessboard complexes and matching complexes., Journal of the London Mathematical Society 49 (1) (1994) 25–39. [17] J.-G. Dumas, B. D. Saunders, G. Villard, Integer Smith form via the Valence: experience with large sparse matrices from Homology, in: C. Traverso (Ed.), Proceedings of the 2000 International Symposium on Symbolic and Algebraic Computation, Saint Andrews, Scotland, ACM Press, New York, 2000, pp. 95–105. [18] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik 13 (1969) 354–356. [19] D. Coppersmith, S. Winograd, Matrix multiplication via arithmetic progressions, Journal of Symbolic Computation 9 (3) (1990) 251–280. [20] A. V. Aho, J. E. Hopcroft, J. D. Ullman, The Design and Analysis of 20

[21] [22] [23]

[24]

Computer Algorithms, Addison-Wesley, 1974. K. Mulmuley, A fast parallel algorithm to compute the rank of a matrix, Combinatorica 7 (1) (1987) 101–104. J. J´aJ´a, Introduction to Parallel Algorithms, Addison-Wesley, New York, 1992. G. G. H. Cavalheiro, M. Doreille, F. Galil´ee, J.-L. Roch, Athapascan1: On-line building data flow graph in a parallel language, in: PACT’98: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Paris, France, 1998. R. L. Graham, Bounds on certain multiprocessing timing anomalies, SIAM Journal of Applied Mathematics 17 (2) (1969) 416–429.

21