0601:cs0601133/trsm.tex

1: \section{Triangular system solving with matrix right/left hand side}\label{sec:trsm}

2: %%

3: We now discuss the implementation of solvers for

4: triangular systems with matrix right hand side (or equivalently left

5: hand side).

6: %This is also the simultaneous resolution of $n$ triangular systems.

7: %

8: The resolution of such systems plays a central role in many linear algebra

9: problems, e.g. it is the second main operation

10: in block Gaussian elimination after matrix multiplication as will be recalled in section \ref{sec:triang}. This operation is commonly named

11: \trsm in the BLAS convention. In the following, we will consider

12: without loss of generality the resolution of an upper triangular system with

13: matrix right hand side, i.e. the operation $B \leftarrow U^{-1}B$, where $U$ is

14: $m\times m$ upper triangular and $B$ is $m\times n$.

15:

16:

17: Following the approach of the BLAS numerical routine,

18: our implementation is based on  a block recursive algorithm

19: to reduce the computation to matrix multiplications.

20:

21: Now similarly to our approach with matrix multiplication, the design of our

22: implementation also focuses on delaying the modular reductions as

23: much as possible. As will be shown in section \ref{ssec:trsmdel}, delaying the

24: whole resolution leads to a quick growth in the size of coefficients.

25: Therefore we also present in section \ref{ssec:trsmdelupdate} another way of

26: delaying these modular reductions.

27: We lastly present how to combine these two techniques within a multi-cascade

28: algorithm.

29:

30:

31: % A mettre en intro

32:

33: %% Let us denote by $R(m,k,n)$ the arithmetical cost of a $m \times k$ by $k \times n$ rectangular

34: %% matrix multiplication.

35: %% Now let us suppose that $k \leq m \leq n$, then

36: %% $R(k,m,n)$, $R(m,k,n)$ and $R(m,n,k)$ are all bounded by $\lCeil

37: %% \frac{m n }{k^2} \rCeil \MM(k)$ (see

38: %% e.g. \cite[(2.5)]{Huang:1997:FRM} for more details).

39:

40: %\newpage

41: \subsection{The block recursive algorithm}\label{ssec:rec-trsm}

42:

43: %% The classical idea is to use the divide and conquer approach.

44: %% Here, we consider the upper left triangular case without loss of

45: %% generality, since any combination of

46: %% upper/lower and left/right triangular cases are similar: if $U$ is

47: %% upper triangular, $L$ is lower triangular and $B$ is rectangular,

48: %% we call \ltrsm\ the resolution of $U X = B$, \lltrsm\ that of

49: %% $L X = B$, \urtrsm\ that of $XU=B$ and \lrtrsm\ that of $XL=B$.

50:

51: Algorithm \texttt{trsm} recalls the block recursive algorithm.

52:

53: \begin{algorithm}

54: \dontprintsemicolon

55: \caption{\trsm($A,B$)}\label{alg:trsm:rec}

56: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$.}

57: \KwResult{$X \in \Zp^{m \times n}$ such that $AX=B$.}

58: \Begin{

59: \eIf{$m=1$}{

60:  $ X:= A_{1,1}^{-1} \times B$\;

61: }{

62:  \tcc{splitting matrices into two blocks of sizes $\left\lfloor \frac{m}{2}

63:   \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$

64: \[

65: \begin{array}{cccc}

66: A & X & & B \\

67: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&

68: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&

69: = &

70: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }

71: \end{array}

72: \]}

73:

74: $X_2:=$\trsm($A_3,B_2$)\;

75: $B_1:= B_1 - A_2X_2$\;

76: $X_1:=$\trsm($A_1,B_1$)\;

77: }

78: }

79: \end{algorithm}

80:

81:

82: \begin{lem}\label{lem:trsm}

83: Algorithm \trsm\ is correct and the leading term of its arithmetic

84: complexity over $\Zp$ is

85: $$\TRSM(m,n) =

86: %\left\{ \begin{array}{ccc}

87:     \frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil  \MM(m)

88: %& &    \text{if}~m\leq n \\

89: %\frac{1}{2^{\omega-1}-2} \lCeil\frac{m}{n}\rCeil^2 \MM(n)& &

90: %    \text{if}~m \geq n

91: %\end{array}\right.

92: $$

93: This complexity is

94: %$\min(mn^2,nm^2)$

95: $m^2n$

96: using classic matrix  multiplication.

97: \end{lem}

98:

99: \begin{proof}

100: Extending the previous notation \MM(n), we denote by \MM(m,k,n) the cost of

101: multiplying a $m\times k$ by a $k\times n$ matrices.

102: The cost function $\TRSM(m,n)$ satisfies the following equation:

103: $$\TRSM(m,n)= 2\TRSM(\frac{m}{2},n)+\MM(\frac{m}{2},\frac{m}{2},n).$$

104: Let $t=\log_2(m)$. Although the algorithm works for any $n$, we restrict the

105: complexity analysis to the case where $m \leq n$ for the sake of simplicity.

106: We then have:

107: \begin{eqnarray*}

108: \TRSM(m,n)&=& 2\TRSM(\frac{m}{2},n)+

109: \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)

110: \\&=& 2^t \TRSM(1,n) + \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)

111: \frac{ 1 -

112:   \left(\frac{2}{2^{\omega-1}}\right)^t}{1-\frac{2}{2^{\omega-1}}}.

113: \end{eqnarray*}

114: As $\TRSM(1,n)=2n$ and $\left(2^{\omega-1}\right)^t = m^{\omega-1}$,

115: we obtain the expected complexity

116: $\TRSM(m,n)=\frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil \MM(m) + \GO(m^2+mn).$

117: \end{proof}

118: %% When $m \geq n$, the trick is to consider two TRSM with the same

119: %% triangular matrix, but of right hand side of size $n/2$.

120: %% $R(\frac{m}{2},\frac{m}{2},n)=2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})$.

121: %% Therefore the inequality $m \geq n$

122: %% is preserved all along the algorithm and the cost is thus

123: %% %

124: %% $\TRSM(m,n)= 4\TRSM(\frac{m}{2},\frac{n}{2})+2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})=

125: %% 4\TRSM(\frac{m}{2},\frac{n}{2})+\frac{1}{2^{\omega-1}} \lCeil \frac{n}{m}

126: %% \rCeil^2 \MM(n)$.

127: %% Thus we have $\TRSM(m,n)=4^t T(1;1) + \frac{1}{2^{\omega-1}}\lCeil\frac{m}{n}\rCeil^2 \MM(m)

128: %% \frac{ 1 -

129: %%   \left(\frac{4}{2^{\omega}}\right)^t}{1-\frac{4}{2^{\omega}}}$.

130: %% This yields the leading term $\frac{2}{2^\omega-4}\lCeil\frac{m}{n}\rCeil^2 \MM(m)$.

131: %

132: %

133: % By counting each operation at one recursive step we have:

134: % \begin{eqnarray}\nonumber

135: % C(m,n)= \sum_{i=1}^{\log m} 2^{i-1} R(\frac{m}{2^i},\frac{m}{2^i},n)

136: % \end{eqnarray}

137: % Now, since $m \leq n$, we get $\forall i \ R(\frac{m}{2^i},\frac{m}{2^i},n)=C_{\omega}\left(\frac{m}{2^i}\right)^{\omega -1}n$ and therefore:

138: % \begin{equation}\nonumber

139: % C(m,n) = \frac{C_{\omega} nm^{\omega -1}}{2}

140: %                         \sum_{i=1}^{\log m} \left(

141: %                         \frac{1}{2^i}\right)^{\omega -2}

142: % \end{equation}

143: % which gives

144: % %\[ C(m,n) \leq \frac{C_{\omega} }{2(2^{\omega-2}-1)}nm^{\omega-1}\]

145: % %Thus, this gives the bound $O(nm^{\omega - 1})$.\\

146: % the $O(nm^{\omega - 1})$ bound of the lemma.

147: %\end{proof}

148: % Without loss of generality for linear algebra applications,

149: % we here consider only the case where the row

150: % dimension, $m$, of the the triangular system is less than or equal to the column dimension, $n$.

151:

152: \subsection{Delaying reductions globally}\label{ssec:trsmdel}

153:

154: As for matrix multiplication, the delayed computation

155: relies on the fact that ring operations over the

156: finite field can be replaced by ring operations over \Z using the ring

157: homomorphisms described in section \ref{ssec:ffperf}.

158: However, triangular system resolutions involve, in the general case, field

159: operations: the divisions by the diagonal elements of the triangular matrix.

160: Therefore this technique is only valid with unit diagonal matrices.

161:

162: In the general case, the triangular matrix is made unit diagonal by the

163: following factorization: $A=DU$, where $D$ is diagonal and $U$ is unit diagonal

164: upper triangular. Then the system $U X = D^{-1}B$ only involves ring operations

165: and can be solved over \Z.

166: This normalization leads to an additional cost of $O(mn)$ arithmetic

167: operations (see \cite{jgd:2004:ffpack} for more details).

168:

169:

170: Now the integer computation with a fixed sized arithmetic (e.g. the floating point

171: arithmetic)  is exact as long as all intermediate results of the computation

172: do not exceed the bit capacity of the representation.

173: Therefore we now propose bounds on the values computed by the algorithm over \Z.

174:

175: %% On peut avoir une premi�re id�e de la croissance des coefficients en remarquant

176: %% que le $k$i�me  coefficient $x_k$ du vecteur solution du syst�me $Ax=b$ est une

177: %% combinaison lin�aire des $n-k$ coefficients suivants~: $x_i, i\in [k+1\dots

178: %% n]$. Par cons�quent, la taille du plus grand des coefficients cro�t lin�airement

179: %% en fonction de la dimension du syst�me.

180: %% Nous donnons dans le th�or�me \ref{th:trsmbound} une borne plus pr�cise de la

181: %% valeur des coefficients calcul�s.

182: %% Nous donnons aussi une classe de syst�mes pour lesquels la borne est atteinte, ce qui

183: %% prouve son optimalit�.

184:

185: %

186: \begin{thm}  \label{th:trsmbound}

187: Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal upper triangular matrix and $b \in \mathbb{Z}^n$,

188: with $m \leq T_{i,j} \leq M$ and $m \leq b_i \leq M$ and  $m\leq 0\leq M$.

189: Let $x = ( x_i )_{i \in [1 \dots n]} \in \mathbb{Z}^n$ be the solution of the

190: system $Tx=b$.

191: Then $\forall \ k \in [0\dots n-1]$~:

192: \[ \left \{

193: \begin{array}{ll}

194:  -u_k \leq x_{n-k} \leq v_k &  \text{for $k$ even,}\\

195:  -v_k \leq x_{n-k} \leq u_k &  \text{for $k$ odd}

196: \end{array}

197: \right.

198: \]

199: with

200: \[

201: \left \{

202: \begin{array}{l}

203: u_k =\frac{M-m}{2}(M+1)^k  - \frac{M+m}{2}(M-1)^k,\\%[2mm]

204: v_k = \frac{M-m}{2}(M+1)^k  + \frac{M+m}{2}(M-1)^k. \\

205: \end{array}

206: \right.

207: \]

208: \end{thm}

209: %

210: \begin{proof}

211:

212: First note the following relations:

213: $$

214: \forall k \left\{

215: \begin{array}{lcl}

216: u_k &\leq& v_k \\

217: -mu_k &\leq &Mv_k\\

218: -mv_k &\leq &Mu_k\\

219: \end{array}

220: \right.

221: $$

222: The third one comes from

223: $$

224: Mu_k+mv_k=\frac{M^2-m^2}{2}((M+1)^k-(M-1)^k) \geq 0.

225: $$

226: The proof is now an induction on $k$, following the system resolution order.

227: The initial case $k=0$ correspond to the first step:

228: $x_n=b_n$, leading to

229: $$ -u_0 = m \leq x_n \leq M = v_0.$$

230: Suppose now that the inequalities hold for $k\in [0\dots l]$

231: and prove them for  $k=l+1$.

232: If  $l$ is odd, $l+1$ is even.

233: {\small

234: \begin{eqnarray*}

235: x_{n-l-1}&=& b_{n-l-1} - \sum_{j=n-l}^n{T_{n-l-1,j}x_j}\\

236: &\leq& M + \sum_{i=0}^{\frac{l-1}{2}}{\max(Mu_{2i},-mv_{2i}) + \max(Mv_{2i+1},-mu_{2i+1})}\\

237: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i} + v_{2i+1}}\right)  \\

238: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} +

239: \frac{M+m}{2}(M-2)(M-1)^{2i}} \right)  \\

240: &\leq& M\left(1 + \frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} +

241: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\

242: &\leq&\frac{M-m}{2}(M+1)^{l+1}  + \frac{M+m}{2}(M-1)^{l+1} = v_{l+1}.

243: \end{eqnarray*}

244: Similarly,

245: \begin{eqnarray*}

246: x_{n-l-1}  &\geq& m - \sum_{i=0}^{\frac{l-1}{2}}{\max(Mv_{2i},-mu_{2i}) +

247: \max(Mu_{2i+1},-mv_{2i+1})}\\

248: &\geq& m -  M\sum_{i=0}^{\frac{l-1}{2}}{v_{2i} + u_{2i+1}}\\

249: &\geq& m -M\sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} -

250: \frac{M+m}{2}(M-2)(M-1)^{2i} }  \\

251: &\geq& m - M\left(\frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} -

252: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\

253: &\geq&\frac{M-m}{2}(M+1)^{l+1}  - \frac{M+m}{2}(M-1)^{l+1} = u_{l+1}.

254: \end{eqnarray*}

255: }

256: For $l$ even, a similar proof leads to

257: $$

258: -v_{l+1} \leq x_{n-l-1} \leq u_{l+1}.

259: $$

260: \end{proof}

261: %

262: \begin{cor}\label{cor:trsmoptimal}

263:  Using the notation of theorem \ref{th:trsmbound},

264: $$

265:  |x| \leq \frac{M-m}{2}(M+1)^{n-1}  + \frac{M+m}{2}(M-1)^{n-1}.

266: $$

267: Moreover this bound is optimal.

268: \end{cor}

269: %

270: \begin{proof}

271: The sequence $(v_k)$ is increasing and always greater than $(u_k)$.

272: Thus $\forall \ k \in [0\dots {n-1}] \ |x_{n-k}|\leq \ u_k \leq v_k \leq v_{n-1}$.

273:

274: Now the vector $x = ( x_i )_{i \in [1\dots n]} \in \mathbb{Z}^n$ such that

275: $ \forall \ k \in [0\dots n-1] \ |x_{n-k}| = v_k$ satisfies the system $Tx=b$ with

276: $$

277: T = \

278: \left[

279: \begin{array}{ccccc}

280: \ddots & \ddots & \ddots & \ddots & \ddots \\

281:        &   1    &  M   &    m   & M  \\

282:        &        &    1   &   M  & m   \\

283:        &        &        &    1   & M \\

284:        &        &        &        & 1   \\

285: \end{array}

286: \right],

287: b = \left[\begin{array}{c}\vdots\\m\\M\\m\\M \end{array} \right]

288: $$

289: Therefore the bound is reached.

290: \end{proof}

291: The following corollaries apply this result to the positive and balanced modular

292: representations.

293:

294: \begin{cor}[Positive modular representation]\label{cor:trsmpositif}

295: For $1 \leq i,j \leq n$, if  $T_{i,j},b_i \in [0\dots p-1]$, then

296: $$

297: |x| \leq \frac{p-1}{2}(p^{n-1}  + (p-1)^{n-1}).

298: $$

299: \end{cor}

300: %

301: \begin{cor}[Balanced modular representation]\label{cor:trsmcentre}

302: For $1 \leq i,j \leq n$, if  $T_{i,j},b_i \in [-\frac{p-1}{2}\dots

303: \frac{p-1}{2}]$, then

304: $$

305: |x| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^{n-1}.

306: $$

307: \end{cor}

308:

309: \begin{rem}

310: The balanced modular representation improves the bound by a factor of $2^{n-1}$.

311: \end{rem}

312:

313: As a consequence, one can solve a unit diagonal triangular system of dimension

314: $n$ using arithmetic operations with integers stored on $\gamma$ bits if

315: \begin{equation}\label{eq:trsmboundpos}

316: \frac{p-1}{2}(p^{n-1}  + (p-1)^{n-1})< 2^{\gamma}

317: \end{equation}

318: for a positive representation and

319: \begin{equation}\label{eq:trsmboundcen}

320: \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{\gamma}

321: \end{equation}

322: for a balanced representation.

323:

324: For instance, using the  \dbl floating point representation ($53$ bits of

325: mantissa)

326: the maximal dimension of the system is $34$ (resp. $52$) for a positive

327: (resp. balanced) representation of $\Z_3$.

328: For larger fields, this maximal dimension becomes quickly very small: with

329: $p=1001$, $n\leq 5$ (resp. $n\leq 6$) for a positive (resp. balanced)

330: representation.

331:

332: In the following, we will denote by $t_\text{del}(p,\gamma)$ the maximum

333: dimension for the resolution with delayed modular reductions.

334: This dimension is small, and this approach can therefore only be used

335: as a terminal case of the recursive block algorithm.

336: This first cascade algorithm is characterized by the threshold

337: $t_\text{del}$.

338: For efficiency, we used in our implementation the BLAS routine \trsm to perform

339: the delayed computation over \Z.

340: Despite the small dimension of the blocks, we will see in section

341: \ref{ssec:trsmexp} that this approach can slightly improve the efficiency of the

342: computation when the finite field is small.

343:

344: \subsection{Delaying reductions in the update phase only} \label{ssec:trsmdelupdate}

345: %

346: The block recursive algorithm consists in several matrix multiplications of

347: different dimensions. In most cases, the matrix multiplications are done over \Z

348: with a modular reduction on the result only. But part of these result matrices

349: will be accumulated to other matrix multiplications in later computations.

350: Therefore these intermediate modular reductions  could be delayed even more by

351: allowing to accumulate these results over \Z as much as possible.

352:

353: This technique can be applied within the former cascade algorithm, to produce a

354: double cascade structure. The key idea is to split the matrices at two levels as

355: shown on figure \ref{fig:trsm:recblasdelayed}:

356: %

357: \begin{figure}[htbp]\begin{center}

358: \includegraphics[width=0.8\textwidth]{trsm_cascade.eps}

359: \caption{Splitting for the double cascade \trsm algorithm}

360: \label{fig:trsm:recblasdelayed}

361: \end{center}\end{figure}

362: %

363: a fine grain

364: splitting with the dimension $t_\text{del}$ of the previous section, and a

365: coarse grain splitting with the dimension $t_\text{update}$ such that

366: all recursive calls of dimension lower than $t_\text{update}$ can let the

367:  matrix multiplication updates accumulate without modular reductions.

368: Choosing $t_\text{update} = k_\text{Winograd}$ (from corrolary \ref{cor:winokmax})

369: will ensure this property.

370: To adjust together the dimensions of the two block decompositions, we set

371: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right

372: \rfloor t_\text{del}$.

373: %

374: \begin{algorithm}

375: \dontprintsemicolon

376: \caption{\texttt{trsm-rec-BLAS-delayed}~:}

377: \label{alg:trsm:recblasdelayed}

378: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$}

379: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}

380: \Begin{

381: Compute $t_\text{del}$ from equation (\ref{eq:trsmboundpos} or \ref{eq:trsmboundcen}) \;

382: Compute $t_\text{Winograd}$ from corrolary (\ref{cor:winokmax}) \;

383: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right

384: \rfloor t_\text{del}$\;

385: \ForEach{block column of $A$ of dimension $m\times t_\text{split}$ of the form

386: $\begin{bmatrix}V_i\\U_i\\0\end{bmatrix}$}{

387: $X_i = \texttt{trsm-partial-delayed} (U_i,B_i)$ \;

388: $X_i = X_i \mod p$\;

389: $B_{1\dots i-1} = B_{1\dots i-1} - V_i X_i$\;

390: $B_{1\dots i-1} = B_{1\dots i-1} \mod p$\;

391: }

392: \Return{$X$}

393: }

394: \end{algorithm}

395: %

396: \begin{algorithm}

397: \dontprintsemicolon

398: \caption{\texttt{trsm-partial-delayed}}

399: \label{alg:trsmdelaye}

400: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$, $m$ must be lower

401:   than  $t_\text{update}$}

402: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}

403: \Begin{

404: \eIf{ $m\leq n_\text{del}$}{

405: $B=B \mod p$\;

406: $X = \texttt{dtrsm}(A,B)$ \tcc*{the BLAS routine}\;

407: $X=X \mod p$\;

408: }{

409: \tcc{ (splitting of the matrix into blocks of dimension $\left\lfloor \frac{m}{2}

410:  \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$) }\;

411: $

412: \begin{array}{cccc}

413: A & X & & B \\

414: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&

415: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&

416: = &

417: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }

418: \end{array}

419: $\;

420: $X_2:=\texttt{trsm-partial-delayed} (A_3,B_2)$ \;

421: $B_1:= B_1 - A_2X_2$  \tcc*{without modular reduction} \;

422: $X_1:=\texttt{trsm-partial-delayed} (A_1,B_1)$ \;

423: }

424: \Return $X$

425: }

426: \end{algorithm}

427: %

428:

429: Algorithm \ref{alg:trsm:recblasdelayed} is a loop on every block of column dimension

430: $t_\text{update}$. For each of them, the triangular system is solved using algorithm

431: \ref{alg:trsmdelaye} and the update is performed  by a matrix multiplication over

432: \Z followed by a modular reduction.

433: Algorithm \ref{alg:trsmdelaye} is simply the cascade algorithm of the previous

434: section: the block recursive algorithm \ref{alg:trsm:rec} with the fully delayed

435: algorithm as a terminal case.

436: The matrix multiplication updates are performed over \Z without any reduction of

437: the result, since the threshold $t_\text{update}$ allows to accumulate them.

438:

439: %So the only modular reductions are

440: %performed after the call to \trsm.

441:

442:

443: \subsection{Experiments}

444: \label{ssec:trsmexp}

445: %\subsubsection{Comparison of the variants}

446:

447: We now compare three implementations of the \trsm routine over a word size finite

448: field:

449: \begin{description}

450: \item Pure recursive (\texttt{Pure-Rec}): Simply algorithm \ref{alg:trsm:rec},

451: \item Recursive-BLAS  (\texttt{Rec-BLAS}): The cascade algorithm formed by

452:   the recursive algorithm and the BLAS routine \dtrsm as a terminal case. It

453:   differs from algorithm \ref{alg:trsmdelaye} by the fact that the matrix

454:   multiplication $B_1:= B_1 - A_2X_2$  is always followed by a modular reduction.

455: \item Recursive-BLAS-Delayed (\texttt{Rec-BLAS-Delayed}): algorihtm \ref{alg:trsm:recblasdelayed}.

456: \end{description}

457:

458: We compare these three variants over finite fields with different cardinalities,

459: so as to make the parameters $t_\text{del}$ and $t_\text{update}$ vary as in

460: the following table:

461:

462: \begin{center}

463: \begin{tabular}{|c||c|c|c|}

464: \hline

465: $p$ & $\lceil \log_2 p\rceil$ & $t_\text{del}$ & $t_\text{update}$\\

466: \hline

467: 5   &   3        &   23            &  2\,147\,483\,642\\

468: 1\,048\,583 & 20     &   2             & 8190 \\

469: 8\,388\,617 & 23     &   2             & 126 \\

470: \hline

471: \end{tabular}

472: \end{center}

473:

474: \begin{figure}[htbp]\begin{center}

475: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_5en_goto.eps}

476: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_1048583en_goto.eps}

477: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_8388617en_goto.eps}

478: \caption{Comparison of the \trsm variants for $p=5,1\,048\,583,8\,388\,617$,

479:  on a Pentium4\--3,2Ghz\--1Go}

480: \label{fig:trsm:compvar}

481: \end{center}\end{figure}

482:

483: In the experiments of figure \ref{fig:trsm:compvar}, the matrix $B$ is

484: square ($m=n$).

485: %

486: One can first notice the gain provided by the use of the first cascade with the

487: delayed \dtrsm routine by comparing the curves \texttt{rec-BLAS} and

488: \texttt{pure-rec} for $p=5$. This advantage shrinks when the characteristic gets larger,

489:  since $t_\text{del}=2$ for $p=1\,048\,583$ or $p=8\,388\,61$.

490:

491: Now the introduction of the coarse grain splitting, delaying the reductions in

492: the update phase improves by up to 500 Mfops the computation speed.

493: This gain is similar for $p=5$ and  $p=1\,048\,583$ since in both

494: cases  $n<t_\text{update}$ and there is therefore no modular reduction between the

495: matrix multiplications.

496:

497: Lastly for $p=8\,388\,617$, the speed drops down since more reductions are required.

498:  The variants \texttt{pure-rec} and \texttt{rec-BLAS} are penalized by their

499: dichotomic splitting, creating too many modular reductions after each matrix

500: multiplication. Now \texttt{rec-BLAS-delayed} has the best efficiency since the double

501: cascade structure minimizes the number of reductions.

502:

503:

504:

505: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

506:

507: %% Matrix multiplication speed over finite fields was improved

508: %% %in~\cite{jgd:2002:fflas,Pernet:2001:Winograd}

509: %% by the use of the

510: %% numerical BLAS library:

511: %% matrices are converted to floating point representations

512: %% (where the linear algebra routines are fast) and converted back to a finite

513: %% field representation afterwards.

514: %% %The computations remained exact as

515: %% %long as no overflow occurred.

516: %% An implementation of \trsm\ can use the same

517: %% techniques. Indeed, as soon as no overflow occurs one can replace the

518: %% recursive call to \trsm\ by the numerical BLAS {\it dtrsm}

519: %% routine. But one can remark that approximate divisions can occur.

520: %% So we need to ensure both that only exact divisions are performed and that no overflow appears.

521: %% Not only one has to be careful for the result to remain within

522: %% acceptable bounds, but, unlike matrix multiplication where data grows

523: %% linearly, data involved in linear system grows exponentially as shown

524: %% in the following.\\

525: %% %

526: %% The next two subsections first show how to deal with

527: %% divisions, and then give an optimal theoretical bound on the

528: %% coefficient growth and therefore an optimal threshold for

529: %% the switch to the numerical call.

530:

531: %% \subsubsection{Dealing with divisions}

532:

533:

534: %% \subsection{A theoretical threshold}

535: %% We want to use the BLAS trsm routine to solve triangular systems over the

536: %% integers (stored as {\tt double} for {\tt dtrsm} or {\tt float} for {\tt

537: %%   strsm}).

538: %% The restriction is then the coefficient growth in the solution.

539: %% Indeed, the $k^{th}$ value in the solution vector is a linear combination of the

540: %% $(n-k)$  already computed next values.

541: %% This implies a linear growth in the coefficient size of the solution, with

542: %% respect to the system dimension.

543: %% Now this resolution can only be performed if every element of the solution can

544: %% be stored in the mantissa of the floating point representation

545: %% (e.g. $53$ bits for {\tt double }).

546: %% Therefore overflow control consists in finding the largest block dimension

547: %% $\beta$, such that the result of the call to BLAS trsm routine will remain

548: %% exact.

549:

550: %% We now propose a bound for the values of the solutions of such a system; this

551: %% bound is optimal (in the sense that there exists a worst

552: %% case matching the bound when $n=2^i \beta$).

553: %% This enables the implementation of a cascading algorithm, starting recursively

554: %% and taking advantage of the BLAS performances as soon as possible.

555: %% %

556: %% %Let us introduce the two following series:

557: %% %\[

558: %% %\left \{

559: %% %\begin{array}{l}

560: %% %u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]\\[2mm]

561: %% %v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]\\

562: %% %\end{array}

563: %% %\right. ~~~ for~ an~ integer~ p>2

564: %% %\]

565: %% %

566: %% \begin{thm}  \label{THEO:TRSMBOUND}

567: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal

568: %% upper triangular matrix, and $b \in \mathbb{Z}^n$,

569: %% with $0 \leq T \leq p-1$ and $0 \leq b \leq p-1$.

570: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of

571: %% $T.X=b$ over the integers.

572: %% Then, $\forall \ k \in [0..n-1]$:

573: %% \[ \left \{

574: %% \begin{array}{ll}

575: %%  (p-2)^k-p^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k + (p-2)^k & \mbox{if $k$ is even}\\

576: %%  -p^k-(p-2)^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k - (p-2)^k & \mbox{if $k$ is odd}

577: %% \end{array}

578: %% \right.

579: %% \]

580: %% \end{thm}

581: %% %

582: %% \begin{proof}

583: %% The idea is to use an induction on $k$ with the relation

584: %% $ x_k = b_k - \sum_{i=k+1}^{n}{T_{k,i}x_i}$.

585: %% A lower and an upper bound for $x_{n-k}$ are computed, depending

586: %% whether $k$ is even or odd:

587: %% Let us  define the following induction hypothesis $IH_l$:

588: %% \[

589: %% \forall \ k \in [0..l]

590: %% \left \{

591: %% \begin{array}{ll}

592: %%  -u_k \leq x_{n-k} \leq v_k & \mbox{if $k$ is even}\\

593: %%  -v_k \leq x_{n-k} \leq u_k & \mbox{if $k$ is odd}\\

594:

595: %% \end{array}

596: %% \right.

597: %% \]

598: %% % Let us define the induction hypothesis $IH_l$ to be that the equations

599: %% % (\ref{eq:bound}) hold for $k \in [0..l-1]$ .

600: %% %

601: %% When $l=0$,  $x_n=b_n$ which implies that

602: %% $ -u_0 = 0 \leq x_n \leq p-1 = v_0$. Thus $IH_0$ is proven.

603: %% %

604: %% Let us suppose that $  IH_l$ is true, and

605: %%   prove $IH_{l+1}$. There are two cases: either $l$ is odd or not !

606: %% If $l$ is odd, $l+1$ is even. Now,

607: %% by induction,

608: %% \begin{small}

609: %% \begin{eqnarray*}

610: %% x_{n-l-1}&\leq& (p-1) \left( 1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i}+v_{2i+1}}\right) \\

611: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}-(p-2)^{2i} + p^{2i+1}+(p-2)^{2i+1}\right] } \\

612: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}(p+1) + (p-2)^{2i}(p-3) \right] } \\

613: %% &\leq & p-1 + \frac{(p-1)^2}{2}

614: %%     \left[(p+1)\frac{p^{l+1}-1}{p^2-1} + (p-3)\frac{(p-2)^{l+1}-1}{(p-2)^2-1} \right] \\

615: %% &\leq & \frac{p-1}{2} \left[ p^{l+1} + (p-2)^{l+1}\right] =  v_{l+1}\\

616: %% \end{eqnarray*}

617: %% \end{small}

618: %% Similarly,

619: %% \begin{small}

620: %% \begin{eqnarray*}

621: %% x_{n-l-1}  &\geq& -(p-1) \sum_{i=0}^{\frac{l-1}{2}}{v_{2i}+u_{2i+1}}\\

622: %%  &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[

623: %%           p^{2i} + (p-2)^{2i} + p^{2i+1} - (p-2)^{2i+1}\right]} \\

624: %%  &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[

625: %%           p^{2i}(p+1) - (p-2)^{2i}(p-3) \right]} \\

626: %%  &\geq & -\frac{p-1}{2} \left[  p^{l+1} - (p-2)^{l+1} \right] =  u_{l+1}\\

627: %% \end{eqnarray*}

628: %% \end{small}

629: %% Finally, If $l$ is  even, a similar proof leads to $-v_{l+1}\leq x_{n-l+1} \leq u_{l+1} $.

630: %% \end{proof}

631: %% %

632: %% \begin{cor}\label{cor:TRSMBOUND}

633: %% $

634: %%  |X| \leq \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right]

635: %% $.\\

636: %% Moreover, this bound is optimal.

637: %% \end{cor}

638: %% %

639: %% \begin{proof} We denote by

640: %% $u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]$

641: %% and $v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]$ the bounds of the

642: %% theorem \ref{THEO:TRSMBOUND}. Now $ \forall \ k \in [0..{n-1}] \ u_k \leq v_k \leq v_{n-1}$.

643: %% Therefore the theorem \ref{THEO:TRSMBOUND} gives $\forall \  k \in [1..n] \ x_k \leq v_{n-1} \leq

644: %%   \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right] $

645: %% \[

646: %% \text{Let}~T = \

647: %% \left[

648: %% \begin{array}{ccccc}

649: %% \ddots & \ddots & \ddots & \ddots & \ddots \\

650: %%        &   1    &  p-1   &    0   & p-1  \\

651: %%        &        &    1   &   p-1  & 0   \\

652: %%        &        &        &    1   & p-1 \\

653: %%        &        &        &        & 1   \\

654: %% \end{array}

655: %% \right],

656: %% b = \left[\begin{array}{c}\vdots\\0\\p-1\\0\\p-1 \end{array} \right]

657: %% \]

658: %%  Then the solution $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ of

659: %%  the system $T.X = b$ satisfies $ \forall \ k \in [0..n-1] \ |x_{n-k}| = v_k$

660: %% \end{proof}

661: %% One can derive the same kind of bound for the centered representation,

662: %% but with an $2^n$ gain.

663: %% \begin{thm}

664: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal

665: %% upper triangular matrix, and $b \in \mathbb{Z}^n$,

666: %% with $\left| T \right| \leq \frac{p-1}{2}$ and $\left| b \right|\leq  \frac{p-1}{2}$.

667: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of

668: %% $T.X=b$ over the integers. Then

669: %% $

670: %%  |X| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n

671: %% $.\\

672: %% Moreover, this bound is optimal.

673: %% \end{thm}

674: %% \begin{proof} The proof is simpler than that of theorem

675: %%   \ref{THEO:TRSMBOUND}, since the inequations are symmetric.

676: %% Therefore $u_n=v_n$ and the induction yields

677: %% $$u_n=\frac{p-1}{2}\left(1+\sum_{i=0}^{n-1}u_i\right)=\frac{p-1}{2}\left(1+\frac{p-1}{2}\frac{

678: %%   \left(\frac{p+1}{2}\right)^n-1}{\frac{p+1}{2}-1}\right) = \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n.$$

679: %% \end{proof}

680:

681: %% Thus, for a given $p$, the dimension $n$ of the system must satisfy

682: %% \begin{equation}

683: %% \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{m}

684: %% \end{equation}

685: %% where $m$ is the size of the mantissa

686: %% so that the resolution over the integers using the BLAS trsm routine

687: %% is exact. For instance, with a 53 bits mantissa,

688: %% this gives quite small matrices,

689: %% namely at most $92 \times 92$ for $p=2$, at most $4\times 4$ for $p

690: %% \

691: %% leq 3089$, and at most $p=416107$ for $2\times 2$ matrices.

692: %% Nevertheless, this technique is speed-worthy in most cases as shown in

693: %% section \ref{ssec:trsmexp}.

694: %Indeed, this test can easily be performed in the

695: %recursive {\tt trsm} routine to determine whether the dimension of the

696: %system is small enough to make use of the BLAS trsm routine.

697:

698: %\input{delayed}

699:

700: %% \subsection{``Trsm'' implementations behavior}\label{ssec:trsmexp}

701: %% As shown in section \ref{ssec:rec-trsm} the block recursive algorithm {\trsm} is based on matrix multiplications.

702: %% This allows us to use our fast matrix multiplication routine.

703: %% %  of the

704: %% % FFLAS package \cite{jgd:2002:fflas}. This is an exact wrapping of the

705: %% % ATLAS

706: %% % library\footnote{\scriptsize\texttt{http://math-atlas.sourceforge.net}\cite{Whaley:2001:AEO}}

707: %% %  used as a kernel to implement

708: %% % the {\trsm} variants.

709: %% The following table

710: %% %derives from experimental results of

711: %% %\cite{jgd:2004:ffpack} and

712: %% expresses which of the two preceding

713: %% variants is better:

714:

715: %% \newcommand{\blastrsm}{{{\tt BLASTrsm}}}

716: %% \newcommand{\deltrsm}{{{\tt DelayTrsm}}}

717: %% {\blastrsm} is the hybric numeric/finite field implementation of section \ref{ssec:trsm-blas} and

718: %% {\deltrsm} is the delayed division implementation of section

719: %% \ref{ssec:recdelay}.

720: %% {\tt Zpz-double} is a field representation

721: %% from \cite{jgd:2004:dotprod} where the  elements are stored as

722: %% floating points to avoid one of the conversions. {\tt Zpz-int}

723: %% is a field representation from \cite{jgd:2005:givaro} where the

724: %% elements are stored as small integers.

725:

726: %% \begin{table}[htbp]\begin{center}

727: %% \small

728: %% \begin{tabular}{|c||r|r|r|r|r|r|}

729: %% \hline

730: %% $n$                      & {\em 400}        & {\em 1000}        & {\em 2000}       & {\em 5000} \\

731: %% \hline

732: %% {\tt Zpz-double(5)}      & \blastrsm        & \blastrsm        & \blastrsm        & \blastrsm \\

733: %% {\tt Zpz-double(32749)}  & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \blastrsm        & \blastrsm  \\

734: %% {\tt Zpz-int(5)}         & \deltrsm$_{100}$ & \deltrsm$_{100}$ & \blastrsm        & \blastrsm  \\

735: %% {\tt Zpz-int(32749)}     & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \deltrsm$_{50}$  \\

736: %% %%$n$          & {\em 400}  & {\em 700}  & {\em 1000} & {\em 2000} &

737: %% %%{\em 5000} \\

738: %% %% \hline

739: %% %% {\tt Mod<double>(5)} & \blastrsm & \blastrsm & \blastrsm & \blastrsm & \blastrsm \\

740: %% %% {\tt Mod<double>(32749)} & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \blastrsm & \blastrsm \\

741: %% %% {\tt G-Zpz(5)}  & \deltrsm$_{100}$ & \deltrsm$_{150}$ & \deltrsm$_{100}$ & \blastrsm & \blastrsm  \\

742: %% %% {\tt G-Zpz(32749)}& \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$& \deltrsm$_{50}$ & \deltrsm$_{50}$ \\

743: %% \hline

744: %% \end{tabular}

745: %% \caption{Best variant for \trsm\ on a P4,

746: %%   2.4GHz}\label{tab:trsmmodular}

747: %% \end{center}

748: %% \vspace{-1em}

749: %% \end{table}

750:

751: %% To summarize, one would rather use {\tt ZpZ-double} representation and  ``blas'' {\trsm} variant in most cases.

752: %% However, when the base field is already specified ``delayed{\large$_t$}''  could provide slightly better performances.

753: %% This requires a search for optimal thresholds which again could be done through an Automated Empirical Optimizations of Software \cite{Whaley:2001:AEO}.

754:

755:

756: %\subsection{Performances and comparison with numerical routines}

757:

758: We now give a comparison of this implementation with the equivalent routine of the original BLAS \dtrsm.

759: %In the previous section we showed that {\trsm} optimized variant based on numerical solving allows us to achieve the best performances.

760: %In this section we compare these performances with pure numerical solving and with matrix multiplication.

761: %In order to achieve the best performances  we use as much as possible fast matrix multiplication of section \ref{ssec:winograd}.

762: %For this purpose we use an experimental switching threshold to classic multiplication since table \ref{tab:winolevel} reflects only theoretical behavior.

763: As for matrix multiplication in section \ref{ssec:fgemm-perf}, we compare the routines according to

764: two different BLAS implementations (i.e. ATLAS and GOTO) and

765: two different architectures. Nevertheless, we do not present the

766: results with ATLAS on Xeon architecture due to the surprisingly poor efficiency

767: of ATLAS \dtrsm during our tests.

768: In the following, \ftrsm denotes the \trsm routine over $16$-bits prime field (i.e. $\Z_{65521}$)

769: using the \texttt{ZpZ-double} implementation.

770:

771:

772:

773:

774: %\begin{figure}[hbtp]

775: %\includegraphics[width=8cm,angle=-90]{timing-trsm-p4}

776: %\caption{Timing comparison for matrix multiplication (exact and numeric) on a P4, 3.4GHz}

777: %\end{figure}

778:

779:

780:

781: \begin{table}[htbp]\begin{center}

782: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|r|}

783: \cline{3-11}

784: \multicolumn{2}{c|}{} & $n$     %& {\em 500}

785: & {\em 1000}  & {\em 2000}  & {\em 3000}  & {\em 5000} & {\em 7000}  & {\em 8000} & {\em 9000} & {\em 10000} \\

786: \cline{3-11}

787: \multicolumn{11}{c}{}\\[-0.1cm]

788: \hline

789: &ATLAS & ftrsm & $0.37$s    &  $1.93$s    & $5.73$s    &   $23.63$s &  $62.50$s   & $91.67$s  & $121.84$s  & $166.74$s \\

790: %\cline{3-11}

791: %& & dtrsm & $$s    &  $$s    & $$s   &   $$s &  $$s   & $$s  & $$s  & $$s \\

792: %\cline{3-11} \\[-.3cm]

793: %\cline{3-11}

794: %& \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{fgemm}{dgemm}$}  & \bf   & \bf  & \bf   &  \bf  &  \bf  & \bf  & \bf  & \bf  \\

795: \hline

796: \multicolumn{11}{c}{}\\[-0.2cm]

797: \hline

798:

799: & & ftrsm  %& $0.059$s

800: &  $0.25$s    & $1.66$s    &  $5.08$s &

801: $21.47$s   & $55.95$s  & $80.77$s  & $111.57$s & $150.81$s \\

802: \cline{3-11}

803: & & dtrsm % & $0.023$s

804: & $0.17$s    &  $1.35$s    & $4.50$s   &

805: $20.64$s &   $56.19$s   & $83.85$s  & $119.18$s & $163.33$s \\

806: \cline{3-11} \\[-.3cm]

807: \cline{3-11}

808:  & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf

809:    $\frac{ftrsm}{dtrsm}$}  %& {\bf 2.57}

810: & \bf 1.47 & \bf 1.23  &  \bf 1.13 &  \bf 1.04 & \bf 1.00 & \bf 0.96 & \bf 0.94 & \bf 0.92 \\

811: \hline

812:

813: \end{tabular}

814: \caption{Timings of triangular solver with matrix hand side on a Xeon,

815:   3.6GHz}\label{tab:trsm-p4}

816: %\end{center}

817: %\end{table}

818:

819: %\begin{figure}[hbtp]

820: %\includegraphics[width=8cm,angle=-90]{timing-trsm-itanium2}

821: %\caption{Timing comparison for triangular system solving with matrix hand side (exact and numeric) on Itanium2-1.3GHz}

822: %\end{figure}

823:

824: %\begin{table}[htbp]\begin{center}

825: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|}

826: \multicolumn{11}{c}{}\\

827: \cline{3-11}

828: \multicolumn{2}{c|}{} & $n$          & {\em 1000}  & {\em 2000}  & {\em 3000}  & {\em 5000} & {\em 7000}  & {\em 8000} & {\em 9000} & {\em 10000}\\

829: \cline{3-11}

830: \multicolumn{11}{c}{}\\[-0.2cm]

831: \hline

832: & & ftrsm & $0.34$s & $2.28$s & $7.11$s & $30.26$s & $77.43$s & $112.01$s & $158.00$s & $214.31$s  \\

833: \cline{3-11}

834: & & dtrsm & $0.26$s & $1.95$s & $6.37$s & $28.60$s & $76.44$s & $113.78$s & $161.19$s & $219.31$s  \\

835: \cline{3-11} \\[-.3cm]

836: \cline{3-11}

837: & \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$}  & \bf 1.31 & \bf 1.17 & \bf 1.12 & \bf 1.06 & \bf 1.01 & \bf 0.98 & \bf 0.98 & \bf 0.98 \\

838: \hline

839: \multicolumn{11}{c}{}\\[-0.2cm]

840: \hline

841:

842: & & ftrsm  & $0.30$s & $2.00$s & $6.23$s & $26.67$s & $68.22$s & $104.32$s & $137.96$s & $192.37$s  \\

843: \cline{3-11}

844: & & dtrsm  & $0.21$s & $1.61$s & $5.36$s & $24.59$s & $67.35$s & $100.42$s & $142.43$s & $195.79$s  \\

845: \cline{3-11} \\[-.3cm]

846: \cline{3-11}

847:  & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$}  & \bf 1.43 & \bf 1.24 & \bf 1.16 & \bf 1.08 & \bf 1.01 & \bf 1.04 & \bf 0.97 & \bf 0.98 \\

848: \hline

849:

850: \end{tabular}

851: \caption{Timings of triangular solver with matrix hand side on Itanium2, 1.3GHz}\label{tab:trsm-ia64}

852: \end{center}

853: \end{table}

854:

855: Tables \ref{tab:trsm-p4} and \ref{tab:trsm-ia64} show that our

856: implementation of exact {\trsm} solving is not far from numerical

857: performances.

858: %In particular, ``ftrsm'' performances tend to catch up with BLAS ones as soon as the dimensions of matrices increase.

859: Moreover, on our Xeon architecture, with GOTO BLAS, we are able to

860: achieve even better performances than numerical solving for matrices

861: of dimension greater than $7\,000$.

862:

863:

864: \begin{figure}[hbtp]

865: \begin{center}

866: \includegraphics[width=0.55\textwidth,angle=-90]{graph-BEST-trsm}

867: \end{center}

868: \caption{Comparing  triangular system solving with matrix

869:   multiplication on a Xeon,

870:   3.6GHz} \label{fig:trsm-ratio}

871: \end{figure}

872:

873: The good performance of our implementation is mostly achieved with

874: the efficient reduction to fast matrix multiplication and the double

875: cascade structure.

876: Figure~\ref{fig:trsm-ratio} shows the ratio of the computation time of

877: our \trsm compared with matrix multiplication routine.

878: %One can see from this figure that our experimental ratio converges to the theoretical one.

879: %In particular, the theoretical ratio is slightly more than $\frac{1}{2}$ since fast matrix multiplication algorithm is used.

880: According to lemma \ref{lem:trsm}, this ratio is $1/2$ with $\omega=3$

881: and $2/3$ with $\omega = \log_2 7$.

882: In practice, our implementation only performs a few recursive calls of

883: Winograd's algorithm, and the ratio appears to be between $0.5$ and $0.666$ as

884: soon as the dimension is large enough, showing the good efficiency of the reduction to

885: matrix multiplication.

886:

887:

888: %

889: %

890: %

891: %\subsubsection{Recursive with delayed modulus}

892: %\subsubsubsection{Threshold}

893: %\subsubsubsection{Dot product}

894: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

895: %%% Local Variables:

896: %%% mode: latex

897: %%% TeX-master: "../dlaff.tex"

898: %%% End:

899: