1: \section{Triangular system solving with matrix right/left hand side}\label{sec:trsm}
2: %%
3: We now discuss the implementation of solvers for
4: triangular systems with matrix right hand side (or equivalently left
5: hand side).
6: %This is also the simultaneous resolution of $n$ triangular systems.
7: %
8: The resolution of such systems plays a central role in many linear algebra
9: problems, e.g. it is the second main operation
10: in block Gaussian elimination after matrix multiplication as will be recalled in section \ref{sec:triang}. This operation is commonly named
11: \trsm in the BLAS convention. In the following, we will consider
12: without loss of generality the resolution of an upper triangular system with
13: matrix right hand side, i.e. the operation $B \leftarrow U^{-1}B$, where $U$ is
14: $m\times m$ upper triangular and $B$ is $m\times n$.
15:
16:
17: Following the approach of the BLAS numerical routine,
18: our implementation is based on a block recursive algorithm
19: to reduce the computation to matrix multiplications.
20:
21: Now similarly to our approach with matrix multiplication, the design of our
22: implementation also focuses on delaying the modular reductions as
23: much as possible. As will be shown in section \ref{ssec:trsmdel}, delaying the
24: whole resolution leads to a quick growth in the size of coefficients.
25: Therefore we also present in section \ref{ssec:trsmdelupdate} another way of
26: delaying these modular reductions.
27: We lastly present how to combine these two techniques within a multi-cascade
28: algorithm.
29:
30:
31: % A mettre en intro
32:
33: %% Let us denote by $R(m,k,n)$ the arithmetical cost of a $m \times k$ by $k \times n$ rectangular
34: %% matrix multiplication.
35: %% Now let us suppose that $k \leq m \leq n$, then
36: %% $R(k,m,n)$, $R(m,k,n)$ and $R(m,n,k)$ are all bounded by $\lCeil
37: %% \frac{m n }{k^2} \rCeil \MM(k)$ (see
38: %% e.g. \cite[(2.5)]{Huang:1997:FRM} for more details).
39:
40: %\newpage
41: \subsection{The block recursive algorithm}\label{ssec:rec-trsm}
42:
43: %% The classical idea is to use the divide and conquer approach.
44: %% Here, we consider the upper left triangular case without loss of
45: %% generality, since any combination of
46: %% upper/lower and left/right triangular cases are similar: if $U$ is
47: %% upper triangular, $L$ is lower triangular and $B$ is rectangular,
48: %% we call \ltrsm\ the resolution of $U X = B$, \lltrsm\ that of
49: %% $L X = B$, \urtrsm\ that of $XU=B$ and \lrtrsm\ that of $XL=B$.
50:
51: Algorithm \texttt{trsm} recalls the block recursive algorithm.
52:
53: \begin{algorithm}
54: \dontprintsemicolon
55: \caption{\trsm($A,B$)}\label{alg:trsm:rec}
56: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$.}
57: \KwResult{$X \in \Zp^{m \times n}$ such that $AX=B$.}
58: \Begin{
59: \eIf{$m=1$}{
60: $ X:= A_{1,1}^{-1} \times B$\;
61: }{
62: \tcc{splitting matrices into two blocks of sizes $\left\lfloor \frac{m}{2}
63: \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$
64: \[
65: \begin{array}{cccc}
66: A & X & & B \\
67: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&
68: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&
69: = &
70: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }
71: \end{array}
72: \]}
73:
74: $X_2:=$\trsm($A_3,B_2$)\;
75: $B_1:= B_1 - A_2X_2$\;
76: $X_1:=$\trsm($A_1,B_1$)\;
77: }
78: }
79: \end{algorithm}
80:
81:
82: \begin{lem}\label{lem:trsm}
83: Algorithm \trsm\ is correct and the leading term of its arithmetic
84: complexity over $\Zp$ is
85: $$\TRSM(m,n) =
86: %\left\{ \begin{array}{ccc}
87: \frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil \MM(m)
88: %& & \text{if}~m\leq n \\
89: %\frac{1}{2^{\omega-1}-2} \lCeil\frac{m}{n}\rCeil^2 \MM(n)& &
90: % \text{if}~m \geq n
91: %\end{array}\right.
92: $$
93: This complexity is
94: %$\min(mn^2,nm^2)$
95: $m^2n$
96: using classic matrix multiplication.
97: \end{lem}
98:
99: \begin{proof}
100: Extending the previous notation \MM(n), we denote by \MM(m,k,n) the cost of
101: multiplying a $m\times k$ by a $k\times n$ matrices.
102: The cost function $\TRSM(m,n)$ satisfies the following equation:
103: $$\TRSM(m,n)= 2\TRSM(\frac{m}{2},n)+\MM(\frac{m}{2},\frac{m}{2},n).$$
104: Let $t=\log_2(m)$. Although the algorithm works for any $n$, we restrict the
105: complexity analysis to the case where $m \leq n$ for the sake of simplicity.
106: We then have:
107: \begin{eqnarray*}
108: \TRSM(m,n)&=& 2\TRSM(\frac{m}{2},n)+
109: \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)
110: \\&=& 2^t \TRSM(1,n) + \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)
111: \frac{ 1 -
112: \left(\frac{2}{2^{\omega-1}}\right)^t}{1-\frac{2}{2^{\omega-1}}}.
113: \end{eqnarray*}
114: As $\TRSM(1,n)=2n$ and $\left(2^{\omega-1}\right)^t = m^{\omega-1}$,
115: we obtain the expected complexity
116: $\TRSM(m,n)=\frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil \MM(m) + \GO(m^2+mn).$
117: \end{proof}
118: %% When $m \geq n$, the trick is to consider two TRSM with the same
119: %% triangular matrix, but of right hand side of size $n/2$.
120: %% $R(\frac{m}{2},\frac{m}{2},n)=2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})$.
121: %% Therefore the inequality $m \geq n$
122: %% is preserved all along the algorithm and the cost is thus
123: %% %
124: %% $\TRSM(m,n)= 4\TRSM(\frac{m}{2},\frac{n}{2})+2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})=
125: %% 4\TRSM(\frac{m}{2},\frac{n}{2})+\frac{1}{2^{\omega-1}} \lCeil \frac{n}{m}
126: %% \rCeil^2 \MM(n)$.
127: %% Thus we have $\TRSM(m,n)=4^t T(1;1) + \frac{1}{2^{\omega-1}}\lCeil\frac{m}{n}\rCeil^2 \MM(m)
128: %% \frac{ 1 -
129: %% \left(\frac{4}{2^{\omega}}\right)^t}{1-\frac{4}{2^{\omega}}}$.
130: %% This yields the leading term $\frac{2}{2^\omega-4}\lCeil\frac{m}{n}\rCeil^2 \MM(m)$.
131: %
132: %
133: % By counting each operation at one recursive step we have:
134: % \begin{eqnarray}\nonumber
135: % C(m,n)= \sum_{i=1}^{\log m} 2^{i-1} R(\frac{m}{2^i},\frac{m}{2^i},n)
136: % \end{eqnarray}
137: % Now, since $m \leq n$, we get $\forall i \ R(\frac{m}{2^i},\frac{m}{2^i},n)=C_{\omega}\left(\frac{m}{2^i}\right)^{\omega -1}n$ and therefore:
138: % \begin{equation}\nonumber
139: % C(m,n) = \frac{C_{\omega} nm^{\omega -1}}{2}
140: % \sum_{i=1}^{\log m} \left(
141: % \frac{1}{2^i}\right)^{\omega -2}
142: % \end{equation}
143: % which gives
144: % %\[ C(m,n) \leq \frac{C_{\omega} }{2(2^{\omega-2}-1)}nm^{\omega-1}\]
145: % %Thus, this gives the bound $O(nm^{\omega - 1})$.\\
146: % the $O(nm^{\omega - 1})$ bound of the lemma.
147: %\end{proof}
148: % Without loss of generality for linear algebra applications,
149: % we here consider only the case where the row
150: % dimension, $m$, of the the triangular system is less than or equal to the column dimension, $n$.
151:
152: \subsection{Delaying reductions globally}\label{ssec:trsmdel}
153:
154: As for matrix multiplication, the delayed computation
155: relies on the fact that ring operations over the
156: finite field can be replaced by ring operations over \Z using the ring
157: homomorphisms described in section \ref{ssec:ffperf}.
158: However, triangular system resolutions involve, in the general case, field
159: operations: the divisions by the diagonal elements of the triangular matrix.
160: Therefore this technique is only valid with unit diagonal matrices.
161:
162: In the general case, the triangular matrix is made unit diagonal by the
163: following factorization: $A=DU$, where $D$ is diagonal and $U$ is unit diagonal
164: upper triangular. Then the system $U X = D^{-1}B$ only involves ring operations
165: and can be solved over \Z.
166: This normalization leads to an additional cost of $O(mn)$ arithmetic
167: operations (see \cite{jgd:2004:ffpack} for more details).
168:
169:
170: Now the integer computation with a fixed sized arithmetic (e.g. the floating point
171: arithmetic) is exact as long as all intermediate results of the computation
172: do not exceed the bit capacity of the representation.
173: Therefore we now propose bounds on the values computed by the algorithm over \Z.
174:
175: %% On peut avoir une première idée de la croissance des coefficients en remarquant
176: %% que le $k$ième coefficient $x_k$ du vecteur solution du système $Ax=b$ est une
177: %% combinaison linéaire des $n-k$ coefficients suivants~: $x_i, i\in [k+1\dots
178: %% n]$. Par conséquent, la taille du plus grand des coefficients croît linéairement
179: %% en fonction de la dimension du système.
180: %% Nous donnons dans le théorème \ref{th:trsmbound} une borne plus précise de la
181: %% valeur des coefficients calculés.
182: %% Nous donnons aussi une classe de systèmes pour lesquels la borne est atteinte, ce qui
183: %% prouve son optimalité.
184:
185: %
186: \begin{thm} \label{th:trsmbound}
187: Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal upper triangular matrix and $b \in \mathbb{Z}^n$,
188: with $m \leq T_{i,j} \leq M$ and $m \leq b_i \leq M$ and $m\leq 0\leq M$.
189: Let $x = ( x_i )_{i \in [1 \dots n]} \in \mathbb{Z}^n$ be the solution of the
190: system $Tx=b$.
191: Then $\forall \ k \in [0\dots n-1]$~:
192: \[ \left \{
193: \begin{array}{ll}
194: -u_k \leq x_{n-k} \leq v_k & \text{for $k$ even,}\\
195: -v_k \leq x_{n-k} \leq u_k & \text{for $k$ odd}
196: \end{array}
197: \right.
198: \]
199: with
200: \[
201: \left \{
202: \begin{array}{l}
203: u_k =\frac{M-m}{2}(M+1)^k - \frac{M+m}{2}(M-1)^k,\\%[2mm]
204: v_k = \frac{M-m}{2}(M+1)^k + \frac{M+m}{2}(M-1)^k. \\
205: \end{array}
206: \right.
207: \]
208: \end{thm}
209: %
210: \begin{proof}
211:
212: First note the following relations:
213: $$
214: \forall k \left\{
215: \begin{array}{lcl}
216: u_k &\leq& v_k \\
217: -mu_k &\leq &Mv_k\\
218: -mv_k &\leq &Mu_k\\
219: \end{array}
220: \right.
221: $$
222: The third one comes from
223: $$
224: Mu_k+mv_k=\frac{M^2-m^2}{2}((M+1)^k-(M-1)^k) \geq 0.
225: $$
226: The proof is now an induction on $k$, following the system resolution order.
227: The initial case $k=0$ correspond to the first step:
228: $x_n=b_n$, leading to
229: $$ -u_0 = m \leq x_n \leq M = v_0.$$
230: Suppose now that the inequalities hold for $k\in [0\dots l]$
231: and prove them for $k=l+1$.
232: If $l$ is odd, $l+1$ is even.
233: {\small
234: \begin{eqnarray*}
235: x_{n-l-1}&=& b_{n-l-1} - \sum_{j=n-l}^n{T_{n-l-1,j}x_j}\\
236: &\leq& M + \sum_{i=0}^{\frac{l-1}{2}}{\max(Mu_{2i},-mv_{2i}) + \max(Mv_{2i+1},-mu_{2i+1})}\\
237: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i} + v_{2i+1}}\right) \\
238: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} +
239: \frac{M+m}{2}(M-2)(M-1)^{2i}} \right) \\
240: &\leq& M\left(1 + \frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} +
241: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\
242: &\leq&\frac{M-m}{2}(M+1)^{l+1} + \frac{M+m}{2}(M-1)^{l+1} = v_{l+1}.
243: \end{eqnarray*}
244: Similarly,
245: \begin{eqnarray*}
246: x_{n-l-1} &\geq& m - \sum_{i=0}^{\frac{l-1}{2}}{\max(Mv_{2i},-mu_{2i}) +
247: \max(Mu_{2i+1},-mv_{2i+1})}\\
248: &\geq& m - M\sum_{i=0}^{\frac{l-1}{2}}{v_{2i} + u_{2i+1}}\\
249: &\geq& m -M\sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} -
250: \frac{M+m}{2}(M-2)(M-1)^{2i} } \\
251: &\geq& m - M\left(\frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} -
252: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\
253: &\geq&\frac{M-m}{2}(M+1)^{l+1} - \frac{M+m}{2}(M-1)^{l+1} = u_{l+1}.
254: \end{eqnarray*}
255: }
256: For $l$ even, a similar proof leads to
257: $$
258: -v_{l+1} \leq x_{n-l-1} \leq u_{l+1}.
259: $$
260: \end{proof}
261: %
262: \begin{cor}\label{cor:trsmoptimal}
263: Using the notation of theorem \ref{th:trsmbound},
264: $$
265: |x| \leq \frac{M-m}{2}(M+1)^{n-1} + \frac{M+m}{2}(M-1)^{n-1}.
266: $$
267: Moreover this bound is optimal.
268: \end{cor}
269: %
270: \begin{proof}
271: The sequence $(v_k)$ is increasing and always greater than $(u_k)$.
272: Thus $\forall \ k \in [0\dots {n-1}] \ |x_{n-k}|\leq \ u_k \leq v_k \leq v_{n-1}$.
273:
274: Now the vector $x = ( x_i )_{i \in [1\dots n]} \in \mathbb{Z}^n$ such that
275: $ \forall \ k \in [0\dots n-1] \ |x_{n-k}| = v_k$ satisfies the system $Tx=b$ with
276: $$
277: T = \
278: \left[
279: \begin{array}{ccccc}
280: \ddots & \ddots & \ddots & \ddots & \ddots \\
281: & 1 & M & m & M \\
282: & & 1 & M & m \\
283: & & & 1 & M \\
284: & & & & 1 \\
285: \end{array}
286: \right],
287: b = \left[\begin{array}{c}\vdots\\m\\M\\m\\M \end{array} \right]
288: $$
289: Therefore the bound is reached.
290: \end{proof}
291: The following corollaries apply this result to the positive and balanced modular
292: representations.
293:
294: \begin{cor}[Positive modular representation]\label{cor:trsmpositif}
295: For $1 \leq i,j \leq n$, if $T_{i,j},b_i \in [0\dots p-1]$, then
296: $$
297: |x| \leq \frac{p-1}{2}(p^{n-1} + (p-1)^{n-1}).
298: $$
299: \end{cor}
300: %
301: \begin{cor}[Balanced modular representation]\label{cor:trsmcentre}
302: For $1 \leq i,j \leq n$, if $T_{i,j},b_i \in [-\frac{p-1}{2}\dots
303: \frac{p-1}{2}]$, then
304: $$
305: |x| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^{n-1}.
306: $$
307: \end{cor}
308:
309: \begin{rem}
310: The balanced modular representation improves the bound by a factor of $2^{n-1}$.
311: \end{rem}
312:
313: As a consequence, one can solve a unit diagonal triangular system of dimension
314: $n$ using arithmetic operations with integers stored on $\gamma$ bits if
315: \begin{equation}\label{eq:trsmboundpos}
316: \frac{p-1}{2}(p^{n-1} + (p-1)^{n-1})< 2^{\gamma}
317: \end{equation}
318: for a positive representation and
319: \begin{equation}\label{eq:trsmboundcen}
320: \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{\gamma}
321: \end{equation}
322: for a balanced representation.
323:
324: For instance, using the \dbl floating point representation ($53$ bits of
325: mantissa)
326: the maximal dimension of the system is $34$ (resp. $52$) for a positive
327: (resp. balanced) representation of $\Z_3$.
328: For larger fields, this maximal dimension becomes quickly very small: with
329: $p=1001$, $n\leq 5$ (resp. $n\leq 6$) for a positive (resp. balanced)
330: representation.
331:
332: In the following, we will denote by $t_\text{del}(p,\gamma)$ the maximum
333: dimension for the resolution with delayed modular reductions.
334: This dimension is small, and this approach can therefore only be used
335: as a terminal case of the recursive block algorithm.
336: This first cascade algorithm is characterized by the threshold
337: $t_\text{del}$.
338: For efficiency, we used in our implementation the BLAS routine \trsm to perform
339: the delayed computation over \Z.
340: Despite the small dimension of the blocks, we will see in section
341: \ref{ssec:trsmexp} that this approach can slightly improve the efficiency of the
342: computation when the finite field is small.
343:
344: \subsection{Delaying reductions in the update phase only} \label{ssec:trsmdelupdate}
345: %
346: The block recursive algorithm consists in several matrix multiplications of
347: different dimensions. In most cases, the matrix multiplications are done over \Z
348: with a modular reduction on the result only. But part of these result matrices
349: will be accumulated to other matrix multiplications in later computations.
350: Therefore these intermediate modular reductions could be delayed even more by
351: allowing to accumulate these results over \Z as much as possible.
352:
353: This technique can be applied within the former cascade algorithm, to produce a
354: double cascade structure. The key idea is to split the matrices at two levels as
355: shown on figure \ref{fig:trsm:recblasdelayed}:
356: %
357: \begin{figure}[htbp]\begin{center}
358: \includegraphics[width=0.8\textwidth]{trsm_cascade.eps}
359: \caption{Splitting for the double cascade \trsm algorithm}
360: \label{fig:trsm:recblasdelayed}
361: \end{center}\end{figure}
362: %
363: a fine grain
364: splitting with the dimension $t_\text{del}$ of the previous section, and a
365: coarse grain splitting with the dimension $t_\text{update}$ such that
366: all recursive calls of dimension lower than $t_\text{update}$ can let the
367: matrix multiplication updates accumulate without modular reductions.
368: Choosing $t_\text{update} = k_\text{Winograd}$ (from corrolary \ref{cor:winokmax})
369: will ensure this property.
370: To adjust together the dimensions of the two block decompositions, we set
371: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right
372: \rfloor t_\text{del}$.
373: %
374: \begin{algorithm}
375: \dontprintsemicolon
376: \caption{\texttt{trsm-rec-BLAS-delayed}~:}
377: \label{alg:trsm:recblasdelayed}
378: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$}
379: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}
380: \Begin{
381: Compute $t_\text{del}$ from equation (\ref{eq:trsmboundpos} or \ref{eq:trsmboundcen}) \;
382: Compute $t_\text{Winograd}$ from corrolary (\ref{cor:winokmax}) \;
383: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right
384: \rfloor t_\text{del}$\;
385: \ForEach{block column of $A$ of dimension $m\times t_\text{split}$ of the form
386: $\begin{bmatrix}V_i\\U_i\\0\end{bmatrix}$}{
387: $X_i = \texttt{trsm-partial-delayed} (U_i,B_i)$ \;
388: $X_i = X_i \mod p$\;
389: $B_{1\dots i-1} = B_{1\dots i-1} - V_i X_i$\;
390: $B_{1\dots i-1} = B_{1\dots i-1} \mod p$\;
391: }
392: \Return{$X$}
393: }
394: \end{algorithm}
395: %
396: \begin{algorithm}
397: \dontprintsemicolon
398: \caption{\texttt{trsm-partial-delayed}}
399: \label{alg:trsmdelaye}
400: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$, $m$ must be lower
401: than $t_\text{update}$}
402: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}
403: \Begin{
404: \eIf{ $m\leq n_\text{del}$}{
405: $B=B \mod p$\;
406: $X = \texttt{dtrsm}(A,B)$ \tcc*{the BLAS routine}\;
407: $X=X \mod p$\;
408: }{
409: \tcc{ (splitting of the matrix into blocks of dimension $\left\lfloor \frac{m}{2}
410: \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$) }\;
411: $
412: \begin{array}{cccc}
413: A & X & & B \\
414: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&
415: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&
416: = &
417: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }
418: \end{array}
419: $\;
420: $X_2:=\texttt{trsm-partial-delayed} (A_3,B_2)$ \;
421: $B_1:= B_1 - A_2X_2$ \tcc*{without modular reduction} \;
422: $X_1:=\texttt{trsm-partial-delayed} (A_1,B_1)$ \;
423: }
424: \Return $X$
425: }
426: \end{algorithm}
427: %
428:
429: Algorithm \ref{alg:trsm:recblasdelayed} is a loop on every block of column dimension
430: $t_\text{update}$. For each of them, the triangular system is solved using algorithm
431: \ref{alg:trsmdelaye} and the update is performed by a matrix multiplication over
432: \Z followed by a modular reduction.
433: Algorithm \ref{alg:trsmdelaye} is simply the cascade algorithm of the previous
434: section: the block recursive algorithm \ref{alg:trsm:rec} with the fully delayed
435: algorithm as a terminal case.
436: The matrix multiplication updates are performed over \Z without any reduction of
437: the result, since the threshold $t_\text{update}$ allows to accumulate them.
438:
439: %So the only modular reductions are
440: %performed after the call to \trsm.
441:
442:
443: \subsection{Experiments}
444: \label{ssec:trsmexp}
445: %\subsubsection{Comparison of the variants}
446:
447: We now compare three implementations of the \trsm routine over a word size finite
448: field:
449: \begin{description}
450: \item Pure recursive (\texttt{Pure-Rec}): Simply algorithm \ref{alg:trsm:rec},
451: \item Recursive-BLAS (\texttt{Rec-BLAS}): The cascade algorithm formed by
452: the recursive algorithm and the BLAS routine \dtrsm as a terminal case. It
453: differs from algorithm \ref{alg:trsmdelaye} by the fact that the matrix
454: multiplication $B_1:= B_1 - A_2X_2$ is always followed by a modular reduction.
455: \item Recursive-BLAS-Delayed (\texttt{Rec-BLAS-Delayed}): algorihtm \ref{alg:trsm:recblasdelayed}.
456: \end{description}
457:
458: We compare these three variants over finite fields with different cardinalities,
459: so as to make the parameters $t_\text{del}$ and $t_\text{update}$ vary as in
460: the following table:
461:
462: \begin{center}
463: \begin{tabular}{|c||c|c|c|}
464: \hline
465: $p$ & $\lceil \log_2 p\rceil$ & $t_\text{del}$ & $t_\text{update}$\\
466: \hline
467: 5 & 3 & 23 & 2\,147\,483\,642\\
468: 1\,048\,583 & 20 & 2 & 8190 \\
469: 8\,388\,617 & 23 & 2 & 126 \\
470: \hline
471: \end{tabular}
472: \end{center}
473:
474: \begin{figure}[htbp]\begin{center}
475: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_5en_goto.eps}
476: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_1048583en_goto.eps}
477: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_8388617en_goto.eps}
478: \caption{Comparison of the \trsm variants for $p=5,1\,048\,583,8\,388\,617$,
479: on a Pentium4\--3,2Ghz\--1Go}
480: \label{fig:trsm:compvar}
481: \end{center}\end{figure}
482:
483: In the experiments of figure \ref{fig:trsm:compvar}, the matrix $B$ is
484: square ($m=n$).
485: %
486: One can first notice the gain provided by the use of the first cascade with the
487: delayed \dtrsm routine by comparing the curves \texttt{rec-BLAS} and
488: \texttt{pure-rec} for $p=5$. This advantage shrinks when the characteristic gets larger,
489: since $t_\text{del}=2$ for $p=1\,048\,583$ or $p=8\,388\,61$.
490:
491: Now the introduction of the coarse grain splitting, delaying the reductions in
492: the update phase improves by up to 500 Mfops the computation speed.
493: This gain is similar for $p=5$ and $p=1\,048\,583$ since in both
494: cases $n<t_\text{update}$ and there is therefore no modular reduction between the
495: matrix multiplications.
496:
497: Lastly for $p=8\,388\,617$, the speed drops down since more reductions are required.
498: The variants \texttt{pure-rec} and \texttt{rec-BLAS} are penalized by their
499: dichotomic splitting, creating too many modular reductions after each matrix
500: multiplication. Now \texttt{rec-BLAS-delayed} has the best efficiency since the double
501: cascade structure minimizes the number of reductions.
502:
503:
504:
505: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
506:
507: %% Matrix multiplication speed over finite fields was improved
508: %% %in~\cite{jgd:2002:fflas,Pernet:2001:Winograd}
509: %% by the use of the
510: %% numerical BLAS library:
511: %% matrices are converted to floating point representations
512: %% (where the linear algebra routines are fast) and converted back to a finite
513: %% field representation afterwards.
514: %% %The computations remained exact as
515: %% %long as no overflow occurred.
516: %% An implementation of \trsm\ can use the same
517: %% techniques. Indeed, as soon as no overflow occurs one can replace the
518: %% recursive call to \trsm\ by the numerical BLAS {\it dtrsm}
519: %% routine. But one can remark that approximate divisions can occur.
520: %% So we need to ensure both that only exact divisions are performed and that no overflow appears.
521: %% Not only one has to be careful for the result to remain within
522: %% acceptable bounds, but, unlike matrix multiplication where data grows
523: %% linearly, data involved in linear system grows exponentially as shown
524: %% in the following.\\
525: %% %
526: %% The next two subsections first show how to deal with
527: %% divisions, and then give an optimal theoretical bound on the
528: %% coefficient growth and therefore an optimal threshold for
529: %% the switch to the numerical call.
530:
531: %% \subsubsection{Dealing with divisions}
532:
533:
534: %% \subsection{A theoretical threshold}
535: %% We want to use the BLAS trsm routine to solve triangular systems over the
536: %% integers (stored as {\tt double} for {\tt dtrsm} or {\tt float} for {\tt
537: %% strsm}).
538: %% The restriction is then the coefficient growth in the solution.
539: %% Indeed, the $k^{th}$ value in the solution vector is a linear combination of the
540: %% $(n-k)$ already computed next values.
541: %% This implies a linear growth in the coefficient size of the solution, with
542: %% respect to the system dimension.
543: %% Now this resolution can only be performed if every element of the solution can
544: %% be stored in the mantissa of the floating point representation
545: %% (e.g. $53$ bits for {\tt double }).
546: %% Therefore overflow control consists in finding the largest block dimension
547: %% $\beta$, such that the result of the call to BLAS trsm routine will remain
548: %% exact.
549:
550: %% We now propose a bound for the values of the solutions of such a system; this
551: %% bound is optimal (in the sense that there exists a worst
552: %% case matching the bound when $n=2^i \beta$).
553: %% This enables the implementation of a cascading algorithm, starting recursively
554: %% and taking advantage of the BLAS performances as soon as possible.
555: %% %
556: %% %Let us introduce the two following series:
557: %% %\[
558: %% %\left \{
559: %% %\begin{array}{l}
560: %% %u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]\\[2mm]
561: %% %v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]\\
562: %% %\end{array}
563: %% %\right. ~~~ for~ an~ integer~ p>2
564: %% %\]
565: %% %
566: %% \begin{thm} \label{THEO:TRSMBOUND}
567: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal
568: %% upper triangular matrix, and $b \in \mathbb{Z}^n$,
569: %% with $0 \leq T \leq p-1$ and $0 \leq b \leq p-1$.
570: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of
571: %% $T.X=b$ over the integers.
572: %% Then, $\forall \ k \in [0..n-1]$:
573: %% \[ \left \{
574: %% \begin{array}{ll}
575: %% (p-2)^k-p^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k + (p-2)^k & \mbox{if $k$ is even}\\
576: %% -p^k-(p-2)^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k - (p-2)^k & \mbox{if $k$ is odd}
577: %% \end{array}
578: %% \right.
579: %% \]
580: %% \end{thm}
581: %% %
582: %% \begin{proof}
583: %% The idea is to use an induction on $k$ with the relation
584: %% $ x_k = b_k - \sum_{i=k+1}^{n}{T_{k,i}x_i}$.
585: %% A lower and an upper bound for $x_{n-k}$ are computed, depending
586: %% whether $k$ is even or odd:
587: %% Let us define the following induction hypothesis $IH_l$:
588: %% \[
589: %% \forall \ k \in [0..l]
590: %% \left \{
591: %% \begin{array}{ll}
592: %% -u_k \leq x_{n-k} \leq v_k & \mbox{if $k$ is even}\\
593: %% -v_k \leq x_{n-k} \leq u_k & \mbox{if $k$ is odd}\\
594:
595: %% \end{array}
596: %% \right.
597: %% \]
598: %% % Let us define the induction hypothesis $IH_l$ to be that the equations
599: %% % (\ref{eq:bound}) hold for $k \in [0..l-1]$ .
600: %% %
601: %% When $l=0$, $x_n=b_n$ which implies that
602: %% $ -u_0 = 0 \leq x_n \leq p-1 = v_0$. Thus $IH_0$ is proven.
603: %% %
604: %% Let us suppose that $ IH_l$ is true, and
605: %% prove $IH_{l+1}$. There are two cases: either $l$ is odd or not !
606: %% If $l$ is odd, $l+1$ is even. Now,
607: %% by induction,
608: %% \begin{small}
609: %% \begin{eqnarray*}
610: %% x_{n-l-1}&\leq& (p-1) \left( 1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i}+v_{2i+1}}\right) \\
611: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}-(p-2)^{2i} + p^{2i+1}+(p-2)^{2i+1}\right] } \\
612: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}(p+1) + (p-2)^{2i}(p-3) \right] } \\
613: %% &\leq & p-1 + \frac{(p-1)^2}{2}
614: %% \left[(p+1)\frac{p^{l+1}-1}{p^2-1} + (p-3)\frac{(p-2)^{l+1}-1}{(p-2)^2-1} \right] \\
615: %% &\leq & \frac{p-1}{2} \left[ p^{l+1} + (p-2)^{l+1}\right] = v_{l+1}\\
616: %% \end{eqnarray*}
617: %% \end{small}
618: %% Similarly,
619: %% \begin{small}
620: %% \begin{eqnarray*}
621: %% x_{n-l-1} &\geq& -(p-1) \sum_{i=0}^{\frac{l-1}{2}}{v_{2i}+u_{2i+1}}\\
622: %% &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[
623: %% p^{2i} + (p-2)^{2i} + p^{2i+1} - (p-2)^{2i+1}\right]} \\
624: %% &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[
625: %% p^{2i}(p+1) - (p-2)^{2i}(p-3) \right]} \\
626: %% &\geq & -\frac{p-1}{2} \left[ p^{l+1} - (p-2)^{l+1} \right] = u_{l+1}\\
627: %% \end{eqnarray*}
628: %% \end{small}
629: %% Finally, If $l$ is even, a similar proof leads to $-v_{l+1}\leq x_{n-l+1} \leq u_{l+1} $.
630: %% \end{proof}
631: %% %
632: %% \begin{cor}\label{cor:TRSMBOUND}
633: %% $
634: %% |X| \leq \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right]
635: %% $.\\
636: %% Moreover, this bound is optimal.
637: %% \end{cor}
638: %% %
639: %% \begin{proof} We denote by
640: %% $u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]$
641: %% and $v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]$ the bounds of the
642: %% theorem \ref{THEO:TRSMBOUND}. Now $ \forall \ k \in [0..{n-1}] \ u_k \leq v_k \leq v_{n-1}$.
643: %% Therefore the theorem \ref{THEO:TRSMBOUND} gives $\forall \ k \in [1..n] \ x_k \leq v_{n-1} \leq
644: %% \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right] $
645: %% \[
646: %% \text{Let}~T = \
647: %% \left[
648: %% \begin{array}{ccccc}
649: %% \ddots & \ddots & \ddots & \ddots & \ddots \\
650: %% & 1 & p-1 & 0 & p-1 \\
651: %% & & 1 & p-1 & 0 \\
652: %% & & & 1 & p-1 \\
653: %% & & & & 1 \\
654: %% \end{array}
655: %% \right],
656: %% b = \left[\begin{array}{c}\vdots\\0\\p-1\\0\\p-1 \end{array} \right]
657: %% \]
658: %% Then the solution $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ of
659: %% the system $T.X = b$ satisfies $ \forall \ k \in [0..n-1] \ |x_{n-k}| = v_k$
660: %% \end{proof}
661: %% One can derive the same kind of bound for the centered representation,
662: %% but with an $2^n$ gain.
663: %% \begin{thm}
664: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal
665: %% upper triangular matrix, and $b \in \mathbb{Z}^n$,
666: %% with $\left| T \right| \leq \frac{p-1}{2}$ and $\left| b \right|\leq \frac{p-1}{2}$.
667: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of
668: %% $T.X=b$ over the integers. Then
669: %% $
670: %% |X| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n
671: %% $.\\
672: %% Moreover, this bound is optimal.
673: %% \end{thm}
674: %% \begin{proof} The proof is simpler than that of theorem
675: %% \ref{THEO:TRSMBOUND}, since the inequations are symmetric.
676: %% Therefore $u_n=v_n$ and the induction yields
677: %% $$u_n=\frac{p-1}{2}\left(1+\sum_{i=0}^{n-1}u_i\right)=\frac{p-1}{2}\left(1+\frac{p-1}{2}\frac{
678: %% \left(\frac{p+1}{2}\right)^n-1}{\frac{p+1}{2}-1}\right) = \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n.$$
679: %% \end{proof}
680:
681: %% Thus, for a given $p$, the dimension $n$ of the system must satisfy
682: %% \begin{equation}
683: %% \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{m}
684: %% \end{equation}
685: %% where $m$ is the size of the mantissa
686: %% so that the resolution over the integers using the BLAS trsm routine
687: %% is exact. For instance, with a 53 bits mantissa,
688: %% this gives quite small matrices,
689: %% namely at most $92 \times 92$ for $p=2$, at most $4\times 4$ for $p
690: %% \
691: %% leq 3089$, and at most $p=416107$ for $2\times 2$ matrices.
692: %% Nevertheless, this technique is speed-worthy in most cases as shown in
693: %% section \ref{ssec:trsmexp}.
694: %Indeed, this test can easily be performed in the
695: %recursive {\tt trsm} routine to determine whether the dimension of the
696: %system is small enough to make use of the BLAS trsm routine.
697:
698: %\input{delayed}
699:
700: %% \subsection{``Trsm'' implementations behavior}\label{ssec:trsmexp}
701: %% As shown in section \ref{ssec:rec-trsm} the block recursive algorithm {\trsm} is based on matrix multiplications.
702: %% This allows us to use our fast matrix multiplication routine.
703: %% % of the
704: %% % FFLAS package \cite{jgd:2002:fflas}. This is an exact wrapping of the
705: %% % ATLAS
706: %% % library\footnote{\scriptsize\texttt{http://math-atlas.sourceforge.net}\cite{Whaley:2001:AEO}}
707: %% % used as a kernel to implement
708: %% % the {\trsm} variants.
709: %% The following table
710: %% %derives from experimental results of
711: %% %\cite{jgd:2004:ffpack} and
712: %% expresses which of the two preceding
713: %% variants is better:
714:
715: %% \newcommand{\blastrsm}{{{\tt BLASTrsm}}}
716: %% \newcommand{\deltrsm}{{{\tt DelayTrsm}}}
717: %% {\blastrsm} is the hybric numeric/finite field implementation of section \ref{ssec:trsm-blas} and
718: %% {\deltrsm} is the delayed division implementation of section
719: %% \ref{ssec:recdelay}.
720: %% {\tt Zpz-double} is a field representation
721: %% from \cite{jgd:2004:dotprod} where the elements are stored as
722: %% floating points to avoid one of the conversions. {\tt Zpz-int}
723: %% is a field representation from \cite{jgd:2005:givaro} where the
724: %% elements are stored as small integers.
725:
726: %% \begin{table}[htbp]\begin{center}
727: %% \small
728: %% \begin{tabular}{|c||r|r|r|r|r|r|}
729: %% \hline
730: %% $n$ & {\em 400} & {\em 1000} & {\em 2000} & {\em 5000} \\
731: %% \hline
732: %% {\tt Zpz-double(5)} & \blastrsm & \blastrsm & \blastrsm & \blastrsm \\
733: %% {\tt Zpz-double(32749)} & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \blastrsm & \blastrsm \\
734: %% {\tt Zpz-int(5)} & \deltrsm$_{100}$ & \deltrsm$_{100}$ & \blastrsm & \blastrsm \\
735: %% {\tt Zpz-int(32749)} & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$ \\
736: %% %%$n$ & {\em 400} & {\em 700} & {\em 1000} & {\em 2000} &
737: %% %%{\em 5000} \\
738: %% %% \hline
739: %% %% {\tt Mod<double>(5)} & \blastrsm & \blastrsm & \blastrsm & \blastrsm & \blastrsm \\
740: %% %% {\tt Mod<double>(32749)} & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \blastrsm & \blastrsm \\
741: %% %% {\tt G-Zpz(5)} & \deltrsm$_{100}$ & \deltrsm$_{150}$ & \deltrsm$_{100}$ & \blastrsm & \blastrsm \\
742: %% %% {\tt G-Zpz(32749)}& \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$& \deltrsm$_{50}$ & \deltrsm$_{50}$ \\
743: %% \hline
744: %% \end{tabular}
745: %% \caption{Best variant for \trsm\ on a P4,
746: %% 2.4GHz}\label{tab:trsmmodular}
747: %% \end{center}
748: %% \vspace{-1em}
749: %% \end{table}
750:
751: %% To summarize, one would rather use {\tt ZpZ-double} representation and ``blas'' {\trsm} variant in most cases.
752: %% However, when the base field is already specified ``delayed{\large$_t$}'' could provide slightly better performances.
753: %% This requires a search for optimal thresholds which again could be done through an Automated Empirical Optimizations of Software \cite{Whaley:2001:AEO}.
754:
755:
756: %\subsection{Performances and comparison with numerical routines}
757:
758: We now give a comparison of this implementation with the equivalent routine of the original BLAS \dtrsm.
759: %In the previous section we showed that {\trsm} optimized variant based on numerical solving allows us to achieve the best performances.
760: %In this section we compare these performances with pure numerical solving and with matrix multiplication.
761: %In order to achieve the best performances we use as much as possible fast matrix multiplication of section \ref{ssec:winograd}.
762: %For this purpose we use an experimental switching threshold to classic multiplication since table \ref{tab:winolevel} reflects only theoretical behavior.
763: As for matrix multiplication in section \ref{ssec:fgemm-perf}, we compare the routines according to
764: two different BLAS implementations (i.e. ATLAS and GOTO) and
765: two different architectures. Nevertheless, we do not present the
766: results with ATLAS on Xeon architecture due to the surprisingly poor efficiency
767: of ATLAS \dtrsm during our tests.
768: In the following, \ftrsm denotes the \trsm routine over $16$-bits prime field (i.e. $\Z_{65521}$)
769: using the \texttt{ZpZ-double} implementation.
770:
771:
772:
773:
774: %\begin{figure}[hbtp]
775: %\includegraphics[width=8cm,angle=-90]{timing-trsm-p4}
776: %\caption{Timing comparison for matrix multiplication (exact and numeric) on a P4, 3.4GHz}
777: %\end{figure}
778:
779:
780:
781: \begin{table}[htbp]\begin{center}
782: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|r|}
783: \cline{3-11}
784: \multicolumn{2}{c|}{} & $n$ %& {\em 500}
785: & {\em 1000} & {\em 2000} & {\em 3000} & {\em 5000} & {\em 7000} & {\em 8000} & {\em 9000} & {\em 10000} \\
786: \cline{3-11}
787: \multicolumn{11}{c}{}\\[-0.1cm]
788: \hline
789: &ATLAS & ftrsm & $0.37$s & $1.93$s & $5.73$s & $23.63$s & $62.50$s & $91.67$s & $121.84$s & $166.74$s \\
790: %\cline{3-11}
791: %& & dtrsm & $$s & $$s & $$s & $$s & $$s & $$s & $$s & $$s \\
792: %\cline{3-11} \\[-.3cm]
793: %\cline{3-11}
794: %& \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{fgemm}{dgemm}$} & \bf & \bf & \bf & \bf & \bf & \bf & \bf & \bf \\
795: \hline
796: \multicolumn{11}{c}{}\\[-0.2cm]
797: \hline
798:
799: & & ftrsm %& $0.059$s
800: & $0.25$s & $1.66$s & $5.08$s &
801: $21.47$s & $55.95$s & $80.77$s & $111.57$s & $150.81$s \\
802: \cline{3-11}
803: & & dtrsm % & $0.023$s
804: & $0.17$s & $1.35$s & $4.50$s &
805: $20.64$s & $56.19$s & $83.85$s & $119.18$s & $163.33$s \\
806: \cline{3-11} \\[-.3cm]
807: \cline{3-11}
808: & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf
809: $\frac{ftrsm}{dtrsm}$} %& {\bf 2.57}
810: & \bf 1.47 & \bf 1.23 & \bf 1.13 & \bf 1.04 & \bf 1.00 & \bf 0.96 & \bf 0.94 & \bf 0.92 \\
811: \hline
812:
813: \end{tabular}
814: \caption{Timings of triangular solver with matrix hand side on a Xeon,
815: 3.6GHz}\label{tab:trsm-p4}
816: %\end{center}
817: %\end{table}
818:
819: %\begin{figure}[hbtp]
820: %\includegraphics[width=8cm,angle=-90]{timing-trsm-itanium2}
821: %\caption{Timing comparison for triangular system solving with matrix hand side (exact and numeric) on Itanium2-1.3GHz}
822: %\end{figure}
823:
824: %\begin{table}[htbp]\begin{center}
825: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|}
826: \multicolumn{11}{c}{}\\
827: \cline{3-11}
828: \multicolumn{2}{c|}{} & $n$ & {\em 1000} & {\em 2000} & {\em 3000} & {\em 5000} & {\em 7000} & {\em 8000} & {\em 9000} & {\em 10000}\\
829: \cline{3-11}
830: \multicolumn{11}{c}{}\\[-0.2cm]
831: \hline
832: & & ftrsm & $0.34$s & $2.28$s & $7.11$s & $30.26$s & $77.43$s & $112.01$s & $158.00$s & $214.31$s \\
833: \cline{3-11}
834: & & dtrsm & $0.26$s & $1.95$s & $6.37$s & $28.60$s & $76.44$s & $113.78$s & $161.19$s & $219.31$s \\
835: \cline{3-11} \\[-.3cm]
836: \cline{3-11}
837: & \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$} & \bf 1.31 & \bf 1.17 & \bf 1.12 & \bf 1.06 & \bf 1.01 & \bf 0.98 & \bf 0.98 & \bf 0.98 \\
838: \hline
839: \multicolumn{11}{c}{}\\[-0.2cm]
840: \hline
841:
842: & & ftrsm & $0.30$s & $2.00$s & $6.23$s & $26.67$s & $68.22$s & $104.32$s & $137.96$s & $192.37$s \\
843: \cline{3-11}
844: & & dtrsm & $0.21$s & $1.61$s & $5.36$s & $24.59$s & $67.35$s & $100.42$s & $142.43$s & $195.79$s \\
845: \cline{3-11} \\[-.3cm]
846: \cline{3-11}
847: & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$} & \bf 1.43 & \bf 1.24 & \bf 1.16 & \bf 1.08 & \bf 1.01 & \bf 1.04 & \bf 0.97 & \bf 0.98 \\
848: \hline
849:
850: \end{tabular}
851: \caption{Timings of triangular solver with matrix hand side on Itanium2, 1.3GHz}\label{tab:trsm-ia64}
852: \end{center}
853: \end{table}
854:
855: Tables \ref{tab:trsm-p4} and \ref{tab:trsm-ia64} show that our
856: implementation of exact {\trsm} solving is not far from numerical
857: performances.
858: %In particular, ``ftrsm'' performances tend to catch up with BLAS ones as soon as the dimensions of matrices increase.
859: Moreover, on our Xeon architecture, with GOTO BLAS, we are able to
860: achieve even better performances than numerical solving for matrices
861: of dimension greater than $7\,000$.
862:
863:
864: \begin{figure}[hbtp]
865: \begin{center}
866: \includegraphics[width=0.55\textwidth,angle=-90]{graph-BEST-trsm}
867: \end{center}
868: \caption{Comparing triangular system solving with matrix
869: multiplication on a Xeon,
870: 3.6GHz} \label{fig:trsm-ratio}
871: \end{figure}
872:
873: The good performance of our implementation is mostly achieved with
874: the efficient reduction to fast matrix multiplication and the double
875: cascade structure.
876: Figure~\ref{fig:trsm-ratio} shows the ratio of the computation time of
877: our \trsm compared with matrix multiplication routine.
878: %One can see from this figure that our experimental ratio converges to the theoretical one.
879: %In particular, the theoretical ratio is slightly more than $\frac{1}{2}$ since fast matrix multiplication algorithm is used.
880: According to lemma \ref{lem:trsm}, this ratio is $1/2$ with $\omega=3$
881: and $2/3$ with $\omega = \log_2 7$.
882: In practice, our implementation only performs a few recursive calls of
883: Winograd's algorithm, and the ratio appears to be between $0.5$ and $0.666$ as
884: soon as the dimension is large enough, showing the good efficiency of the reduction to
885: matrix multiplication.
886:
887:
888: %
889: %
890: %
891: %\subsubsection{Recursive with delayed modulus}
892: %\subsubsubsection{Threshold}
893: %\subsubsubsection{Dot product}
894: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
895: %%% Local Variables:
896: %%% mode: latex
897: %%% TeX-master: "../dlaff.tex"
898: %%% End:
899: