cs0601133/trsm.tex
1: \section{Triangular system solving with matrix right/left hand side}\label{sec:trsm}
2: %%
3: We now discuss the implementation of solvers for 
4: triangular systems with matrix right hand side (or equivalently left 
5: hand side). 
6: %This is also the simultaneous resolution of $n$ triangular systems.
7: %
8: The resolution of such systems plays a central role in many linear algebra
9: problems, e.g. it is the second main operation 
10: in block Gaussian elimination after matrix multiplication as will be recalled in section \ref{sec:triang}. This operation is commonly named 
11: \trsm in the BLAS convention. In the following, we will consider
12: without loss of generality the resolution of an upper triangular system with
13: matrix right hand side, i.e. the operation $B \leftarrow U^{-1}B$, where $U$ is
14: $m\times m$ upper triangular and $B$ is $m\times n$.
15: 
16: 
17: Following the approach of the BLAS numerical routine,
18: our implementation is based on  a block recursive algorithm 
19: to reduce the computation to matrix multiplications. 
20: 
21: Now similarly to our approach with matrix multiplication, the design of our
22: implementation also focuses on delaying the modular reductions as
23: much as possible. As will be shown in section \ref{ssec:trsmdel}, delaying the
24: whole resolution leads to a quick growth in the size of coefficients.
25: Therefore we also present in section \ref{ssec:trsmdelupdate} another way of
26: delaying these modular reductions.
27: We lastly present how to combine these two techniques within a multi-cascade
28: algorithm.
29: 
30: 
31: % A mettre en intro
32: 
33: %% Let us denote by $R(m,k,n)$ the arithmetical cost of a $m \times k$ by $k \times n$ rectangular
34: %% matrix multiplication.
35: %% Now let us suppose that $k \leq m \leq n$, then
36: %% $R(k,m,n)$, $R(m,k,n)$ and $R(m,n,k)$ are all bounded by $\lCeil
37: %% \frac{m n }{k^2} \rCeil \MM(k)$ (see
38: %% e.g. \cite[(2.5)]{Huang:1997:FRM} for more details).
39: 
40: %\newpage
41: \subsection{The block recursive algorithm}\label{ssec:rec-trsm}
42: 
43: %% The classical idea is to use the divide and conquer approach.
44: %% Here, we consider the upper left triangular case without loss of
45: %% generality, since any combination of
46: %% upper/lower and left/right triangular cases are similar: if $U$ is
47: %% upper triangular, $L$ is lower triangular and $B$ is rectangular, 
48: %% we call \ltrsm\ the resolution of $U X = B$, \lltrsm\ that of 
49: %% $L X = B$, \urtrsm\ that of $XU=B$ and \lrtrsm\ that of $XL=B$.
50: 
51: Algorithm \texttt{trsm} recalls the block recursive algorithm.
52: 
53: \begin{algorithm}
54: \dontprintsemicolon
55: \caption{\trsm($A,B$)}\label{alg:trsm:rec}
56: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$.}
57: \KwResult{$X \in \Zp^{m \times n}$ such that $AX=B$.}
58: \Begin{
59: \eIf{$m=1$}{
60:  $ X:= A_{1,1}^{-1} \times B$\;
61: }{
62:  \tcc{splitting matrices into two blocks of sizes $\left\lfloor \frac{m}{2}
63:   \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$ 
64: \[ 
65: \begin{array}{cccc}
66: A & X & & B \\
67: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&
68: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&
69: = &
70: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }
71: \end{array}
72: \]}
73: 
74: $X_2:=$\trsm($A_3,B_2$)\;
75: $B_1:= B_1 - A_2X_2$\;
76: $X_1:=$\trsm($A_1,B_1$)\;
77: }
78: }
79: \end{algorithm}
80: 
81: 
82: \begin{lem}\label{lem:trsm}
83: Algorithm \trsm\ is correct and the leading term of its arithmetic
84: complexity over $\Zp$ is 
85: $$\TRSM(m,n) = 
86: %\left\{ \begin{array}{ccc}
87:     \frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil  \MM(m) 
88: %& &    \text{if}~m\leq n \\
89: %\frac{1}{2^{\omega-1}-2} \lCeil\frac{m}{n}\rCeil^2 \MM(n)& &
90: %    \text{if}~m \geq n
91: %\end{array}\right.
92: $$
93: This complexity is 
94: %$\min(mn^2,nm^2)$ 
95: $m^2n$
96: using classic matrix  multiplication.
97: \end{lem}
98: 
99: \begin{proof}
100: Extending the previous notation \MM(n), we denote by \MM(m,k,n) the cost of
101: multiplying a $m\times k$ by a $k\times n$ matrices.
102: The cost function $\TRSM(m,n)$ satisfies the following equation:
103: $$\TRSM(m,n)= 2\TRSM(\frac{m}{2},n)+\MM(\frac{m}{2},\frac{m}{2},n).$$
104: Let $t=\log_2(m)$. Although the algorithm works for any $n$, we restrict the
105: complexity analysis to the case where $m \leq n$ for the sake of simplicity.
106: We then have:
107: \begin{eqnarray*}
108: \TRSM(m,n)&=& 2\TRSM(\frac{m}{2},n)+
109: \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)
110: \\&=& 2^t \TRSM(1,n) + \frac{1}{2^{\omega-1}}\lCeil\frac{n}{m}\rCeil \MM(m)
111: \frac{ 1 -
112:   \left(\frac{2}{2^{\omega-1}}\right)^t}{1-\frac{2}{2^{\omega-1}}}.
113: \end{eqnarray*}
114: As $\TRSM(1,n)=2n$ and $\left(2^{\omega-1}\right)^t = m^{\omega-1}$,
115: we obtain the expected complexity 
116: $\TRSM(m,n)=\frac{1}{2^{\omega-1}-2}\lCeil\frac{n}{m}\rCeil \MM(m) + \GO(m^2+mn).$
117: \end{proof}
118: %% When $m \geq n$, the trick is to consider two TRSM with the same
119: %% triangular matrix, but of right hand side of size $n/2$. 
120: %% $R(\frac{m}{2},\frac{m}{2},n)=2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})$.
121: %% Therefore the inequality $m \geq n$ 
122: %% is preserved all along the algorithm and the cost is thus
123: %% %
124: %% $\TRSM(m,n)= 4\TRSM(\frac{m}{2},\frac{n}{2})+2R(\frac{m}{2},\frac{m}{2},\frac{n}{2})=
125: %% 4\TRSM(\frac{m}{2},\frac{n}{2})+\frac{1}{2^{\omega-1}} \lCeil \frac{n}{m}
126: %% \rCeil^2 \MM(n)$.
127: %% Thus we have $\TRSM(m,n)=4^t T(1;1) + \frac{1}{2^{\omega-1}}\lCeil\frac{m}{n}\rCeil^2 \MM(m)
128: %% \frac{ 1 -
129: %%   \left(\frac{4}{2^{\omega}}\right)^t}{1-\frac{4}{2^{\omega}}}$.
130: %% This yields the leading term $\frac{2}{2^\omega-4}\lCeil\frac{m}{n}\rCeil^2 \MM(m)$.
131: %
132: %
133: % By counting each operation at one recursive step we have:
134: % \begin{eqnarray}\nonumber
135: % C(m,n)= \sum_{i=1}^{\log m} 2^{i-1} R(\frac{m}{2^i},\frac{m}{2^i},n)
136: % \end{eqnarray}
137: % Now, since $m \leq n$, we get $\forall i \ R(\frac{m}{2^i},\frac{m}{2^i},n)=C_{\omega}\left(\frac{m}{2^i}\right)^{\omega -1}n$ and therefore:
138: % \begin{equation}\nonumber
139: % C(m,n) = \frac{C_{\omega} nm^{\omega -1}}{2}
140: %                         \sum_{i=1}^{\log m} \left(
141: %                         \frac{1}{2^i}\right)^{\omega -2}
142: % \end{equation}
143: % which gives 
144: % %\[ C(m,n) \leq \frac{C_{\omega} }{2(2^{\omega-2}-1)}nm^{\omega-1}\]
145: % %Thus, this gives the bound $O(nm^{\omega - 1})$.\\
146: % the $O(nm^{\omega - 1})$ bound of the lemma.
147: %\end{proof} 
148: % Without loss of generality for linear algebra applications, 
149: % we here consider only the case where the row
150: % dimension, $m$, of the the triangular system is less than or equal to the column dimension, $n$.
151: 
152: \subsection{Delaying reductions globally}\label{ssec:trsmdel}
153: 
154: As for matrix multiplication, the delayed computation 
155: relies on the fact that ring operations over the
156: finite field can be replaced by ring operations over \Z using the ring
157: homomorphisms described in section \ref{ssec:ffperf}.
158: However, triangular system resolutions involve, in the general case, field
159: operations: the divisions by the diagonal elements of the triangular matrix.
160: Therefore this technique is only valid with unit diagonal matrices.
161: 
162: In the general case, the triangular matrix is made unit diagonal by the
163: following factorization: $A=DU$, where $D$ is diagonal and $U$ is unit diagonal
164: upper triangular. Then the system $U X = D^{-1}B$ only involves ring operations
165: and can be solved over \Z.
166: This normalization leads to an additional cost of $O(mn)$ arithmetic
167: operations (see \cite{jgd:2004:ffpack} for more details).
168: 
169: 
170: Now the integer computation with a fixed sized arithmetic (e.g. the floating point
171: arithmetic)  is exact as long as all intermediate results of the computation
172: do not exceed the bit capacity of the representation.
173: Therefore we now propose bounds on the values computed by the algorithm over \Z.
174: 
175: %% On peut avoir une première idée de la croissance des coefficients en remarquant
176: %% que le $k$ième  coefficient $x_k$ du vecteur solution du système $Ax=b$ est une
177: %% combinaison linéaire des $n-k$ coefficients suivants~: $x_i, i\in [k+1\dots
178: %% n]$. Par conséquent, la taille du plus grand des coefficients croît linéairement
179: %% en fonction de la dimension du système.
180: %% Nous donnons dans le théorème \ref{th:trsmbound} une borne plus précise de la
181: %% valeur des coefficients calculés.
182: %% Nous donnons aussi une classe de systèmes pour lesquels la borne est atteinte, ce qui
183: %% prouve son optimalité.
184: 
185: %
186: \begin{thm}  \label{th:trsmbound}
187: Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal upper triangular matrix and $b \in \mathbb{Z}^n$, 
188: with $m \leq T_{i,j} \leq M$ and $m \leq b_i \leq M$ and  $m\leq 0\leq M$.
189: Let $x = ( x_i )_{i \in [1 \dots n]} \in \mathbb{Z}^n$ be the solution of the
190: system $Tx=b$.
191: Then $\forall \ k \in [0\dots n-1]$~:
192: \[ \left \{
193: \begin{array}{ll}
194:  -u_k \leq x_{n-k} \leq v_k &  \text{for $k$ even,}\\ 
195:  -v_k \leq x_{n-k} \leq u_k &  \text{for $k$ odd}
196: \end{array}
197: \right. 
198: \]
199: with 
200: \[
201: \left \{
202: \begin{array}{l}
203: u_k =\frac{M-m}{2}(M+1)^k  - \frac{M+m}{2}(M-1)^k,\\%[2mm]
204: v_k = \frac{M-m}{2}(M+1)^k  + \frac{M+m}{2}(M-1)^k. \\
205: \end{array}
206: \right. 
207: \]
208: \end{thm}
209: %
210: \begin{proof}
211: 
212: First note the following relations:
213: $$
214: \forall k \left\{
215: \begin{array}{lcl}
216: u_k &\leq& v_k \\
217: -mu_k &\leq &Mv_k\\
218: -mv_k &\leq &Mu_k\\
219: \end{array}
220: \right.
221: $$
222: The third one comes from
223: $$
224: Mu_k+mv_k=\frac{M^2-m^2}{2}((M+1)^k-(M-1)^k) \geq 0.
225: $$
226: The proof is now an induction on $k$, following the system resolution order.
227: The initial case $k=0$ correspond to the first step:
228: $x_n=b_n$, leading to
229: $$ -u_0 = m \leq x_n \leq M = v_0.$$ 
230: Suppose now that the inequalities hold for $k\in [0\dots l]$
231: and prove them for  $k=l+1$.
232: If  $l$ is odd, $l+1$ is even.
233: {\small
234: \begin{eqnarray*}
235: x_{n-l-1}&=& b_{n-l-1} - \sum_{j=n-l}^n{T_{n-l-1,j}x_j}\\
236: &\leq& M + \sum_{i=0}^{\frac{l-1}{2}}{\max(Mu_{2i},-mv_{2i}) + \max(Mv_{2i+1},-mu_{2i+1})}\\
237: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i} + v_{2i+1}}\right)  \\ 
238: &\leq& M\left(1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} +
239: \frac{M+m}{2}(M-2)(M-1)^{2i}} \right)  \\ 
240: &\leq& M\left(1 + \frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} +
241: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\  
242: &\leq&\frac{M-m}{2}(M+1)^{l+1}  + \frac{M+m}{2}(M-1)^{l+1} = v_{l+1}.
243: \end{eqnarray*}
244: Similarly,
245: \begin{eqnarray*}
246: x_{n-l-1}  &\geq& m - \sum_{i=0}^{\frac{l-1}{2}}{\max(Mv_{2i},-mu_{2i}) +
247: \max(Mu_{2i+1},-mv_{2i+1})}\\ 
248: &\geq& m -  M\sum_{i=0}^{\frac{l-1}{2}}{v_{2i} + u_{2i+1}}\\ 
249: &\geq& m -M\sum_{i=0}^{\frac{l-1}{2}}{\frac{M-m}{2}(M+2)(M+1)^{2i} -
250: \frac{M+m}{2}(M-2)(M-1)^{2i} }  \\ 
251: &\geq& m - M\left(\frac{M-m}{2}(M+2)\frac{(M+1)^{l+1}-1}{(M+1)^2-1} -
252: \frac{M+m}{2}(M-2)\frac{(M-1)^{l+1}-1}{(M-1)^2-1} \right)\\  
253: &\geq&\frac{M-m}{2}(M+1)^{l+1}  - \frac{M+m}{2}(M-1)^{l+1} = u_{l+1}.
254: \end{eqnarray*}
255: }
256: For $l$ even, a similar proof leads to
257: $$
258: -v_{l+1} \leq x_{n-l-1} \leq u_{l+1}.
259: $$
260: \end{proof}
261: %
262: \begin{cor}\label{cor:trsmoptimal}
263:  Using the notation of theorem \ref{th:trsmbound}, 
264: $$ 
265:  |x| \leq \frac{M-m}{2}(M+1)^{n-1}  + \frac{M+m}{2}(M-1)^{n-1}.
266: $$
267: Moreover this bound is optimal.
268: \end{cor}
269: %
270: \begin{proof} 
271: The sequence $(v_k)$ is increasing and always greater than $(u_k)$. 
272: Thus $\forall \ k \in [0\dots {n-1}] \ |x_{n-k}|\leq \ u_k \leq v_k \leq v_{n-1}$.
273: 
274: Now the vector $x = ( x_i )_{i \in [1\dots n]} \in \mathbb{Z}^n$ such that 
275: $ \forall \ k \in [0\dots n-1] \ |x_{n-k}| = v_k$ satisfies the system $Tx=b$ with
276: $$
277: T = \
278: \left[
279: \begin{array}{ccccc}
280: \ddots & \ddots & \ddots & \ddots & \ddots \\
281:        &   1    &  M   &    m   & M  \\
282:        &        &    1   &   M  & m   \\
283:        &        &        &    1   & M \\
284:        &        &        &        & 1   \\
285: \end{array}
286: \right], 
287: b = \left[\begin{array}{c}\vdots\\m\\M\\m\\M \end{array} \right]
288: $$
289: Therefore the bound is reached.
290: \end{proof}
291: The following corollaries apply this result to the positive and balanced modular
292: representations.
293: 
294: \begin{cor}[Positive modular representation]\label{cor:trsmpositif}
295: For $1 \leq i,j \leq n$, if  $T_{i,j},b_i \in [0\dots p-1]$, then 
296: $$
297: |x| \leq \frac{p-1}{2}(p^{n-1}  + (p-1)^{n-1}).
298: $$ 
299: \end{cor}
300: %
301: \begin{cor}[Balanced modular representation]\label{cor:trsmcentre}
302: For $1 \leq i,j \leq n$, if  $T_{i,j},b_i \in [-\frac{p-1}{2}\dots
303: \frac{p-1}{2}]$, then
304: $$
305: |x| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^{n-1}.
306: $$ 
307: \end{cor}
308: 
309: \begin{rem}
310: The balanced modular representation improves the bound by a factor of $2^{n-1}$.
311: \end{rem}
312: 
313: As a consequence, one can solve a unit diagonal triangular system of dimension
314: $n$ using arithmetic operations with integers stored on $\gamma$ bits if
315: \begin{equation}\label{eq:trsmboundpos}
316: \frac{p-1}{2}(p^{n-1}  + (p-1)^{n-1})< 2^{\gamma}
317: \end{equation}
318: for a positive representation and 
319: \begin{equation}\label{eq:trsmboundcen}
320: \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{\gamma}
321: \end{equation}
322: for a balanced representation.
323: 
324: For instance, using the  \dbl floating point representation ($53$ bits of
325: mantissa)
326: the maximal dimension of the system is $34$ (resp. $52$) for a positive
327: (resp. balanced) representation of $\Z_3$. 
328: For larger fields, this maximal dimension becomes quickly very small: with
329: $p=1001$, $n\leq 5$ (resp. $n\leq 6$) for a positive (resp. balanced)
330: representation. 
331: 
332: In the following, we will denote by $t_\text{del}(p,\gamma)$ the maximum
333: dimension for the resolution with delayed modular reductions.
334: This dimension is small, and this approach can therefore only be used 
335: as a terminal case of the recursive block algorithm. 
336: This first cascade algorithm is characterized by the threshold 
337: $t_\text{del}$.
338: For efficiency, we used in our implementation the BLAS routine \trsm to perform
339: the delayed computation over \Z.
340: Despite the small dimension of the blocks, we will see in section
341: \ref{ssec:trsmexp} that this approach can slightly improve the efficiency of the
342: computation when the finite field is small. 
343: 
344: \subsection{Delaying reductions in the update phase only} \label{ssec:trsmdelupdate}
345: %
346: The block recursive algorithm consists in several matrix multiplications of
347: different dimensions. In most cases, the matrix multiplications are done over \Z
348: with a modular reduction on the result only. But part of these result matrices
349: will be accumulated to other matrix multiplications in later computations.
350: Therefore these intermediate modular reductions  could be delayed even more by 
351: allowing to accumulate these results over \Z as much as possible.
352: 
353: This technique can be applied within the former cascade algorithm, to produce a
354: double cascade structure. The key idea is to split the matrices at two levels as
355: shown on figure \ref{fig:trsm:recblasdelayed}: 
356: %
357: \begin{figure}[htbp]\begin{center}
358: \includegraphics[width=0.8\textwidth]{trsm_cascade.eps}
359: \caption{Splitting for the double cascade \trsm algorithm}
360: \label{fig:trsm:recblasdelayed}
361: \end{center}\end{figure}
362: %
363: a fine grain
364: splitting with the dimension $t_\text{del}$ of the previous section, and a
365: coarse grain splitting with the dimension $t_\text{update}$ such that
366: all recursive calls of dimension lower than $t_\text{update}$ can let the 
367:  matrix multiplication updates accumulate without modular reductions. 
368: Choosing $t_\text{update} = k_\text{Winograd}$ (from corrolary \ref{cor:winokmax}) 
369: will ensure this property.
370: To adjust together the dimensions of the two block decompositions, we set
371: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right
372: \rfloor t_\text{del}$.
373: %
374: \begin{algorithm}
375: \dontprintsemicolon
376: \caption{\texttt{trsm-rec-BLAS-delayed}~:} 
377: \label{alg:trsm:recblasdelayed}
378: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$}
379: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}
380: \Begin{
381: Compute $t_\text{del}$ from equation (\ref{eq:trsmboundpos} or \ref{eq:trsmboundcen}) \;
382: Compute $t_\text{Winograd}$ from corrolary (\ref{cor:winokmax}) \;
383: $t_\text{split} = \left \lfloor t_\text{Winograd} / t_\text{del} \right
384: \rfloor t_\text{del}$\;
385: \ForEach{block column of $A$ of dimension $m\times t_\text{split}$ of the form
386: $\begin{bmatrix}V_i\\U_i\\0\end{bmatrix}$}{
387: $X_i = \texttt{trsm-partial-delayed} (U_i,B_i)$ \;
388: $X_i = X_i \mod p$\;
389: $B_{1\dots i-1} = B_{1\dots i-1} - V_i X_i$\;
390: $B_{1\dots i-1} = B_{1\dots i-1} \mod p$\;
391: }
392: \Return{$X$}
393: }
394: \end{algorithm}
395: %
396: \begin{algorithm}
397: \dontprintsemicolon
398: \caption{\texttt{trsm-partial-delayed}}
399: \label{alg:trsmdelaye}
400: \KwData{ $A \in \Zp^{m \times m}$, $B \in \Zp^{m \times n}$, $m$ must be lower
401:   than  $t_\text{update}$}
402: \KwResult{$X \in \Zp^{m \times n}$ s.t. $AX=B$}
403: \Begin{
404: \eIf{ $m\leq n_\text{del}$}{
405: $B=B \mod p$\;
406: $X = \texttt{dtrsm}(A,B)$ \tcc*{the BLAS routine}\;
407: $X=X \mod p$\;
408: }{
409: \tcc{ (splitting of the matrix into blocks of dimension $\left\lfloor \frac{m}{2}
410:  \right\rfloor$ and $\lCeil \frac{m}{2} \rCeil$) }\;
411: $
412: \begin{array}{cccc}
413: A & X & & B \\
414: \overbrace {\left[ \begin{array}{cc} A_1 & A_2 \\ & A_3 \end{array} \right] }&
415: \overbrace{\left[ \begin{array}{ccc} & X_1 & \\ & X_2 & \end{array} \right] }&
416: = &
417: \overbrace{\left[ \begin{array}{ccc} & B_1 & \\ & B_2 & \end{array} \right] }
418: \end{array}
419: $\;
420: $X_2:=\texttt{trsm-partial-delayed} (A_3,B_2)$ \;
421: $B_1:= B_1 - A_2X_2$  \tcc*{without modular reduction} \;
422: $X_1:=\texttt{trsm-partial-delayed} (A_1,B_1)$ \;
423: }
424: \Return $X$
425: }
426: \end{algorithm}
427: %
428: 
429: Algorithm \ref{alg:trsm:recblasdelayed} is a loop on every block of column dimension
430: $t_\text{update}$. For each of them, the triangular system is solved using algorithm
431: \ref{alg:trsmdelaye} and the update is performed  by a matrix multiplication over
432: \Z followed by a modular reduction.
433: Algorithm \ref{alg:trsmdelaye} is simply the cascade algorithm of the previous
434: section: the block recursive algorithm \ref{alg:trsm:rec} with the fully delayed
435: algorithm as a terminal case.
436: The matrix multiplication updates are performed over \Z without any reduction of
437: the result, since the threshold $t_\text{update}$ allows to accumulate them.
438: 
439: %So the only modular reductions are
440: %performed after the call to \trsm.
441: 
442: 
443: \subsection{Experiments}
444: \label{ssec:trsmexp}
445: %\subsubsection{Comparison of the variants}
446: 
447: We now compare three implementations of the \trsm routine over a word size finite
448: field: 
449: \begin{description}
450: \item Pure recursive (\texttt{Pure-Rec}): Simply algorithm \ref{alg:trsm:rec},
451: \item Recursive-BLAS  (\texttt{Rec-BLAS}): The cascade algorithm formed by
452:   the recursive algorithm and the BLAS routine \dtrsm as a terminal case. It
453:   differs from algorithm \ref{alg:trsmdelaye} by the fact that the matrix
454:   multiplication $B_1:= B_1 - A_2X_2$  is always followed by a modular reduction.
455: \item Recursive-BLAS-Delayed (\texttt{Rec-BLAS-Delayed}): algorihtm \ref{alg:trsm:recblasdelayed}.
456: \end{description}
457: 
458: We compare these three variants over finite fields with different cardinalities,
459: so as to make the parameters $t_\text{del}$ and $t_\text{update}$ vary as in
460: the following table:
461: 
462: \begin{center}
463: \begin{tabular}{|c||c|c|c|}
464: \hline
465: $p$ & $\lceil \log_2 p\rceil$ & $t_\text{del}$ & $t_\text{update}$\\
466: \hline
467: 5   &   3        &   23            &  2\,147\,483\,642\\
468: 1\,048\,583 & 20     &   2             & 8190 \\
469: 8\,388\,617 & 23     &   2             & 126 \\
470: \hline
471: \end{tabular}
472: \end{center}
473: 
474: \begin{figure}[htbp]\begin{center}
475: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_5en_goto.eps}
476: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_1048583en_goto.eps}
477: \includegraphics[width=0.692\textwidth]{trsm_speed_dim_8388617en_goto.eps}
478: \caption{Comparison of the \trsm variants for $p=5,1\,048\,583,8\,388\,617$,
479:  on a Pentium4\--3,2Ghz\--1Go}
480: \label{fig:trsm:compvar}
481: \end{center}\end{figure}
482: 
483: In the experiments of figure \ref{fig:trsm:compvar}, the matrix $B$ is 
484: square ($m=n$).
485: %
486: One can first notice the gain provided by the use of the first cascade with the
487: delayed \dtrsm routine by comparing the curves \texttt{rec-BLAS} and
488: \texttt{pure-rec} for $p=5$. This advantage shrinks when the characteristic gets larger,
489:  since $t_\text{del}=2$ for $p=1\,048\,583$ or $p=8\,388\,61$. 
490: 
491: Now the introduction of the coarse grain splitting, delaying the reductions in
492: the update phase improves by up to 500 Mfops the computation speed.
493: This gain is similar for $p=5$ and  $p=1\,048\,583$ since in both
494: cases  $n<t_\text{update}$ and there is therefore no modular reduction between the
495: matrix multiplications.
496: 
497: Lastly for $p=8\,388\,617$, the speed drops down since more reductions are required.
498:  The variants \texttt{pure-rec} and \texttt{rec-BLAS} are penalized by their 
499: dichotomic splitting, creating too many modular reductions after each matrix
500: multiplication. Now \texttt{rec-BLAS-delayed} has the best efficiency since the double
501: cascade structure minimizes the number of reductions.
502: 
503: 
504: 
505: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
506: 
507: %% Matrix multiplication speed over finite fields was improved
508: %% %in~\cite{jgd:2002:fflas,Pernet:2001:Winograd} 
509: %% by the use of the
510: %% numerical BLAS library: 
511: %% matrices are converted to floating point representations
512: %% (where the linear algebra routines are fast) and converted back to a finite
513: %% field representation afterwards. 
514: %% %The computations remained exact as
515: %% %long as no overflow occurred. 
516: %% An implementation of \trsm\ can use the same
517: %% techniques. Indeed, as soon as no overflow occurs one can replace the
518: %% recursive call to \trsm\ by the numerical BLAS {\it dtrsm}
519: %% routine. But one can remark that approximate divisions can occur.
520: %% So we need to ensure both that only exact divisions are performed and that no overflow appears.
521: %% Not only one has to be careful for the result to remain within
522: %% acceptable bounds, but, unlike matrix multiplication where data grows
523: %% linearly, data involved in linear system grows exponentially as shown
524: %% in the following.\\
525: %% %
526: %% The next two subsections first show how to deal with
527: %% divisions, and then give an optimal theoretical bound on the
528: %% coefficient growth and therefore an optimal threshold for
529: %% the switch to the numerical call.
530: 
531: %% \subsubsection{Dealing with divisions} 
532: 
533: 
534: %% \subsection{A theoretical threshold}
535: %% We want to use the BLAS trsm routine to solve triangular systems over the
536: %% integers (stored as {\tt double} for {\tt dtrsm} or {\tt float} for {\tt
537: %%   strsm}).  
538: %% The restriction is then the coefficient growth in the solution. 
539: %% Indeed, the $k^{th}$ value in the solution vector is a linear combination of the
540: %% $(n-k)$  already computed next values.  
541: %% This implies a linear growth in the coefficient size of the solution, with
542: %% respect to the system dimension. 
543: %% Now this resolution can only be performed if every element of the solution can
544: %% be stored in the mantissa of the floating point representation  
545: %% (e.g. $53$ bits for {\tt double }).
546: %% Therefore overflow control consists in finding the largest block dimension
547: %% $\beta$, such that the result of the call to BLAS trsm routine will remain
548: %% exact. 
549: 
550: %% We now propose a bound for the values of the solutions of such a system; this
551: %% bound is optimal (in the sense that there exists a worst  
552: %% case matching the bound when $n=2^i \beta$). 
553: %% This enables the implementation of a cascading algorithm, starting recursively
554: %% and taking advantage of the BLAS performances as soon as possible.  
555: %% %
556: %% %Let us introduce the two following series:
557: %% %\[
558: %% %\left \{
559: %% %\begin{array}{l}
560: %% %u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]\\[2mm]
561: %% %v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]\\
562: %% %\end{array}
563: %% %\right. ~~~ for~ an~ integer~ p>2
564: %% %\]
565: %% %
566: %% \begin{thm}  \label{THEO:TRSMBOUND}
567: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal
568: %% upper triangular matrix, and $b \in \mathbb{Z}^n$, 
569: %% with $0 \leq T \leq p-1$ and $0 \leq b \leq p-1$.
570: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of 
571: %% $T.X=b$ over the integers.
572: %% Then, $\forall \ k \in [0..n-1]$:
573: %% \[ \left \{
574: %% \begin{array}{ll}
575: %%  (p-2)^k-p^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k + (p-2)^k & \mbox{if $k$ is even}\\
576: %%  -p^k-(p-2)^k \leq 2\frac{x_{n-k}}{p-1} \leq p^k - (p-2)^k & \mbox{if $k$ is odd}
577: %% \end{array}
578: %% \right. 
579: %% \]
580: %% \end{thm}
581: %% %
582: %% \begin{proof}
583: %% The idea is to use an induction on $k$ with the relation 
584: %% $ x_k = b_k - \sum_{i=k+1}^{n}{T_{k,i}x_i}$. 
585: %% A lower and an upper bound for $x_{n-k}$ are computed, depending
586: %% whether $k$ is even or odd:
587: %% Let us  define the following induction hypothesis $IH_l$:
588: %% \[
589: %% \forall \ k \in [0..l]
590: %% \left \{
591: %% \begin{array}{ll}
592: %%  -u_k \leq x_{n-k} \leq v_k & \mbox{if $k$ is even}\\
593: %%  -v_k \leq x_{n-k} \leq u_k & \mbox{if $k$ is odd}\\
594: 
595: %% \end{array}
596: %% \right. 
597: %% \]
598: %% % Let us define the induction hypothesis $IH_l$ to be that the equations
599: %% % (\ref{eq:bound}) hold for $k \in [0..l-1]$ .
600: %% %
601: %% When $l=0$,  $x_n=b_n$ which implies that 
602: %% $ -u_0 = 0 \leq x_n \leq p-1 = v_0$. Thus $IH_0$ is proven.
603: %% %
604: %% Let us suppose that $  IH_l$ is true, and
605: %%   prove $IH_{l+1}$. There are two cases: either $l$ is odd or not !
606: %% If $l$ is odd, $l+1$ is even. Now, 
607: %% by induction, 
608: %% \begin{small}
609: %% \begin{eqnarray*}
610: %% x_{n-l-1}&\leq& (p-1) \left( 1 + \sum_{i=0}^{\frac{l-1}{2}}{u_{2i}+v_{2i+1}}\right) \\
611: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}-(p-2)^{2i} + p^{2i+1}+(p-2)^{2i+1}\right] } \\ 
612: %% &\leq & p-1 + \sum_{i=0}^{\frac{l-1}{2}}{\frac{(p-1)^2}{2} \left[p^{2i}(p+1) + (p-2)^{2i}(p-3) \right] } \\
613: %% &\leq & p-1 + \frac{(p-1)^2}{2} 
614: %%     \left[(p+1)\frac{p^{l+1}-1}{p^2-1} + (p-3)\frac{(p-2)^{l+1}-1}{(p-2)^2-1} \right] \\
615: %% &\leq & \frac{p-1}{2} \left[ p^{l+1} + (p-2)^{l+1}\right] =  v_{l+1}\\
616: %% \end{eqnarray*}
617: %% \end{small}
618: %% Similarly, 
619: %% \begin{small}
620: %% \begin{eqnarray*}
621: %% x_{n-l-1}  &\geq& -(p-1) \sum_{i=0}^{\frac{l-1}{2}}{v_{2i}+u_{2i+1}}\\
622: %%  &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[
623: %%           p^{2i} + (p-2)^{2i} + p^{2i+1} - (p-2)^{2i+1}\right]} \\
624: %%  &\geq & -\frac{(p-1)^2}{2} \sum_{i=0}^{\frac{l-1}{2}}{ \left[
625: %%           p^{2i}(p+1) - (p-2)^{2i}(p-3) \right]} \\
626: %%  &\geq & -\frac{p-1}{2} \left[  p^{l+1} - (p-2)^{l+1} \right] =  u_{l+1}\\
627: %% \end{eqnarray*}
628: %% \end{small}
629: %% Finally, If $l$ is  even, a similar proof leads to $-v_{l+1}\leq x_{n-l+1} \leq u_{l+1} $.
630: %% \end{proof}
631: %% %
632: %% \begin{cor}\label{cor:TRSMBOUND}
633: %% $ 
634: %%  |X| \leq \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right] 
635: %% $.\\
636: %% Moreover, this bound is optimal.
637: %% \end{cor}
638: %% %
639: %% \begin{proof} We denote by 
640: %% $u_n = \frac{p-1}{2}\left[ p^n - (p-2)^n\right]$
641: %% and $v_n = \frac{p-1}{2}\left[ p^n + (p-2)^n\right]$ the bounds of the
642: %% theorem \ref{THEO:TRSMBOUND}. Now $ \forall \ k \in [0..{n-1}] \ u_k \leq v_k \leq v_{n-1}$.
643: %% Therefore the theorem \ref{THEO:TRSMBOUND} gives $\forall \  k \in [1..n] \ x_k \leq v_{n-1} \leq
644: %%   \frac{p-1}{2}\left[ p^{n-1} + (p-2)^{n-1}\right] $
645: %% \[
646: %% \text{Let}~T = \
647: %% \left[
648: %% \begin{array}{ccccc}
649: %% \ddots & \ddots & \ddots & \ddots & \ddots \\
650: %%        &   1    &  p-1   &    0   & p-1  \\
651: %%        &        &    1   &   p-1  & 0   \\
652: %%        &        &        &    1   & p-1 \\
653: %%        &        &        &        & 1   \\
654: %% \end{array}
655: %% \right], 
656: %% b = \left[\begin{array}{c}\vdots\\0\\p-1\\0\\p-1 \end{array} \right]
657: %% \]
658: %%  Then the solution $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ of
659: %%  the system $T.X = b$ satisfies $ \forall \ k \in [0..n-1] \ |x_{n-k}| = v_k$
660: %% \end{proof}
661: %% One can derive the same kind of bound for the centered representation,
662: %% but with an $2^n$ gain.
663: %% \begin{thm}
664: %% Let $T \in \mathbb{Z}^{ n\times n}$ be a unit diagonal
665: %% upper triangular matrix, and $b \in \mathbb{Z}^n$, 
666: %% with $\left| T \right| \leq \frac{p-1}{2}$ and $\left| b \right|\leq  \frac{p-1}{2}$.
667: %% Let $X = ( x_i )_{i \in [1..n]} \in \mathbb{Z}^n$ be the solution of 
668: %% $T.X=b$ over the integers. Then
669: %% $ 
670: %%  |X| \leq \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n
671: %% $.\\
672: %% Moreover, this bound is optimal.
673: %% \end{thm}
674: %% \begin{proof} The proof is simpler than that of theorem
675: %%   \ref{THEO:TRSMBOUND}, since the inequations are symmetric. 
676: %% Therefore $u_n=v_n$ and the induction yields
677: %% $$u_n=\frac{p-1}{2}\left(1+\sum_{i=0}^{n-1}u_i\right)=\frac{p-1}{2}\left(1+\frac{p-1}{2}\frac{
678: %%   \left(\frac{p+1}{2}\right)^n-1}{\frac{p+1}{2}-1}\right) = \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n.$$
679: %% \end{proof}
680: 
681: %% Thus, for a given $p$, the dimension $n$ of the system must satisfy
682: %% \begin{equation}
683: %% \frac{p-1}{2}\left(\frac{p+1}{2}\right)^n< 2^{m}
684: %% \end{equation}
685: %% where $m$ is the size of the mantissa
686: %% so that the resolution over the integers using the BLAS trsm routine
687: %% is exact. For instance, with a 53 bits mantissa, 
688: %% this gives quite small matrices,
689: %% namely at most $92 \times 92$ for $p=2$, at most $4\times 4$ for $p
690: %% \
691: %% leq 3089$, and at most $p=416107$ for $2\times 2$ matrices. 
692: %% Nevertheless, this technique is speed-worthy in most cases as shown in
693: %% section \ref{ssec:trsmexp}. 
694: %Indeed, this test can easily be performed in the
695: %recursive {\tt trsm} routine to determine whether the dimension of the
696: %system is small enough to make use of the BLAS trsm routine. 
697:  
698: %\input{delayed}
699: 
700: %% \subsection{``Trsm'' implementations behavior}\label{ssec:trsmexp}
701: %% As shown in section \ref{ssec:rec-trsm} the block recursive algorithm {\trsm} is based on matrix multiplications.
702: %% This allows us to use our fast matrix multiplication routine.
703: %% %  of the
704: %% % FFLAS package \cite{jgd:2002:fflas}. This is an exact wrapping of the
705: %% % ATLAS
706: %% % library\footnote{\scriptsize\texttt{http://math-atlas.sourceforge.net}\cite{Whaley:2001:AEO}}
707: %% %  used as a kernel to implement
708: %% % the {\trsm} variants.
709: %% The following table 
710: %% %derives from experimental results of
711: %% %\cite{jgd:2004:ffpack} and 
712: %% expresses which of the two preceding
713: %% variants is better: 
714: 
715: %% \newcommand{\blastrsm}{{{\tt BLASTrsm}}}
716: %% \newcommand{\deltrsm}{{{\tt DelayTrsm}}}
717: %% {\blastrsm} is the hybric numeric/finite field implementation of section \ref{ssec:trsm-blas} and 
718: %% {\deltrsm} is the delayed division implementation of section
719: %% \ref{ssec:recdelay}.
720: %% {\tt Zpz-double} is a field representation
721: %% from \cite{jgd:2004:dotprod} where the  elements are stored as
722: %% floating points to avoid one of the conversions. {\tt Zpz-int}
723: %% is a field representation from \cite{jgd:2005:givaro} where the
724: %% elements are stored as small integers.
725: 
726: %% \begin{table}[htbp]\begin{center}
727: %% \small
728: %% \begin{tabular}{|c||r|r|r|r|r|r|}
729: %% \hline
730: %% $n$                      & {\em 400}        & {\em 1000}        & {\em 2000}       & {\em 5000} \\
731: %% \hline
732: %% {\tt Zpz-double(5)}      & \blastrsm        & \blastrsm        & \blastrsm        & \blastrsm \\
733: %% {\tt Zpz-double(32749)}  & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \blastrsm        & \blastrsm  \\
734: %% {\tt Zpz-int(5)}         & \deltrsm$_{100}$ & \deltrsm$_{100}$ & \blastrsm        & \blastrsm  \\
735: %% {\tt Zpz-int(32749)}     & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \deltrsm$_{50}$  & \deltrsm$_{50}$  \\
736: %% %%$n$          & {\em 400}  & {\em 700}  & {\em 1000} & {\em 2000} &
737: %% %%{\em 5000} \\
738: %% %% \hline
739: %% %% {\tt Mod<double>(5)} & \blastrsm & \blastrsm & \blastrsm & \blastrsm & \blastrsm \\
740: %% %% {\tt Mod<double>(32749)} & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$ & \blastrsm & \blastrsm \\
741: %% %% {\tt G-Zpz(5)}  & \deltrsm$_{100}$ & \deltrsm$_{150}$ & \deltrsm$_{100}$ & \blastrsm & \blastrsm  \\
742: %% %% {\tt G-Zpz(32749)}& \deltrsm$_{50}$ & \deltrsm$_{50}$ & \deltrsm$_{50}$& \deltrsm$_{50}$ & \deltrsm$_{50}$ \\
743: %% \hline
744: %% \end{tabular}
745: %% \caption{Best variant for \trsm\ on a P4,
746: %%   2.4GHz}\label{tab:trsmmodular}
747: %% \end{center}
748: %% \vspace{-1em}
749: %% \end{table}
750: 
751: %% To summarize, one would rather use {\tt ZpZ-double} representation and  ``blas'' {\trsm} variant in most cases.
752: %% However, when the base field is already specified ``delayed{\large$_t$}''  could provide slightly better performances.
753: %% This requires a search for optimal thresholds which again could be done through an Automated Empirical Optimizations of Software \cite{Whaley:2001:AEO}.
754: 
755: 
756: %\subsection{Performances and comparison with numerical routines}
757: 
758: We now give a comparison of this implementation with the equivalent routine of the original BLAS \dtrsm.
759: %In the previous section we showed that {\trsm} optimized variant based on numerical solving allows us to achieve the best performances.
760: %In this section we compare these performances with pure numerical solving and with matrix multiplication.
761: %In order to achieve the best performances  we use as much as possible fast matrix multiplication of section \ref{ssec:winograd}.
762: %For this purpose we use an experimental switching threshold to classic multiplication since table \ref{tab:winolevel} reflects only theoretical behavior.
763: As for matrix multiplication in section \ref{ssec:fgemm-perf}, we compare the routines according to 
764: two different BLAS implementations (i.e. ATLAS and GOTO) and 
765: two different architectures. Nevertheless, we do not present the
766: results with ATLAS on Xeon architecture due to the surprisingly poor efficiency
767: of ATLAS \dtrsm during our tests.
768: In the following, \ftrsm denotes the \trsm routine over $16$-bits prime field (i.e. $\Z_{65521}$) 
769: using the \texttt{ZpZ-double} implementation.
770: 
771: 
772: 
773: 
774: %\begin{figure}[hbtp]
775: %\includegraphics[width=8cm,angle=-90]{timing-trsm-p4}
776: %\caption{Timing comparison for matrix multiplication (exact and numeric) on a P4, 3.4GHz}
777: %\end{figure}
778: 
779: 
780: 
781: \begin{table}[htbp]\begin{center}
782: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|r|}
783: \cline{3-11}
784: \multicolumn{2}{c|}{} & $n$     %& {\em 500}      
785: & {\em 1000}  & {\em 2000}  & {\em 3000}  & {\em 5000} & {\em 7000}  & {\em 8000} & {\em 9000} & {\em 10000} \\
786: \cline{3-11}
787: \multicolumn{11}{c}{}\\[-0.1cm]
788: \hline
789: &ATLAS & ftrsm & $0.37$s    &  $1.93$s    & $5.73$s    &   $23.63$s &  $62.50$s   & $91.67$s  & $121.84$s  & $166.74$s \\ 
790: %\cline{3-11}
791: %& & dtrsm & $$s    &  $$s    & $$s   &   $$s &  $$s   & $$s  & $$s  & $$s \\ 
792: %\cline{3-11} \\[-.3cm]
793: %\cline{3-11}
794: %& \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{fgemm}{dgemm}$}  & \bf   & \bf  & \bf   &  \bf  &  \bf  & \bf  & \bf  & \bf  \\   
795: \hline
796: \multicolumn{11}{c}{}\\[-0.2cm]
797: \hline
798: 
799: & & ftrsm  %& $0.059$s    
800: &  $0.25$s    & $1.66$s    &  $5.08$s &
801: $21.47$s   & $55.95$s  & $80.77$s  & $111.57$s & $150.81$s \\ 
802: \cline{3-11}
803: & & dtrsm % & $0.023$s    
804: & $0.17$s    &  $1.35$s    & $4.50$s   &
805: $20.64$s &   $56.19$s   & $83.85$s  & $119.18$s & $163.33$s \\ 
806: \cline{3-11} \\[-.3cm]
807: \cline{3-11}
808:  & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf
809:    $\frac{ftrsm}{dtrsm}$}  %& {\bf 2.57} 
810: & \bf 1.47 & \bf 1.23  &  \bf 1.13 &  \bf 1.04 & \bf 1.00 & \bf 0.96 & \bf 0.94 & \bf 0.92 \\   
811: \hline
812: 
813: \end{tabular}
814: \caption{Timings of triangular solver with matrix hand side on a Xeon,
815:   3.6GHz}\label{tab:trsm-p4}
816: %\end{center}
817: %\end{table}
818: 
819: %\begin{figure}[hbtp]
820: %\includegraphics[width=8cm,angle=-90]{timing-trsm-itanium2}
821: %\caption{Timing comparison for triangular system solving with matrix hand side (exact and numeric) on Itanium2-1.3GHz}
822: %\end{figure}
823: 
824: %\begin{table}[htbp]\begin{center}
825: \begin{tabular}{|cc|c||r|r|r|r|r|r|r|r|}
826: \multicolumn{11}{c}{}\\
827: \cline{3-11}
828: \multicolumn{2}{c|}{} & $n$          & {\em 1000}  & {\em 2000}  & {\em 3000}  & {\em 5000} & {\em 7000}  & {\em 8000} & {\em 9000} & {\em 10000}\\
829: \cline{3-11}
830: \multicolumn{11}{c}{}\\[-0.2cm]
831: \hline
832: & & ftrsm & $0.34$s & $2.28$s & $7.11$s & $30.26$s & $77.43$s & $112.01$s & $158.00$s & $214.31$s  \\
833: \cline{3-11}
834: & & dtrsm & $0.26$s & $1.95$s & $6.37$s & $28.60$s & $76.44$s & $113.78$s & $161.19$s & $219.31$s  \\
835: \cline{3-11} \\[-.3cm]
836: \cline{3-11}
837: & \begin{rotate}{90}\scriptsize ATLAS \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$}  & \bf 1.31 & \bf 1.17 & \bf 1.12 & \bf 1.06 & \bf 1.01 & \bf 0.98 & \bf 0.98 & \bf 0.98 \\
838: \hline
839: \multicolumn{11}{c}{}\\[-0.2cm]
840: \hline
841: 
842: & & ftrsm  & $0.30$s & $2.00$s & $6.23$s & $26.67$s & $68.22$s & $104.32$s & $137.96$s & $192.37$s  \\
843: \cline{3-11}
844: & & dtrsm  & $0.21$s & $1.61$s & $5.36$s & $24.59$s & $67.35$s & $100.42$s & $142.43$s & $195.79$s  \\
845: \cline{3-11} \\[-.3cm]
846: \cline{3-11}
847:  & \begin{rotate}{90}\scriptsize GOTO \end{rotate} & {\bf $\frac{ftrsm}{dtrsm}$}  & \bf 1.43 & \bf 1.24 & \bf 1.16 & \bf 1.08 & \bf 1.01 & \bf 1.04 & \bf 0.97 & \bf 0.98 \\
848: \hline
849: 
850: \end{tabular}
851: \caption{Timings of triangular solver with matrix hand side on Itanium2, 1.3GHz}\label{tab:trsm-ia64}
852: \end{center}
853: \end{table}
854: 
855: Tables \ref{tab:trsm-p4} and \ref{tab:trsm-ia64} show that our
856: implementation of exact {\trsm} solving is not far from numerical
857: performances.
858: %In particular, ``ftrsm'' performances tend to catch up with BLAS ones as soon as the dimensions of matrices increase.
859: Moreover, on our Xeon architecture, with GOTO BLAS, we are able to
860: achieve even better performances than numerical solving for matrices
861: of dimension greater than $7\,000$.
862: 
863: 
864: \begin{figure}[hbtp]
865: \begin{center}
866: \includegraphics[width=0.55\textwidth,angle=-90]{graph-BEST-trsm}
867: \end{center}
868: \caption{Comparing  triangular system solving with matrix
869:   multiplication on a Xeon,
870:   3.6GHz} \label{fig:trsm-ratio}
871: \end{figure}
872: 
873: The good performance of our implementation is mostly achieved with 
874: the efficient reduction to fast matrix multiplication and the double
875: cascade structure.
876: Figure~\ref{fig:trsm-ratio} shows the ratio of the computation time of
877: our \trsm compared with matrix multiplication routine.
878: %One can see from this figure that our experimental ratio converges to the theoretical one.
879: %In particular, the theoretical ratio is slightly more than $\frac{1}{2}$ since fast matrix multiplication algorithm is used.
880: According to lemma \ref{lem:trsm}, this ratio is $1/2$ with $\omega=3$ 
881: and $2/3$ with $\omega = \log_2 7$.
882: In practice, our implementation only performs a few recursive calls of 
883: Winograd's algorithm, and the ratio appears to be between $0.5$ and $0.666$ as 
884: soon as the dimension is large enough, showing the good efficiency of the reduction to 
885: matrix multiplication.
886: 
887: 
888: %
889: %
890: %
891: %\subsubsection{Recursive with delayed modulus}
892: %\subsubsubsection{Threshold}
893: %\subsubsubsection{Dot product}
894: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
895: %%% Local Variables:
896: %%% mode: latex
897: %%% TeX-master: "../dlaff.tex"
898: %%% End:
899: