the triangular and the rectangular parts separately is penalized by the extra
function call and matrix setup. Subroutine setup costs are minimized by the use
of our superscalar GEMM routine for problems that fit into the level1 cache,
that is, problems solved by the
tr*
subsystem solvers. In the algorithm, this
is controlled by the parameter
blks
, the same parameter used in the recursive
algorithms to switch from the recursive blocked algorithms to the subsystem
solvers. In the implementation, fixup code for handling any 2
×
2 diagonal blocks
of the quasitriangular matrices is implemented using level1 BLAS routines.
For the symmetric case,
˜
C
←
C

AXA
T
,
X
=
X
T
, and
C
=
C
T
, the SLICOT
MB01RD subroutine is used [SLICOT 2001]. It reduces the complexity from
2
M
3
to (3
/
2)
M
3
, by calculating the upper triangular part of the symmetric
matrix triu(
˜
C
)
=
V
+
triu(
B
)
+
stril(
B
)
T
+
diag(
V
)
+
diag(triu(
B
)), where
B
=
ATA
T
,
T
=
triu(
X
)

1
2
diag(
X
), and
V
=
triu(
C
)

1
2
diag(
C
), and triu, stril, and
diag denote the upper triangular, the strictly lower triangular, and the diagonal
parts, respectively. The MB01RD routine is implemented using level3 BLAS
routines and requires no workspace.
However, for some problems, the complexity can be lowered even more, at
the cost of extra workspace. In
rtrglydt
, Algorithm 2, the temporary result
W
=
A
12
X
22
in the operation
C
12
=
axb
(

A
12
X
22
A
T
22
) can be reused in the oper
ation
C
11
=
axb
(

A
12
X
22
A
T
22
), and analogously for the
E
matrix. Although this
triples the workspace needed, it lowers the overall complexity of the algorithm
from (25
/
6)
N
3
to (23
/
6)
N
3
.
5. PERFORMANCE RESULTS OF RECURSIVE BLOCKED ALGORITHMS
The recursive blocked algorithms for solving twosided triangular matrix equa
tions have been implemented in Fortran 90, using the facilities for recursive
calls of subprograms, dynamic memory allocation, and threads. In this sec
tion, sample performance results of these implementations executing on IBM
Power, MIPS, and Intel Pentium processorbased systems are presented. We
have selected a few processors representing different vendors and different
architecture characteristics and memory hierarchies. The details of the proces
sors, memory subsystems, and operating systems are all described in Part I
[Jonsson and K˚agstr¨om 2002]. The only difference from Part I is the Power3
machine, which is running at 375 MHz, giving each processor a theoretical peak
rate of 1500 Mflops/s. The performance results (measured in Mflops/s) are com
puted using the flopcounts presented in Table I of Section 3.6. The results are
displayed in graphs and tables, where the performances of different variants
of the recursive blocked implementations are visualized together with existing
routines in the stateoftheart SLICOT library [SLICOT 2001]. Moreover, the
relative performances (measured as speedup) between different algorithms in
cluding sequential as well as parallel implementations are presented in tables.