b4ab6b7dafaa791a.tex
1: \begin{abstract}
2:   Modern graphics hardware is designed for highly parallel numerical
3:   tasks and promises significant cost and performance benefits for
4:   many scientific applications.  One such application is lattice
5:   quantum chromodyamics (lattice QCD), where the main computational
6:   challenge is to efficiently solve the discretized Dirac equation in
7:   the presence of an $SU(3)$ gauge field.  Using \nvidia's CUDA
8:   platform we have implemented a Wilson-Dirac sparse matrix-vector
9:   product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops
10:   for double, single and half precision respectively on \nvidia's
11:   GeForce GTX 280 GPU.  We have developed a new mixed precision
12:   approach for Krylov solvers using {\it reliable updates} which
13:   allows for full double precision accuracy while using only single or
14:   half precision arithmetic for the bulk of the computation.  The
15:   resulting BiCGstab and CG solvers run in excess of 100 Gflops and,
16:   in terms of iterations until convergence, perform better than the
17:   usual defect-correction approach for mixed precision.
18: \end{abstract}
19: