7186cfb241d7ab23.tex
1: \begin{abstract}
2:   The predicted reduced resiliency of next-generation high performance
3:   computers means that it will become necessary to take into account
4:   the effects of randomly occurring faults on numerical methods.
5:   Further, in the event of a hard fault occurring, a decision has to
6:   be made as to what remedial action should be taken in order to
7:   resume the execution of the algorithm. The action that is chosen can
8:   have a dramatic effect on the performance and characteristics of the
9:   scheme. Ideally, the resulting algorithm should be subjected to the
10:   same kind of mathematical analysis that was applied to the original,
11:   deterministic variant.
12: 
13:   The purpose of this work is to provide an analysis of the behaviour
14:   of the multigrid algorithm in the presence of faults. Multigrid is
15:   arguably the method of choice for the solution of large\--scale linear
16:   algebra problems arising from discretization of partial differential
17:   equations and it is of considerable importance to anticipate its
18:   behaviour on an exascale machine. The analysis of resilience of
19:   algorithms is in its infancy and the current work is perhaps the
20:   first to provide a mathematical model for faults and analyse the
21:   behaviour of a state-of-the-art algorithm under the model. It is
22:   shown that the Two Grid Method fails to be resilient to faults.
23:   Attention is then turned to identifying the minimal necessary
24:   remedial action required to restore the rate of convergence to that
25:   enjoyed by the ideal fault-free method.
26: \end{abstract}
27: