1: \begin{abstract}
2: The predicted reduced resiliency of next-generation high performance
3: computers means that it will become necessary to take into account
4: the effects of randomly occurring faults on numerical methods.
5: Further, in the event of a hard fault occurring, a decision has to
6: be made as to what remedial action should be taken in order to
7: resume the execution of the algorithm. The action that is chosen can
8: have a dramatic effect on the performance and characteristics of the
9: scheme. Ideally, the resulting algorithm should be subjected to the
10: same kind of mathematical analysis that was applied to the original,
11: deterministic variant.
12:
13: The purpose of this work is to provide an analysis of the behaviour
14: of the multigrid algorithm in the presence of faults. Multigrid is
15: arguably the method of choice for the solution of large\--scale linear
16: algebra problems arising from discretization of partial differential
17: equations and it is of considerable importance to anticipate its
18: behaviour on an exascale machine. The analysis of resilience of
19: algorithms is in its infancy and the current work is perhaps the
20: first to provide a mathematical model for faults and analyse the
21: behaviour of a state-of-the-art algorithm under the model. It is
22: shown that the Two Grid Method fails to be resilient to faults.
23: Attention is then turned to identifying the minimal necessary
24: remedial action required to restore the rate of convergence to that
25: enjoyed by the ideal fault-free method.
26: \end{abstract}
27: