Loading…
Blue Waters Symposium 2015 has ended
Tuesday, May 12 • 11:10am - 11:30am
[Computer Science] Jon Calhoun: Effect and Propagation of Silent Data Corruption in HPC Applications

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Modern HPC systems are complex due to the sheer number of components that comprise them. With this complexity comes the reality of failures. One particular damaging and little understood type of failure is silent data corruption (SDC). SDC occurs when program state changes without intervention of the application or the system. An understanding of how applications handle state perturbations and how these corrupted values propagate through HPC applications is key to mitigating its effects. In this talk, we present our results from fault injection experiments on an Algebraic Multigrid linear solver. We explore the sparse matrix vector multiply, a fundamental
component to AMG and other HPC applications. In addition, we explore the effects of SDC on
other applications and HPC computation kernels. Finally, we discuss algorithm level fault tolerance for SDC detection.

Speakers

Tuesday May 12, 2015 11:10am - 11:30am PDT
Heritage II

Attendees (0)