Fr. 134.00

Fault-Tolerance Techniques for High-Performance Computing

English · Hardback

Shipping usually within 2 to 3 weeks (title will be printed to order)

Description

Read more

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

List of contents

Part I: General Overview.- Fault-Tolerance Techniques for High-Performance Computing.- Part II: Technical Contributions.- Errors and Faults.- Fault-Tolerant MPI.- Using Replication for Resilience on Exascale Systems.- Energy-Aware Check pointing Strategies.

Summary

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Product details

Assisted by Thoma Herault (Editor), Thomas Herault (Editor), Thomas Hérault (Editor), ROBERT (Editor), Robert (Editor), Yves Robert (Editor)
Publisher Springer, Berlin
 
Languages English
Product format Hardback
Released 01.01.2015
 
EAN 9783319209425
ISBN 978-3-31-920942-5
No. of pages 320
Dimensions 156 mm x 24 mm x 244 mm
Weight 650 g
Illustrations IX, 320 p. 113 illus.
Series Computer Communications and Networks
Computer Communications and Networks
Subject Natural sciences, medicine, IT, technology > IT, data processing > IT

Customer reviews

No reviews have been written for this item yet. Write the first review and be helpful to other users when they decide on a purchase.

Write a review

Thumbs up or thumbs down? Write your own review.

For messages to CeDe.ch please use the contact form.

The input fields marked * are obligatory

By submitting this form you agree to our data privacy statement.