High availability for parallel computers

Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant arc...

Full description

Bibliographic Details
Main Authors: Dolores Rexachs del Rosario, Emilio Luque Fadón
Format: Article
Language:English
Published: Postgraduate Office, School of Computer Science, Universidad Nacional de La Plata 2010-10-01
Series:Journal of Computer Science and Technology
Subjects:
Online Access:https://journal.info.unlp.edu.ar/JCST/article/view/697
id doaj-e5ed1cd7ebdb45a8b326118b84ede969
record_format Article
spelling doaj-e5ed1cd7ebdb45a8b326118b84ede9692021-05-05T13:54:27ZengPostgraduate Office, School of Computer Science, Universidad Nacional de La PlataJournal of Computer Science and Technology1666-60461666-60382010-10-011003110116392High availability for parallel computersDolores Rexachs del Rosario0Emilio Luque Fadón1Computer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, SpainComputer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, SpainFault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environmenthttps://journal.info.unlp.edu.ar/JCST/article/view/697fault toleranceavailabilityradictransient faultsperformability
collection DOAJ
language English
format Article
sources DOAJ
author Dolores Rexachs del Rosario
Emilio Luque Fadón
spellingShingle Dolores Rexachs del Rosario
Emilio Luque Fadón
High availability for parallel computers
Journal of Computer Science and Technology
fault tolerance
availability
radic
transient faults
performability
author_facet Dolores Rexachs del Rosario
Emilio Luque Fadón
author_sort Dolores Rexachs del Rosario
title High availability for parallel computers
title_short High availability for parallel computers
title_full High availability for parallel computers
title_fullStr High availability for parallel computers
title_full_unstemmed High availability for parallel computers
title_sort high availability for parallel computers
publisher Postgraduate Office, School of Computer Science, Universidad Nacional de La Plata
series Journal of Computer Science and Technology
issn 1666-6046
1666-6038
publishDate 2010-10-01
description Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environment
topic fault tolerance
availability
radic
transient faults
performability
url https://journal.info.unlp.edu.ar/JCST/article/view/697
work_keys_str_mv AT doloresrexachsdelrosario highavailabilityforparallelcomputers
AT emilioluquefadon highavailabilityforparallelcomputers
_version_ 1721460546724691968