Fault Detection and Tolerance in Cluster of Workstations using Message Passing Interface

Syed Misbahuddin

doi:10.33317/ssurj.72

Authors

Syed Misbahuddin

DOI:

https://doi.org/10.33317/ssurj.72

Keywords:

Availability, Cluster of Workstation, MPI Applications.

Abstract

A Cluster of Workstations (COW) is network based multi-computer system aimed to replace supercomputers. A cluster of workstations works on Divisible Load Theory (DLT) according to which a job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks must be completed. Therefore, for satisfactory job completion, all workstations must be functional. However, a faulty node can suspend the overall job completion task until and unless some fault avoidance and correction measures are taken. This paper presents a fault detection and fault tolerant algorithm which will use Message Passing Interface (MPI) to identify faulty workstations and transfer the subtask being performed by them to a normally working workstation. The assigned workstations will continue their original subtasks in addition to assigned subtasks on time sharing basis.

References

Cristiana Amza, et al “Tread Marks: shared memory computing on networks of workstations,” IEEE Computers, Feb 1996, pp. 18 – 28.

G. F. Pfister, “Clusters of computers: Characteristics of an invisible architecture,” keynote address presented at IEEE Int’l. Parallel processing Symp., Honolulu, April 1996.

T. G. Robertazzi, “Networks and Grids Technology and Theory,” Springer, New york, 2007.

Sameer Bataineha and Jamal Al-Karaki, “Fault Tolerant computing on cluster of workstations”, ACS/IEEE Int’l conf. on computer systems and applications, Tunis, Tunisia, July 14-18, 2003.

E. Gelenbe, D. Finkle, S. Tripathi, "Availability of a distributed computer system with failures," Acta Informatica, Vol. 23, 1986, pp. 643-655

Syed Misbahuddin and Nizar Al-Holou, “A Performance Model of Highly Available Multi-computer Systems”, International Journal of Modelling and Simulation, Vol. 26, No. 2, 2006.

Message Passing Interface (MPI) Forum, http://www.mpi-forum.org

http://www.lam-mpi.org/

J. Duell. “The design and implementation of Berkeley lab's Linux checkpoint/restart.” Lawrence Berkeley National Laboratory, Paper LBNL-54941, April 2005

Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. In Technical Report UW-CS-TR-1346, University of Wisconsin-Madison, Computer Sciences Department, April 1997

http://horms.net/projects/has/html/node11.html

Salim Hariri, and Hasan B. Mutlu “A Hierarchical Modeling of Availability in Distributed Systems”, Proceedings International conference on distributed Systems, May 1991.