Sale!
Placeholder

Fault tolerance in parallel distributed systems

10,000 3,000

Topic Description

Summary
This project involved an investigation into fault tolerance in distributed and parallel systems
and implementation of fault tolerance in a number of example programs, exhibiting common
characteristics of parallel programs, using the MPI standard.

Table of contents
Summary ………………………………………………………………………………………………………………………….. I
Acknowledgements……………………………………………………………………………………………………………. II
Table of contents ………………………………………………………………………………………………………………III
1 Background…………………………………………………………………………………………………………………1
Introduction ……………………………………………………………………………………………………………………1
1.1 Parallel computing…………………………………………………………………………………………………1
1.1.1 What is Parallel Computing………………………………………………………………………………1
1.1.2 Shared Memory Computers………………………………………………………………………………1
1.1.3 Distributed Memory Systems ……………………………………………………………………………3
1.1.4 Programming Paradigms ………………………………………………………………………………….4
1.1.5 Selected Environment ……………………………………………………………………………………..8
1.2 Distributed Systems ……………………………………………………………………………………………….8
1.2.1 What is a distributed system? ……………………………………………………………………………8
1.2.2 Transparency …………………………………………………………………………………………………9
1.2.3 Openness …………………………………………………………………………………………………….10
1.2.4 Scalability……………………………………………………………………………………………………10
1.3 Grid computing …………………………………………………………………………………………………..11
1.3.1 What is grid computing………………………………………………………………………………….11
1.4 Fault Tolerance……………………………………………………………………………………………………12
1.4.1 What is a fault?…………………………………………………………………………………………….12
1.4.2 Failure Models……………………………………………………………………………………………..12
1.4.3 Fault tolerance ……………………………………………………………………………………………..12
1.5 Previous work in this area……………………………………………………………………………………..14
Summary ……………………………………………………………………………………………………………………..15
2 Objectives and Planning………………………………………………………………………………………………16
Introduction ………………………………………………………………………………………………………………….16
2.1 Aim…………………………………………………………………………………………………………………..16
2.2 Objectives ………………………………………………………………………………………………………….16
2.2.1 Objectives……………………………………………………………………………………………………16
2.2.2 Minimum requirements ………………………………………………………………………………….16
2.2.3 Possible extensions ……………………………………………………………………………………….17
2.3 Methodology ………………………………………………………………………………………………………17
2.3.1 Development ……………………………………………………………………………………………….17
2.3.2 Testing Strategy……………………………………………………………………………………………17
IV
2.3.3 Evaluation …………………………………………………………………………………………………..18
2.4 Planning …………………………………………………………………………………………………………….19
2.4.1 Initial Plan …………………………………………………………………………………………………..19
2.4.2 Characteristics analysis ………………………………………………………………………………….19
2.4.3 Example programs to develop …………………………………………………………………………20
2.4.4 Extended plan ………………………………………………………………………………………………20
2.4.5 Problems with the extended plan……………………………………………………………………..21
2.4.6 Final progress ………………………………………………………………………………………………21
2.4.7 Justification of alterations ………………………………………………………………………………21
Summary ……………………………………………………………………………………………………………………..22
3 Matrix multiplication example programs ………………………………………………………………………..23
Introduction ………………………………………………………………………………………………………………….23
3.1 Matrix multiplication algorithm……………………………………………………………………………..23
3.2 Master slave centralised work farm model ……………………………………………………………….23
3.2.1 Other applications …………………………………………………………………………………………23
3.2.2 Master slave model ……………………………………………………………………………………….23
3.2.3 Work farm distribution model …………………………………………………………………………24
3.2.4 Key features of the original parallel program……………………………………………………..24
3.2.5 Implementation of the original program ……………………………………………………………25
3.2.6 Fault tolerance design and implementation………………………………………………………..26
3.2.7 Evaluation …………………………………………………………………………………………………..29
3.2.8 Summary …………………………………………………………………………………………………….31
3.3 Semi-distributed block decomposition model……………………………………………………………32
3.3.1 Other applications …………………………………………………………………………………………32
3.3.2 Semi distributed model ………………………………………………………………………………….32
3.3.3 Block decomposition model ……………………………………………………………………………32
3.3.4 Key features of the original parallel program……………………………………………………..33
3.3.5 Implementation of the original program ……………………………………………………………33
3.3.6 Code evolution……………………………………………………………………………………………..34
3.3.7 Fault tolerance design ……………………………………………………………………………………35
3.3.8 Implementation discussion……………………………………………………………………………..36
3.3.9 Evaluation …………………………………………………………………………………………………..37
3.3.10 Summary …………………………………………………………………………………………………….40
3.4 Distributed block decomposition model …………………………………………………………………..40
3.4.1 Distributed model …………………………………………………………………………………………40
3.4.2 Block decomposition model ……………………………………………………………………………41
3.4.3 Key features of the basic parallel program…………………………………………………………41
V
3.4.4 Implementing the original program ………………………………………………………………….41
3.4.5 Fault tolerance approach ………………………………………………………………………………..42
3.4.6 Implementation discussion……………………………………………………………………………..42
3.4.7 Evaluation …………………………………………………………………………………………………..43
3.4.8 Summary …………………………………………………………………………………………………….45
Summary ……………………………………………………………………………………………………………………..45
4 Iterative example program……………………………………………………………………………………………46
Introduction ………………………………………………………………………………………………………………….46
4.1 Iterative programs………………………………………………………………………………………………..46
4.2 Finite difference ………………………………………………………………………………………………….46
4.3 Key features of the original parallel program ……………………………………………………………46
4.4 Fault tolerance design…………………………………………………………………………………………..46
4.4.1 Non-blocking communication …………………………………………………………………………46
4.4.2 Synchronisation ……………………………………………………………………………………………47
4.4.3 Check pointing……………………………………………………………………………………………..47
4.5 Implementation discussion…………………………………………………………………………………….47
4.5.1 Check pointing……………………………………………………………………………………………..47
4.5.2 Restoration of check point………………………………………………………………………………48
4.6 Evaluation ………………………………………………………………………………………………………….48
4.6.1 Fault tolerance ……………………………………………………………………………………………..48
4.6.2 Fault tolerance tests……………………………………………………………………………………….49
4.6.3 Code size, development time and difficulty ……………………………………………………….49
4.6.4 Theoretical performance ………………………………………………………………………………..49
4.6.5 Performance tests………………………………………………………………………………………….50
Summary ……………………………………………………………………………………………………………………..51
5 Conclusion and possible extensions ……………………………………………………………………………….52
5.1 Overall conclusion……………………………………………………………………………………………….52
5.2 Possible extensions ………………………………………………………………………………………………52
Bibliography…………………………………………………………………………………………………………………….53
Appendix A – Reflection ……………………………………………………………………………………………………55
Appendix B – Glossary ………………………………………………………………………………………………………56
Terms and MPI Functions ……………………………………………………………………………………………….

GET COMPLETE PROJECT