Answer: 1. The phrase "fault tolerance" means many things to many people. Typical definitions range from user processes dumping vital state to disk periodically to checkpoint/restart of running processes to elaborate recreate-process-state-from-incremental-pieces schemes to ... (you get the idea).
In the scope of Open MPI, we typically define "fault tolerance" to mean the ability to recover from one or more component failures in a well defined manner with either a transparent or application-directed mechanism. Component failures may exhibit themselves as a corrupted transmission over a faulty network interface or the failure of one or more serial or parallel processes due to a processor or node failure. Open MPI strives to provide the application with a consistent system view while still providing a production quality, high performance implementation.
Yes, that's pretty much as all-inclusive as possible -- intentionally so! Remember that in addition to being a production-quality MPI implementation, Open MPI is also a vehicle for research. So while some forms of "fault tolerance" are more widely accepted and used, others are certainly of valid academic interest.
Answer: 2. Open MPI plans on supporting the following fault tolerance techniques:
* Coordinated and uncoordinated process checkpoint and restart. Similar to those implemented in LAM/MPI and MPICH-V, respectively.
* Message logging techniques. Similar to those implemented in MPICH-V
* Data Reliability and network fault tolerance. Similar to those implemented in LA-MPI
* User directed, and communicator driven fault tolerance. Similar to those implemented in FT-MPI.
The Open MPI team will not limit their fault tolerance techniques to those mentioned above, but intend on extending beyond them in the future.