6.2.2 Robustness

Next: 6.2.3 Load Polling Up: 6.2 Distributed Computing Previous: 6.2.1 Workload Distribution

6.2.2 Robustness

Distributed computation takes place in a quite complicated technical system. Aside from the stability of the hardware and the operating system itself, distributed computation adds one more source of system failures which is the network connections. Likewise to the complexity of the system, also the probability of system failures rises. Additionally, the longer the overall simulation takes, the more likely is the occurrence of a failure within that period.

In order to get a feeling for the stability that can be achieved let us briefly sketch a case study of distributed computation on a cluster of workstations. Assuming the probability of a hardware or operating system failure is P_OS, the probability of failure due to disk storage shortage is P_Disk, and a network failure occurs with a probability of P_Net. Thus the probability for the successful completion of a simulation on a single machine is

$\begin{displaymath}P_{Succ}= \left(1-P_{OS}\right) \cdot \left(1-P_{Disk}\right) \cdot \left(1-P_{Net}\right) \mbox{.}\end{displaymath}$

Whereas the probability to fail for an ensemble of N simulations which are computing in parallel is

$\begin{displaymath}P_{Fail,Total}= 1- \prod^N P_{Succ,i} = 1- \left(P_{Succ}\right)^N \mbox{,} \end{displaymath}$

assuming that all parts of the distributed simulation are exposed to equal risk and an individual failure invalidates the entire simulation. A reasonable assumption for values of P_OS and P_Disk is one failure per month and one failure per week for network failures P_Net which is equivalent to $P_{OS} = P_{Disk}= 1.93 \cdot 10^{-3}$ , and $P_{Net}=5.95 \cdot 10^{-3}$ failures per hour.

Table 6.2 shows the resulting failure probabilities for two experiments taking one hour's time and one day, respectively, under the assumption that (sub-) processes are computing optimally balanced on a cluster of 20 workstations.
$\begin{Table} % latex2html id marker 6471\begin{center} \begin{tabular}{lc} \h... ...$\ if the experiment on a cluster of 20 workstations takes one day.} \end{Table}$
Table 6.2 makes clear that parallel and distributed computation on a local area network results in a fairly unstable system unless special measures are taken in order to improve stability. This is particularly true for large scale simulation experiments such as optimizations, which can take up to a week's time or even longer.

Next: 6.2.3 Load Polling Up: 6.2 Distributed Computing Previous: 6.2.1 Workload Distribution

Rudi Strasser
1999-05-27