5.5.3 Fault Tolerance

Next: 6. Parallel and Distributed Up: 5.5 Evaluation of Simulation-Flow-Models Previous: 5.5.2 Result Management

5.5.3 Fault Tolerance

The handling of failures during the evaluation of a simulation-flow-model is critical for a smooth operation of TCAD experiments which are based on it. There are mainly three sources for errors that cause failures. Firstly, failures of the computer systems, either hardware- or operating systems failures, can cause malfunctions during the evaluation of a tool. Secondly, users sometimes run out of disk space which can be due to various reasons. And thirdly, the simulation tools themselves sometimes fail to deliver the simulation result. Therefore, provision have been made to enable users to cope with such situations.

If the evaluation of a tool fails, the simulation-flow-model retries to evaluate that tool until it succeeds. Each time an evaluation fails the value of the auxiliary symbol aux.repeatlevel is increased. This convention has proven to be extremely useful in the case of system or network failures during optimization experiments. Unless fault tolerance is introduced, optimizations get stalled as consequence of single point failures (see Section 6.2.2 on page for a detailed discussion).

Next: 6. Parallel and Distributed Up: 5.5 Evaluation of Simulation-Flow-Models Previous: 5.5.2 Result Management

Rudi Strasser
1999-05-27