4.5.2 Fault Recovery

Next: 4.5.3 SIESTA Optimizer Module Up: 4.5 Integration into a Previous: 4.5.1 Parallel Execution

4.5.2 Fault Recovery

The time needed for an optimization can take from some hours up to several days or weeks. During this time period several failures in the computation environment can occur.

: 1. Problems in the local area network (LAN). Modern network structures supply the connected hosts with high band-width. They are very complex and consist of a lot of equipment like routers, switches, etc. The large number of high-technology equipments makes these networks difficult to install and to maintain. A disconnected host only effects the simulations taking place on this machine and so does not automatically kill a running optimization.
: 2. Required maintenance tasks like system software updates or hardware replacement may make it necessary to temporarily shutdown a host thus killing the optimizer task.
: 3. Overloading of the hosts, or exceeding their virtual memory limits are critical operating conditions where the kernel could stop a task in order to keep the system alive.
: 4. Power outages or hardware failures are unlikely but can also occur.

Hence, if, for some reason, a simulator crashes the framework recognizes this failure and the simulation is queued again. A corrupted host, e.g., detached from the network connection, is detected by the framework by unanswered queries of the process load values. This host will be temporarily disabled from the list of available hosts until a stable network connection can be established again.

When the optimizer itself fails all calculated data are lost. A database of all control and response variables could solve the problem. This requires a large database, especially for least-squares problems where the whole residual vector has to be stored for an evaluation with a given input parameter. An $\epsilon$ has to be defined for the comparison of new requested evaluations and the stored values from the database.

Saving all the evaluation data means a large overhead, since for proceeding the optimization, only the Hessian and gradient at the current evaluation point are necessary.

A possibility of solving this problem is to store all matrices, vectors and scalars changed during the runtime of the program. This had to be done by functions for storing the actual values in a file and loading this file and continuing the execution in case of a restart. The store and restore functions must include all variables used in the program which makes it difficult to manage during the development phase. Special care must be taken for open files, because they have to be reopened before they can be accessed.

Another strategy is to use the support from external libraries or from the operating system. On some operating systems like IRIX there are functions supplied by the operating system which support a functionality called Checkpointing^4.6. On other platforms the external library chkpt [67] supports an equivalent functionality^4.7.

The time when a checkpoint is set is important for the efficiency of the whole system. Checkpoints can be set periodically by a timer or triggered by a signal or by explicitly calling the function at a specific point in the program.

**Figure 4.9:** Triggering of checkpoints.
$\includegraphics[width=0.9\linewidth]{graphics/checkpoint.eps}$

A checkpoint triggered in the collection loop of the parallelized finite difference evaluation of the gradient calculation would block the system. The reexecution would bring the optimizer in a state where it waits for results which have been requested during the previous execution. This would cause a deadlock and the optimization process would not start.

In the Figure 4.9 a time diagram is shown where the checkpoints have to be placed. Starting with a gradient calculation (1) n model evaluations are queried by the optimizer. Just before this the image of the program is stored by calling the checkpoint routine. The evaluation of the models and collection of the results (3-4) can take rather long, depending on n and on the number of available hosts. If the optimization process is killed during this time it can be resumed at (1), but the intermediate results needed for the gradient calculation (2-3), are lost. Also before a single step (4-5, 6-7, and 8-9) a checkpoint is triggered. A failure during these steps only require the recomputation of one model evaluation.

Footnotes

...Checkpointing ^4.6: On IRIX 6.2 more details can be found in the manpages checkpoint(1), ckpt_setup(3), ckpt_create(3), ckpt_restart(3) and ckpt_remove(3).
... functionality ^4.7: Currently it supports NetBSD and Digital Unix and in the near future also Solaris. It supports the same functions as the IRIX original, but by using the standard dynamic loader only static linked programs are supported.

Next: 4.5.3 SIESTA Optimizer Module Up: 4.5 Integration into a Previous: 4.5.1 Parallel Execution

R. Plasun