Job Farming Facility

SIESTA uses a sophisticated job farming facility (QUEUE-MANAGER) [92] that utilizes the available computing power of a heterogeneous cluster of workstations. The algorithm to distribute the jobs takes care not to overload a host by using an adjustable load limit. Hosts are temporarily disabled when another started job would exceed that limit.

To choose among all available hosts, the hosts are ranked according to their relative speed (derived from the CPU speed), to their current load (i.e., the number of currently executing jobs), and to their number of CPUs. The host that is assumed to finish a job fastest -- the one with the smallest rank -- is chosen for the next job in the queue. This allows for the combination of hosts with different CPU speeds. Since a started job is not immediately reflected in the system load, this measure cannot directly be used to compute the rank or the workload of the cluster would oscillate. Therefore, a modified load, the so called guess load, is used instead. The guess load is an estimation of the actual load which is computed by superimposing the system load ($ l_{sys}$), the number of jobs already running ($ N$), and the number of recently finished jobs ($ M$) [92]

$\displaystyle l_{guess} = l_{base} + l_{sys} + \underbrace{\sum_{i=1}^{N} {e^{-...
...ight) \cdot{e^{-\frac{t -t_i^{stop}}{\tau_{stop}}}}\right]}}_{\rm stopped~jobs}$ (5.4)

The parameter base load $ l_{base}$ is used to reserve computing resources for jobs that are running outside SIESTA. The time constants $ \tau_{start}$ and $ \tau_{stop}$ are used to account for the actual time that elapses between the start of a job or its termination respectively and the reflection in the system load. These parameters are architecture dependent and must be determined experimentally. An empirically determined value for LINUX systems is $ 5s$. A too high value for $ \tau_{start}$ will result in jobs queued too late -- the host is running unnecessarily long idle or the operating load will even oscillate -- whereas a too small value will result in a permanently overqueued host. Conversely, a too high value for $ \tau_{stop}$ will overqueue the host, a too low value results in too high idle times. The rank $ R_i$ of a host is then computed based on the guess load

$\displaystyle R_i = \frac{\ensuremath{\max\left[\frac{l_{guess} +1}{n_i^{cpu}}, 1\right]}}{s_i}$ (5.5)

where $ n_i^{cpu}$ is the number of CPUs and $ s_i$ is the relative speed of a host.

SIESTA is realized by linking the GUI components against the LISP system. Automatically generated language bindings [93,94,95] allow the seamless use of graphics components (written in the language C) from within the LISP language. Although this concept seems very practical and combines the rapid prototyping advantages usually attributed to the LISP language with a graphical components library (MOTIF), it also holds major drawbacks. On the one hand these drawbacks lie in the hard to learn syntax of the LISP language itself, on the other hand a much bigger disadvantage of this integral concept is the lack of clean interfaces between the modules. It is thus very complicated if not impossible to exchange one module for another. Therefore the SIESTA framework has been redesigned and some parts of it have been reimplemented. These parts are the QUEUE-MANAGER module that takes care of distributing the jobs over the workstation cluster, and the GUI module.

The functional deficiencies that finally led to a redesign of SIESTA are:

  1. The GUI cannot be run independently from the rest of the application. Although SIESTA supports a text mode, once the application is started this mode can not be changed. There is also no way to stop the GUI on one display and start it on another. This is very annoying if an optimization runs for several days and has to be monitored from different workstations.
  2. The QUEUE-MANAGER relies on the Network File System (NFS) protocol to copy data between hosts. NFS is a stateless protocol which means that the time when the data are valid on the remote host is not transparent to the client. This lead to workarounds where a certain time span (usually a few seconds) has to be awaited after every file operation. Although this helps in many cases, there is no general guarantee that the data are consistent at a given time. The probability of an optimization failure due to NFS problems rises with the duration of the optimization task.
  3. The QUEUE-MANAGER uses a central file system, where all operations are performed against the user's mounted home directory (via NFS). This may result in many unnecessary copy and read operations if data are passed between two hosts and the user's home directory is located on a third host.
  4. The QUEUE-MANAGER relies on the Remote Shell (RSH) command for which a standardized server part is only available on UNIX machines.
  5. The QUEUE-MANAGER does not take into account jobs that run at different priorities (so-called nice levels). Usually a lower priority (i.e., a higher nice level) is assigned to jobs that run rather long (hours to days) in order to achieve a short response time for comparably short jobs (seconds to minutes). This results in SIESTA underestimating the available computing power on hosts that run low priority jobs outside SIESTA. If the cluster can not be reserved exclusively for SIESTA, the only workaround is to temporarily increase the maximum load on the given host.
  6. There is no way to group jobs such that they are executed on the same machine. In many experiments an artificial device serves as the basic model for the optimization task. Such a model is created by a relatively short running job. It would thus make sense to run the model generator and the actual simulation on the same host. Such a grouping could contribute to keeping temporary input/output data from simulations as local as possible.
  7. The existing QUEUE-MANAGER does not allow for several experiments to be run at the same time. Instead, for each experiment a new instance of SIESTA has to be started. This may result in situations where the cluster may be blocked by one instance and all other instances are waiting for computing resources to become available.

To overcome deficiency (1) a QUEUE-MANAGER GUI that is decoupled from the application via the network protocol CORBA [49] was implemented. Information flow is bidirectional, the GUI receives data about the status of the queue and sends data about user requests like addition or removal of a host from the cluster. The GUI was implemented in the programming language JAVA [34]. JAVA is platform independent and allows for easy binary distributions of applications, which relieves the user from the need to compile the source codes. Another advantage of JAVA is the capability to run the whole GUI in a web-browser. The introduced runtime overhead of JAVA / CORBA compared to the MOTIF approach can be justified for this application, since the software contains no time critical parts. Fig. 5.6 depicts a screen shot of the new GUI.

Figure 5.6: Swing QUEUE-MANAGER GUI.
\begin{figure}\centering\psfig{file=pics/siesta-new-gui, width=0.85\linewidth} \end{figure}

The QUEUE-MANAGER was reimplemented to cope with drawbacks (2) to (7). Again CORBA is used to handle the communication between the hosts participating in the cluster and with the GUI. To handle data transfer (2) between hosts the QUEUE-MANAGER provides a platform independent copy mechanism. The client side invokes a CORBA method. The method opens a socket on the server side, and attaches the requested file to the socket in a separate execution thread. The socket number is then returned to the client who initiates a connection to the socket, and retrieves the data from the server. Unnecessary file copy operations (3) are avoided by using symbolic names instead of concrete paths to physical file locations. The location of a file corresponding to a symbolic name is stored in a central location and queried directly by the host that needs the file. The RSH command (4) is replaced by a CORBA method invocation. To account for different UNIX5.2 nice levels (5) a modified strategy to choose a host is used. The nice level determines the amount of computing time a process gets assigned. To estimate the CPU time available to a newly started job, a number of virtual ticks $ T$ that are scheduled among $ N$ running jobs with nice levels $ n_j$ on machine $ i$ (with $ n^{cpu}_i$ CPUs) is defined as (Linux kernel sources)

$\displaystyle T_i=n^{cpu}_i\cdot\sum_{j=1}^{N}{\left(21-n_j\right)} \equiv 21\cdot n^{cpu}_i\cdot N-n^{cpu}_i\cdot\sum_{j=1}^{N}{n_j}$ (5.6)

Therefore the percentage $ P_i$ of CPU time an individual process gets on machine $ i$ is:

$\displaystyle P_i(N, n)=\left\{ \begin{array}{ll} 100 & \mbox{if $N\le n^{cpu}_...
...}_i\cdot\frac{21-n}{T_i}\rule{0.2in}{0in}& \mbox{otherwise} \end{array} \right.$ (5.7)

Based on this percentage a modified Rank $ R^m_i$ to estimate the expected amount of CPU a new started job with nice level $ n$ will get on host $ i$ is defined as

$\displaystyle R^m_i = s_i\cdot P_i(N\!\!+\!\!1, n)$ (5.8)

The new implemented QUEUE-MANAGER supports the notion of groups, where an arbitrary number of jobs can be grouped together to be executed on the same machine (6). Jobs of several experiments are now queued into one QUEUE-MANAGER instance (7), the experiment that is watched in the GUI can be selected and changed at runtime.

To make the new QUEUE-MANAGER work with the existing SIESTA program, the
CORBA client of the QUEUE-MANAGER is linked against the text version of the SIESTA LISP system (see Fig. 5.7).

Figure 5.7: Coupling of SIESTA and the CORBA client stubs.
\begin{figure}\psfig{file=pics/siesta-corba, width=\linewidth} \end{figure}

The LISP parts of SIESTA were changed such that the jobs are queued into the new QUEUE-MANAGER module. No extra code is necessary for the CORBA client stubs, they are created automatically from a so-called interface definition language (IDL) by the IDL compiler. IDL allows to define structures as known from the C, C++ and JAVA languages and methods that use such structures as arguments and in return values. Fig. 5.8, Fig. 5.9, and Fig. 5.10 depict parts of the IDL definition. as it is used in the communication between GUI and QUEUE-MANAGER.

To automate the binding of the C++ CORBA code to the LISP system, the Tool Abstraction Concept (TAC) of the VMAKE [87] CASE tool was used. TAC is able to semi-automatically create language bindings among the languages FORTRAN, C, and LISP. These bindings are created based on instructions that are ignored5.3 by the compiler of the target language but used by the TAC module of VMAKE.

Figure 5.8: IDL definition of the struct HostStruct which is used in the communication between the QUEUE-MANAGER and the GUI.


struct HostStruct
{
  string name;                 // name of host
  short  nrCpus;               // nr of cpus
  double limit, weight;	       // load limit, weight of host
  double actLoad, guessLoad,
         rank;                 // status of host
  string os;                   // type of operating system
  string status;
};


Figure 5.9: IDL definition of the interface ClientRequest with methods addHost and changeHost which are used in the communication between the QUEUE-MANAGER and the GUI. The GUI component uses these methods to notify the QUEUE-MANAGER that the user has requested a change (changeHost) in the host table, that the user wants to add (addHost) a host to the table, or that the user would like to remove a host from the table (removeHost).


module QMan
{
  interface ClientRequest
  {
    void addHost(in HostStruct ho);
    void changeHost(in HostStruct ho);
    void removeHost(in string name);
  }
}


Figure 5.10: IDL definition of the interface ClientUpdate. The methods addHost, updateHost, and removeHost are used by the QUEUE-MANAGER to indicate to the GUI that a change in the QUEUE-MANAGER'S host table has occurred. Such a change occurs e.g. if a user that is connected via another GUI instance requests a change.


  interface ClientUpdate
  {
    void addHost(in HostStruct ho);
    void updateHost(in HostStruct ho);
    void removeHost(in string name);


2003-03-27