5.3.5 Scheduler Kernels

This section discusses the design and implementation of different scheduler kernels. In particular, a serial scheduler is discussed as well as a task parallel scheduler - components can be executed by different MPI processes if possible - and a data parallel scheduler - all components execute the same instructions, but solely on a specific subset of the data.

Serial Mode

The SM kernel is used for serial task graph execution on a shared-memory machine. Although the task graph is processed in a serial manner, the individual plugins can indeed utilize shared-memory parallelization approaches, such as OpenMP. The task graph implementation is based on the Boost Graph Library [48][155], which not only provides the data structure but also graph algorithms, such as topological sort [156].

The serial scheduler is based on the list scheduling technique [157]. Informally, this technique uses a prioritized sequence of tasks, which is then processed consecutively. Figure 5.19 depicts the major steps of execution flow.


pict


Figure 5.19: Flow diagram of the SM scheduler kernel.


As previously discussed in the context of the framework’s configuration mechanism, the plugins are loaded according to the input configuration file (Section 5.3.4) by the factory mechanism (Section 5.3.2). Each plugin is configured based on the parameters listed in the input file. With these parameters the input and output dependencies are defined.

A task graph meshing algorithm connects the various plugins based on their dependencies. The meshing procedure is based on plugging the sink sockets of the plugins to valid source sockets of other plugins (Section 5.3.2). Validity is ensured by comparing the socket IDs, incorporating the socket key as well as the data associated with the socket. The generated task graph is used for building the prioritized sequence, generated by the Boost Graph’s topological sort graph algorithm, as depicted in the following.

1std::list<Vertex>   plist; 
2boost::topological_sort(graph, 
3  std::front_inserter(plist));

The graph object is a directed graph data structure, which is used to hold the entire task graph. A directed graph is a graph consisting solely of directed edges, i.e., edges pointing from a source to a target vertex. In addition, the graph must not contain cycles8 , also known as loops. If these conditions are satisfied, the topological sort algorithm can be utilized to generate a linear sequence of vertices - representing the ViennaX components - which upon execution guarantees that the input dependencies of each component are met. More concretely, the prioritized sequence of tasks is processed consecutively by traversing the result container plist and executing the individual plugins via the plugin’s execute interface method (Section 5.3.2). Note that the linear solver plugin introduced in Section 5.3.3 can be utilized with this scheduler.

Despite of the acyclic requirement, the SM scheduler supports loops by identifying the loop entry vertex and the loop exit vertex. These particular vertices are identified by a loop-detection algorithm based on Boost Graph’s implementation of Tiernan’s approach to detect cycles in an acyclic graph [158]. The loop is then broken up to satisfy the acyclic condition, followed by executing the topological sort algorithm to determine the linear execution sequence. ViennaX manually triggers a re-execution of the sub-graph representing the loop part via the previously identified loop vertices.

To implement a loop, the components representing the previously mentioned loop entry and loop exit vertices, need to provide a sink-source socket connection representing the loop. To this end, the loop exit component implements a source socket whereas the loop entry component provides a corresponding sink socket, thus closing the loop via a backwards connection. As this connection is merely required for establishing a dependency in the task graph, the associated data is irrelevant, and can thus be chosen arbitrarily. The loop exit component is responsible for triggering a loop continuation or a loop exit by either returning false or true at the end of the execution method, respectively. The framework evaluates the boolean return value and acts accordingly.

Distributed Task Parallel Mode

The DTPM scheduler kernel enables applications focusing on a task parallel approach. In general, the scheduler follows a static scheduling approach, based on load balancing indicated by optional plugin weights. Similar workload distribution approaches are available, focusing on dynamic scheduling implementations based on, for instance, work-stealing [159][160]. The execution of the individual plugins is distributed among the available MPI processes. Therefore, a considerable speedup of the task execution can be achieved, if the task graph offers parallel paths. Figure 5.20a depicts the flow diagram of the scheduler.

The DTPM scheduler has two peculiarities: First, the global task graph is partitioned and ultimately the individual subgraphs are processed by different MPI processes. Second, as the plugins sharing a data connection might be executed on different MPI processes, an extension to the socket data communication layer incorporating the distributed memory environment is required.

The distribution of the workload is based on the METIS graph partitioning library [161], to automatically improve the efficiency of the parallel execution of the task graph. A weighting approach is implemented enabling an advanced user or a developer to assign a weight to the plugin implementation via a corresponding method, indicating the computational load of the respective component. This load value is used by METIS, aiming to equalize the computational effort over the generated partitions, thus improving parallel execution efficiency. In general, the larger the assigned value the larger the associated computational load of the respective component. Considering four components (each offering a load of one) and two processes, each process will be assigned two components as each process is responsible for a computational load of two. However, if one component has a load of three instead, one process will be responsible for three components, whereas the other will be dealing with one component; in this case both processes will handle a computational load of three.

With respect to the implementation of the METIS-based graph partitioning, internally, the developed framework approach converts the Boost Graph-based graph data structure to a compressed sparse row format [120], which is required by the METIS API. The computational weights are transferred to the METIS backend and the corresponding partitioning algorithm is executed. The algorithm aims to minimize the so-called edge-cut, aiming to minimize the number of edges which straddle partitions by simultaneously balancing vertex weights across partitions. Upon completion, each vertex is assigned a partition number which is used by ViennaX to distribute the workload among the MPI processes accordingly.

The second peculiarity of the DTPM scheduler, being the incorporation of a distributed memory environment into the socket data communication layer, is based primarily on the non-blocking point-to-point communication capabilities of the MPI layer. The graph partitioning step yields, aside of the MPI process assignments of the plugins, a lookup table for the socket communication. Each MPI process holds its own socket database, and utilizes the communication lookup table to determine the corresponding transmission sources and sinks. This mechanism is utilized after a plugin has been executed on an MPI process, where its source sockets requiring outbound inter-process communication are traversed and the transmission is initiated.

In general, the non-blocking point-to-point methods are utilized to increase execution performance. This is crucial, as an MPI process should in the optimal case not wait for an outgoing transmission to be finished before it executes another plugin. Such an approach is typically referred to as overlapping communication with computation. However, using a pure MPI approach and therefore non-blocking communication methods, such an overlap is rarely achieved. In fact, specialized hardware and software is required to achieve a reasonable overlap, for instance, Cray’s XE6 with Gemini interconnects is capable of delivering such an overlap [100]. A possible future extension would be a hybrid approach, utilizing MPI and threads to implement a true asynchronous approach, thus introducing a much more improved overlap of communication and computation. The DTPM scheduler currently does not support loops, being especially challenging due to the distributed nature of the components requiring additional communication to orchestrate a looped-execution.

The linear solver plugin introduced in Section 5.3.3 can be utilized with this scheduler, as each plugin is executed by one process. Therefore, one MPI process accesses the available computational resources via the parallel accelerator layer.

Distributed Data Parallel Mode

The DDPM scheduler kernel enables simulations based on the data parallel approach. Figure 5.20b depicts the flow diagram of the scheduler implementation. Contrary to the DTPM scheduler, the graph is not partitioned as all plugins are processed by all MPI processes in the same sequence. The root process prepares the task graph and generates a prioritized list of plugins. This list is distributed to all MPI processes each processing the graph in its entirety. As with the DTPM scheduler, each MPI process holds its own socket database responsible for storing the data associated with the sockets on the local process.


pict (a) DTPM pict (b) DDPM


Figure 5.20: (a) Flow diagram of the DTPM scheduler kernel. The root MPI process is responsible for preparing and distributing the workload evenly between the compute units. All available compute units process their distinct parts of the graph. The fact that this scheduler assigns parts of the graph to the compute units is indicated by the respective Subgraph nodes. (b) Flow diagram of the DDPM scheduler kernel. Similar to the DTPM scheduler, the root MPI process prepares the entire task graph. However, the entire workload is distributed to all MPI processes, as each process executes the entire task set represented in the task graph.


A peculiarity of the DDPM scheduler kernel is the fact that each plugin has access to an MPI communicator object via the comm method, providing access to the entirety of the MPI environment. The following code snippet depicts an exemplary utilization in a plugin’s implementation to evaluate the rank of the current MPI process.

1if (comm().rank() == 0) { 
2  // Root code 
3}

A Boost MPI communicator object offers implicit conversion to a raw MPI communicator, ensuring interoperability with non-Boost MPI implementations.

Figure 5.21 shows the execution behavior of the scheduler. Each plugin is processed by all MPI processes and has access to an MPI communicator. Inter-plugin communication is provided by the socket data layer, whereas inter-process communication is supported by the MPI library.

Note that the utilization of the linear solver plugin introduced in Section 5.3.3 is not reasonable here, as in this case each process would perform the computation, thus massively overburdening the compute unit beyond reasoning. For the scheduler at hand, an MPI-powered linear solver implementation is the proper choice, as is provided by, for instance, the PETSc library.


pict


Figure 5.21: Exemplary execution behavior of the DDPM scheduler based on two plugins and four MPI processes. The bars in the right part of the figure indicate the computational load. Each MPI process executes the individual plugin. Additionally, each plugin has access to an MPI communicator object, enabling not only classical data parallel execution modes but also plugin inter-process communication. Inter-plugin communication is realized by the socket mechanism (SCK).