5.3 External Solvers

Since the remarkably increased performance of todays average computers inspires to even more costly simulations (for example optimizations and mixed-mode device/circuit simulations), a further speed-up of the simulators is highly appreciated. Thus, the development of highly-optimized mathematical code must not be neglected. With regard to such new developments, an interface is provided to employ external solver modules.

At the moment, two external modules can be employed: First, the Parallel Sparse Direct Linear Solver PARDISO [182,117,181], which provides a multi-threaded direct solver as well as an LU-CGS iterative solver implementation. Second, the Algebraic Multigrid Methods for Systems SAMG [37,36], which not only provides multi-level algorithms, but almost the same iterative solvers as those of the in-house module. Both external packages are written in FORTRAN. The only overhead of the interface to these modules is the conversion of matrix storage formats (see Section E), which is regarded as negligible due to its linear implementation effort (see Section 5.5.2 for quantitative information).

Since PARDISO provides a multi-threaded solver, a discussion of several parallelization issues is given in the next section.

5.3.1 Parallelization Issues

Parallelization techniques become important in case a simulation is regarded to take a ``long'' time. Thus, the main objective is the reduction of the simulation time, especially if time-to-market figures are considered also in the TCAD context. An additional utility of parallelization can be seen in the efficient employment of all available computational resources.

Whereas during one kind of long simulations much time is spent for particular tasks such as solving large linear equation systems, the other kind consists of a large number of simulation runs with similar setups. In the latter case one simulator run might be short, but thousands of them take sometimes a prohibitively long time. Obviously, a combination of both makes parallelization even more necessary. Typical examples for these kinds of long simulations are three-dimensional device, process, and mixed-mode device/circuit simulations or expensive optimizations and multi-parameter steppings, respectively.

Basically, the main objective of parallelization is the distribution of the simulation load over several CPUs and/or workstations in order to reduce the execution time. In the case of the first kind of simulations, the program execution itself is parallelized. Libraries, such as OPENMP [145] or MPI [159], can be used to split the simulator into several threads, as far as the code is suited for parallelization. Due to the parallel execution, the total computational effort increases, but the real simulation time can be reduced.

For the second kind of simulations, another approach can be identified: The repeated execution of the simulator with similar setups can be distributed over several CPUs and/or a network. Since the single runs are independent from each other, the simulation time can be significantly reduced. Still, the computational effort is increased if simulator-internal speed-ups or synergies are not fully used. From the engineer's perspective a convenient administration of the various simulation setups and results must be provided, otherwise the overall results are likely inconsistent and thus as contra-productive as useless. Normally, the input/output system causes a prominent restriction for this kind of load distribution.

5.3.1.1 Distribution of Repeated Executions

The load distribution of repeated executions is extremely important during optimizations. For example, the calculation of gradients offers the possibility to simultaneously start independent runs of the same simulator with slightly different setups. Another example in this field are genetic optimizers [234].

The optimization framework SIESTA [232] consists of an internal host management in order to employ various hosts for such calculations. Due to the distribution, SIESTA is able to significantly reduce the optimization time.

Besides of optimizations, another field of application are fixed sets of similar simulations yielding combined results. A simple example is the calculation of an IV curve. A single-threaded simulator such as MINIMOS-NT calculates each operating point subsequently, using simulator-internal speed-ups such as good initial guesses for the next step. In addition, the pre- and post-processing has to be done only once. However, if several CPUs are available, each branch of the IV curve can be calculated independently, multiplying the pre-processing effort, but reducing the overall real simulation time. See Figure 5.3 for a comparison of the two approaches.

**Figure 5.3:** Concept of single- (left) and multi-threaded (right) stepping algorithm. Whereas the former calculates all 55 steps subsequently, the latter uses five processes to calculates eleven steps each.
$\includegraphics[width=0.48\linewidth]{figures/iv_single.eps}$ $\includegraphics[width=0.48\linewidth]{figures/iv_multi.eps}$

As stated above, the administration of such a parallelization approach might be difficult, cumbersome, and error-prone. Since the simulator is not parallelized itself, additional functionality is required in a higher level of application. As already used for optimizations, so-called meta- or parameterized input-decks are systematically used to create actual simulator setups.

Here, a counter-part to the SIESTA system is required to exempt the user from the administration of these setups and their results as well as from a fault-tolerant host management, simulator calling, and consistency checking. All of these tasks are performed by the new library SEILIB, which is discussed in Appendix C.3. As an application example, the test of the MINIMOS-NT simulator is presented in Chapter C.4. This test takes more than one day on a single-CPU Linux machine, but can be run in approximately one hour on an IBM cluster with 32 CPUs. As additional benefits, this library can also be used to combine several programs to complete simulation flows and to perform optimizations.

5.3.1.2 Parallelization of the Program Execution

Whereas load distribution of repeated executions does not require modifications of the simulators, this section gives a short introduction in parallelization of the codes itself. This task frequently raises severe design and implementational problems, which might be hard to resolve for already existing single-threaded codes. However, particular tasks which are sufficiently modularized can be targets for simulator-internal parallelizations.

In context of semiconductor device and process simulations, a significant part of the computation time is spent on solving linear equation systems. As can be seen in Table 5.2, the range starts at about 40% for small two-dimensional device simulations up to 90% for large mixed-mode simulations. However, well-known bottle-necks prevent a significant performance impact. Direct algorithms are a promising parallelization target, as shown in several publications, for example about the OPENMP-based PARDISO [181] or MPI-based HSL codes [56].

The implementation of PARDISO is based on OPENMP [145], which has become a standard for parallelization applications. In comparison to single-threaded program developments, additional design and programming complexity have to be taken into account. Algorithms are not only defined in terms what work they are supposed to perform, but also how this work can be distributed to more than one processor. Therefore, OPENMP is an implementation model to support the implementation of parallel algorithms. Based on the standard programming languages FORTRAN and C++, OPENMP provides a set of compiler directives and a small function library. In summary, the main objective of OPENMP is to provide a standard and portable Application Programming Interface (API) for implementing shared memory parallel programs [34].

For that reason, OPENMP is primarily designed for shared memory multiprocessors, where all available processors can directly access the complete memory of the machine. The alternative is distributed memory, where memory is physically associated with each processor. Since each processor can access only the local memory, the programmer is responsible for mapping the program data accordingly. Messages have to be passed to access memory associated with other processors. For that reason, so-called Message Passing Interfaces (MPI [159]) and Parallel Virtual Machines (PVM) are employed. There are also high-level language approaches for automatic generation of low-level messages, for example High Performance Fortran (HPF [119]). Distributed memory systems are particularly employed if the number of processors exceeds the limit for shared memory systems. A schematic comparison of shared and distributed memory architecture is depicted in Figure 5.4 [34].

**Figure 5.4:** Comparison of a canonical shared memory (left) and message passing non-shared memory (right) architectures [34].
$\includegraphics[width=13cm]{figures/solver-memarch.eps}$

5.3.2 Pardiso

PARDISO is a high-performance and memory efficient package for solving large symmetric and unsymmetric linear equation systems on shared memory multi-processors [39]. The implementation is based on OPENMP [145] and a combination of left- and right-looking Level-3 BLAS super-node techniques are used [182]. To improve the performance of sequential and parallel factorizations, a Level-3 BLAS update is used.

In terms of the provided features, PARDISO can be perfectly used as an alternative for the internal solver modules. Basically, it solves real- and complex-valued linear equation systems $\ensuremath{\mathbf{A}} \ensuremath{\mathbf{x}} = \ensuremath{\mathbf{b}}$ , where multiple right-hand-side vectors $\mathbf{b}$ may be specified. Actually, PARDISO supports even a wider range of sparse matrix types, which are not all required by MINIMOS-NT: real- and complex-valued, symmetric, structurally symmetric and unsymmetric, positive definite, indefinite, and hermitian sparse linear equation systems.

PARDISO can also reuse already existing structures in the memory (cf. Newton Adjustment). The required phases among the four existing phases can be specified [39]:

For the first call, phases one to three are performed, whereas for subsequent steps only phase two and three have to be repeated. On exit, the destructing phase is invoked. Note that due to the already established Newton adjustment levels (see Section 4.12), the solver interface has always the complete information whether to repeat the symbolic phase or not.

The user of the module has the choice between two different solvers: either the direct LU-factorization or a combination of the direct and iterative method LU-CGS. See Appendix A.12 on how to select PARDISO in the input-deck.

5.3.3 Algebraic Multigrid Methods for Systems

The efficiency of solving large linear equation systems arising from discretized partial differential equations can be increased by applying hierarchical algorithms [213]. Over the last decades, the multigrid principle has turned out to be a promising approach and therefore the interest in algebraic methods has been increasing. Such methods automatically derive the hierarchy based on algebraic information which is contained in the discretization matrix. Geometric multigrid methods on the other side use a hierarchy of grids which are defined by coarsing based on grid information.

Algebraic multigrid or AMG was the first hierarchical, matrix-based multigrid method. The ideas of geometric multigrid, that is the coarsing and smoothing of the grids, have been extended to specific classes of linear equation systems. The automatically derived hierarchy consists of linear equation systems whose dimension is repeatedly decreasing. Matrix entries are used to derive the operators to transfer information between two levels. Due to this automatic procedure, the robustness, adaptability, and flexibility is remarkably increased.

The SAMG package (Algebraic Multigrid Methods for Systems) [37] has been fully coupled to the internal solver module. Like PARDISO, it can be used as an alternative for the in-house solver module. See Appendix A.12 how to select these solvers.

As SAMG is a package of several solver systems, the user can choose between four configurations of preconditioners and solvers (accelerators): AMG-BICGSTAB, AMG-GMRES(M), ILU-BICGSTAB, and ILU-GMRES(M). The SAMG interface is implemented in a separate function. Since SAMG can handle real-valued equation systems only, the approach of splitting a complex-valued one into a real-valued system with double dimension as discussed in Section 2.6.2 is used. As seen for the internal and PARDISO solvers, also SAMG provides specific features for solving multiple right-hand-side vectors and to reuse already allocated structures during repeated calls of SAMG.