Next: 5.2 Overview of the Up: 5. The Solver Module Previous: 5. The Solver Module

5.1 Third Party Modules and Libraries

Since solving linear equation systems is a very common computational task, a short overview of the available third party solutions shall be given.

The Basic Linear Algebra Subprograms BLAS provide high quality building block routines for performing basic vector and matrix operations: Level 1 BLAS for vector-vector operations, Level 2 BLAS for matrix-vector operations, and Level 3 BLAS for matrix-matrix operations [18]. Since BLAS libraries are efficient, portable, and widely available for many different platforms, many mathematical developments are based on them.

The original objective of the Linear Algebra PACKage LAPACK, which is based on the BLAS library, was to provide a version of the widely used EISPACK [202] and LINPACK [53] libraries that runs efficiently on shared memory vector and parallel processors. The bottle necks of these modules are the memory access patterns disregarding the multi-layered memory hierarchies of the machines. Thus, too much time is spent for data manipulation rather than for doing useful floating-point operations. LAPACK addresses this problem by making use of block matrix operations, such as matrix multiplication, in the innermost loops. Since these block operations can be optimized for various architectures to account for the memory hierarchy, a way to achieve high efficiency on diverse modern machines was found [13]. The Scalable LAPACK (SCALAPACK) library provides a subset of LAPACK routines for distributed memory parallel computers [21]. Whereas SCALAPACK can be applied for dense systems, a library for parallel linear algebra computation on sparse matrices is presented in [64]. That PSBLAS package addresses the parallel implementation of iterative solvers and is designed for distributed memory computers.

5.1.1 Commercial Libraries

The NAG libraries of the Numerical Algorithms Group contain a wide range of robust numerical and statistical routines, for example for linear algebra, eigenvalue analysis, and differential equations [155]. The libraries are offered in FORTRAN 77 and 90, C, as well as an SMP Library for shared memory and a Parallel Library for distributed memory parallel computing.

The IMSL libraries of Visual Numerics provide accurate and reliable FORTRAN algorithms with full coverage of mathematics and statistics. According to [226], the libraries are claimed to be a cornerstone of high-performance and deep computing as well as predictive analytics applications in science, technical and business environments for well over three decades. IMSL stands for International Mathematical and Statistical Library.

The HSL Software Library, formerly known as the Harwell Subroutine Library and commercially distributed by Hyprotech UK Ltd, offers a collection of portable, fully documented FORTRAN packages. Besides the commercial distribution, there are special arrangements for licensing to academic users. The library can be particularly applied for sparse matrix computations and large-scale optimization. HSL routines can be found in advanced software applications such as for chemical engineering and finite element modeling.

5.1.2 Libraries of Hardware Vendors

Intel offers the Math Kernel Library MKL to provide highly optimized thread-safe mathematical routines for High-Performance Computing (HPC), science, engineering and financial applications, which are able to take advantage of the maximum performance on Intel processors [107]. The key features of the MKL are its optimization for recent Intel platforms, an automatic run-time detection of the CPU actually used, scaling on multi-processor environments, thread-safety, and royalty-free distribution of the run-time library. The Linux non-commercial development license is also free. The functionality embraces not only linear algebra (BLAS and LAPACK), but also discrete Fourier transforms, vector mathematics and a vector statistical library with random number generators.

The AMD Core Math Library ACML for Linux and Windows offers BLAS, LAPACK and FFT routines, which can be applied by a wide range of software developers to obtain excellent performance on AMD platforms. The highly optimized library provides numerical functions for mathematical, engineering, scientific and financial applications. ACML is available both as a 32-bit and 64-bit library which is able to fully exploit the large memory space and improved performance offered by new AMD 64-bit architectures [3].

Applications running on IBM pSeries computers can take advantage of optimized ESSL (Engineering and Scientific Subroutine Library) routines. This allows both to reduce programming requirements and performance improvements. The library provides optimized versions of dense matrix kernels contained in the BLAS and LAPACK libraries. Another approach to improve the performance of applications is the IBM Mathematical Acceleration SubSystem Library MASS. It provides high-performance versions of a subset of intrinsic functions while sacrificing a small amount of accuracy [2].

5.1.3 Recent Developments

In [190], an overview of parallel frontal solvers for large sparse linear systems can be found. These codes were implemented in the HSL package. As a variant of Gaussian elimination, the frontal method was originally developed [109] to solve large linear equation systems. Because the computational work is limited to a relatively small frontal matrix, large problems could be solved even on computers with small memories. The assembly and elimination procedure were combined, and the additional data was written to secondary storage systems. Today the motivation for frontal solvers arise from the fact, that the frontal matrices can be stored in full format [191]. This allows one to employ high-level BLAS routines. Furthermore, multiple front methods were developed, which can simultaneously process subproblems, and allow to parallelize the factorization.

A new parallel direct solver for large sparse linear systems is presented in [56], which is also incorporated in the HSL package as the HSL_MP48 algorithm. This algorithm is designed to solve highly unsymmetric systems by employing several processors, typically up to 16. It is based on the serial code MA48, which implements a sparse variant of Gaussian elimination with sparse data structures and threshold pivoting. Since MA48 stores the matrix and its factorization in main memory, HSL_MP48 can parallelize the problem in order to reduce the memory demand as well as to speed up the factorization process. The parallel approach is based on partitioning the system matrix into a small number of loosely connected submatrices (singly bordered block diagonal form [56]). A modified version of MA48 is then used to factorize all matrices which can be done in parallel. The interface problem can be solved by any sparse solver. Three advantages in comparison to a general parallelization of a sparse solver are given: all processors can be preassigned all data in advance, the factorizations of the submatrices can be done in parallel, and interprocessor communication as required for the interface problem is limited and structured. For the majority of test cases arising from chemical process engineering problems HSL_MP48 outperforms MA48 if two processors are employed. For some problems and eight processors speed-ups in excess of four are achieved. Since the code does not require a single file system and is based on MPI [159], it can be applied both on shared memory and distributed memory machines (see Section 5.3.1).

The comparison in [93] involves different strategies mainly regarding the numerical factorization and the target architecture. However, it is concluded that improvements in all phases of solving sparse linear equation systems have been achieved, which are the main reasons for the significant speed-up already observed.

The following solvers are compared [93]:

UMFPACK, the Unsymmetric Multi-Frontal Package, is a set of serial routines for solving unsymmetric sparse linear systems using the unsymmetric multi-frontal method. Whereas versions 2.2 and 3.1 are compared in [93], [47] provides version 4.4 now.
SuperLU $_\mathrm{MT}$ and SuperLU $_\mathrm{dist}$ [223]: whereas the former is designed for Shared Memory Parallel Processors (SMPs), the latter is designed for Distributed Memory Parallel Processors (see Section 5.3.1). The factorization is based on the super-nodal left-looking or right-looking approach, respectively.
SPOOLES stands for SParse Object Oriented Linear Equations Solver [140] and is a library for solving sparse real- and complex-valued linear equation systems. Numerical factorization is done by a super-nodal Crout [131] approach. The target architectures include serial, shared-memory parallel, and distributed-memory parallel.
MUMPS stands for MUlti-frontal Massively Parallel sparse direct Solver and provides capabilities for solving linear equation systems with symmetric positive definite matrices, general symmetric matrices, general unsymmetric matrices, complex or real arithmetic matrices. Furthermore, it offers parallel factorization as well as iterative refinement and backward error analysis [163]. The implementation is written in FORTRAN 90 and based on MPI. Partial pivoting and dynamic distributed scheduling is used to accommodate both numerical fill-in and multi-user environment. The BLAS, LAPACK, and SCALAPACK packages are employed. The numerical factorization is based on symmetric-pattern multi-frontal and the target architecture is distributed-memory parallel.
WSMP: The Watson Sparse Matrix Package (WSMP) is a high-performance software package for solving large sparse linear equation systems by a direct method on IBM pSeries Systems, serial or multi-processor Linux workstations and Linux clusters [2]. WSMP can be used as a serial package, in a shared-memory multiprocessor environment, or as a scalable parallel solver in a message-passing environment, where each node can either be a uniprocessor or a shared-memory multiprocessor. The numerical factorization is based on unsymmetric-pattern multi-frontal and the target architecture is shared-memory parallel.

Two additional modules are PARDISO [182] and SAMG [37]. The former is also provided by DESSIS [111] and the MKL [107]. The latter is industrially employed for computational fluid dynamics as used in the car industry, oil reservoir simulations, electromagnetics, groundwater flows, semiconductor physics, and structural mechanics [213]. Both packages are discussed later in this chapter in a more detailed way, see Section 5.3.2 and Section 5.3.3, respectively.

5.1.4 Solver Frameworks

The Boeing Math Group commercially provides the Intelligent Iterative Solver Service IISS, which is based on the performance considerations regarding the advantage of iterative solvers for large sparse linear equation systems. The optimal selection of an iterative solver varies and these methods are also known to suffer from irregular performance in different applications. Furthermore, it can be difficult for users to choose the correct parameters required for performance. For that reason the Boeing Math Group provides a collection of state-of-the-art iterative solvers [24].

This collection contains BCSLIB-EXT [23], a preconditioned iterative solver, which has been also used for semiconductor device simulation. Hypre [63], a scalable software for solving large, sparse linear equation systems on massively parallel computers, and AZTEC [222], a massively parallel iterative solver library with advanced partitioning techniques and dense matrix algorithms, are additionally available [24].

Trilinos is a project aiming to provide parallel solver algorithms and libraries for the solution of large-scale, complex multi-physics engineering and scientific applications. Since an object-oriented software framework is developed, Trilinos is based on packages [180]. Each package is focused on important, state-of-the-art algorithms in its problem regime, developed by a small expert group, self-contained, as well as configurable, buildable and documented on its own [61]. For example, Trilinos supports the object-oriented version of AZTEC, SuperLU, Hypre, HSL and many more. In addition, three specific package collections are provided: one for linear algebra classes, one for abstract solver classes, and one for basic tools like BLAS or MPI interfaces.

5.1.5 Special-Purpose Libraries

The Automatically Tuned Linear Algebra Software ATLAS provides portable linear algebra software, including a complete C or FORTRAN 77 BLAS API and a very small subset of the LAPACK API. For all supported operations, ATLAS achieves performance comparable with machine-specific tuned libraries [224,238]. Thus, ATLAS can be used as an alternative to the BLAS library.

Blitz++ is a C++ class library for scientific computing showing a comparable performance with FORTRAN 77/90 [157]. The library incorporates dense arrays and vectors, random number generators and small vectors and matrices. It is available under an open source license. The basic idea is to use the advanced features of the C++ programming language while avoiding the performance drawback. The speed-up is not achieved by better compiler optimization and related measures, but by using templates, which allow to perform optimizations such as loop fusion, unrolling, tiling, and algorithm specialization automatically at compile time. Thus, instead of compiler optimizations, the Blitz++ library parses and analyzes array expressions at compile time and performs respective loop transformations in order to speed-up the code. In addition, Blitz++ extends the conventional dense array model to incorporate new and useful features, such as flexible storage formats, tensor notation and index placeholders [157].

Finally, there is Boost++, a set of free portable C++ source libraries [27] which fully comply to the C++ standard and work well with the C++ standard library. It is based on C++-templates and the system is said to be included into the next C++ standard. One template class library regards linear algebra: UBLAS provides BLAS functionalities for dense, packed, and sparse matrices.

The design and implementation is based on operator overloading and efficient code generation via templates. Classes are provided for dense, unit, and sparse vectors, dense, identity, triangular, banded, symmetric, hermitian, and sparse matrices. Ranges, slices, and adaptor classes can be defined to view into vectors and matrices. The library provides reductions like different norms, addition and subtraction of vectors and matrices and multiplication with a scalar, inner and outer products of vectors, matrix-vector and matrix-matrix products and triangular solver. The implementation is mostly STL conform including the iterator interface [26]. The known limitations are that the implementation assumes a linear memory address model and that dense matrices have been the tuning focus.

5.1.6 Discussion

As can be easily seen, there is a large variety of solver systems and packages available and many of them are still developed and improved. As they are highly optimized, particularly for specific platforms, the two-strategy approach for the in-house solver module presented here seems to be promising. On the one hand side own implementations are provided, on the other hand side the application can benefit from external codes which are directly coupled to the solver module.

However, the integration of external modules is restricted by purpose and does not embrace the full spectrum of available systems presented above. By now, only two external modules with specific properties have been made available. The choice of the external modules depends on the following considerations:

Performance should be the main priority for providing different solvers. As evaluated, both integrated external solver systems show a higher performance for specific kinds of simulations than the in-house solver systems. So the choice was motivated by performance advantages rather than offering alternative solvers to try getting non-converging simulations run. Whereas some solver systems are known for being advantageous in context of more ill-conditioned problems (see Section 5.5.1), a trial-and-error convergence strategy frequently hides problems in the underlying simulation setup. These short-comings can cause questionable overall simulation results although one linear solver is maybe able to deliver a result with the desired accuracy at each step of the Newton approximation.
Many users of the simulators regard the linear solver modules as black boxes which are supposed to deliver the ``correct'' solution vector. There are plenty of settings both for the assembly and solver module, but in most cases the default settings are used. Therefore, several solvers are integrated to automatic solver hierarchies (see Section 5.4), which lose efficiency with an increased number of solvers. Furthermore, to avoid contradictions with the first item regarding the convergence problem, appropriate feedback has to be given. It could be useful to obtain a questionable solution in combination with a respective feedback in order to specifically improve the simulation setup.
Typical simulators require a specific set of capabilities which should be available. Independently of the particular implementation and target architecture, the key demands on the solver module can be summarized as follows: solving of sparse real-valued non-symmetric and complex-valued non-hermitian linear equation systems, speed-up features like Newton adjustment and multiple right-hand-side vectors, shared library concept, as well as a control concept based on parameters. If complex-valued linear equation systems are not directly supported, the interface can transform them to double-sized real-valued ones (see Section 2.6.2). Since the linear equation systems are sparse, the solver systems designed for sparse matrices are preferred, because the memory consumption of compressed storage formats is far less than that of full matrices (see Section 4.6). However, due to linear memory addressing, algorithms working on full or band matrices can be far better optimized.
Since the simulators are provided for several platforms, for example Linux and AIX, portability of the libraries is a critical issue. Platform-dependent libraries normally utilize all capabilities of the respective hardware. However, if the same functionality and/or API used in the interface is not available or efficient on alternative platforms, the module should not be chosen.
As already stated in the introduction of this chapter, economic and legal considerations must not be neglected. First of all, the various license agreements have to be taken into account. Besides the technical evaluation, the decision whether to employ a module bound to commercial run-time or development licenses, must take the market, financial return, costs, and the economic situation of the vendor into account.

Next: 5.2 Overview of the Up: 5. The Solver Module Previous: 5. The Solver Module

S. Wagner: Small-Signal Device and Circuit Simulation