next up previous contents
Next: 5.6 Concluding Remarks Up: 5. The Solver Module Previous: 5.4 Solver Hierarchy

Subsections



5.5 Practical Evaluation of the Solvers

In the course of this work the performance of these solvers has been evaluated. Rather than using a set of single matrices, the approach is based on complete simulations with consistent settings, as typically encountered during daily work.

In mathematical publications, different solver systems are normally compared by applying them for single linear equation systems. For example, the matrix market [139,25] provides such a data repository for use in comparative studies of linear algebra algorithms. The repository does not only provide matrices, but also matrix generation software in a wide variety of scientific and engineering disciplines. For each matrix and matrix set details regarding the matrix properties and a visualization of the matrix structure is provided. The matrices can be downloaded in several different text file formats. A matrix generator is either a static software for download, a Java applet, or a form-based web service to process requests for generating matrices. In 2000, 482 individual matrices and 25 matrix generators were available. The database includes the Harwell-Boeing Sparse Matrix Collection (Release I) [55], the SPARSKIT collection [178], and the Non-Hermitian Eigenvalue Problem collection [14].

Such an approach is particularly useful for the evaluation of new mathematical algorithms. The conclusions drawn from the results are used to improve the respective implementations. However, from the perspective of the implementation of a general-purpose simulator, the objective is quite different. Here, a set of solvers are available and the one which fits best for the respective problem shall be chosen. In contrast to the solver development, the research on this topic is restricted to practical evaluations regarding the respective simulators.


5.5.1 Test Examples and Processing

The 16 examples summarized in Table 5.1 were taken from current scientific projects at the institute. They consist of field effect, bipolar, and silicon-on-insulator transistors. Structures such as FINFETs are analyzed by means of three-dimensional simulations.

All examples were simulated on:


Table 5.2: This table provides general quantitative information about the simulations using the in-house ILU-BICGSTAB on the IBM cluster. After the index of the simulation, the dimensions of the inner and complete equation system is given. The next column shows the share of the inner in respect to the complete system. In order to analyze how the simulation time is spent, the remaining columns contain the shares of the initialization, pre-processing, assembling, solving (includes all steps shown in Table 5.3), quantity update, and post-processing, respectively.
# Simulation Type Dimension Entries ([ $ {\,}^0\!/_{00}$]) DC it Remark
1 MOSFET 2D device 2,704 23,662 (3.24) 17 test structure with $ L_{\mathrm{G}} = 1\,{\mathrm{{\mu}m}}$
2 Flash cell 2D device 5,967 47,956 (1.35) 24 tunneling effects
3 Pin-diode 2D device 6,335 56,127 (1.40) 13 optical generation model
4 Bjt transistor 2D device 6,389 33,869 (0.83) 23 ET transport model
5 SA-LIGBT 2D device 16,774 158,223 (0.56) 27 two V $ _{\mathrm{D}}$ steps: 0$ \,$V, 5$ \,$V
6 SiGe HBT 2D device 19,313 210,915 (0.57) 32 self-heating
7 Colpitts oscillator circuit (1) 3,928 35,002 (2.27) 41 transient (400 steps)
8 Amplifier circuit (1) 6,391 35,291 (0.86) 30 ET transport model
9 Ring oscillator circuit (10) 25,246 226,931 (0.36) 29 transient (100 steps)
10 2-input nand gates circuit (8) 146,614 1,347,138 (0.06) 23 transient (50 steps)
11 3-input nand gates circuit (12) 219,920 2,020,706 (0.04) 23 transient (50 steps)
12 MagFET 3D device 85,308 921,860 (0.13) 36 magnetic field
13 FinFET 3D device 81,037 807,150 (0.12) 13 thin SOI finger
14 SOI 3D device 87,777 997,296 (0.13) 23 two V $ _{\mathrm{D}}$ steps: 0$ \,$V, 0.1$ \,$V
15 HBT 3D device 175,983 1,833,138 (0.06) 35 self-heating
16 LD-MOSFET 3D device 167,197 1,925,553 (0.07) 25 power device; two V $ _{\mathrm{D}}$ steps: 0$ \,$V, 1.0$ \,$V



Table 5.1: Six two-dimensional, five mixed-mode device/circuit (the number of devices is given in parenthesis), and five three-dimensional simulations were used for evaluating the solver performance. The dimension of the linear equation system, the number of non-zero entries, and the typical number of DC Newton iterations are given.
# Inner Complete Ratio [%] CPU [s] Init [%] Pre [%] Asm [%] Lin [%] Upd [%] Post [%]
1 2,704 2,920 92.60 1.34 14.93 6.72 31.34 34.33 3.73 2.99
2 5,967 13,915 42.88 7.02 3.56 5.98 46.15 40.17 1.71 1.14
3 6,335 6,451 98.20 5.06 16.01 7.51 17.19 55.53 0.59 2.17
4 6,389 6,754 94.60 4.84 9.09 5.99 28.10 51.24 1.45 1.65
5 16,774 17,705 94.74 39.64 12.66 4.69 37.24 42.73 0.33 1.54
6 19,313 19,836 97.36 28.29 39.46 3.45 10.35 44.46 0.24 1.24
7 3,928 5,475 71.74 32.60 1.53 6.20 17.61 64.39 1.20 7.15
8 6,391 8,056 79.33 63.06 0.78 4.60 30.13 59.94 0.76 3.00
9 25,246 37,007 68.22 226.97 0.81 4.88 13.69 76.16 2.07 1.74
10 146,614 202,168 72.52 3,641.37 0.06 2.88 8.34 87.13 0.33 1.10
11 219,920 303,250 72.52 7,671.19 0.04 2.40 6.78 89.60 0.26 0.80
12 85,308 90,203 94.57 593.55 0.39 2.52 7.24 89.03 0.10 0.48
13 81,037 98,870 81.96 99.13 4.68 4.07 9.69 79.72 0.22 1.12
14 87,777 93,572 93.81 190.19 1.45 5.41 23.19 67.32 0.19 1.90
15 175,983 226,687 77.63 988.29 0.65 2.54 6.97 88.42 0.14 1.04
16 167,197 179,111 93.35 495.99 0.74 3.61 9.98 83.93 0.15 1.20


The extensible benchmark is processed by a single program, which has been implemented as a SEILIB application (see Section C.3). The real (wall-clock) and user time is measured by the GNU time command, and the fastest of three consecutive runs is taken.


5.5.2 General Quantitative Comparisons

Table 5.2 contains information on how the simulation time is actually spent. The simulator was started on the IBM cluster and the in-house ILU-BICGSTAB was selected. The first columns show the number of the example, the dimensions of the inner and complete system, and the ratio $ dim_\mathrm{inner}/dim_\mathrm{complete}$. A lower ratio indicates that more equations are pre-eliminated and thus more time is spent for pre-elimination. The CPU column shows the absolute user time required for the respective example. The remaining columns show the shares of the simulation time spent in selected modules or for selected tasks: for initialization, pre-processing, assembling the linear equation systems (including the calculation of the model contributions), the share of the linear modules including all steps shown in Table 5.3, the update of the quantities, and eventually the postprocessing.

The most important data in the context of that chapter is how much time is spent for the linear system after the assembly has been completed (the Lin column). For the smaller two-dimensional examples that share is below 50%, that means more time is spent for calculating and assembling the model contributions as well as for pre- and post-processing. However, for all other simulations including the larger two-dimensional ones, that share rises significantly even up to more than 90%. Thus, for increasing the efficiency of advanced device simulations and optimization, it is essential to reduce the relative effort spent on preparing and solving the linear equation systems.


Table 5.4: This table shows the same information as Table 5.3, but without the Newton adjustment.
# Time [s] MCSR [%] Compile [%] Pre [%] Sort [%] Scale [%] ILU [%] BICGSTAB [%] Back [%]
1 0.49 0.65 10.93 6.11 2.25 17.02 32.93 29.52 0.38
2 2.74 0.67 6.73 5.07 1.20 8.98 35.63 41.16 0.51
3 2.69 0.27 4.20 2.26 1.05 8.82 42.71 40.54 0.12
4 2.40 0.45 6.69 4.21 1.64 10.70 37.67 38.38 0.19
5 27.67 0.16 4.58 3.08 0.99 8.53 34.37 48.16 0.12
6 13.11 1.25 5.01 3.02 1.10 6.93 36.06 46.51 0.11
7 35.17 0.26 4.75 3.87 0.73 5.97 42.32 41.77 0.31
8 45.53 0.19 5.65 4.31 1.10 8.04 40.00 40.48 0.20
9 224.52 0.26 3.28 2.66 0.66 4.04 30.85 58.04 0.19
10 3139.28 0.13 1.83 1.79 0.40 2.51 29.26 63.93 0.14
11 6855.87 0.12 1.38 1.48 0.31 1.97 27.53 67.11 0.11
12 504.36 3.99 2.32 1.14 0.50 2.39 16.67 72.95 0.04
13 94.43 9.82 4.95 2.75 0.88 3.84 23.16 54.47 0.12
14 146.11 1.08 3.48 2.48 1.01 5.17 26.68 59.97 0.13
15 921.04 37.04 3.51 1.69 0.58 2.71 11.33 43.08 0.07
16 447.12 1.47 2.45 1.82 0.68 3.12 28.21 62.16 0.09



Table 5.3: This table provides general quantitative information about the solving process which includes the MCSR conversion, compiling of the complete linear system, pre-elimination, sorting, scaling, preconditioning (ILU), and solving with the BICGSTAB. The simulations were started on the IBM cluster.
# Time [s] MCSR [%] Compile [%] Pre [%] Sort [%] Scale [%] ILU [%] BICGSTAB [%] Back [%]
1 4.35 0.77 1.88 2.07 0.50 13.25 50.99 30.06 0.15
2 2.86 3.43 5.63 5.36 1.19 8.76 34.20 40.87 0.52
3 2.77 1.98 3.76 2.69 1.02 8.62 41.52 40.27 0.11
4 2.51 2.37 5.67 4.65 1.69 10.60 36.52 38.27 0.18
5 16.22 2.61 3.98 3.41 1.10 8.31 34.40 46.04 0.14
6 17.00 1.32 2.41 1.82 0.66 4.64 45.33 43.74 0.07
7 21.44 1.74 4.17 4.95 0.97 5.92 40.25 41.70 0.24
8 38.72 2.51 5.35 5.70 1.42 8.67 39.08 37.01 0.22
9 180.63 2.19 2.97 3.50 0.67 3.99 29.93 56.55 0.20
10 3388.38 1.45 1.64 2.23 0.41 2.41 27.52 64.22 0.14
11 7618.11 1.42 1.27 1.86 0.34 1.99 25.31 67.71 0.11
12 634.29 21.19 1.21 1.11 0.44 1.96 13.37 60.68 0.04
13 92.97 33.06 2.03 2.05 0.55 2.79 17.92 41.49 0.11
14 129.94 7.13 2.97 2.86 0.94 4.46 25.17 56.32 0.13
15 2499.04 79.80 0.58 0.70 0.18 0.89 3.83 14.00 0.02
16 462.25 9.60 1.79 1.64 0.62 2.83 24.81 58.62 0.08


In addition, the effort of the conversion to alternative sparse matrix formats was evaluated. Since the time required for the conversion is too short, no significant information can be given by comparison of complete run times. However, in order to give quantitative information on the negligibility of the matrix conversions, the input and output system of the assembly module was used.

The assembly module has been equipped with a comprehensive input and output system which enables the user to have the assembled complete and inner linear equation systems read from or written to files. These features can be used for debugging and quality assessment purposes. Furthermore, different sets of parameters or alternative solvers can be efficiently tested. Whereas the simulator is normally used to calculate the entries of the linear equation system, a test program is provided which reads the files and starts the solver.

In order to quantify the effort of the matrix conversions, the first linear equation system assembled for the three-input nand gate (dimension 219,920, 1,925,553 non-zeros) was written to a file. Afterwards, it was read by a modified version of this test program. The modification regards only the activation of one type of matrix conversion. The conversion time is measured by the gettimeofday function, the minimum time of three consecutive runs is taken. One conversion to CSR variant 1 takes 38.212$ \,$ms, 266 conversions take 9,184.06$ \,$ms. For the CSR variant 2, the times are 38.955$ \,$ms and 8,798.3$ \,$ms, respectively. Since the complete simulation takes 8,468.19$ \,$s, the share of all conversions is 0.108% or 0.104%, respectively. These very small shares allow to conclude that the effort of the conversion can be neglected.

The complex-valued variant of that test program was also modified to evaluate another interesting issue. As outlined in Section 2.6.2, one can split a complex-valued equation system into a double-sized real-valued one in order to employ a real-valued assembly and solver system also for complex-valued linear equation systems. Regarding the memory consumption, this approach has disadvantages, which shall be quantified in the following discussion. Three examples were taken and the small-signal systems for 1$ \,$GHz were written to files. These files were read by the test program - once for the original version employing the complex-valued solver system and once for the modified version. Thus, this evaluation approach emulates also the different assembling process of the respective matrices. The results can summarized as follows:

These results motivates the conclusion that with the current implementation it is advantageous to solve the complex-valued equation systems with the complex-valued solver implementations. The main reason for that is the memory consumption, which is always doubled, whereas different results for the simulation time are obtained.


5.5.3 Simulation Results of the GMRES(m) Evaluation

Several test runs of the in-house GMRES(M) solver were performed and the cumulated number of mathematical operations as a function of $ m$ were recorded.

Figure 5.5: User time, number of operations (left), and memory consumption ratios (right) depending on the GMRES(M) restart factor $ m$ for the two-dimensional (above), three-dimensional (center), and mixed-mode simulations (below).
\includegraphics[width=0.48\linewidth]{figures/gmres2_2D_op.eps} \includegraphics[width=0.48\linewidth]{figures/gmres2_2D_mem.eps} \includegraphics[width=0.48\linewidth]{figures/gmres2_3D_op.eps} \includegraphics[width=0.48\linewidth]{figures/gmres2_3D_mem.eps} \includegraphics[width=0.48\linewidth]{figures/gmres2_MM_op.eps} \includegraphics[width=0.48\linewidth]{figures/gmres2_MM_mem.eps}

Since values of $ m$ smaller than 16 cause errors for some three-dimensional and mixed-mode simulations, the commonly used $ m$ has to be set to a value higher than 16. In Figure 5.5, all values are scaled to those of $ m = 16$. The following ratios are shown: minimum, maximum, and average of the number of mathematical operations, the user time, and the memory consumption. By analyzing the solid black line with symbols for the average user time, a default value shall be selected which can be commonly set for all kinds of simulations. The first part of that curve is decreasing between 16 and 70 down to a level of approximate 55%. Afterwards, it slightly increases again. One reason for this might be the higher memory consumption. The curve for the mixed-mode device/circuit simulations has a similar shape, however the increasing part starts a little bit earlier. Finally, the curve for the two-dimensional simulations shows a contraproductive effect of an increased $ m$ between 70 and 75.

So one can conclude that a value between 50 and 70 seems to be advantageous for all kinds of simulations if enough memory is available. For that reason, the default value was set to $ m = 65$. As this value can be easily controlled by the user, different requirements depending on the simulation and the memory resources can be met. In addition it is to note that the simulator itself can automatically adjust this value if the respective information is available.

5.5.4 Simulation Results of the Performance Evaluation

The objective of the performance evaluation is to compare the simulation times required by MINIMOS-NT depending on the solver systems selected. Note that the solver hierarchy (see Section 5.4) was deactivated. All three in-house solvers as well as the two PARDISO and four SAMG solvers are evaluated, although not all examples could be solved by all solver systems.

Figure 5.6: Solving times (average, minimum, maximum) on the Intel computer. All times are scaled to the in-house ILU-BICGSTAB in the center.
\includegraphics[width=0.48\linewidth]{figures/intel7s.eps}

In Figure 5.6 a comparison of different solvers on the Intel computer is given. Due to the large simulation time differences, all times are scaled to the in-house ILU-BICGSTAB in the center of the graph. Interesting results are the superiority of the advanced implementations of LU factorization and iterative solvers for circuits and three-dimensional devices, respectively. The in-house GMRES(M) solver has advantages for circuits also, whereas the direct solver on the left hand side can in fact only be used for quality assessment of two-dimensional simulations. Whereas the iterative SAMG solvers show single significant performance advantages up to 35%, the multi-level algorithms require still some effort to be generally applicable.

Figure 5.7: In the left figure, the relative PARDISO real/wall clock times (average, minimum, maximum) versus the number of processors/threads on the IBM cluster. The user time in the right figure increases due to the parallelization overhead.
\includegraphics[width=0.48\linewidth]{figures/ibm6_t1.eps} \includegraphics[width=0.48\linewidth]{figures/ibm6_t2.eps}

To show the relative impact of multi-threading, the PARDISO-LU solving ratios (referring to the single-threaded version) against the number of processors/threads are shown in Figure 5.7. For the three-dimensional examples, also the PARDISO-LU-CGS ratios are given. The real (wall clock) time required for solving the example, the cumulated user (CPU) times are shown in Figure 5.7, which increase due to the parallelization overhead. Whereas for two-dimensional device and circuit simulations too many processors can be even contra-productive, the marginal additional utility for three-dimensional simulations is drastically diminishing. Thus, for the average simulation four processors should be sufficient. Especially under scarce conditions, for example during optimizations, assigning two tasks per node of eight processors appears to minimize the real time effort.

The iterative methods still show a significant performance advantage over the direct solvers. However, the 1983 quotation ``In 3D sparse direct solution methods can safely be labeled a disaster'' [16] describes the experiences (in regard to both time and memory consumption) with the in-house LU solver, but does not embrace the recent developments. Especially for mixed-mode device/circuit simulations the advanced direct methods show a significant performance advantage, even up to the highest dimensions.


5.5.5 Evaluation of Solvers for Higher-Order Transport Models

The same concept as discussed in the last section is used to evaluate solvers for higher-order transport models. However, instead of the devices presented above, double-gate MOSFETs with different gate lengths were used, see Table 5.5.5.


Table 5.5: Three two-dimensional device structures with the given gate lengths were simulated. The dimensions of the linear equation systems are given depending on the transport model used. For the drift-diffusion simulation and BICGSTAB, the CPU time as well as the share of the simulation time spent for compiling, pre-eliminating, sorting, scaling, and solving are specified.
# Simulation SM ET DD DD CPU [s] DD Linear [%]
1 MOSFET with $ {\ensuremath{L_\mathrm{g}}} = 25\,$nm 4642 3522 2402 65.67 56.94
2 MOSFET with $ {\ensuremath{L_\mathrm{g}}} = 35\,$nm 3480 2640 1800 41.56 50.57
3 MOSFET with $ {\ensuremath{L_\mathrm{g}}} = 50\,$nm 2733 2073 1413 29.54 46.94


The transport models as derived in Section 2.1.3 were used and compared on the Intel computer: the six moments transport model (SM), the energy-transport transport model (ET), and the drift-diffusion transport model (DD). Three different simulation tasks were performed:

  1. Extraction of an IV-curve: For this simulation example, $ V_\mathrm{DS}$ is stepped from 0.0$ \,$V to 1.0$ \,$V by 0.025$ \,$V with $ \ensuremath{V_\mathrm{GS}}= 1.0\,$V.
  2. Quasi-static extraction of $ f_\textrm {T}$: In order to apply the formula given in (3.19) two $ V_\mathrm{GS}$ steps of $ 1.0\,$V $ \pm 5\,$mV are necessary while $ {\ensuremath{V_\mathrm{DS}}} = 1.0\,$V.
  3. Small-signal extraction of $ f_\textrm {T}$: A small-signal simulation based on a conditional frequency stepping was performed in order to extract $ f_\textrm {T}$.

As concluded in the evaluation of the last section, several solvers are better employable for more ill-conditioned problems. In contrast to the drift-diffusion transport models, simulations based on the energy-transport and six moments models tend to be more ill-conditioned. The underlying simulation setup is more sensitive to the mesh. As adaptive meshes are required, ongoing research outside the scope of this thesis is performed to improve this situation. However, to some extend, solver systems better equipped for ill-conditioned problems can significantly reduce the solving effort. The objective of this evaluation is to confirm the respective conclusions of the last section and to give a quantitative information on this.

The results for the two in-house iterative solvers BICGSTAB and GMRES(M) as well as those of the PARDISO solver are presented in Figure 5.8. Due to the more ill-conditioned problem and the small dimensions of the linear equation system, the direct solver significantly outperforms the two in-house solvers for the higher-order transport models. For the same reasons also the GMRES(M) solver shows advantages over the BICGSTAB. However, for the steady-state drift-diffusion simulations, the advantages of alternative solvers shrinks, whereas for small-signal simulations PARDISO is still up to 50% faster than the in-house ILU-BICGSTAB.

Figure 5.8: Scaled results for the six moments transport models (upper figure). The lower figures show scaled results for the energy-transport (left) and the drift-diffusion (right) transport models.
\includegraphics[width=0.48\linewidth]{figures/hotm_sm2.eps}

\includegraphics[width=0.48\linewidth]{figures/hotm_hd2.eps} \includegraphics[width=0.48\linewidth]{figures/hotm_dd2.eps}


next up previous contents
Next: 5.6 Concluding Remarks Up: 5. The Solver Module Previous: 5.4 Solver Hierarchy

S. Wagner: Small-Signal Device and Circuit Simulation