5.5 Practical Evaluation of the Solvers

In the course of this work the performance of these solvers has been evaluated. Rather than using a set of single matrices, the approach is based on complete simulations with consistent settings, as typically encountered during daily work.

In mathematical publications, different solver systems are normally compared by applying them for single linear equation systems. For example, the matrix market [139,25] provides such a data repository for use in comparative studies of linear algebra algorithms. The repository does not only provide matrices, but also matrix generation software in a wide variety of scientific and engineering disciplines. For each matrix and matrix set details regarding the matrix properties and a visualization of the matrix structure is provided. The matrices can be downloaded in several different text file formats. A matrix generator is either a static software for download, a Java applet, or a form-based web service to process requests for generating matrices. In 2000, 482 individual matrices and 25 matrix generators were available. The database includes the Harwell-Boeing Sparse Matrix Collection (Release I) [55], the SPARSKIT collection [178], and the Non-Hermitian Eigenvalue Problem collection [14].

Such an approach is particularly useful for the evaluation of new mathematical algorithms. The conclusions drawn from the results are used to improve the respective implementations. However, from the perspective of the implementation of a general-purpose simulator, the objective is quite different. Here, a set of solvers are available and the one which fits best for the respective problem shall be chosen. In contrast to the solver development, the research on this topic is restricted to practical evaluations regarding the respective simulators.

5.5.1 Test Examples and Processing

The 16 examples summarized in Table 5.1 were taken from current scientific projects at the institute. They consist of field effect, bipolar, and silicon-on-insulator transistors. Structures such as FINFETs are analyzed by means of three-dimensional simulations.

Table 5.2: This table provides general quantitative information about the simulations using the in-house ILU-BICGSTAB on the IBM cluster. After the index of the simulation, the dimensions of the inner and complete equation system is given. The next column shows the share of the inner in respect to the complete system. In order to analyze how the simulation time is spent, the remaining columns contain the shares of the initialization, pre-processing, assembling, solving (includes all steps shown in Table 5.3), quantity update, and post-processing, respectively.

#	Simulation	Type	Dimension	Entries ([ ${\,}^0\!/_{00}$ ])	DC it	Remark
1	MOSFET	2D device	2,704	23,662 (3.24)	17	test structure with $L_{\mathrm{G}} = 1\,{\mathrm{{\mu}m}}$
2	Flash cell	2D device	5,967	47,956 (1.35)	24	tunneling effects
3	Pin-diode	2D device	6,335	56,127 (1.40)	13	optical generation model
4	Bjt transistor	2D device	6,389	33,869 (0.83)	23	ET transport model
5	SA-LIGBT	2D device	16,774	158,223 (0.56)	27	two V $_{\mathrm{D}}$ steps: 0 $\,$ V, 5 $\,$ V
6	SiGe HBT	2D device	19,313	210,915 (0.57)	32	self-heating
7	Colpitts oscillator	circuit (1)	3,928	35,002 (2.27)	41	transient (400 steps)
8	Amplifier	circuit (1)	6,391	35,291 (0.86)	30	ET transport model
9	Ring oscillator	circuit (10)	25,246	226,931 (0.36)	29	transient (100 steps)
10	2-input nand gates	circuit (8)	146,614	1,347,138 (0.06)	23	transient (50 steps)
11	3-input nand gates	circuit (12)	219,920	2,020,706 (0.04)	23	transient (50 steps)
12	MagFET	3D device	85,308	921,860 (0.13)	36	magnetic field
13	FinFET	3D device	81,037	807,150 (0.12)	13	thin SOI finger
14	SOI	3D device	87,777	997,296 (0.13)	23	two V $_{\mathrm{D}}$ steps: 0 $\,$ V, 0.1 $\,$ V
15	HBT	3D device	175,983	1,833,138 (0.06)	35	self-heating
16	LD-MOSFET	3D device	167,197	1,925,553 (0.07)	25	power device; two V $_{\mathrm{D}}$ steps: 0 $\,$ V, 1.0 $\,$ V

**Table 5.1:** Six two-dimensional, five mixed-mode device/circuit (the number of devices is given in parenthesis), and five three-dimensional simulations were used for evaluating the solver performance. The dimension of the linear equation system, the number of non-zero entries, and the typical number of DC *Newton* iterations are given.
#	Inner	Complete	Ratio [%]	CPU [s]	Init [%]	Pre [%]	Asm [%]	Lin [%]	Upd [%]	Post [%]
1	2,704	2,920	92.60	1.34	14.93	6.72	31.34	34.33	3.73	2.99
2	5,967	13,915	42.88	7.02	3.56	5.98	46.15	40.17	1.71	1.14
3	6,335	6,451	98.20	5.06	16.01	7.51	17.19	55.53	0.59	2.17
4	6,389	6,754	94.60	4.84	9.09	5.99	28.10	51.24	1.45	1.65
5	16,774	17,705	94.74	39.64	12.66	4.69	37.24	42.73	0.33	1.54
6	19,313	19,836	97.36	28.29	39.46	3.45	10.35	44.46	0.24	1.24
7	3,928	5,475	71.74	32.60	1.53	6.20	17.61	64.39	1.20	7.15
8	6,391	8,056	79.33	63.06	0.78	4.60	30.13	59.94	0.76	3.00
9	25,246	37,007	68.22	226.97	0.81	4.88	13.69	76.16	2.07	1.74
10	146,614	202,168	72.52	3,641.37	0.06	2.88	8.34	87.13	0.33	1.10
11	219,920	303,250	72.52	7,671.19	0.04	2.40	6.78	89.60	0.26	0.80
12	85,308	90,203	94.57	593.55	0.39	2.52	7.24	89.03	0.10	0.48
13	81,037	98,870	81.96	99.13	4.68	4.07	9.69	79.72	0.22	1.12
14	87,777	93,572	93.81	190.19	1.45	5.41	23.19	67.32	0.19	1.90
15	175,983	226,687	77.63	988.29	0.65	2.54	6.97	88.42	0.14	1.04
16	167,197	179,111	93.35	495.99	0.74	3.61	9.98	83.93	0.15	1.20

The extensible benchmark is processed by a single program, which has been implemented as a SEILIB application (see Section C.3). The real (wall-clock) and user time is measured by the GNU time command, and the fastest of three consecutive runs is taken.

5.5.2 General Quantitative Comparisons

Table 5.2 contains information on how the simulation time is actually spent. The simulator was started on the IBM cluster and the in-house ILU-BICGSTAB was selected. The first columns show the number of the example, the dimensions of the inner and complete system, and the ratio $dim_\mathrm{inner}/dim_\mathrm{complete}$ . A lower ratio indicates that more equations are pre-eliminated and thus more time is spent for pre-elimination. The CPU column shows the absolute user time required for the respective example. The remaining columns show the shares of the simulation time spent in selected modules or for selected tasks: for initialization, pre-processing, assembling the linear equation systems (including the calculation of the model contributions), the share of the linear modules including all steps shown in Table 5.3, the update of the quantities, and eventually the postprocessing.

The most important data in the context of that chapter is how much time is spent for the linear system after the assembly has been completed (the Lin column). For the smaller two-dimensional examples that share is below 50%, that means more time is spent for calculating and assembling the model contributions as well as for pre- and post-processing. However, for all other simulations including the larger two-dimensional ones, that share rises significantly even up to more than 90%. Thus, for increasing the efficiency of advanced device simulations and optimization, it is essential to reduce the relative effort spent on preparing and solving the linear equation systems.

Table 5.4: This table shows the same information as Table 5.3, but without the Newton adjustment.

#	Time [s]	MCSR [%]	Compile [%]	Pre [%]	Sort [%]	Scale [%]	ILU [%]	BICGSTAB [%]	Back [%]
1	0.49	0.65	10.93	6.11	2.25	17.02	32.93	29.52	0.38
2	2.74	0.67	6.73	5.07	1.20	8.98	35.63	41.16	0.51
3	2.69	0.27	4.20	2.26	1.05	8.82	42.71	40.54	0.12
4	2.40	0.45	6.69	4.21	1.64	10.70	37.67	38.38	0.19
5	27.67	0.16	4.58	3.08	0.99	8.53	34.37	48.16	0.12
6	13.11	1.25	5.01	3.02	1.10	6.93	36.06	46.51	0.11
7	35.17	0.26	4.75	3.87	0.73	5.97	42.32	41.77	0.31
8	45.53	0.19	5.65	4.31	1.10	8.04	40.00	40.48	0.20
9	224.52	0.26	3.28	2.66	0.66	4.04	30.85	58.04	0.19
10	3139.28	0.13	1.83	1.79	0.40	2.51	29.26	63.93	0.14
11	6855.87	0.12	1.38	1.48	0.31	1.97	27.53	67.11	0.11
12	504.36	3.99	2.32	1.14	0.50	2.39	16.67	72.95	0.04
13	94.43	9.82	4.95	2.75	0.88	3.84	23.16	54.47	0.12
14	146.11	1.08	3.48	2.48	1.01	5.17	26.68	59.97	0.13
15	921.04	37.04	3.51	1.69	0.58	2.71	11.33	43.08	0.07
16	447.12	1.47	2.45	1.82	0.68	3.12	28.21	62.16	0.09

**Table 5.3:** This table provides general quantitative information about the solving process which includes the MCSR conversion, compiling of the complete linear system, pre-elimination, sorting, scaling, preconditioning (ILU), and solving with the BICGSTAB. The simulations were started on the IBM cluster.
#	Time [s]	MCSR [%]	Compile [%]	Pre [%]	Sort [%]	Scale [%]	ILU [%]	BICGSTAB [%]	Back [%]
1	4.35	0.77	1.88	2.07	0.50	13.25	50.99	30.06	0.15
2	2.86	3.43	5.63	5.36	1.19	8.76	34.20	40.87	0.52
3	2.77	1.98	3.76	2.69	1.02	8.62	41.52	40.27	0.11
4	2.51	2.37	5.67	4.65	1.69	10.60	36.52	38.27	0.18
5	16.22	2.61	3.98	3.41	1.10	8.31	34.40	46.04	0.14
6	17.00	1.32	2.41	1.82	0.66	4.64	45.33	43.74	0.07
7	21.44	1.74	4.17	4.95	0.97	5.92	40.25	41.70	0.24
8	38.72	2.51	5.35	5.70	1.42	8.67	39.08	37.01	0.22
9	180.63	2.19	2.97	3.50	0.67	3.99	29.93	56.55	0.20
10	3388.38	1.45	1.64	2.23	0.41	2.41	27.52	64.22	0.14
11	7618.11	1.42	1.27	1.86	0.34	1.99	25.31	67.71	0.11
12	634.29	21.19	1.21	1.11	0.44	1.96	13.37	60.68	0.04
13	92.97	33.06	2.03	2.05	0.55	2.79	17.92	41.49	0.11
14	129.94	7.13	2.97	2.86	0.94	4.46	25.17	56.32	0.13
15	2499.04	79.80	0.58	0.70	0.18	0.89	3.83	14.00	0.02
16	462.25	9.60	1.79	1.64	0.62	2.83	24.81	58.62	0.08

In addition, the effort of the conversion to alternative sparse matrix formats was evaluated. Since the time required for the conversion is too short, no significant information can be given by comparison of complete run times. However, in order to give quantitative information on the negligibility of the matrix conversions, the input and output system of the assembly module was used.

The assembly module has been equipped with a comprehensive input and output system which enables the user to have the assembled complete and inner linear equation systems read from or written to files. These features can be used for debugging and quality assessment purposes. Furthermore, different sets of parameters or alternative solvers can be efficiently tested. Whereas the simulator is normally used to calculate the entries of the linear equation system, a test program is provided which reads the files and starts the solver.

In order to quantify the effort of the matrix conversions, the first linear equation system assembled for the three-input nand gate (dimension 219,920, 1,925,553 non-zeros) was written to a file. Afterwards, it was read by a modified version of this test program. The modification regards only the activation of one type of matrix conversion. The conversion time is measured by the gettimeofday function, the minimum time of three consecutive runs is taken. One conversion to CSR variant 1 takes 38.212 $\,$ ms, 266 conversions take 9,184.06 $\,$ ms. For the CSR variant 2, the times are 38.955 $\,$ ms and 8,798.3 $\,$ ms, respectively. Since the complete simulation takes 8,468.19 $\,$ s, the share of all conversions is 0.108% or 0.104%, respectively. These very small shares allow to conclude that the effort of the conversion can be neglected.

The complex-valued variant of that test program was also modified to evaluate another interesting issue. As outlined in Section 2.6.2, one can split a complex-valued equation system into a double-sized real-valued one in order to employ a real-valued assembly and solver system also for complex-valued linear equation systems. Regarding the memory consumption, this approach has disadvantages, which shall be quantified in the following discussion. Three examples were taken and the small-signal systems for 1 $\,$ GHz were written to files. These files were read by the test program - once for the original version employing the complex-valued solver system and once for the modified version. Thus, this evaluation approach emulates also the different assembling process of the respective matrices. The results can summarized as follows:

5.5.3 Simulation Results of the GMRES(m) Evaluation

Several test runs of the in-house GMRES(M) solver were performed and the cumulated number of mathematical operations as a function of

were recorded.

**Figure 5.5:** User time, number of operations (left), and memory consumption ratios (right) depending on the GMRES(M) restart factor for the two-dimensional (above), three-dimensional (center), and mixed-mode simulations (below).
$\includegraphics[width=0.48\linewidth]{figures/gmres2_2D_op.eps}$ $\includegraphics[width=0.48\linewidth]{figures/gmres2_2D_mem.eps}$ $\includegraphics[width=0.48\linewidth]{figures/gmres2_3D_op.eps}$ $\includegraphics[width=0.48\linewidth]{figures/gmres2_3D_mem.eps}$ $\includegraphics[width=0.48\linewidth]{figures/gmres2_MM_op.eps}$ $\includegraphics[width=0.48\linewidth]{figures/gmres2_MM_mem.eps}$

Since values of

smaller than 16 cause errors for some three-dimensional and mixed-mode simulations, the commonly used

has to be set to a value higher than 16. In Figure 5.5, all values are scaled to those of

. The following ratios are shown: minimum, maximum, and average of the number of mathematical operations, the user time, and the memory consumption. By analyzing the solid black line with symbols for the average user time, a default value shall be selected which can be commonly set for all kinds of simulations. The first part of that curve is decreasing between 16 and 70 down to a level of approximate 55%. Afterwards, it slightly increases again. One reason for this might be the higher memory consumption. The curve for the mixed-mode device/circuit simulations has a similar shape, however the increasing part starts a little bit earlier. Finally, the curve for the two-dimensional simulations shows a contraproductive effect of an increased

between 70 and 75.

So one can conclude that a value between 50 and 70 seems to be advantageous for all kinds of simulations if enough memory is available. For that reason, the default value was set to

. As this value can be easily controlled by the user, different requirements depending on the simulation and the memory resources can be met. In addition it is to note that the simulator itself can automatically adjust this value if the respective information is available.

5.5.4 Simulation Results of the Performance Evaluation

The objective of the performance evaluation is to compare the simulation times required by MINIMOS-NT depending on the solver systems selected. Note that the solver hierarchy (see Section 5.4) was deactivated. All three in-house solvers as well as the two PARDISO and four SAMG solvers are evaluated, although not all examples could be solved by all solver systems.

**Figure 5.6:** Solving times (average, minimum, maximum) on the *Intel* computer. All times are scaled to the in-house ILU-BICGSTAB in the center.
$\includegraphics[width=0.48\linewidth]{figures/intel7s.eps}$

In Figure 5.6 a comparison of different solvers on the Intel computer is given. Due to the large simulation time differences, all times are scaled to the in-house ILU-BICGSTAB in the center of the graph. Interesting results are the superiority of the advanced implementations of LU factorization and iterative solvers for circuits and three-dimensional devices, respectively. The in-house GMRES(M) solver has advantages for circuits also, whereas the direct solver on the left hand side can in fact only be used for quality assessment of two-dimensional simulations. Whereas the iterative SAMG solvers show single significant performance advantages up to 35%, the multi-level algorithms require still some effort to be generally applicable.

**Figure 5.7:** In the left figure, the relative PARDISO real/wall clock times (average, minimum, maximum) versus the number of processors/threads on the IBM cluster. The user time in the right figure increases due to the parallelization overhead.
$\includegraphics[width=0.48\linewidth]{figures/ibm6_t1.eps}$ $\includegraphics[width=0.48\linewidth]{figures/ibm6_t2.eps}$

To show the relative impact of multi-threading, the PARDISO-LU solving ratios (referring to the single-threaded version) against the number of processors/threads are shown in Figure 5.7. For the three-dimensional examples, also the PARDISO-LU-CGS ratios are given. The real (wall clock) time required for solving the example, the cumulated user (CPU) times are shown in Figure 5.7, which increase due to the parallelization overhead. Whereas for two-dimensional device and circuit simulations too many processors can be even contra-productive, the marginal additional utility for three-dimensional simulations is drastically diminishing. Thus, for the average simulation four processors should be sufficient. Especially under scarce conditions, for example during optimizations, assigning two tasks per node of eight processors appears to minimize the real time effort.

The iterative methods still show a significant performance advantage over the direct solvers. However, the 1983 quotation ``In 3D sparse direct solution methods can safely be labeled a disaster'' [16] describes the experiences (in regard to both time and memory consumption) with the in-house LU solver, but does not embrace the recent developments. Especially for mixed-mode device/circuit simulations the advanced direct methods show a significant performance advantage, even up to the highest dimensions.

5.5.5 Evaluation of Solvers for Higher-Order Transport Models

The same concept as discussed in the last section is used to evaluate solvers for higher-order transport models. However, instead of the devices presented above, double-gate MOSFETs with different gate lengths were used, see Table 5.5.5.

Table 5.5: Three two-dimensional device structures with the given gate lengths were simulated. The dimensions of the linear equation systems are given depending on the transport model used. For the drift-diffusion simulation and BICGSTAB, the CPU time as well as the share of the simulation time spent for compiling, pre-eliminating, sorting, scaling, and solving are specified.

#	Simulation	SM	ET	DD	DD CPU [s]	DD Linear [%]
1	MOSFET with ${\ensuremath{L_\mathrm{g}}} = 25\,$ nm	4642	3522	2402	65.67	56.94
2	MOSFET with ${\ensuremath{L_\mathrm{g}}} = 35\,$ nm	3480	2640	1800	41.56	50.57
3	MOSFET with ${\ensuremath{L_\mathrm{g}}} = 50\,$ nm	2733	2073	1413	29.54	46.94

The transport models as derived in Section 2.1.3 were used and compared on the Intel computer: the six moments transport model (SM), the energy-transport transport model (ET), and the drift-diffusion transport model (DD). Three different simulation tasks were performed:

As concluded in the evaluation of the last section, several solvers are better employable for more ill-conditioned problems. In contrast to the drift-diffusion transport models, simulations based on the energy-transport and six moments models tend to be more ill-conditioned. The underlying simulation setup is more sensitive to the mesh. As adaptive meshes are required, ongoing research outside the scope of this thesis is performed to improve this situation. However, to some extend, solver systems better equipped for ill-conditioned problems can significantly reduce the solving effort. The objective of this evaluation is to confirm the respective conclusions of the last section and to give a quantitative information on this.

The results for the two in-house iterative solvers BICGSTAB and GMRES(M) as well as those of the PARDISO solver are presented in Figure 5.8. Due to the more ill-conditioned problem and the small dimensions of the linear equation system, the direct solver significantly outperforms the two in-house solvers for the higher-order transport models. For the same reasons also the GMRES(M) solver shows advantages over the BICGSTAB. However, for the steady-state drift-diffusion simulations, the advantages of alternative solvers shrinks, whereas for small-signal simulations PARDISO is still up to 50% faster than the in-house ILU-BICGSTAB.