Reference documentation for deal.II version 8.4.1

Table of contents  

This program was contributed by Katharina Kormann and Martin Kronbichler.
The algorithm for the matrixvector product is based on the article A generic interface for parallel cellbased finite element operator application by Martin Kronbichler and Katharina Kormann, Computers and Fluids 63:135–147, 2012, and the paper "Parallel finite element operator application: Graph partitioning and coloring" by Katharina Kormann and Martin Kronbichler in: Proceedings of the 7th IEEE International Conference on eScience, 2011.
This program demonstrates how to use the cellbased implementation of finite element operators with the MatrixFree class, first introduced in step37, to solve nonlinear partial differential equations. Moreover, we demonstrate how the MatrixFree class handles constraints and how it can be parallelized over distributed nodes. Finally, we will use an explicit timestepping method to solve the problem and introduce GaussLobatto finite elements that are very convenient in this case since they have a diagonal, and thus trivially invertible, mass matrix. Moreover, this type of elements clusters the nodes towards the element boundaries which is why they have good properties for highorder discretization methods. Indeed, the condition number of standard FE_Q elements with equidistant nodes grows exponentially with the degree, which destroys any benefit for orders of about five and higher.
As an example, we choose to solve the sineGordon soliton equation
\begin{eqnarray*} u_{tt} &=& \Delta u \sin(u) \quad\mbox{for}\quad (x,t) \in \Omega \times (t_0,t_f],\\ {\mathbf n} \cdot \nabla u &=& 0 \quad\mbox{for}\quad (x,t) \in \partial\Omega \times (t_0,t_f],\\ u(x,t_0) &=& u_0(x). \end{eqnarray*}
that was already introduced in step25. As a simple explicit time integration method, we choose leap frog scheme using the secondorder formulation of the equation. Then, the scheme reads in weak form
\begin{eqnarray*} (v,u^{n+1}) = (v,2 u^nu^{n1}  (\Delta t)^2 \sin(u^n))  (\nabla v, (\Delta t)^2 \nabla u^n), \end{eqnarray*}
where v denotes a test function and the index n stands for the time step number.
For the spatial discretization, we choose FE_Q elements with basis functions defined to interpolate the support points of the GaussLobatto quadrature rule. Moreover, when we compute the integrals over the basis functions to form the mass matrix and the operator on the right hand side of the equation above, we use the GaussLobatto quadrature rule with the same support points as the node points of the finite element to evaluate the integrals. Since the finite element is Lagrangian, this will yield a diagonal mass matrix on the left hand side of the equation, making the solution of the linear system in each time step trivial.
Using this quadrature rule, for a pth order finite element, we use a (2p1)th order accurate formula to evaluate the integrals. Since the product of two pth order basis functions when computing a mass matrix gives a function with polynomial degree 2p in each direction, the integrals are not exactly evaluated. However, considering the fact that the interpolation order of finite elements of degree p is p+1, the overall convergence properties are not disturbed by the quadrature error, in particular not when we use high orders.
Apart from the fact, that we avoid solving linear systems with this type of elements when using explicit timestepping, they come with two other advantages. When we are using the sumfactorization approach to evaluate the finite element operator (cf. step37), we have to evaluate the function at the quadrature points. In the case of GaussLobatto elements, where quadrature points and node points of the finite element coincide, this operation is trivial since the value of the function at the quadrature points is given by its onedimensional coefficients. In this way, the complexity of a finite element operator evaluation is further reduced compared to equidistant elements.
The third advantage is the fact that these elements are better conditioned as equidistant Lagrange polynomials for increasing order so that we can use higher order elements for an accurate solution of the equation. Lagrange elements FE_Q with equidistant points should not be used for polynomial degrees four and higher.
To sum up the discussion, by using the right finite element and quadrature rule combination, we end up with a scheme where we in each time step need to compute the right hand side vector corresponding to the formulation above and then multiply it by the inverse of the diagonal mass matrix. In practice, of course, we extract the diagonal elements and invert them only once at the beginning of the program.
The usual way to handle constraints in deal.II
is to use the ConstraintMatrix class that builds a sparse matrix storing information about which degrees of freedom (DoF) are constrained and how they are constrained. This format uses an unnecessarily large amount of memory since there are not so many different types of constraints: for example, in the case of hanging nodes when using linear finite element on every cell, constraints most have the form \(x_k = \frac 12 x_i + \frac 12 x_j\) where the coefficients \(\frac 12\) are always the same and only \(i,j,k\) are different. While storing this redundant information is not a problem in general because it is only needed once during matrix and right hand side assembly, it becomes a problem when we want to use the matrixfree approach since there this information has to be accessed every time we apply the operator. Thus, instead of a ConstraintMatrix, we use a variable that we call constraint_pool
that collects the weights of the different constraints. Then, we only have to store an identifier of each constraint in the mesh instead of all the weights. Moreover, we do not want to apply the constraints in a pre and postprocessing step but want to take care of constraints as we evaluate the finite element operator. Therefore, we embed the constraint information into the variable indices_local_to_global
that is used to extract the cell information from the global vector. If a DoF is constrained, the indices_local_to_global
variable contains the global indices of the DoFs that it is constrained to. Then, we have another variable constraint_indicator
at hand that holds, for each cell, the local indices of DoFs that are constrained as well as the identifier of the type of constraint. Actually, you will not see these data structures in the example program since the class FEEvaluation
takes care of the constraints without user interaction.
The MatrixFree class comes with the option to be parallelized on three levels: MPI parallelization on clusters of distributed nodes, thread parallelization scheduled by the Threading Building Blocks library, and finally with a vectorization by clustering of two (or more) cells into a SIMD data type for the operator application. As we have already discussed in step37, you will get best performance by using an instruction set specific to your system, e.g. with the cmake variable DCMAKE_CXX_FLAGS="march=native"
. Shared memory (thread) parallelization was also exploited in step37. Here, we demonstrate MPI parallelization.
To facilitate parallelism with distributed memory (MPI), we use a special vector type parallel::distributed::Vector that holds the processorlocal part of the solution as well as information on and data fields for the ghost DoFs, i.e. DoFs that are owned by a remote processor but needed on cells that are treated by the present processor. Moreover, it holds the MPIsend information for DoFs that are owned locally but needed by other processors. This is similar to the PETScWrappers::MPI::Vector and TrilinosWrappers::MPI::Vector data types we have used in step40 and step32 before, but since we do not need any other parallel functionality of these libraries, we use the parallel::distributed::Vector class of deal.II instead of linking in another large library.
Note that this program is designed to be run with a distributed triangulation (parallel::distributed::Triangulation), which requires deal.II to be configured with p4est as described in the deal.II ReadMe file. However, a nondistributed triangulation is also supported, in which case the computation will be run in serial.
In our example, we choose the initial value to be
\begin{eqnarray*} u(x,t) = \prod_{i=1}^{d} 4 \arctan \left( \frac{m}{\sqrt{1m^2}}\frac{\sin\left(\sqrt{1m^2} t +c_2\right)}{\cosh(mx_i+c_1)}\right) \end{eqnarray*}
and solve the equation over the time interval [10,10]. The constants are chosen to be \(c_1=c_1=0\) and m=0.5. As mentioned in step25, in one dimension u as a function of t is the exact solution of the sineGordon equation. For higher dimension, this is however not the case.
The necessary files from the deal.II library.
This includes the data structures for the efficient implementation of matrixfree methods.
We start by defining two global variables to collect all parameters subject to changes at one place: One for the dimension and one for the finite element degree. The dimension is used in the main function as a template argument for the actual classes (like in all other deal.II programs), whereas the degree of the finite element is more crucial, as it is passed as a template argument to the implementation of the SineGordon operator. Therefore, it needs to be a compiletime constant.
The SineGordonOperation
class implements the cellbased operation that is needed in each time step. This nonlinear operation can be implemented straightforwardly based on the MatrixFree
class, in the same way as a linear operation would be treated by this implementation of the finite element operator application. We apply two template arguments to the class, one for the dimension and one for the degree of the finite element. This is a difference to other functions in deal.II where only the dimension is a template argument. This is necessary to provide the inner loops in FEEvaluation
with information about loop lengths etc., which is essential for efficiency. On the other hand, it makes it more challenging to implement the degree as a runtime parameter.
This is the constructor of the SineGordonOperation class. It receives a reference to the MatrixFree holding the problem information and the time step size as input parameters. The initialization routine sets up the mass matrix. Since we use GaussLobatto elements, the mass matrix is a diagonal matrix and can be stored as a vector. The computation of the mass matrix diagonal is simple to achieve with the data structures provided by FEEvaluation: Just loop over all (macro) cells and integrate over the function that is constant one on all quadrature points by using the integrate
function with true
argument at the slot for values. Finally, we invert the diagonal entries since we have to multiply by the inverse mass matrix in each time step.
This operator implements the core operation of the program, the integration over a range of cells for the nonlinear operator of the SineGordon problem. The implementation is based on the FEEvaluation class as in step37. Due to the special structure in GaussLobatto elements, certain operations become simpler, in particular the evaluation of shape function values on quadrature points which is simply the injection of the values of cell degrees of freedom. The MatrixFree class detects possible structure of the finite element at quadrature points when initializing, which is then used by FEEvaluation for selecting the most appropriate numerical kernel.
The nonlinear function that we have to evaluate for the time stepping routine includes the value of the function at the present time current
as well as the value at the previous time step old
. Both values are passed to the operator in the collection of source vectors src
, which is simply a std::vector
of pointers to the actual solution vectors. This construct of collecting several source vectors into one is necessary as the cell loop in MatrixFree
takes exactly one source and one destination vector, even if we happen to use many vectors like the two in this case. Note that the cell loop accepts any valid class for input and output, which does not only include vectors but general data types. However, only in case it encounters a parallel::distributed::Vector<Number> or a std::vector
collecting these vectors, it calls functions that exchange data at the beginning and the end of the loop. In the loop over the cells, we first have to read in the values in the vectors related to the local values. Then, we evaluate the value and the gradient of the current solution vector and the values of the old vector at the quadrature points. Then, we combine the terms in the scheme in the loop over the quadrature points. Finally, we integrate the result against the test function and accumulate the result to the global solution vector dst
.
This function performs the time stepping routine based on the celllocal strategy. First the destination vector is set to zero, then the cellloop is called, and finally the solution is multiplied by the inverse mass matrix. The structure of the cell loop is implemented in the cell finite element operator class. On each cell it applies the routine defined as the local_apply()
method of the class SineGordonOperation
, i.e., this
. One could also provide a function with the same signature that is not part of a class.
We define a timedependent function that is used as initial value. Different solutions can be obtained by varying the starting time. This function has already been explained in step25.
This is the main class that builds on the class in step25. However, we replaced the SparseMatrix<double> class by the MatrixFree class to store the geometry data. Also, we use a distributed triangulation in this example.
This is the constructor of the SineGordonProblem class. The time interval and time step size are defined here. Moreover, we use the degree of the finite element that we defined at the top of the program to initialize a FE_Q finite element based on GaussLobatto support points. These points are convenient because in conjunction with a QGaussLobatto quadrature rule of the same order they give a diagonal mass matrix without compromising accuracy too much (note that the integration is inexact, though), see also the discussion in the introduction.
As in step25 this functions sets up a cube grid in dim
dimensions of extent \([15,15]\). We refine the mesh more in the center of the domain since the solution is concentrated there. We first refine all cells whose center is within a radius of 11, and then refine once more for a radius 6. This simple ad hoc refinement could be done better by adapting the mesh to the solution using error estimators during the time stepping as done in other example programs, and using parallel::distributed::SolutionTransfer to transfer the solution to the new mesh.
We generate hanging node constraints for ensuring continuity of the solution. As in step40, we need to equip the constraint matrix with the IndexSet of locally relevant degrees of freedom to avoid it to consume too much memory for big problems. Next, the MatrixFree
for the problem is set up. Note that we specify the MPI communicator which we are going to use, and that we also want to use sharedmemory parallelization (hence one would use multithreading for intranode parallelism and not MPI; note that we here choose the standard option — if we wanted to disable shared memory parallelization, we would choose none
). Finally, three solution vectors are initialized. MatrixFree stores the layout that is to be used by distributed vectors, so we just ask it to initialize the vectors.
This function prints the norm of the solution and writes the solution vector to a file. The norm is standard (except for the fact that we need to be sure to only count norms on locally owned cells), and the second is similar to what we did in step40. Note that we can use the same vector for output as we used for computation: The vectors in the matrixfree framework always provide full information on all locally owned cells (this is what is needed in the local evaluations, too), including ghost vector entries on these cells. This is the only data that is needed in the integrate_difference function as well as in DataOut. We only need to make sure that we tell the vector to update its ghost values before we read them. This is a feature present only in the parallel::distributed::Vector class. Distributed vectors with PETSc and Trilinos, on the other hand, need to be copied to special vectors including ghost values (see the relevant section in step40). If we wanted to access all degrees of freedom on ghost cells, too (e.g. when computing error estimators that use the jump of solution over cell boundaries), we would need more information and create a vector initialized with locally relevant dofs just as in step40. Observe also that we need to distribute constraints for output  they are not filled during computations (rather, they are distributed on the fly in the matrixfree method read_dof_values).
This function is called by the main function and calls the subroutines of the class.
The first step is to set up the grid and the cell operator. Then, the time step is computed from the CFL number given in the constructor and the finest mesh size. The finest mesh size is computed as the diameter of the last cell in the triangulation, which is the last cell on the finest level of the mesh. This is only possible for Cartesian meshes, otherwise, one needs to loop over all cells. Note that we need to query all the processors for their finest cell since the not all processors might hold a region where the mesh is at the finest level. Then, we readjust the time step a little to hit the final time exactly.
Next the initial value is set. Since we have a twostep time stepping method, we also need a value of the solution at timetime_step. For accurate results, one would need to compute this from the time derivative of the solution at initial time, but here we ignore this difficulty and just set it to the initial value function at that artificial time.
We create an output of the initial value. Then we also need to collect the two starting solutions in a std::vector
of pointers field and to set up an instance of the SineGordonOperation class
based on the finite element degree specified at the top of this file.
Now loop over the time steps. In each iteration, we shift the solution vectors by one and call the apply
function of the SineGordonOperator
. Then, we write the solution to a file. We clock the wall times for the computational time needed as wall as the time needed to create the output and report the numbers when the time stepping is finished.
Note how this shift is implemented: We simply call the swap method on the two vectors which swaps only some pointers without the need to copy data around. Obviously, this is a more efficient way to update the vectors during time stepping. Let us see what happens in more detail: First, we exchange old_solution
with old_old_solution
, which means that old_old_solution
gets old_solution
, which is what we expect. Similarly, old_solution
gets the content from solution
in the next step. Afterward, solution
holds old_old_solution
, but that will be overwritten during this step.
main
functionAs in step40, we initialize MPI at the start of the program. Since we will in general mix MPI parallelization with threads, we also set the third argument in MPI_InitFinalize that controls the number of threads to an invalid number, which means that the TBB library chooses the number of threads automatically, typically to the number of available cores in the system. As an alternative, you can also set this number manually if you want to set a specific number of threads (e.g. when MPIonly is required).
In order to demonstrate the gain in using the MatrixFree class instead of the standard deal.II
assembly routines for evaluating the information from old time steps, we study a simple serial run of the code on a nonadaptive mesh. Since much time is spent on evaluating the sine function, we do not only show the numbers of the full sineGordon equation but also for the wave equation (the sineterm skipped from the sineGordon equation). We use both second and fourth order elements. The results are summarized in the following table.
wave equation  sineGordon  

MF  SpMV  dealii  MF  dealii  
2D, \(\mathcal{Q}_2\)  0.0106  0.00971  0.109  0.0243  0.124 
2D, \(\mathcal{Q}_4\)  0.0328  0.0706  0.528  0.0714  0.502 
3D, \(\mathcal{Q}_2\)  0.0151  0.0320  0.331  0.0376  0.364 
3D, \(\mathcal{Q}_4\)  0.0918  0.844  6.83  0.194  6.95 
It is apparent that the matrixfree code outperforms the standard assembly routines in deal.II by far. In 3D and for fourth order elements, one operator application is also almost ten times as fast as a sparse matrixvector product.
To demonstrate how the example scales for a parallel run and to demonstrate that hanging node constraints can be handled in an efficient way, we run the example in 3D with \(\mathcal{Q}_4\) elements. First, we run it on a notebook with 2 cores (Sandy Bridge CPU) at 2.7 GHz.
It takes 0.04 seconds for one time step on a notebook with more than a million degrees of freedom (note that we would need many processors to reach such numbers when solving linear systems). If we run the same 3D code on a cluster with 2 nodes and each node runs 8 threads, we get the following times:
We observe a considerable speedup over the notebook (16 cores versus 2 cores; nonetheless, one notebook core is considerably faster than one core of the cluster because of a newer processor architecture). If we run the same program on 4 nodes with 8 threads on each node, we get:
By comparing the times for two nodes and four nodes, we observe the nice scaling behavior of the implementation. Of course, the code can also be run in MPImode only by disabling the multithreading flag in the code. If we use the same 32 cores as for the hybrid parallelization above, we observe the following runtime:
We observe slower speed for computations, but faster output (which makes sense, as output is only parallelized by MPI and not threads), whereas the computations are faster if we use hybrid parallelism in the given case.
There are several things in this program that could be improved to make it even more efficient (besides improved boundary conditions and physical stuff as discussed in step25):
Faster evaluation of sine terms: As becomes obvious from the comparison of the plain wave equation and the sineGordon equation above, the evaluation of the sine terms dominates the total time for the finite element operator application. There are a few reasons for this: Firstly, the deal.II sine computation of a VectorizedArray field is not vectorized (as opposed to the rest of the operator application). This could be cured by handing the sine computation to a library with vectorized sine computations like Intel's math kernel library (MKL). By using the function vdSin
in MKL, the program uses half the computing time in 2D and 40 percent less time in 3D. On the other hand, the sine computation is structurally much more complicated than the simple arithmetic operations like additions and multiplications in the rest of the local operation.
Higher order time stepping: While the implementation allows for arbitrary order in the spatial part (by adjusting the degree of the finite element), the time stepping scheme is a standard secondorder leapfrog scheme. Since solutions in wave propagation problems are usually very smooth, the error is likely dominated by the time stepping part. Of course, this could be cured by using smaller time steps (at a fixed spatial resolution), but it would be more efficient to use higher order time stepping as well. While it would be straightforward to do so for a firstorder system (use some Runge–Kutta scheme of higher order, probably combined with adaptive time step selection like the Dormand–Prince method), it is more challenging for the secondorder formulation. At least in the finite difference community, people usually use the PDE to find spatial correction terms that improve the temporal error.