Matrix-free codes and how it affects Optizelle -- OptimoJoe

Introduction

It’s been a little while since I added a post, so I wanted to give a glimpse into some of the technical problems that I’m currently working on and how they relate to general scientific codes.

One of the things that I advertise about Optizelle is that it’s matrix-free and that’s certainly true. However, at the same time, what does that really mean? Basically, a matrix-free code is one that doesn’t require its linear operators to be matrices. Note, I didn’t say that a matrix-free code doesn’t use matrices. Certainly, if a user wants to use one, they should be allowed. However, they should not be required.

Of course, this leads to the question: Should we be using matrices anyway? Most of the time, the answer is yes. In order to understand why, let’s look at Optizelle.

Brief Primer on Linear System Solves in Optimization

In unconstrained optimization, we use either truncated-CG or truncated-MINRES to solve the Newton system:

$$ \nabla^2 f(x) \delta x = - \nabla f(x) $$

where $\nabla f(x)$ and $\nabla^2 f(x)$ denote the gradient and Hessian of $f$ , respectively. Although we don’t need an exact solution to this system, in order to obtain high-order convergence, we need to solve it well enough where the criteria for well enough comes from the theory of inexact Newton methods, which we can find in places like Numerical Optimization’s section 7.1. In general, these criteria are difficult, if not impossible, to verify, so we basically start conservatively and then increase the amount of precision we obtain in each solve until we’re happy with our convergence rate.

Next, how quickly we meet the criteria for well enough depends on how well clustered the eigvenvalues of $\nabla^2 f(x)$ are. In order to see this, we note that Krylov methods essentially look for a polynomial that is small on the spectrum of the linear operator. For a symmetric system $Ax=b$ , we write the norm of the n-th residual as

$$ \begin{align*} \| r_n \| =& \|P_n(A)b\| \\ =& \|VP_n(\Lambda)V^{-1}b\| \\ \leq& \|V\| \|P_n(\Lambda)\| \|V^{-1}b\| \\ =& \|P_n(\Lambda)\| \|b\| \\ \end{align*} $$

where we used the spectral decomposition $A=V\Lambda V^{-1}$ . Hence, our residual becomes small when our polynomial is small on the spectrum and this occurs more readily when the eigenvalues are well clustered. In order to better cluster the eigenvalues, we apply a preconditioner, $P$ , to both sides of the equation. This gives us

$$ P \nabla^2 f(x) \delta x = - P \nabla f(x) $$

Here, we could choose any sort of preconditioner $P$ . For example, we can see that when $P=\nabla^2 f(x)^{-1}$ , we cluster all of the eigenvalues of our operator at one and our equation reduces to

$$ \delta x = - \nabla^2 f(x)^{-1} \nabla f(x). $$

In other words, with a perfect preconditioner, $\nabla^2 f(x)^{-1}$ , our truncated-Krylov solver could solve the Newton equation in a single iteration. Though, to be sure, such an inverse may not always exist.

In a composite-step SQP method, we face a similar issue. There, we use GMRES solve the augmented system

$$ \begin{bmatrix} I & g^\prime(x)^*\\ g^\prime(x) & 0 \end{bmatrix} \begin{bmatrix} \delta x\\ \delta y \end{bmatrix} = \begin{bmatrix} a\\ b \end{bmatrix} $$

where the vectors $a$ and $b$ depend on the particular augmented system being solved, $g^\prime(x)$ denotes the total (Fréchet) derivative of $g$ at $x$ , and $g^\prime(x)^*$ denotes its adjoint. Similar to the case in unconstrained optimization, we don’t require an exact solution, but we still need to solve the system well enough and how quickly we accomplish that depends on how well the eigenvalues of

$$ \begin{bmatrix} I & g^\prime(x)^*\\ g^\prime(x) & 0 \end{bmatrix} $$

are clustered. Fortunately, we can construct a Schur preconditioner of the form

$$ \begin{bmatrix} I & 0\\ 0 & (g^\prime(x)g^\prime(x)^*)^{-1} \end{bmatrix} $$

which gives us exactly three clusters of eigenvalues and allows us to solve the augmented system in three iterations of GMRES. That reduces the problem to approximating the inverse $(g^\prime(x)g^\prime(x)^*)^{-1}$ .

Unfortunately, there’s no fully generic way to approximate an inverse short of a factorization, incomplete or otherwise. Really, the problem with factorizations is that it becomes increasingly difficult to scale these algorithms to thousands of processors. Even good sparse, direct solvers can have trouble with fill-in and too much fill-in can lead to memory problems. Certainly, for certain classes of PDE constrained optimization problems, we can use highly effective preconditioners like algebraic multigrid. However, at the end of the day, many of these preconditioners still end up using an incomplete LU factorization at the lowest level, so matrices and factorizations are still involved. In other words, matrix factorizations still represent our best, most reliable way to construct a generic preconditioner. At the same time, for many problems, this is not good enough and our optimization solver must be able to handle both kinds of preconditioners.

At this point, we contend that we need matrices most, but not all of the time. This leads to the question: When we do use matrices, should we specify how these matrices are stored and factored? Here, I strongly contend: No. We can see this by considering different kinds of matrix representations. For example, if we required a user to define all linear operators using a dense, column major format like Fortran, we’d be very fast on certain classes of small problems, but we wouldn’t scale. If we required a user to use a distributed, MPI based sparse matrix format for their matrices, we’d scale well on certain classes of PDE-constrained optimization problems, but this would extremely painful to use when solving something like the Rosenbrock function in MATLAB. Basically, we need the right tool for the right job and mandating a matrix format means that we restrict ourselves to a certain kind of problem class. Funny enough, this is the exact same argument that we make when promoting the use of general, user-defined vector spaces.

How Matrix-Free Elements Appear in Optizelle

As we noted above, Optizelle does currently have the ability use both matrices and general operators, which it accomplishes by allowing the user to specify general operators that accept one vector and return another. If we want to use a matrix to compute the affect of the operator, we are free to do so. At the same time, this has to be done carefully into to generate a fast code. For example, in the composite-step SQP method, we only need to factorize the operator $g^\prime(x)g^\prime(x)^*$ once per iteration. As long as we store this factorization, we can reuse it in all of our augmented system solves. Technically, this is a little trickier and we actually want to cache two factorizations. When we solve for our equality multiplier step, we actually solve the augmented system at a new trial point, which requires a new factorization. If we reject the step, we revert to our old point, which requires our previous factorization.

In any case, at the moment, we require the user to write this caching, which isn’t ideal for a couple of reasons. First, that means we expect our user to know what to cache where, which isn’t quite fair as these algorithms are moderately complicated. Second, if the machine crashes, we don’t have the ability to leverage our restart scheme to restore the factorization. We have to recompute it. Third, factorizations can and will fail and having the factorization cached inside of a function makes it difficult to debug. One of the nice, auxiliary benefits of restart files is that it makes it easier to debug algorithm difficulties since we have direct access to all the data therein. As such, we’re faced with a dilemma: How do we allow for general operators and still cache matrix factorizations effectively?

To that point, I’m not fully decided, but I have some ideas. Essentially, we need a new algebra for operators that includes the operations

Generate
Apply
Adjoint apply
Factorize
Solve

We also need two new data types for the

Representation
Factorization

First, the operation “Generate” takes in the current data point and creates some sort of representation of the operator. Then, the operations “Apply” and “Adjoint apply” apply this representation or its adjoint to a vector, respectively. Next, “Factorize” generates some sort of factorization from the representation. For an actual matrix, this would be an actual factorization. For something like algebraic multigrid, this would be all of the associated information necessary to create the preconditioner. Finally, “Solve” uses the factorization in order to solve a system.

Now, the benefit of using a system like this is that we would then gain the ability to cache the operators and factorizations within the optimization state. This would also allow us to utilize the restart framework, which gives us repeatability as well as the ability to debug.

At the same time, there’s not a great way to inject this algebra into our current setup. Right now, the user provides Optizelle information about the real type as well as all of the vector spaces X, Y, and Z. Although the Hessian only uses the vector space X, the operator $g^\prime(x)$ intersects the vector spaces X and Y, and the operator $h^\prime(x)$ intersects the vector spaces X and Z. Therefore, it’s awkward to add these operations and data types to the existing vector spaces. Alternatively, we could just pass in these new operations separately, but this increases the number of template arguments for a constrained optimization problem from four to seven: the real type, three vector spaces, three operator algebras. In addition, the number of data types that our restart object needs to manage increases from six to twelve: reals, naturals, parameters, three different kinds of vectors, three different kinds of representations, and three different kinds of factorizations. On top of this, the user needs to provide code that describes how to read and write these representations and factorizations since their form is problem specific; they could be serial, parallel, dense, sparse, etc. Certainly, our default vector spaces can take care of many of these details, but going down this route increases the complexity of problem setup considerably.

At the end of the day, will these changes be required? Probably. In general, we spend far more time running and debugging problems than setting them up, so we should favor design that make our codes more efficient and easier to debug. At the same time, we want to get the design correct and this issue requires more thought.

Summary

In short

A matrix-free code is one that doesn’t require its linear operators to be matrices
Most of the time, we wan to use a matrix anyway because factorizations are a great tool to construct a general preconditioner
Once we use factorizations, we need to cache them carefully in order to compute efficiency
If we cache factorizations, we want to do so implicitly, so that we can more easily debug issues and utilize our restart framework
Caching factorizations implicitly requires our users to provide additional operations and data types
Requiring additional operations and data types complicates our interfaces for both solves and restarts
In the end, these changes are probably necessary, but we need to carefully consider whether or not we have the right interfaces

Certainly, if anyone else has a good idea of how to tackle this problem, please drop me a line.

Matrix-Free Codes and How It Affects Optizelle

Details on How to Make Optimization Codes Work in Practice

Introduction

Brief Primer on Linear System Solves in Optimization

How Matrix-Free Elements Appear in Optizelle

Summary