**Thallo -- Scheduling for High-Performance Large-scale Non-linear Least-Squares Solvers**
Michael Mara, Felix Heide, Michael Zollhöfer, Matthias Nießner, Pat Hanrahan
![](./images/representative_image_thallo_siggraph_small.png)
# Overview
Large-scale optimization problems at the core of many graphics, vision, and imaging applications are often implemented by hand in tedious and error-prone processes in order to achieve high performance (in particular on GPUs), despite recent developments in libraries and DSLs. At the same time, these hand-crafted solver implementations reveal that the key for high performance is a problem-specific schedule that enables efficient usage of the underlying hardware.
Thallo is a high-performance domain-specific language (DSL) that generates tailored high-performance GPU solvers and schedules from a concise, high-level energy description without the hassle of manually constructing and maintaining tedious and error-prone solvers.
It is a direct successor to the [Opt DSL](http://optlang.org/), extended with compiler transforms and a scheduling languge inspired by [Halide](https://halide-lang.org/) that enable it to tackle a much wider class of problems (from BundleFusion to Blendshape Fitting) with state-of-the-art performance.
# Resources
- The original paper has been accepted to Transactions on Graphics and will be (remotely) presented August 2021 at [SIGGRAPH](https://s2021.siggraph.org).
- A pre-alpha version of the code has been release on [github](https://github.com/thallolang/thallo); an alpha version with build instructions and a docker container are slated for a September 2021 release.
- Since this project is a direct descendant of Opt, [its website](http://optlang.org/) is a good resource as well.
For more information, contact:
mmara at cs dot stanford dot edu
# Illustrative Example: 3D Blendshape Fitting
For illustrative purposes, let's take a deep look at a simple example:
## Example Overview
Non-rigid model-based registration is a fundamental building block for a large variety of tracking and
reconstruction approaches. One very prominent example is fitting an affine parametric 3D mesh model
$P$ to 2D observations, e.g., as done in the Face2Face [Thies et al. 2016] approach. In the following,
we assume that the embedding of a mesh is represented by a vector $\mathbf{m}\in \mathbb{R}^{3N}$ that stacks its $N$ vertices.
In the case of face tracking, the affine parametric model is then defined by a low-dimensional expression subspace
of $M$ blendshapes spanned by a matrix $B\in\mathbb{R}^{3N\times M}$ that models the facial expressions relative
to a neutral face template $\mathbf{m}\in \mathbb{R}^{3N}$; $m=\mathcal{P}(\mathbf{c})=\mathbf{a} + \mathbf{B} \mathbf{c}$.
Each of the $M$ blendshapes represents a different facial expression as per-vertex offsets to the neutral face template.
New faces $\mathbf{m}\in \mathbb{R}^{3N}$ are created by interpolation of these $M$ displacement vectors based on the $M$ blendshape weights
$\mathbf{c}\in \mathbb{R}^{M}$. In the easiest scenario, fitting the model to 3D data means finding the best
coefficients $c_*$, such that $\mathcal{P}(\mathbf{c}_*)$ best matches a set of target correspondences $\mathbf{t}$.
Typically, the number of vertices is much larger than the number of blendshapes $N \gg M$, hence the resulting Jacobian
matrix has a lot more rows than columns and the Jacobian matrix is fully dense. We regularize the model by adding a term
penalizing weights with large magnitude. This optimization is run in alternation with a correspondence search to build a model-based non-rigid ICP.
![An example of parametric model fitting using optimization. The mesh is constrained to be a linear combination of basis meshes, called blendshapes, and the optimization seeks to minimize vertex distances to target correspondences, shown in green.
](./images/face_fitting.png)
## Mathematical Formulation
\begin{equation}
\min_{\mathbf{c}} \sum_{i=1}^M \big|\big| \mathbf{t}_{i}-\Pi\left(\mathcal{P}(\mathbf{c}_i)\right) \big|\big|_{2}^{2} + \sum_{i=1}^M \omega_r \mathbf{c}_i^2 \\
\mathcal{P}(\mathbf{c}) = \mathbf{m} = \mathbf{a} + \mathbf{B} \mathbf{c}
\end{equation}
where $\mathbf{m}$ are the vertex positions of the parametric mesh, calculated from the neutral template $\mathbf{a}$ with offsets calculated by multiplying the blendshape matrix $\mathbf{B}$ by the vector of blendshape weights $\mathbf{c}$, which are what we are optimizing for. $\mathbf{t}$ are the target (2D) correspondences, and $\Pi$ is the camera transformation and projection function, which transforms 3D points into 2D coordinates. $\omega_r$ is the regularization weight.
## Code
The code is expressed per-element. This requires a straightforward translation from the slightly higher-level linear alegebra formulation above. The matrix multiply becomes a dot product (sum of products) along the blendshape weight dimension $M$. `BlendshapeWeights` is $\mathbf{c}$, `BlendshapeBasis` is $\mathbf{B}$, `AverageMesh` is $\mathbf{a}$, `Target` is $\mathbf{t}$, and `w_regSqrt` is $\sqrt{\omega_r}$. $\Pi$ is implemented by the line using `CameraParams` and `CameraXForm`. The intermediate expression `Mesh` corresponds to $\mathcal{P}(\mathbf{c}) = \mathbf{m}$. We also add some extra code to allow users to disable the fitting term for some vertices by setting the target position's $x$ coordinate to a large negative number. In the final line we use a single scheduling construct to direct the compiler to split the core $\mathbf{J}^T\mathbf{J}p$ computation into two separate (matrix-free) kernels, materializing $\mathbf{J}p$ in the first step. Both of these kernels are (by default) parallelized over the residual terms.
~~~~lua
local N,M = Dims("N", "M")
Inputs {
BlendshapeWeights = Unknown(opt_float, {M}, 0),
AverageMesh = Array(opt_float3, {N}, 1),
BlendshapeBasis = Array(opt_float3, {N,M}, 2),
Target = Array(opt_float3, {N}, 4),
w_regSqrt = Param(float, 5),
cameraParams = Param(float9, 6),
cameraXform = Param(float6, 7)
}
local m,n = M(),N()
local Mesh = AverageMesh(n) + Sum({m}, BlendshapeBasis(n,m)*BlendshapeWeights(m))
local Pos2D = project_point(rigid_trans(PoseToMatrix(cameraXForm), Mesh), cameraParams)
local e_fit = Target(n) - Pos2D
local valid = greatereq(Target(n,0), -999999.9)
r = Residuals {
reg = w_regSqrt*BlendshapeWeights(m),
fit = Select(valid,e_fit,0)
}
r.fit.Jp:set_materialize(true)
~~~~
This code uses the following helper functions from `lib.t`:
- `PoseToMatrix`
- `rigid_trans`
- `project_point`
## References
- [Face2Face: Real-time Face Capture and Reenactment of RGB Videos](http://www.graphics.stanford.edu/~niessner/papers/2016/1facetoface/thies2016face.pdf)
- [Deformable Model Fitting by Regularized Landmark Mean-Shift](https://ci2cv.net/media/papers/2011_IJCV_Saragih.pdf)