【正文】
late mesh, created prior to the capture work us ing a separate photogrammetry pipeline, is then projected on to the unstructured mesh and associated with the optical flow. The tem plate mesh is tracked across the performance and any issues are fixed semiautomatically in the DI4DTrack software by a tracking artist. The position and orientation of the head are then stabilized using a few key vertices of the tracking mesh. Finally a point cache of the facial performance is exported for the fixedtopology tem plate mesh. The point cache file contains the positions of each ver tex in the mesh for each frame of animation in the shot. Additional automated deformations are later applied to the point cache to fix the remaining issues. These deformations were not ap plied to the point caches in the training set. For example, the eyelids are deformed to meet the eyeball exactly and to slide slightly with motion of the eyes. Also, opposite vertices of the lips are smoothly brought together to improv e lip contacts when needed. After ani mating the eye directions the results are pressed for runtime use in Remedy’s Northlight engine using 416 facial joints. Pose space deformation is used to augment the facial animation with detailed wrinkle normal map blending. 2 Previous Work While facial performance capture systems e in many varieties, all share a fundamental goal: the nonrigid tracking of the shape of the actor’s head throughout a performance, given inputs in the form of video sequences, RGBD sequences, or other measurements. Once solved for, the moving geometry is often further retargeted onto an existing animation rig for further processing. In this work, we concentrate on the first problem: given a video sequence as in put, our system outputs a timevarying mesh sequence that tracks the performance. There are numerous existing methods for timevarying facial 3D re construction (tracking). Markers drawn at specific locations on the actor’s face enable multiview stereo techniques to find the mark ers’ trajectories in space, and knowledge of their positions on the face geometry allow estimating a deformation for the entire head [Williams 1990]. Markerless techniques, on the other hand, attempt to track the entire face simultaneously, often with the help of a pa rameterized template head model that may include animation pri ors, ., [Zhang et al. 20xx。 Weise et al. 20xx], or by performing multiview stereo or photometric reconstructions of the individual input frames [Furukawa and Ponce 20xx。 Beeler et al. 20xx。 Fyffe et al. 20xx]. In contrast to the direct geometric puter vision approaches de scribed above, machine learning techniques have been successfully applied to f acial performance capture as well. A typical approach is to use radial basis functions (RBF) to infer blendshape weights from motioncapture data based on a set of training examples [Deng et al. 20xx]. More recently, restricted Boltzmann machines (RBM) and multilayer perceptrons (MLP) have also been used for simi lar tasks [Zeiler et al. 20xx。 Costigan et al. 20xx]. Furthermore, neural works have been used to infer facial animation data from audio [Hong et al. 20xx]. To our knowledge, none of the existing machine learning methods are, however, able to perform f acial per formance capture from video data in an endtoend fashion. As a consequence, they have to rely on other techniques to extract time varying feature vectors from the video. We base our work on deep convolutional neural works [Simard et al. 20xx], which have received significant attention in the re cent years and prov en particularly well suited for largescale image recognition tasks [Krizhevsky et al. 20xx。 Simonyan and Zisser man 20xx]. Modern convolutional neural works employ var ious techniques to reduce training time and improv e generaliza tion over novel input data. These include piecewiselinear activa tion functions [Krizhevsky et al. 20xx], data augmentation [Simard et al. 20xx], dropout regularization [Hinton et al. 20xx。 Srivastav a et al. 20xx], and GPU acceleration [Krizhevsky et al. 20xx]. Fur thermore, it has been shown that stateof theart performance can be achieved with a very simple work architecture that mainly consists of small 3 3 convolution al layers [Simonyan and Zisser man 20xx] that employ strided output to reduce spatial resolution throughout the work [Springenberg et al. 20xx]. In a way, our method is a ―metaalgorithm‖ in the sense that it re lies on an existing technique for generating the training examples。 however, after our work has learned to mimic the host algorithm, it produces results at a fraction of the cost. While we base our sys tem on a specific mercial solution, the same general idea can be built on top of any f acial motion capture technique taking video inputs. 3 Network Architecture Our input footage is divided into a number of shots, with each shot typically consisting of 100–20xx frames at 30 FPS. Data for each input frame consists of a 1200 1600 pixel image from each of the nine cameras. As explained above, the output is the perframe vertex positions for each of the ~5000 facial mesh vertices, ., ~15000 scalars (= Nout) in total. As the input for the work, we take the 1200 1600 video frame from the central camera, crop it with a fixed rectangle so that the face remains in the picture, and scale the remaining portion to 240 320 resolution. Furthermore, we convert the image to grayscale, resulting in a total of 76800 scalars to be fed to the work. The resolution may seem low, but numerous tests confirmed that increasing it did not improve the results. During the course of the project, we experimented with two neu ral work archit