Research on expression extraction and face animation generation based on visual features

Abstract

Face modelling is the key to modern visuals in special effects movies and computer games. In this paper, based on the dynamic modelling of 3D face model, a modelling method based on feature extraction is proposed. For the captured face image, firstly locate the face region, then extract the face feature points of the face region, and deform the standard 3D face model according to the extracted face feature points, and finally obtain a real-time three-dimensional image. Face animation system. The experimental results show that the proposed method can accurately complete the face modelling of the corresponding expression in real time for 3D face modelling, which has high real-time and accuracy.

Keywords

Visual features expression extraction moving least squares algorithm face animation generation

1. Introduction

The human face image carries and conveys human emotions, identity and spiritual state, and expresses rich and colourful social and cultural significance. In addition to its own natural physiological characteristics, the human face also contains and carries a wealth of social information, which is inseparable from the participation and expression of human face in social, cultural and artistic fields. Especially in the interaction between people, the face plays a unique and important information transmission function. With the voice and gestures, the face can express the emotions, contents and hints necessary in social interaction [1].

Face animation is an important research topic in computer animation, human-computer interaction and computer vision. It brings great convenience and fun to people’s communication, culture and entertainment. Whether it is film and television, virtual teaching or web conferencing, or video games, face animation technology has been applied to become an integral part [2]. This paper starts from the recognition of face shape features based on video, and proposes an expression face animation generation technology under the premise of extracting features. The technology is divided into two major processing steps. Firstly, based on the key frame generation technology of expression mapping and image fusion, a complete key frame library containing various deformations and expressions of the face is generated through a small number of pre-recorded key frame images; The fast incremental intermediate frame generation algorithm dynamically selects key frames according to known key frame libraries and input parameters, and inserts intermediate frames between key frames to generate smooth natural and expressive face animation. Experiments show that the method of this paper can generate facial expressions with rich expressions, and it is also convenient to combine with the driving methods such as voice. At the same time, the method can also realize the generation of stylized facial expression animation such as cartoon.

2. Face shape feature extraction based on condensation algorithm

The condensation algorithm is an important technology based on video object recognition, and it is also an algorithm framework, which can be easily improved. Its main idea is to assume that there is a first-order Markov property between the frames before and after the video, that is, the position, shape and other information of the target in a certain frame are only affected by the previous frame, so that the target information that has been identified in the previous frame is utilized. Predicting the target information of the current frame, and correcting the predicted value according to some properties of the current frame image to obtain a final recognition result.

In an observation-true value model, let $z$ be the observed value and $x$ be the true value, then the target is the conditional probability that the true value $x$ appears under the observed value $z$ . According to Bayes’ theorem:

$\displaystyle{p}\left({x|z}\right)=kp\left({z|x}\right)p\left(x\right)$ (1)

When ${p}\left({x|z}\right)$ is complex, it is difficult to solve it in a conventional way [3]. Then we can use the sampling method, firstly according to the probability distribution $p\left(x\right)$ sampling, get a series of samples ${s}^{\left(1\right)},\ldots,{s}^{\left(N\right)}$ , and then resample the sample: in the above sample set, select the serial number $n=1,2,\ldots,N$ , the rule is to choose according to the following probability distribution $\pi_{n}$ .

$\displaystyle\pi_{n}=\frac{p_{z}\left({s^{\left(n\right)}}\right)}{\sum% \nolimits_{j=1}^{N}{p_{z}\left({s^{\left(j\right)}}\right)}},p_{z}\left(x% \right)=p_{z}\left({z|x}\right)$ (2)

It can be seen that the new sample set constructed at this time must conform to the ${p}\left({x|z}\right)$ distribution, and the expectation of ${p}\left({x|z}\right)$ occurrence can also be directly obtained from the sample weighted average. In this way, we use the form of distributed sampling to avoid the difficulty of modelling the unknown probability distribution.

The probability distribution of the shape feature of the face is expressed by a sample-set, so at time $t$ , sample-set $\left\{{s_{t}^{\left(n\right)},\pi_{t}^{\left(n\right)},n=1,\ldots,N}\right\}$ , where $s_{t}^{\left(n\right)}$ is a face shape feature and $\pi_{t}^{\left(n\right)}$ is a weight or probability. The process of Condensation is to calculate the sample-set at time $t$ from the sample-set at time $t-1$ .

Sampling is based on the numerical ratio of $\pi_{{t-1}}^{\left(n\right)}$ for $s_{{t-1}}$ samples at time $t-1$ , and $N$ candidate samples are generated, denoted as $s_{t}^{{}^{\prime}\left(n\right)},n=1,\ldots,N$ . Predicting the motion model of the face (from training) and the random diffusion model to obtain a sample set of the contour of the current frame:

$\displaystyle s_{t}^{\left(n\right)}=As_{t}^{{}^{\prime}\left(n\right)}+Bw_{t}% ^{\left(n\right)}$ (3)

This process is called movement and diffusion. The two matrices $A$ and $B$ are trained to represent the regular motion parameters and random motion parameters in face motion and deformation, while ${w}_{t}^{\left(n\right)}$ is a vector consisting of a standard normal distribution. Characterize the randomness of facial motion [4].

The weights are calculated separately for the samples in the new sample set. The weights are measured by the difference between them and the current frame image (observation). The larger the difference, the less the sample is closer to the true shape of the current face, and the lower the weight.

$\displaystyle\pi_{t}^{\left(n\right)}=p\left({z_{t}|{x_{t}=s_{t}^{\left(n% \right)}}}\right)$ (4)

The observation method of the current frame image can be diversified. In the paper, the edge detection method is adopted, and the Euclidean distance between the face shape of the sample and the nearest edge curve is measured, and the weight with the larger distance is lower. Upgrade the parameters of the estimated model and output the average of the samples.

$\displaystyle\varepsilon\left[{{f}\left({x_{t}}\right)}\right]=\sum\limits_{n=% 1}^{N}{\pi_{t}^{\left(n\right)}{f}\left({s_{t}^{\left(n\right)}}\right)}$ (5)

3. Face modelling based on coordinate system

The personalized face model is based on the Candide-3 neutral face model. Firstly, the three-dimensional face is realized by binding the face feature points extracted from the video image with the feature points of the features in the Candide-3 model. The initial displacement of the model feature points is then refined by the RBF interpolation algorithm and finally the texture information is added to obtain the final realistic 3D face model. Each vertex in the Candide-3 neutral model corresponds to three coordinate values $x$ , $y$ , and $z$ , and the vertex coordinates is ${P}_{i}=\left({{x}_{i},y_{i},z_{i}}\right)$ . The Candide-3 model can be expressed by formula:

$\displaystyle{g}\left({\sigma,\alpha}\right)=R_{S}\left({\bar{g}+S\sigma+A% \alpha}\right)t$ (6)

Figure 1.

Characteristic point calibration structure.

In order to be able to fine-tune the face mesh model, this paper uses the convolution method. Firstly, it is necessary to select a finite feature point and calculate its displacement, then select the appropriate scattered data interpolation method to calculate the displacement of other feature points by solving the appropriate spatial interpolation function, thus completing the elasticity of the entire character avatar mesh model. Deformation. Binding textures can make the model look more realistic.

This article uses the 8-point method. Among them, 8 points include: two points to determine the left eye distance, two points to determine the right eye distance, the nose tip point flash, the left mouth corner point and the right mouth corner point, and the lip centre point. Accurate feature point positioning results can be used to correct the face angle and posture, thereby improving the accuracy of face recognition as shown in Fig. 1. The excellent feature point calibration algorithm not only can obtain the feature point position efficiently and accurately, but also has certain robustness to the face being affected by expression, posture, rotation, occlusion and illumination. This paper chooses to use convolutional neural networks for the study of feature point calibration algorithms.

4. Local weight and image frame representation

4.1 Local weight sharing strategy

The local weight sharing strategy is to divide the input feature map into A and other regions. The shared weight kernel weight is shared within each region, which is equivalent to extracting a local texture feature. The schematic diagram of the local weight sharing strategy is shown in Fig. 2.

Figure 2.

Local weight sharing strategy.

Suppose ${I}\left({{h,w,m}}\right)$ represents the previous layer feature map parameter of the convolutional layer, the size is ${h}\times{w}$ , and the number of feature maps is $m$ . ${C}\left({{s,n,p,q}}\right)$ represents convolutional layer parameters, where $s$ is the convolution kernel size, $n$ is the number of convolutional layer feature maps, and the convolutional layer is divided into ${p}\times q$ equal-sized regions using a local weight sharing strategy, using sigmoid Activate the function. Then the convolutional layer node output value is calculated as:

$\displaystyle{y}_{{i},{j}}^{\left({t}\right)}=\textit{sigmoild}\left({\sum% \limits_{{r=0}}^{{m-1}}{\sum\limits_{k=0}^{s-1}{\sum\limits_{{i=0}}^{s-1}{x_{i% +k,j+1}^{\left(r\right)}\cdot{w}_{k,j}^{\left({r,u,v,t}\right)}+b^{\left({{u},% v,t}\right)}}}}}\right)$ (7)

4.2 Key frame generation algorithm for image fusion and expression mapping

Expressions of expressions are expressed through facial deformation and texture changes. An influential expression classification standard divides expression into neutral, sad, happy, amazed, angry, and fearful. In practice, we can divide expressions into neutral, sad, happy, surprised, and angry. Therefore, expression recognition can be seen as a process of classifying face images according to this label. We use a sample-based generation method, that is, an image of a basic face shape in which a good expression is pre-recorded, and an image with a corresponding expression after the face is deformed [5]. The face pose is expressed as $g=\left({r,s,t_{x},t_{y}}\right)$ . According to different scholars’ research, there are about 48–50 Chinese phonemes, but many phonemes are similar when they are pronounced. The paper combines similar mouth shapes to get 20 basic mouth shapes. For the convenience of calculation, we still use Its corresponding phoneme is named. See Table 1.

Table 1
Phoneme-visual position correspondence table

Visual position	Phoneme	Visual position	Phoneme
silence	silence	an	an
b	b, p, m	ai	ai
F	f	ao	ao
d	d, t, n, j, q, x, i, y	o	o, ong
z	z, c, s	ou	ou
l	l	er	er
g	g, k, h, e	u	u, w, v
ei	ei, en, eng	ing	ing, in
zh	zh, ch, sh, r	iu	iu
a	a, ang	ui	ui, un

In summary, we combine the three major factors of face deformation, expression, posture and mouth shape into two types: expression and mouth shape. For everyone, the requirement to build a keyframe library is to include a face image of all combinations of expressions and mouth shapes. A single person needs at least 20 $\times$ 5 $=$ 100 keyframes, and the number of keyframes increases if one considers typical gesture changes and blinking (closed eyes) processing. Experience estimates that the total number of key frames per speaker is more than 300.

In order to get a vivid expression and a face image, we took a method of collecting a small number of samples and then using the key frame image synthesis algorithm to obtain the key frame library. There are two outstanding advantages to doing this:

Firstly, the feature-based image synthesis makes the expressions and mouth features vivid, and the synthesized face image is realistic and credible. Secondly, it is limited by the original data. It can handle the difficulty of the original image, especially for synthesizing avatars or new characters that do not exist in the original image library.

We regard expressions and gestures as two separate features of the human face, and for each character to be processed, an expression-pose matrix is established. Each element in the matrix is the key frame image and its auxiliary data combined with the corresponding expression and gesture. Our task is to synthesize unknown elements through a small number of known elements in this matrix.

5. Face animation generation

5.1 Voice and video data

The speech training and test corpus are all from the Chinese Academy of Sciences’ speech database. Currently, a female Mandarin speech data is used. The speech data is divided into five categories: natural, angry, happy, sad and surprised. Each category includes the same sentence. These corpora are used as training and test data for phoneme and emotion recognition. In the end, each sentence is divided into phonemes, and each phoneme and the emotional tag of the phoneme are identified. We recorded the video data ourselves and invited a female model as the recording object. Recordings include reading a lone phoneme, various expressions without utterance, a combination of gestures and expressions, and a video of a sentence [6].

5.2 Automatic selection of keyframes based on phonemes and emotional tags

We use phonemes and emotion tags as input parameters to guide the combination of face animations. The format of each phoneme data is triplet $P_{i}=\left\{{p_{i},e_{i},t_{i}}\right\}$ , where $p_{i}$ is the phoneme tag, $e_{i}$ is the emoticon tag when uttering, and $t_{i}$ is the time of the phoneme in the whole sentence. We map the phonemes to the lip type, and then combine the emotion tags to map directly to the corresponding keyframes in the keyframe library.

5.3 Inserting frames and generating animations

As shown in Fig. 3, after getting the key frame data, we can insert the intermediate frame to generate a cartoon animation. The number of intermediate frames to be inserted and the corresponding parameters are determined by the time stamp of the key frame and the frame rate of the video. We still use $P_{i}$ for the two key frames before and after.

Figure 3.

Cartoon style keyframe generation.

6. Conclusion

In the research fields of face recognition and reconstruction, face animation generation, computer vision, human-computer interaction and computer animation researchers publish a large number of excellent papers each year to study and discuss these issues. The paper proposes a method for generating face animation key frames based on feature point data. The algorithm vividly implements expression mapping and image fusion based on a small number of known face key frame images and feature point data, and generates the remaining key frame images to obtain a complete key frame matrix. The illumination changes are automatically compensated during the generation process and local consistency corrections are made. At the same time, the paper proposes that the expression face animation generation algorithm is applied to the speech-driven face animation generation, and the satisfactory results are obtained. At the same time, the algorithm can realize the key frame image synthesis of the cartoon face based on the sample, which is not limited by style, and then generate the cartoon face animation. With the active exploration of truth and the relentless pursuit of a better life, we believe that in the near future, we can see more intelligent and lifelike face animation.

References

Jiang

Zhao

Sahli

and Zhang

, Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features, Multimedia Tools and Applications73(1) (2014), 397–415.

Luo

and Wang

, Synthesizing performance-driven facial animation, Acta Automatica Sinica40(10) (2014), 2245–2252.

Gunanto

S.G.

Hariadi

and Yuniarno

E.M.

, Improved 3D face feature-point nearest neighbor clustering using orthogonal face map, Advanced Science Letters22(8) (2016), 1882–1886.

Yanchao

Yanming

Jiguang

and Hu

, Real time 3D facial movement tracking using a monocular camera, Sensors16(8) (2016), 1157–1159.

Sheu

J.S.

Hsieh

T.S.

and Shou

H.N.

, Automatic generation of facial expression using triangular geometric deformation, Journal of Applied Research and Technology12(6) (2014), 1115–1130.

Patil

R.A.

Sahula

and Mandal

A.S.

, Features classification using geometrical deformation feature vector of support vector machine and active appearance algorithm for automatic facial expression recognition, Machine Vision and Applications25(3) (2014), 747–761.