Abstract
As the basis of intelligent breeding management and animal husbandry insurance, the identification of individual cattle is important in animal husbandry management. Given the difficulty of data acquisition caused by the non-rigid and lacking cooperation of cattle, this study proposes a method for cattle face image acquisition and processing that can efficiently adapt to the harsh environment of cattle barns. When processing the non-rigid cow face, the method of approximating the cow face to a rigid body is used to establish the cow face image data set., and the cattle face image data set is established. The Three Dimensional(3D) reconstruction method of cattle face uses a 3D image reconstruction method based on multiple perspectives. First, the scale-invariant feature transform algorithm is used to extract the image feature points. The fast library for approximate nearest neighbors algorithm is used to match feature points. The matching results are selected via random sampling consensus. Second, the structure of the motion method is used for the sparse reconstruction of point clouds, and the dense point cloud is then generated using the three-dimensional multi-view stereo vision algorithm. Finally, the Poisson surface reconstruction method is used for surface reconstruction. The results indicate that this method can effectively realize the three-dimensional reconstruction of cattle faces; the reconstructed images have obvious color, clear texture, and complete shape features.
Introduction
The cattle face has rich skin texture information and distinctive facial features, and facial images are considered to be one of the main biological characteristics for individual identification of cattle. The basic characteristics of cattle facial features include universality, particularity and persistence. Facial feature sets (such as pixel intensity) can be used as facial features to identify cattle. The 3D image reconstruction of cattle face can provide effective means to express and process cattle face information for 3D recognition of cattle face. Traditional 3D reconstruction methods such as laser scanning are expensive, not able to obtain target texture information and easily affected by the environment [1]. In contrast, the 3D reconstruction method based on image feature points can restore the 3D information of the image scene only by inputting images from different angles. This method is less expensive, can be used in most scenarios and can obtain accurate and realistic models, which is of high research value [2].
In recent years, researchers at home and abroad have done a lot of research on 3D reconstruction, and achieved remarkable results. In 2016, Csurka and Gabriella proposed a geometric reconstruction structure based on multi-view point-line algebraic relations [3]. The principle is based on the linear mapping between the two-dimensional image and the three-dimensional spatial perspective, so as to further improve the traditional algorithm based on perspective projection [4]. Due to its fast processing speed, this structure has been widely used in the initial creation of non-textured 3D graphics and has been fully recognized by the industry. Zhang J et al. proposed that polygon layout is a convenient intermediate model for performing other visual tasks [5]. The calculation of the motion strategy, the matching requirements for reliable self-positioning of the robot, and the physical limitations of the sensors (range, incidence) are considered to effectively establish the polygon layout of the indoor environment using the mobile robot equipped with distance sensors. Li Li et al. performed linear quadratic shape restoration and globally recovered quadric surfaces from closed contours [6]. It is proved that two images of a quadric are not enough to reconstruct a quadric. Three images with different views are the only necessary and sufficient way to recover the quadric surface. It is also proved that the contours in the same quadric image correspond to each other through invariants [7].
In 2017, Fitzgibbon A et al. described A completely automatic method to recover 3D scene structure, and for each frame of the camera, sequence images obtained from an unknown camera with unknown motion can be applied to A large number of scenes and movements [8].
In 2019, Han Z et al. proposed MAP-VAE to achieve the learning of global and local geometry by combining the use of global and local self-monitoring [9]. The unsupervised feature learning of point clouds is very important for the understanding of large-scale point clouds. The effective local self-monitoring can be achieved by introducing multi-angle analysis of point clouds. Xie H et al. proposed a new single and multi-view 3D reconstruction framework PIX2VOX, which generates a rough 3D volume from each input image by using a well-designed encoder-decoder [10]. Thereafter, the context-aware fusion module is used to adaptively select high-quality reconstructions for each part of 3D solids of different volumes (such as table legs) to obtain the fused 3D bodies. Finally, a refiner further refines the fused 3D volume to produce the final output. Qi et al. proposed Votenet [11], an end-to-end 3D target detection network based on deep point set network and Hough voting collaboration. Constructing a 3D detection pipeline of point cloud data and as generic as possible, the center of mass of a 3D object can be far away from any surface point thus making it difficult to regrow accurately in one-step to get the problem solved. Gabeur et al. proposed a nonparametric method, which uses depth map to represent the 3D shape of human; Estimate and combine visible depth mapping and hidden depth to reconstruct human 3D shape. The problem of 3D human shape estimation from a single RGB image is solved [12]. Kato H et al. trained a discriminator to learn a priori knowledge of possible viewpoints in order to reconstruct a shape that is reasonable from any Angle in the case of a small number of viewpoints observed. The discriminator is trained to distinguish between the reconstructed viewpoint of the observed viewpoint and the reconstructed viewpoint of the unobserved viewpoint [13]. The refactorer is trained to correct unobserved views by fooling the discriminator. Yu Z et al. proposed a two-stage method based on association embedding for segmented 3D reconstruction of a single image plane, and this method has achieved success in instance segmentation [14]. This method can detect any number of planes and solve the problem that only a fixed number of planes can be detected in a certain learning order.
At present, the domestic and foreign three-dimensional reconstruction of animals is still in the initial stage, and the research in this field is not mature. In order to protect animals, hardware devices affecting animal health, such as 3D scanners [15], cannot be used too much. Therefore, three-dimensional static models related to animals are difficult to obtain and inaccurate. People can have a very high degree of cooperation during the experiment [16], while animals have a low degree of cooperation. Therefore, there are few reconstruction methods applicable to all animals at present.
In 2000, Bregler et al. demonstrated for the first time the use of Non-Rigid Structure from Motion (NRSFM) to estimate the shape deformation pattern of giraffe’s neck from video sequences [17], but the reconstruction effect was not good.
In 2013, Cashman and Fitzgibbon proposed to learn the deformable animal model from several images [18], but the 3D deformable model did not have surface texture.
In 2013, Vincente and Agapito proposed the method of reconstruction of closed surface to restore the complete shape of variable 3D objects [19]. The template was obtained from the reference image, and then the template was deformed, but the final shape was still rough.
Inspired by the A Skinned Multi-Person Linear Model (SMPL) model of human body, Silvia Zuff et al. modeled the joint structure of animals in 2017, and tried to capture the changes of the shape and posture of animals from the animal model, and then expressed the model parameterized [20]. Finally, parameters such as shape and posture are abstracted from the animal model, and a real 3D model is formed by using the changes of parameters, which is called SMAL model [21].
In 2021, Fu et al. simplified the slender body into an elastic rod, and applied the skeleton optimization method to insert continuous body shapes between the end constraints imposed by the tracking markers, thus realizing 3D reconstruction of animals [22].
In order not to affect the health status of cattle, this paper uses the image-based 3D reconstruction method to reconstruct the cattle face, and carries out visual output. To solve the problem of non-rigid and lack of fit degree of cattle, and the difficulty of image acquisition, an approximate rigid multi-view image acquisition method for cattle face was designed. This paper is divided into five steps: image acquisition and data set establishment, camera calibration, sparse reconstruction of point cloud, dense reconstruction of point cloud and surface reconstruction. The method of cattle face image acquisition and data set establishment were introduced in detail. Then, the Structure from Motion (SFM) method was used for camera calibration and sparse reconstruction of point cloud, and the reconstruction of dense point cloud was realized. According to these point clouds, the surface mesh of cattle face was reconstructed by Poisson surface reconstruction method. Finally, the image texture is projected onto the surface grid to complete the surface reconstruction of the cattle face.
Image acquisition and data set establishment
The cattle face is dynamic, and the image data of non-rigid object is difficult to obtain. Aiming at this situation, an approximate rigid image acquisition method of cattle face is designed. Multiple cameras are used to collect images of the cattle face from different angles. During the process of collection, the cattle face and the camera move relative to each other. Therefore, it is impossible to build an image data set based on the cattle face image acquired in this case to carry out accurate 3D reconstruction. At the same time, the cattle face shot from different angles is still relative to the camera. At this time, the cattle face can be approximately regarded as rigid. Then, the 3D image reconstruction of the cattle face can be carried out according to the rigid 3D reconstruction method, which can reduce the difficulty of 3D image reconstruction of the cattle face.
Four mobile phones of the same model are prepared to take pictures of the cattle face. This way can reduce the difficulty of camera calibration in the subsequent process. The four cameras are placed at intervals apart to obtain the image of the cattle face from four different angles. If the distance between the cameras is too far, the degree of overlap between the images will be reduced and the feature points will be lost more. Because it is difficult for the cattle to cooperate with the shooting, people cannot directly shoot the designated cattle, and too close distance will reduce the cooperation degree of the cattle, so the long-distance shooting is carried out. Meanwhile, the forage is placed in the shooting area to guide the cattle to stay in the designated shooting area, so as to realize the shooting of the cattle face. In the actual process of cattle face image acquisition, it is difficult to capture cattle face image from different angles at the same time. In view of this situation, the data of cattle face from different angles were collected by recording video. The recorded video is calibrated to make the first frame of the four videos start at the same time, so that the cattle face images from different angles corresponding to each frame can be guaranteed to be at the same time. Four cattle face images from different angles at the same time were obtained to establish the data set required for 3D reconstruction of cattle face, as shown in Fig. 1.

Cattle face image dataset.
When extracting feature points of an image, factors such as illumination and resolution will interfere with the extraction process. The SIFT algorithm can well adapt to changes in the environment when detecting and describing local features in the image. In addition, scaling, rotation or affine transformation will occur during image processing, in which case feature detection can maintain the same characteristics. SIFT algorithm can effectively reduce the low extraction probability caused by noise or occlusion. It applies the difference of Gaussian operator to the image, so that the feature extraction can be independent of the direction and size of the feature. At the same time, it also uses the direction of local gradient to calculate the direction and position of the feature. The SIFT algorithm can still detect the features of the target when the background is cluttered and the image is partially occluded. It is widely used in the field of vision.
The SIFT algorithm proposed by David G. Lowe was used in feature point extraction. The SIFT algorithm is a local feature extraction scheme, which has scale invariance and can still achieve relatively good detection results when the image rotation angle, image brightness or shooting angle is changed. The image information obtained by the SIFT algorithm includes three parts: position, scale, and direction. The position is the coordinate of the key point. The scale is the difference between the close view and the long view, and the rotation direction is the different main directions determined when shooting an object at different angles. The flow chart of the SIFT algorithm is shown in Fig. 2. The process is as follows: (1) Scale-space extreme value detection: search for a set of potential key points (local extreme values of Gaussian difference images) in the scale space to ensure scale invariance; (2) Key point positioning: from potential key points Eliminate unstable points (edge points) from the points to obtain the final set of key points to further accurately locate the position of the key points (Sub-pixel Accuracy); (3) Main direction of feature points: determine the key points and the main direction of the local area where the key points are located. The main direction is searched for according to the peak value of the direction gradient histogram to achieve direction invariance. The scale, position, and direction ensure the invariance of the SIFT key points to similar transformations; (4) Feature descriptor generation: count the local area centered on the key points, generate the histogram of the direction amplitude of the key points, and get the SIFT feature descriptor. The image uses four cattle face images with different angles, as shown in Fig. 1. The results are shown in Table 1, and Fig. 3 shows the result of feature point extraction.

SIFT algorithm feature extraction flowchart.

(a) Feature point extraction of cattle face image a and (b) Feature point extraction of cattle face image b.
Feature point extraction result of cattle face image
The feature points extracted by the FLANN stereo matching algorithm were used for matching and the matching result is shown in Fig. 4. The FLANN algorithm is based on K-Dimensions Tree (KD-Tree) as the overall structure, and the Approximate Nearest Neighbor (ANN) algorithm is used to search for results. The FLANN algorithm transforms the search of objects in space into a search in a K-Dimensional tree structure, which reduces the overall calculation amount of the algorithm and speeds up the search speed of the algorithm. The FLANN algorithm is used to complete the stereo matching of feature points. The implementation steps are as follows: (1) According to the given image data, construct KD-Tree; (2) Use the dimensionality of the pixel in the image as the index, and in the structure, search for the given nearest neighbor point of the feature points; (3) Compare the actual distance between the searched pixel point and the feature point, then calculate the results to filter out the nearest neighbor point and the next neighbor point. If the result obtained is less than the threshold range given in advance, the obtained data is considered valid; otherwise, it is invalid.

Feature point matching map of the cattle face image.
The RANSAC algorithm was used to filter the matching results and filter out the mismatched data. RANSAC algorithm is a robust algorithm [25], which estimates model parameters from samples containing outer points in an iterative way. It is an uncertain algorithm and has a certain probability to get the correct result. The algorithm can be roughly divided into four steps: random sampling, calculating parameters, evaluating parameters and obtaining the optimal solution. That is, randomly select some points, calculate the parameters of the model, and bring the remaining points into the model for verification. If enough point deviations are within a given range, the selected sample is the best. Otherwise, this step is repeated. The results are shown in Table 2, and Fig. 4 is the result of feature point matching.
Feature point matching result of the cattle face image
The pinhole model [26] is a camera model widely used in computer vision and is also called a linear model. According to the pinhole model imaging principle, the matrix K is the internal parameter matrix of the camera, see formula (1). The size of each parameter of matrix K is only related to the internal structure of the camera. The internal parameters of the camera are f
x
, f
y
, u0 and v0. f
x
and f
y
are the focal length of the camera, where: f
x
= f/dx, f
y
= f/dy, u0, v0 respectively represent the difference between the horizontal and vertical pixels of the pixel coordinates of the image origin and the pixel coordinates of the image center point.
The actual lens will have different degrees of distortion, which is not an ideal perspective imaging. Among them, the radial distortion is caused by the shape defect of the lens. In this case, it is necessary to introduce various correction units to improve the imaging effect of the camera. At the same time, various nonlinear imaging models will also be produced. In general, the description of radial distortion for nonlinear distortion is sufficient. If too many nonlinear parameters are introduced, the accuracy of the model cannot be improved, and the solution of the model equation may even be unstable. Therefore, the nonlinear model with only radial distortion of the lens is adopted, and the description formula is as follows:
Among them,
Under normal circumstances, the initial values of f x and f y are both specified as 1.2max {w,h}, w and h are the length and width of the image respectively; u0 and v0 are the optical center of the camera, which are generally specified as half of the length and width of the image. The initial values of distortion coefficients k1 and k2 are all set to zero. After the camera internal parameters are obtained, the internal parameters can be used to solve the camera external parameters. The camera’s internal parameters are shown in Table 3, and the camera calibration results are shown in Fig. 5.

Camera pose diagram.
Camera internal parameter calibration
The overall algorithm of 3D reconstruction of bovine face can be summarized as follows: feature point extraction and matching, camera calibration and sparse point cloud reconstruction, dense point cloud reconstruction and surface reconstruction.
Sparse point cloud reconstruction
The Structure from motion(SFM) [28] is used to reconstruct the 3D sparse point cloud of cattle face. SFM can reconstruct a sparse point cloud of a three-dimensional model from two- dimensional images obtained from different angles of the object to be reconstructed. According to the topological structure of the image sequence added in the reconstruction process, SFM can be divided into augmented SFM, global SFM and hybrid SFM. The advantage of augmented SFM is that it is more robust to wrong feature matching pairs and has higher overall accuracy. Nevertheless, the disadvantage is that the reconstruction time is longer, and there will be a phenomenon of deviation accumulation with the camera registration step. On the contrary, the global SFM improves the reconstruction efficiency and avoids the accumulation of deviations in the registration process, but it has poor robustness to incorrect matching pairs and is difficult to correct. Hybrid SFM is a compromise between the efficiency and effectiveness of the two. Cattle face are used as the reconstruction object, the data set is small, the efficiency advantage of global SFM is not obvious, and the deviation accumulation is not significant, so augmented SFM is used to reconstruct sparse point clouds. The flowchart is shown in Fig. 6.

Schematic diagram of the augmented SFM process.
The process of augmented SFM reconstruction method is as follows: Firstly, the initial image pair is selected. The image pair with the most matching points is taken as the initial image pair, and the initial camera movement information and cattle facial feature information can be obtained through calculation. Triangulate the matching point pairs between the two images to obtain the initial spatial 3D point cloud of the cattle’s face. Each reconstructed point corresponds to the corresponding part of the cattle’s face, then, iteratively add new images. Every time a new image is added, the camera motion information and corresponding cattle face feature information of the image can be obtained by using the method of the previous step. Finally, parameter optimization is performed using bundle adjustment (BA) [29] methods to achieve. When describing the same scene, shooting from multiple different angles can obtain a series of image feature points. In this case, cluster adjustment can optimize the initial values of multiple cameras and structures to find reasonable parameters, so that the spatial coordinates of the corresponding points can be calculated more accurately. There is a reprojection error between the 3D points obtained by the triangulation of the moving structure and their corresponding actual observation positions. The cluster adjustment uses a cost function to minimize and optimize reprojection error until the complete image set is reconstructed to obtain the optimal 3D point cloud reconstruction model of the cattle face. In m images, each image contains n feature points, and the reprojection deviation calculation formula is as follows:
Among them, X ij is the i-th 3D point seen on the j-th image, and v ij is the mapping judgment coefficient of the i-th 3D point on the j-th image. If v ij = 1, it means there is a mapping. If v ij = 0, there is no mapping. a j and b i are the vectors parameterized representations of the camera j of each picture and each 3D point i, respectively. The function Q(a j , b i ) represents the projected coordinates of the object point b i under the camera a j , which is the predicted value. The function d represents the Euclidean distance between the observed image coordinates and the predicted image coordinates.
Multi-View Stereo is the key technology in dense reconstruction. Its algorithms can be roughly divided into four categories: deformable polygonal meshes, which require a model body to initialize; voxel-based methods, which require a scene that contains the scene and the accuracy of the edge box is determined by the size of the voxel grid; the block-based method requires reconstruction of a group of small patches; multiple depth maps require the fusion of multiple depth maps to obtain a globally unique model. The MVS algorithm can also be classified according to the type of data processed: single model, large scene, complex environment, etc. In this paper, a dense reconstruction method based on a patch model (Patch-Based MVS) was used.
The PMVS method does not need to know some structural information in advance and is very effective for general scene reconstruction. This method has three main steps: (1) Feature matching, extracting feature points through Harris corner detection and DOG operator, and matching the feature points in multiple views to obtain a set of sparse matching point pairs. These initial matching information are given, then repeat the following two steps (generally three times); (2) Diffusion, using a method similar to the literature13 to perform adjacent diffusion on the original sparse point cloud model to obtain a set of dense patches; (3) Filtering, using global visualization constraints to eliminate internal and external points on the surface of the object or scene.
Surface reconstruction
The Poisson surface reconstruction algorithm combines the advantages of global fitting and local fitting methods and transforms the surface reconstruction problem into the solution of the Poisson equation [31]. By solving the best fitting surface of the dense point cloud, a continuous and smooth 3D surface is obtained. The realization process of the Poisson reconstruction algorithm mainly includes four aspects: discretizing the global problem, creating a vector field, solving the Poisson problem, and extracting isosurface. The process diagram is shown in Fig. 7. The Poisson reconstruction model uses the intrinsic relationship between the model indicator function and the directed point set sampled on the surface to construct the Poisson equation, and then solves the Poisson equation to obtain the reconstruction model. Since the indicator function is almost continuous everywhere, the gradient of the indicator function is zero except at the surface of the model. Then the directed sample can be regarded as the sample of the gradient of the model indicator function, and then the reconstruction can be obtained by extracting the appropriate isosurface surface. Therefore, the calculation of the indicator function can be transformed into the calculation of the gradient operator, even if the gradient of the scalar function can best approximate the vector field defined by the sample, that is, Schematic diagram of Poisson surface reconstruction.
In the specific implementation process of this method, the input 3D point set needs to be stored through the octree data structure, that is, the octree is defined according to the location of the sampling point set. And then the octree is subdivided so that each sampling point falls at a depth of in the leaf nodes of D. Therefore, the level of surface detail after reconstruction is closely related to the depth D. Figure 8 shows the Poisson reconstruction effect of dense point cloud through different octree depths. When the depth of the octree is 5, the outline of the cattle’s face can be restored; when the depth is 7, the network model has basically restored the shape of the cattle’s face but lacks details; when the depth is 10 and 12, it can effectively restore the shape and details of the cattle’s face. After a lot of experiments, considering efficiency and mesh details, this paper finally chose the Poisson reconstruction effect with an iteration depth of 10 as the 3D reconstruction model of the original cattle face object.

Poisson reconstructed surface with different octree depths.
The experimental environment of this article is Windows 10 operating system, processor: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx 2.00 GHz. In the experiment, the real cattle’s face was photographed from different angles at the same time as the original input image of the experiment.
Figure 9(a) shows the reconstruction result of the sparse point cloud of the cattle face. These points were composed of SIFT feature points after screening. The sparse point cloud only gave a part of the 3D lattice of the cattle’s face and reconstructed part of the contour. Figure 9(b) is the result of the dense point cloud reconstruction of the cattle’s face. The dense point cloud reconstructed the 3D surface of the cattle’s face basically, but the 3D surface was very rough and had obvious holes. Figure 10(a) and 10(b) are the Poisson surface reconstruction results of the cattle’s face. The reconstruction points were dense and the surface was smooth. It could be seen from the 3D reconstruction result that the 3D reconstruction of the cattle’s face was realized, which provided the feasibility for the subsequent 3D recognition of the cattle’s face.

(a) Sparse point cloud on cattle face and (b) dense point cloud on cattle face.

(a) Front view of cattle facial surface reconstruction and (b) side view of cattle facial surface reconstruction.
It can be seen from the figure that the three-dimensional model of the reconstructed bovine face basically restores the real bovine face shape, and the visualization effect is very good. The 3D model of bovine face obtained has complete shape features, obvious color features and clear texture features. Through camera calibration and incremental motion recovery structure, the sparse point cloud of the cow’s face is reconstructed, and the approximate contour of the cow’s face can be seen. After that, the dense point cloud of the cow’s face is reconstructed using the method based on clustering and patch as the input of the subsequent surface reconstruction. Finally, through surface reconstruction and texture mapping, the 3D model of the bovine face which is consistent with the actual bovine face and has good visual effect is obtained, and the 3D reconstruction of the bovine face is realized, which provides feasibility for the subsequent 3D recognition of the bovine face.
In this paper, a nearly rigid method of cow face image acquisition is designed for the complex cow face data acquisition environment. Due to the large amount of information collected from the cow face video, the improved feature fusion K-means clustering method is used to extract the key frame of the cow face video. When extracting and matching the feature points of the cow face image, the FLANN matching algorithm of SIFT and ORB fusion is used to extract and match the feature points of the cow face data set. Compared with SIFT and ORB features, the matching points increased by 50.31% and 23.74% respectively. Finally, the AC-RANSAC algorithm is introduced to screen out accurate matching pairs, and the error matching rate is 12%, which effectively improves the accuracy of feature point matching. After that, the method of 3D reconstruction of bovine face was studied. Through surface reconstruction and texture mapping, the 3D model of bovine face was obtained, which was consistent with the actual bovine face and had a good visual effect. The 3D reconstruction of bovine face was realized. After testing, the real-time performance of the code could meet the needs of practical applications, providing feasibility for the subsequent 3D recognition of bovine face.
Deviations were observed between the reconstructed and real cattle faces. First, the resolution of the original image is low. In taking the cattle face pictures, considering the cost factor and the impact on the health of cattle, we chose to shoot from a distance, which resulted in the low resolution of the image and affected accuracy. Second, the 3D reconstruction image set contains cattle face images from different angles taken simultaneously. The cattle face is considered an ideal situation that is approximately rigid. Therefore, the direct non-rigid 3D reconstruction method of the cattle face can be considered in subsequent studies.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Footnotes
Acknowledgments
This research was funded the Natural Science Foundation of Inner Mongolia Autonomous Region under Grant 2020MS06015 and Grant 2021MS06014, and in part by the National Natural Science Foundation of China under Grant 61966026. The authors appreciate the funding organization for their financial supports.
