Abstract
This paper presents a framework of architecture scene based on augmented reality, which can help designers to analyze and evaluate the design scheme at the early stage of architecture design. First, we used Revit to construct the 3D architectural model, and then we extracted the scene feature points with AKAZE algorithm, and applied the visual vocabulary tree to identify the architectural scene. Finally, the monocular SLAM technology was used to register the virtual 3D building model of the real outdoor scene, and the architecture scene construction system based on augmented reality was implemented. The experimental results show that the proposed framework has the characteristics of a real immersion and easy spread and can help designers to make a more reasonable evaluation of the architectural design scheme.
Introduction
In Computer Aided Architectural Design (CAAD), many designers use AutoCAD to complete their design scheme. AutoCAD is a 2D CAD software, which limits spacial imagination and creation of the designers and seriously hinders the promotion of the architectural [1]. Augmented reality (AR) is a new computer application and human-computer technology developed on the basis of the virtual reality (VR) and can effectively solve the above problems [2].
AR technology has been extensively applied in architecture engineering and construction (AEC). In 2012, Suganya collected video frames and processed it by STRUCTURE-FROM-MOTION (SFM) algorithm. SFM is an off-line algorithm for 3D reconstruction based on various disordered images collected. First, the focal length is extracted from the pictures. Second, SIFT and other feature extraction algorithms are used to extract image features and the two adjacent images are matched. Third, the base matrix between the two images is estimated and then 3D coordinates of the internal and external parameters of the camera and the sparse feature points are estimated. Finally, a 3D model of the scene is achieved by SFM [3]. Schall proposed AR system that generated views of the frame structure inside a building, helping the workers in utility fields to perform the outdoor tasks such as maintaining, planning and measuring underground infrastructure [4]. In 2010, Chen et al. designed a mobile AR system which rebuilt the ruin of Dashuifa of Beijing’s Old Summer Palace [5]. The system matched real time frame and stored key frame of videos to identify the rebuilt ruin model. Then overlay position of virtual mode is determined by coordinate transformation, and the real scene and finally the virtual model were fused after rendering. Canciani et al. collected 3D data from the site of the Aurelian wall. He used AR to overlay virtual models on real scene. The missing structures and color of the wall were made up and the ancient ruin was reappearing [6]. Zhong et al. put forward a visual simulation of the construction schedule for core rock-fill dam [7]. This method mainly solved two problems. One problem was that virtual scene and real scene were not in the same space. The other problem was terrain model occupied too many resources in traditional visual simulation. They used the 3D registration of virtual camera to solve the first problem. The second problem was solved through overlay the 3D scene information obtained by video monitoring with virtual objects.
In the AR system, to realize the recognition of an outdoor scene, the first step is feature extraction of the target image. The main feature extraction algorithms are SIFT, SURF, ORB and AKAZE [8]. The SIFT [9, 10] feature is a local feature of the image, with good invariance for translation, rotation, scale scaling, brightness change, occlusion, and noise, etc. It maintains a certain degree of stability for visual change and affine transformation. The SIFT algorithm has a high complexity of time, which is not suitable for scenes with high real-time. SURF [11] is not only more stable but also much faster than SIFT. Its computational time is about one-third of that of the SIFT algorithm, but its robustness is not inferior to that of the SIFT algorithm. ORB algorithm is more than 10 times faster than SURF algorithm but slightly less robust than SURF algorithm in terms of rotation robustness, fuzzy robustness and scaling robustness.
In 2012, PF Alcantarilla proposed the KAZE algorithm [12], which used the additive operator splitting algorithm (AOS) to solve the nonlinear diffusion equation, but the implementation was difficult. The accelerated version of KAZE algorithm (AKAZE) [13] used the fast explicit diffusion (FED) [14, 15] numerical analysis framework to solve the nonlinear diffusion filter equation. Compared with AOS algorithm, FED algorithm greatly improves the accuracy and reduces the complexity of implementation.
In recent years, bag of word is the mainstream method in image retrieval field, which quantifies a large number of features into visual words but consumes too much time. Nister proposed a method which used the hierarchical k-means clustering to generate the visual word vocabulary tree [16]. Vocabulary tree is an efficient data structure for retrieving images based on visual vocabulary. The vocabulary tree effectively solves the problem that the quantization process is too slow due to the search for non-hierarchical words. Generally generated visual words are weighted by the TF-IDF (Term frequency – inverse document frequency). Philbin [17] mapped features to several visual words with similar distance when allocating SIFT features of the image, and the visual histogram obtained has better identification. Wengert [18] embedded color features into the inverted index of visual words, and used color SIFT features to build visual words, which could effectively solve the problem of lack of color information in the traditional vocabulary tree. However, this method needed to extract the color information of each SIFT feature, and the calculation complexity was large.
Simultaneous Localization and Mapping (SLAM) technology is widely used in automatic driving, mobile robot, UAV, AR and other fields. The advantages of SLAM include that cheap various sensors in mobile phones (camera, gyroscopes, Wi-Fi, magnetometers), low power consumption and many wireless access points. Sensors can be used for indoor and outdoor positioning through wireless flow. Wi-Fi can be used in case of occlusion and non-visibility. The SLAM is generally divided into two stages: inter frame estimation and back-end optimization [19], which is first proposed and implemented by PTAM [20]. PTAM distinguishes the front-end and the back-end to complete the parallelization of feature point tracking and mapping. The front-end tracking needs to respond to image data in real time, and mapping optimization is performed on the back-end. Raul proposed ORB-SLAM system based on feature recognition, which can run in real time, indoor or outdoor, large or small scenes. The system has strong robustness [21].
To sum up, researches on the application of AR in the display of architectural landscape have achieved remarkable results. But there is not any research on how to combine BIM (Building Information Modeling) with AR to help architects with architectural design. BIM can integrate the engineering information, construction progress, resource information, construction process and other simulation of all stages of the construction project in the whole life cycle into a single and complete model through the 3D digital simulation technology, so that a public and convenient cooperation platform can be provided for all participants of the project.
In our work, we completed scene recognition based on the main ideas of references [8, 13, 16] and finished the 3D registration referencing the ideas of the references [20, 21]. This paper studied how to combine BIM with AR for architectural landscape display. An AR framework of outdoor architectural landscape is designed to provide 3D visual simulation for engineering designers in the architectural design stage. The goal of this paper is to design an AR framework of outdoor architectural landscape to provide 3D visual simulation for architects in the architectural design stage.
Research on building scene construction based on AR
There are four parts in architectural scene construction based on AR, as shown in Fig. 1. In the first part, 3D building data is obtained according to Revit file of building, consequently 3D solid model of outdoor buildings is generated and texture is mapped on the model. In the second part, we use AKAZE and vocabulary tree to recognize outdoor building scene. In the third part, based on the SLAM and mobile GPS, we not only complete the registration initialization but also calculate the registration location and estimate the camera poses. In the fourth part, the virtual building model are rendered and fused with the real scene.
Four components of building scene construction based on AR.
There are many modelling softwares in the market, such as 3D Max, Solidworks and so on. 3DMax is commonly used in advertising, film and television, industrial design, 3D animation and other fields. 3DMax includes basic modelling tools and advanced modelling tools. Basic modelling tools can be used for modelling simple geometric objects such as spheres, cuboids and vertebral bodies. Advanced modelling tools can complete the modelling of irregular geometric objects such as landscapes, flowers and plants, and animals in the real scenes. 3D Max can be easily extended by the rich library of plug-in. It has a very powerful material editing function. With the powerful game rendering engine of the unreal engine, it can directly render pictures and animations in real time.
Solidworks plays a major role in industrial design and mechanical design. Solidworks is a popular engineering modelling software recently. The Solidworks is simple and easy to use. It can automatically produce engineering drawings including views, sizes, and annotations for 3D models, and allow users to perform feature recognition and geometric simplification of entities, model error reset and redundant topology removal, etc.
Both of 3DMax and Solidworks are the most widely used 3D modelling softwares. However, they cannot describe the geometric information, professional attributes and state information of building components, so they cannot be used in AEC. The Revit is designed for BIM to help architects to build and maintain buildings with better quality and energy efficiency. Revit generates and manages relevant graphical and non-graphical engineering data, which is stored in relational databases, including constructing 2D and 3D models.
In this paper, Revit is adopted for modelling, and the specific steps are: 1. Create the elevation. According to the elevation information and using the elevation tool, the elevation can be created after the annotation form of the elevation is selected. 2. Create axis network. The axis nets are made based on the existing design information. In addition, the axis nets recognition tool can be used to automatically generate the axis nets that are consistent with the CAD diagram. 3. Draw primitive. The primitives are drawn in different layers of the planar graph according to a certain order. The drawing order needs to consider the mutual position of primitive, mainly including defining the type and instance properties of primitive. On the basis of following the principles of architectural design and modelling standards, the corresponding building information is added. Finally, Revit builds a 3D model for the AR system. The 3D building model based on Revit is shown in Fig. 2.
3D building model based on Revit.
Traditional feature extraction algorithms are based on the linear Gaussian pyramid for multi-scale decomposition to eliminate noise and extract image feature points. At the cost of local accuracy, a Gaussian decomposition result in blurring of boundaries and loss of details, which seriously affects the robustness of images. CNN image processing has become a hot topic in recent years. There are two disadvantages of CNN. One is that CNN is relatively complex to implement. Another is that training takes a long time. CNN is not used because of the high requirement of real-time in this system. AKAZE algorithm takes less time to extract image features and its effects are close to or better than SIFT and SURF in many aspects, so AKAZE is used.
In this work, we use AKAZE which has a lot of advantages such as scale invariance, rotation invariance, illumination invariance and speed, to extract the features of outdoor scene. There are three steps to extract feature through AKZAE. First, nonlinear diffusion filtering is used to build the nonlinear scale space. Then feature points are located in nonlinear scale space using determinant of Hessian matrix. Finally, the main direction of the feature points is determined by the M-LDB descriptor.
Feature extraction based on AKAZE algorithm
2.2.1.1. Nonlinear diffusion filtering
In order to solve the boundary blurring and detail loss problems caused by Gaussian decomposition, nonlinear diffusion filter is leveraged to protect the edge. While AKAZE constructs the nonlinear scale space through the nonlinear diffusion filtering algorithm, which ensures the accuracy of the algorithm.
The method of using flow function to represent the change of image brightness in different scale spaces is called nonlinear diffusion filtering. The nonlinear partial differential equation is expressed as Eq. (1).
In the above formula,
Where
The FED algorithm is used to solve Eq. (1) because it combines the advantages of explicit and semi-implicit. The core idea of FED is that: for N explicit diffusion processes, the m-step cycle is carried out by changing the step size
In the Eq. (3),
In the above formula,
2.2.1.2. Constructing a nonlinear scale space
In the scale space constructed by AKAZE, the resolution of each image layer is equal to that of the original image. The calculation formula for scale parameters of each layer is shown in Eq. (5).
In order to use the diffusion equation, AKAZE uses the conduction function to convert the discrete scale parameters into time units. The value of the contrast factor
FED algorithm will use the obtained evolution time, contrast parameters and time step to construct the nonlinear scale space.
2.2.1.3. Feature point detection and description
AKAZE algorithm proposes an improved local difference binary descriptor, namely M-LDB descriptor. M-LDB divides the image into n
The Hessian matrix of each pixel in the image pyramid is calculated by non-maximum suppression method. AKAZE algorithm looks for Hessian local maximum points at various scales to obtain stable interest points. The current scale of
The LDB descriptor needs to determine the principal direction through the gray value of the neighbourhood of the central pixel point. The integral image is recalculated, which increases the computational complexity and time consumption. The M-LDB descriptor needn’t to calculate the average value of all the pixels in the grid. But sampling in scale
The implementation of AKAZE algorithm is shown in Algorithm 1.
Feature number statistics of test set.
The accuracy and time of SIFT, SURF, ORB and AKAZE algorithm are tested in the experiment. The test set is the Mikolajczyk image library mentioned by the author of KAZE algorithm in his paper, which contains eight sets of images, respectively bark, bikes, boat, graf, lenven, trees, ubc and wall. Each group of six images has a different gradient transformation, with a strong degree of differentiation, which can effectively detect the performance and robustness of each algorithm in the aspects of perspective transformation, rotation scaling, Gaussian blur, illumination change, compression and reconstruction to a certain extent.
First of all, SIFT, SURF, AKAZE, and ORB algorithms are used to extract feature points of the images of test set, and then the feature descriptors are calculated. The number of features in each image extracted by each algorithm is counted separately. The number of feature points extracted by each algorithm for each group of images is shown in Fig. 3.
As can be observed in Fig. 3, the X-axis represents six images of each group and the Y-axis represents the number of feature points extracted. Under the conditions of fuzzy, illumination, perspective, rotation, scaling and compression loss, SIFT extracts the most feature points. SURF is in the second place. AKAZE extracts fewer feature points in the bark and leuven atlas, which is not as good as ORB, but in other atlas, AKAZE is close to SURF. ORB can extract feature points stably under various conditions, but it has the least number in most image sets.
The average time of feature extraction and descriptions of test set images by different algorithms are shown in Fig. 4.
Algorithm average time statistics for different algorithm.
SIFT takes the most time, two orders of magnitude higher than ORB which takes the least time. SURF takes about one-third of the time of the SIFT. AKAZE takes about several times as long as the ORB which takes the least time.
Thus, under various conditions, such as blur, illumination, angle of view, rotation, scaling, compression loss, the SIFT has the best effect but takes the most time. SURF takes the least time but has the worst effect. AKAZE takes the least time too but has the same or better effect than SIFT and SURF in many aspects. Therefore, this paper selects AKAZE as one of the core algorithms of outdoor building scene recognition and tracking registration technology to extract and describe the image feature points.
The traditional method is to extract the image features of the training set offline and store them. When real-time recognition is carried out, the real time image frame of the real scene is firstly acquired, and the features are extracted online, and then matched one by one with the image features of the training set. The image with the highest matching degree is selected and returned as the recognition result. The features extracted from the real-time frame need to be matched with all the features of the training set, which takes a long time. In this paper, AKAZE algorithm is selected for feature extraction of the training set and real-time frame. Using image retrieval technology based on vocabulary tree [14] instead of traditional image feature matching, a technical framework for massive outdoor building scene recognition is constructed.
Vocabulary tree is an efficient data structure based on visual vocabulary word retrieval image. Aiming at the problem of too much time lost in quantifying a large number of features into visual words in the word bag model, the hierarchical K-means clustering method is used to generate the visual word vocabulary tree, which can effectively solve the problem of too much time lost in the quantifying process.
2.2.3.1. Construct a visual vocabulary tree
The eigenvector of all images in the training set constitute the eigenvector set
Then hierarchical clustering is used to F. In this paper, the K-Means clustering algorithm is adopted and the branching factor K is set. First, the original feature set is clustered in the first layer by K-Means, and K clusters are obtained. Each feature vector is divided into the cluster nearest to it. According to the above rules, K-Means clustering is carried out continuously for each cluster until the depth of the tree reaches the pre-specified L layer. When the number of eigenvectors in each new cluster is less than K, clustering is no longer conducted. At this point, the total number of nodes in the whole vocabulary tree is a Eq. (8).
2.2.3.2. Scene identification of buildings
In the visual vocabulary tree, term frequency-inverse document frequency is used to distinguish the importance of each visual word.
Term Frequency (TF) is the Frequency of a given Term
Where
Another important parameter – Inverse Document Frequency (IDF) represents the importance of a given term
Where
The query image
Equation (11) converts the similarity measurement between images into the cumulative sum of non-zero elements on corresponding dimensions of feature vectors, which speeds up the calculation speed. In the vocabulary tree, the similarity between two images from top to bottom is compared with each other.
AKAZE feature comparison. a. Feature extraction at small scale; b. Feature extraction at large scale.
AKAZE algorithm is used to extract features of two images of different scales of doctor building of Xi’an University of Architecture and Technology, as shown in Fig. 5.
Figure 5a and b which are the doctoral building of xi’an university of architecture and technology are extracted feature at different scales using AKAZE algorithm. Figure 5a is feature extraction at a small scale, and Fig. 5b is feature extraction at a big scale.
We use AKAZE and visual vocabulary tree to identify the south gate of Xi’an University of architecture and technology. There are 300 images of train set and 30 images of test set which are taken from different angle, scale and illumination. The matching result set is successfully returned, the average recognition time is about 201.7 ms, and 28 images are successfully identified, with a success rate of about 93.3%.
In the process of monocular SLAM initialization, there is usually only the coordinate information of 2D-2D matching point pairs and camera calibration parameters. As shown in Fig. 6, the left picture is the target image, and the right picture is the real-time frame image acquired in real time.
Feature matching.
The external parameters of the camera need to be calculated according to the two 2D images in the Fig. 6. There is the feature point
The problem is transformed into a problem of solving
The plane of the polar constraint.
The constraint satisfies the condition shown in Eq. (14).
Any set of matching point pairs satisfies the Eq. (14) and
After solving for the essential matrix
Although the rotation and translation vectors are obtained, the depth information of the scene cannot be obtained because only 2D coordinate information is available. The obtained parameter information has scale uncertainty. This is an inevitable problem in the process of monocular SLAM initialization.
In the outdoor architectural building scene constructionï¼the accuracy of tracking registration is not high, so the mobile phone GPS is used to obtain the approximate distance between the current position and the real scene building to compensate the missing depth information. Then 3D coordinate information of the spatial feature points is got and the initialization process of tracking registration is completed. Finally PnP attitude estimation can be carried out based on the acquired 3D coordinate information and 2D image Coordinate information.
The experiment steps are follows.
In order to calibrate images, Zhang Zhengyou’s chessboard calibration method is used, so 14 rectangular chessboard images with different angles, different postures and black and white are prepared. One of calibration images is indicated by Fig. 8.
A calibration image. Subpixeling the prepared chessboard images and extracting the corner information from the sub-pixel images. Findchessboardcorners () is a function that extracts sub-pixel corner information. The drawchessboardcorners () function draws the corner points. Findchessboardcorners () is a function that extracts sub-pixel corner information. The drawchessboardcorners () function draws the corner points. The result diagram is as Fig. 9. In OpenCV, the calibratecamera () function is used to calibrate the camera, and on this basis, the camera calibration of the system is completed. We obtain the camera internal parameter matrix, distortion parameters, and the external parameters (rotation vector and translation vector) corresponding to each chessboard image through the calibration function. Then we use the calibrated parameters and the projectpoints () function to recalculate the projection coordinates of the 3D corner points on the projection plane, and compare with the previous coordinates to correct the calculation errors. The calibration results are as Fig. 10.
The result of draws the image corner points.
The calibration results.



The system selects 360n4s mobile phone as the mobile hardware device. The processor of mobile phone is heliox 20, memory is 4 G, memory is 128 G and sensor gyroscope is electronic and compass. The resolution of the front camera is 8 million, and that of the rear camera is 16 million. The server uses Sugon i840-g25. The CPU of the server is Xeon e7-4809v2, and the memory is 16
The mobile terminal collects the real-time image of outdoor and then which are transmitted to the server through the wireless network. On the server, the AKAZE extract the features of outdoor image and the similarity of image in the vocabulary tree is searched. The matching image is achieved and returned to the mobile terminal for tracking registration. After getting the matching image, the feature matching is purified with the real-time frame image to estimate the camera attitude. The 3D building model is rendered and fused with real scene by Unity3D. Finally the 3D virtual building model is displayed in a real scene of outdoor.
The real scene information is near the doctoral dormitory of Xi’an University of Architecture and Technology, and a pre-established 3D building model is placed on the open space in front of the dormitory, as shown in Fig. 11.
Enhancement display of model front view.
Zoom in and out of the model by holding the model with both fingers on the screen. Press and hold the model to slide and change the position of the model. After holding the model for two seconds, slide to change the angle of the model, as shown in Figs 12–14.
Enhanced display of model displacement view.
Enhanced display of model zoom out view.
Enhanced display of model side view.
This paper proposes an AR system. First, the AKAZE features of the training set image are extracted and the eigenvectors set forms a vocabulary tree by K-means, consequently the images are vectorized and inverted index file is generated. The collected scene images are vectorized according to steps of training set image vectorization, and compared similarity with the vectors of the training set to return to the recognition results. Then, on the mobile terminal, we collect real time frames and match with the target image through the AKAZE features are extracted and purified. Attitude estimation of coordinates and mobile position information is calculated from the obtained matching points. Model registration coordinates are calculated and feature points are tracked by optical flow tracking. Finally, Revit is used to build the 3D building model. The model is fused with the real scene by Unity3D, therefore an enhanced 3D building is displayed in the AR system.
When we test the system, first, we turn on the camera to wirelessly transmit the acquired the real time frame to the server. The image is identified on the server and the matching target image is returned to the mobile side. Then the features of the real time frame and target image are extracted and matched. The distance information is calculated by Baidu map SDK and the initialization of monocular SLAM is completed. The coordinates of 2D and 3D feature points is obtained and the optical flow tracking record is initialized. In PnP computing, we need to extract the latest real time frame features and match with the target image to obtain the matching point pairs. The average time of each process of the system is shown in Table 1.
Average time of each sub-process of the system
Average time of each sub-process of the system
Under normal circumstances, when starting the application, we need to maintain the current posture for a certain time to complete the system initialization. Later, when we move slowly, it can basically maintain the real time tracking, display the outdoor scene, and have good interaction. However when we move quickly or there is an obvious occlusion, it is easy to lose tracking. There is an error accumulation in the process of moving. When it reaches a certain degree, the tracking registration will have an offset.
In this paper, mobile AR is implemented in architecture, which realizes scenes recognition, tracking registration and the fusion of virtual and real scenes. The experiment shows that the system can show 3D architectural model in a real scene to give users a sense of immersion so as to realize the function of architecture aided design.
In our work, the monocular SLAM initialization algorithm is used to estimate camera pose, which has the following problems:
SLAM is not good in the environment with bad or complicated illumination conditions. When the motion range of the camera is large, the tracking of feature points is easy to be lost. The processing of dynamic objects in the scene is not ideal. The amount of computation is large and the system response is slow.
Therefore, the focus of the next step of this paper is to study how to solve the tracking loss problem when fast-moving or occlusion occurs so as to improve the system’s stability.
Footnotes
Acknowledgments
This work is supported by Innovation Leading Project of Xi’an Science and Technology Bureau under Grant NO. 201805033YD 11CG17 (1), and the Innovation Leading Project of Xi’an Science and Technology Bureau under Grant NO. 201805033YD11 CG 17 (2).
