Abstract
Real estate represents a major share of economic activities and wealth in all economies. Due to the lack of widely acknowledged standards, however, the structuring, providing and managing of a life cycle-comprehensive building documentation yet remain challenging. Based on the empirical analysis of 8965 digital documents from 14 properties of 8 different owners, the article presents a model that will unify existing approaches and lead to the development of a document classification standard. This provides the basis for software systems to process relevant data and create timely information over the entire life cycle of a building. Further, it is shown that automated information extraction through artificial intelligence will become instrumental for enhanced and innovative business models and products in real estate such as automated data validation and data evaluation, documentation review, benchmarking and other analytical applications.
Keywords
Introduction
Every company, regardless of its industry and goals, needs meaningful information to help management achieve its goals, trigger improvements and make good decisions (Bhatt, 2001: 70). It is vital to obtain information in the right place, in the right format, in the right level of detail and, above all, to obtain the right information. Generally, there is a lack of clear standards for methods, tools and structures. In fact, there is hardly a company anywhere that lacks data but instead suffers from too much of it being collected and presented (Bawden and Robinson, 2009: 187). Relevant information, on the other hand, which is reliable and up to date, remains scarce (Krcmar, 2015: 15). The availability of up-to-date and complete documentation for transactions, such as real estate investments, can help reducing safety margins due to better estimated risks. Although IT-supported solutions (e.g. CAFM, BIM or IWM systems) enable an overall life cycle technical provision for buildings, they do not guarantee consistent use, structured storing and sufficient quality of the information available (RICS, 2017: 28–29). In real estate, financial, transactional and physical data have to be compiled in real time, organized despite gaps and structured in an appropriate manner. A large portion of resources are related to data collection and analysis, particularly for the management of risk (RICS, 2017: 17; Winson-Geideman and Krause, 2016: 6–7). The broad range of acquired and rehashed data and individual client specifications extend beyond the possibilities of regular information pathways. As the quantity of information is constantly growing, there is a risk of overlooking the essentials (Bawden and Robinson, 2009: 182–184; Krcmar, 2015: 54). For this reason, a structured and needs-based provision of information is required.
Since real estate is an essential part of economic activity and wealth in all economies, there is a particular need for data about the structure, condition, equipment, operation and contractual relationships over the life cycle of buildings. The data are usually contained in numerous unstructured documents (e.g. service provider contracts, maintenance protocols, planning documents). However, ‘[t]o date, there is no universal, standardised system or protocol in place that would easily facilitate access, storage, update and transfer of building-related data and information in a standardised format along the value chain’ (RICS, 2017: 29). Due to the large amount of relevant documents and the lack of widely acknowledged standards (e.g. methods, tools and structures), 1 the capture, provision and updating of a life cycle-comprehensive building documentation still remain challenging. A consistent, unambiguous and overlap-free standard for the classification of documents will make it possible to migrate documents and data when different parties, structures and software systems are involved. Given this, automated document classification, information extraction (IE) and predictive analytics based on artificial intelligence (AI) are within reach. Real estate management (REM) of today will fundamentally change for digital REM, particularly when it comes to due diligence (DD) processes. This article presents an approach for a seamless provision and labelling of building-related documents and expands on further use cases for AI in digital REM.
Aim of research, data and methodology
The central goal of digital REM is a seamless provision and explicit labelling of documents across the individual life cycle phases of buildings. In practice, approaches already exist for structuring within individual phases and for the transfer of data between phases. The basic ideas behind this article are the lasting and unique classification of documents as well as the possibility of automatically classifying relevant documents (unstructured documents) in order to form the prerequisites for software systems that may support the process throughout the life cycle of buildings. Based on this, relevant information can be systematically captured, structured and maintained.
In this effort, empirical data from 14 properties from 8 different asset managers are investigated. The building documentation of the properties counts 8965 digital documents in 2895 folders. These documents were provided by the PropTech firm Architrave GmbH, Berlin, subject to a strict non-disclosure agreement. 2 Table 1 summarizes the features of the 14 properties. The properties are located in 10 German municipalities of different sizes and include all major use types. The building sizes range from 2500 m2 (office) to 112,500 m2 (warehouse).
Assets underlying empirical investigation of building documentation (own illustration).
I: industry; W: warehouse; O: office; R: retail; FS: food service; H: housing; HT: hotel; L: logistics.
The digital building documentation of the properties includes official documents such as clearance certificates, land-use plans, building permissions and urban development contracts; commercial documents such as rental contracts, invoices and receipts; organizational documents such as internal company guidelines, instruction protocols and certificates of qualification as well as technical documents such as maintenance protocols, expert opinions, facilities descriptions and instructions for use.
Of the 8965 documents, the majority typically relate to lease agreements (14.43%), floor plans (7.09%), defect tracking and warranty documents (6.85%) as well as building permits and building applications (5.76%).
First, this study evaluates five prevailing standards (as in Table 2) for the categorization of documents in REM and DD processes. These standards, although primarily from Germany, are typical for global institutional real estate investment (Bodenbender, 2019). 3
Prevailing building documentation standards (own illustration).
DD: due diligence; RICS: Royal Institution of Chartered Surveyors; GEFMA: German Facility Management Association.
Based on the empirical analysis, a framework is presented to show how data can be standardized and made compatible with each other.
The results of this work are supposed to support and optimize the availability, completeness, uniformity, transparency, timeliness and applicability of documents and items of information for digital REM. The potential for automated document classification, IE and predictive analytics will be outlined in ‘Automated document classification in REM/DD’ and ‘Automated IE, predictive analytics and use cases in REM’ sections of this article.
Current information structure in REM/DD
The majority of building-related data are collected in day-to-day operations and are paper-based, such as maintenance protocols, contracts or energy consumptions. The origination and handling of a building documentation usually involves facility management (FM), asset management (AM) and property management. Different functions (e.g. DD, fund management, finance, appraisal, maintenance) each call for different pieces of information (RICS, 2017: 17). Due to the high glut of documentation and information that accumulates in REM, a database is needed to make documents and information verifiable and traceable (see Figure 1).

Status quo – document management in REM (own illustration). REM: real estate management.
Documents must comply with the respective process requirements and be displayable in different structures. This means that the documents not only have to pass through the various life cycle phases but also will be filed under different structures in each phase (see Figure 2). It is, therefore, critical to decide how the identified documents and information are to be captured, structured and maintained to make these last across all life cycle phases and keep them accessible for different stakeholders (e.g. owners, users, service providers) and processes.

Life cycle of a building and media breaks in documentation (own illustration).
For real estate transactions involving entire portfolios, for example, the interested parties usually have just a few weeks to perform a DD of the individual properties. Before concluding a contract, documentation and information on inventory and operation for the potential buyer have to be laboriously prepared and made available for viewing in a data room. The building documentation contained in a data room easily reaches 1000 documents for a single transaction, in our sample as many as 1933 individual documents. The amount of documents primarily relates with the use type. The smallest building in the sample (2500 m2 office, built in 2011) features 1535 documents, more than the 112,500 m2 warehouse (1112 documents, built before 1960). The minimum amount of documents for a property is 230 for the oldest buildings in the sample (5000 m2 office, built in 1905–1906).
The process of filling a data room requires that the documentation on the construction phase as well as the operational phase be migrated in matching structures for the DD process. Gaps in the documentation (of the asset manager or the seller) will possibly remain hidden as a result of these shortcomings and can only be identified with great effort prior to the closing of the contract. If documents are not available, they have to be laboriously procured from different institutions (e.g. authorities, service providers); otherwise, larger safety margins will be applied. During the operation and use phase of a building, data are structured using criteria other than those for DD. The expense of sorting documents all over again occurs not only between construction, transaction and operational phases but also for other processes such as certifications or appraisals of buildings.
Deriving document classes for the classification of documents in REM
It becomes clear that building-related documents are not only required and utilized within one single life cycle phase but must be transferable into further phases. Given the multitude of existing solutions for the categorization and structuring of documents, there is a need for a more holistic approach. The analysis of 8965 digital documents and 2895 folders performed in this research shows that the amount of folders to structure documents increases as the number of documents grows. However, not only the number of folders increases but also the number of empty folders grows even disproportionately. A larger number of documents do not necessarily relate with more folders and structure for storage. This results from thematic overlaps due to an increasing amount of categories. The categories cannot be unambiguously delineated from each other when a given point of detail is reached. To avoid redundancies in the storage/repository, the user must decide on a category – other overlapping folders then remain empty. Further, the method of manual storage can differ by user or even by standard, depending on the respective application (e.g. transaction, certification, operation). Documents appear missing if attributed to another than the presumed category. Therefore, documents need to be attributed to different categories of different standards to export and transfer contents adequately. In this way, the structuring effort occurs not only once but for every application separately. A compatibility of categories between standards will be useful, if it makes possible the automated transfer of documents from one structure to another.
To test to which extent standards are compatible with each other, the categories of the five prevailing building documentation standards (‘Aim of research, data and methodology’ section) were set in relation to each other (Table 2). 4
It was found that only a distinct proportion of the categories are directly transferable between the standards (see Figure 3). The percentage ratio of 1:1 connections largely differs as the perspective changes from one standard to the other since the standards feature different numbers of categories.

Analysis of connections between standards (own illustration).
For example, the document classes for the operation phase can be assigned directly to the categories of the German Society of Property Researchers (gif) data room index in only 43% of cases, and in only 32% of cases to the categories of the Drooms data room index (percentage of linkable categories). An automatic transfer (by n:1 or 1:1 connections) from the document classes for the operation phase to the categories of the gif data room index is possible in only 30% of cases, to the categories of the Drooms data room index in only 20% of cases. Therefore, if documents pertaining to one building need to be prepared for a transaction, the majority of documents need to be sorted manually between different standards. Once a transaction is completed and documents need to be transferred to the operation and use phase, the same problem occurs again. Categories of the standards cannot be unequivocally linked 1:1 to each other.
From a practical standpoint, data transfer is expedited by using translation tables that show the old and new standards and clearly identify from which field, which category and which file the contents should be transferred to another standard. However, this assumes that the standards can be linked 1:1. The decision on a category under conditions of doubt or in the face of several possible categories is not yet supported by existing software systems. So, a choice has to be made between multiple filing and non-redundant filing both of which face obvious disadvantages.
German Facility Management Association (GEFMA) 922 (GEFMA, 2016) assigns phases, processes and objects to 1000 typical documents in the life cycle of a building. By means of a rule model, it is possible to assign documents to a particular document class in accordance with GEFMA 198 (GEFMA, 2013). Corresponding test runs produced an accuracy of 79.67%. The accuracy results almost exclusively from knowledge of the process by which the document came into being. Processes could again be correctly assigned to the appropriate category of the GEFMA 198 standard with an accuracy of 70.27%. Because of the different levels of detail between the standards, this approach is limited, however. Since other standards relate to objects, cost groups, subject areas or requirements within a process, a rule based on processes will not work with every standard. In a next step, therefore, the groups, criteria, documents and so on from the five standards were extracted and integrated into one data pool. Redundancies were eliminated and overlapping subjects separated from each other. To further segregate the documents that referred to the same subject, the meta-information described above (phase, process and object) was combined with the document type as in Figure 4.

Mapping of phase, process, object and type to each document (own illustration).
The resulting so-called document classes were then connected with the previously analysed standards (see Figure 5). In doing so, it was important that every connection corresponded to a 1:1 connection. This way, redundancies were avoided and overlaps excluded.

Method to make standards compatible with each other (own illustration).
As a result of this unique allocation, an export into different structures is possible without any manual effort (see Figure 6). Thus, automated migration by means of a translation between the standards, which is based on the smallest common denominator and enriched with meta-information, is possible with high accuracy. A further advantage is that users only have to control for the correct allocation to document classes, not however, for the categorization in any other standard.

Benefit of the document classes (own illustration).
An automated transfer of already existent data is not the only desirable outcome of this procedure. Moreover, the documents can be clearly classified in any import or digitization process throughout the building life cycle and restructured at any time for specific applications. To classify unstructured documents (e.g. contracts, protocols) in this way, the content of the documents, the frequency of specific words and the relevant information must be determined. Using the document classes, a manual classification by means of meta-information such as document type, phase and process and object/reference can be performed already (see Figure 7).

Benefit of the defined properties for document classification (own illustration).
The complete document classes are made available by the authors upon request and will be published in due course.
Automated document classification in REM/DD
Over and above this, an automated classification of documents is indeed possible. The non-overlapping demarcation of document classes, free of redundancies, makes it possible to train an algorithm, which will be able to evaluate new documents and assign an appropriate category relating to a document class. Automated document classification can form the basis for filing documents in data rooms fast and comprehensively. Previously trained algorithms already help to automatically recognize, classify and name documents as well as to sort them into an individual structure in data rooms (Bodenbender, 2019). The algorithms used in classification are mostly based on AI, in particular, machine learning (ML) and natural language processing techniques (NLP) (Russell and Norvig, 2010). Clearly, ML greatly increases the efficiency of classification procedures. NLP enables the recognition and processing of texts in written language. This means that, for example, rental agreements are identified as rental agreements and maintenance protocols as maintenance protocols. Through a multitude of test data, an algorithm learns patterns, word combinations and text components. The algorithm can use this to independently classify subsequent documents and increase hit rates (Figure 8).

Classification with supervised learning (own illustration).
The first tests performed on the building documentation of the 14 properties in our sample, with just a small number of document classes (maximum 30), produced an accuracy of up to 90%. For a larger number of document classes, the data set for training the algorithm must be increased. An automated classification provides various additional benefits, besides the fast import and accurate categorization. For example, data rooms normally point out missing subjects such as empty file folders (e.g. rental contracts, maintenance protocols, building permits). However, a system can only determine whether all documents on a particular subject (e.g. all maintenance protocols for all technical facilities) are present, if also data from supporting systems (e.g. CAFM system) could be accessed. As part of a technical DD prior to the acquisition of a building, on-site inspections and manual examinations of the documentation have to be carried out. An automated classification would already be able to discover documents that have been manually attributed to a particular document type but do not qualify as such (e.g. blank pages or a different document altogether). In the course of this research, the document classes as well as manual and automated classifications of 8965 documents and 2895 folders from various data rooms were validated. The document classes were checked for completeness by means of standardized document types (GEFMA 922, 2016; n = 1000 documents). The initial test run with real documents showed that not all of the digital documents were suitable for automated classification directly. For example, different documents should always be scanned into separate files, be recognized via optical character recognition and be attributed to the correct language. These requirements, however, could be sorted rather easily in practice and give rise to automated document classification in REM on a larger scale.
Automated IE, predictive analytics and use cases in REM
A new research focus on digital REM is the extraction of information from documents and the recognition of patterns for the derivation of trends (predictive analytics) from building data.
The aim of IE is to recognize and extract specific information types such as entities or relations from texts in machine-readable documents (Jurafsky and James, 2017). The segmentation rules must be defined in as much detail as possible to enable a comprehensive and precise extraction from one segment of a document without having to relate to the rest of the document. IE is a central technology for information processing due to this combination of unstructured and structured data. As long as a relevant portion of a building documentation relates to unstructured documents (e.g. contracts, protocols), IE is key to making comprehensive use of building data. Automated IE can even identify gaps or point out redundancies in the documentation. 5
Still, IE is just the first step. The creation of value results from data analytics and specific applications. In digital data rooms, documents are collected and their availability, timeliness and consistency are ensured. Such data rooms can serve as a central document source and meet growing data protection requirements. Traditional data rooms are usually installed only for a limited period of time. To take greater advantage of the available information, current developments are moving towards inventory data spaces. Permanent data rooms open up new value-adding opportunities by combining further applications and AI even for a larger number of users (Figure 9).

Procedure of IE with segmentation, data warehouse and analytics (own illustration). IE: information extraction.
A comprehensive use of building data and predictive analytics, also from Internet of Things (IoT) and sensor technologies in building operation, have the potential to alter the real estate game. Use cases for enhanced and innovative business models and products based on permanent digital data rooms include automated data validation and data evaluation, documentation review, benchmarking and other analytical applications. Since, for instance, common start-up processes in FM can take several months, large savings potentials can be realized particularly in this area. Similarly, standard processes in the course of building appraisals, such as the collection and combination of raw data, can be substituted by automated IE and supporting software systems in order to accelerate the efficiency of the overall process. Even data-based market or building forecasts (e.g. predictive maintenance) could be performed quickly and efficiently as automated IE and predictive analytics become more advanced.
Summary, conclusions and recommendations
Today, the capture, provision and updating of building data is a costly and important exercise and so are regular information losses during the life cycle of buildings. Digital REM of tomorrow will greatly benefit from high-quality information and key performance indicators that may be accessed in real time from permanent digital data rooms.
This article examined existing approaches in theory, research and practice for structuring objects of building documentation. The five prevailing building documentation standards were mapped with each other and then related to the building documentation data. Likewise, 8965 documents from 14 properties were then matched with the standards. Overlaps, redundancies and missing unambiguous 1:1 and n:1 connections lead to conclude that manual migration of documents between the individual processes and phases in the life cycle of buildings consumes great expense. The properties from eight different owners all had different features in terms of use type, age, size, layout and location which did not relate with the quality of the building documentation. Building age should affect the amount of documents associated with a building. However, we find that substantial document losses occur throughout the life cycle of buildings, probably associated with former changes of ownership; this may change in the future with a digital building documentation, which is demanded ever more nowadays and can easily be migrated. Larger buildings do not necessarily feature a larger amount of building documents associated with them; rather, this depends on the use type of the building. Newer buildings usually have a better sorted building documentation. However, owners digitize the building documentations even of older buildings (if available) in digital data rooms to mitigate transactions, which is usually done by service companies that have specialized in this field.
The effort of migrating documents can be reduced through unambiguous document classes – as presented in this article – that serve as interfaces between the standards and make unambiguous mapping possible. The classes may reduce time and effort in the migration of documents and data. ML may further accelerate this process. Through these structures, classes and automation, possible errors in the processes of providing and preparing documents may be reduced and redundancies in the storage of data avoided.
For the different stakeholders (e.g. owners, users, service providers) who are reliant on building documentation, there are clear benefits from using the document classes: storing documents without redundancies, changing the categorization of documents with little effort and using various standards for different processes at the same time. At this stage, only building-specific documents were considered, but individual standards may be supplemented at any time by a one-time assignment (mapping) to the document classes. In the same fashion, international standards, which rarely exist in more than draft versions, can be considered as they become published. A direct compatibility of many diverse standards can thus be achieved. In further research and proceedings, the document classes will be optimized for automated document classification by means of ML.
Automated IE opens a promising way for enhanced and innovative business models and products in real estate such as automated data validation and data evaluation, documentation review, benchmarking and other analytical applications. In the medium term, smart data from buildings and facilities (IoT) will form a new source of information. As an example, by linking sensory information with statistical data from the building documentation, an information infrastructure can be created to forecast and optimize predictive maintenance measures and cycles (Deloitte University Press, 2016: 8). Going forward, these data may form a basis for stochastic building life cycle simulation tools, which are already subject of further research.
It can be assumed that manual tasks for the capture, provision and updating of building data will become more and more substituted by AI. Professional services in all fields of real estate, as in other industries, will likely be affected by this fundamental change. Managers and service providers should be prepared for digital REM and adjust their business models to capitalize on the chances that result from new technologies. Besides, they will have to acknowledge that quality data justify that an adequate price be paid for improved transparency and reduced risk. As data and information become more and more accessible, the quality of data, the quality of services and the quality of buildings will ever more make the difference for competitive advantages in real estate. Skilled people combined with a sense of responsibility and sound decisions should remain the basis for this – now and in the future.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
