Abstract
RFID is used in logistics in the automotive industry to automate processes and optimise material flow. However, the data generated by RFID installations during operation offer more potential for further analyses to collect even more benefits from the technology. Therefore, in this paper, RFID data will be used to create a digital twin of the RFID-enabled material flow (DTRMF) in real-time and to programme various big data analyses. The architecture of the DTRMF must meet various qualitative requirements. Since the big data and digital twin architectures available in the literature either do not optimally fulfil all these requirements, or they are not described in enough detail to support real applications, this paper presents a new digital twin architecture for RFID-enabled material flow. This architecture consists of the data ingestion layer, data processing and analyses layer, data storage layer, visualisation layer, and the optional semantic layer. In addition, suitable technologies for the implementation of the architecture are described, and the feasibility of the architecture is demonstrated and verified by means of a case study.
Introduction
Radio frequency identification (RFID) is one of the automatic identification (Auto-ID)
procedures (Finkenzeller, 2015) and is one of the
central technologies for the Internet of Things (IoT) (Uckelmann et al., 2011), because it allows objects to be identified automatically.
RFID is a versatile technology and is therefore used in various areas, such as production
and logistics (Franke & Dangelmaier, 2006;
Kirch et al., 2017), fashion and apparel retail
(Esposito et al., 2015; Richter, 2013) or healthcare (Tu et al., 2019). The use of RFID for
product tracking and tracing has been steadily increasing in the supply chains of these
diverse sectors due to the numerous benefits that this technology offers (Cilloni et al., 2019; Uckelmann & Romagnoli, 2016). In automotive industry logistics,
RFID is used at various points to automate individual processes and, thus, to optimise
material flow (Richter, 2013). RFID gates are used at goods receipt, for example, as the
load carriers containing the materials for the vehicles might be equipped with RFID tags
(see, for example, Knapp & Romagnoli, 2021).
When a forklift truck drives through the RFID gate with these load carriers, the RFID tags
are read, and the goods are automatically identified and processed. During operations, the
individual RFID installations constantly generate data. In practice, however, the data
generated by an RFID installation are mostly used for the automation and optimisation of one
or more processes, without being linked to other data or used for additional analyses to
optimise the material flow. The choice of connecting RFID reads with other data, however,
could be a way to mitigate (or even solve) the commonly known problem of low cost efficiency
of RFID (Abugabah et al., 2020; Costa et al., 2017; Moretti et al., 2019). The use of big data analytics offers enormous potential,
for example big data analytics can be used to reduce inventory costs (Amerland, 2015; Klein et al.,
2013). In order to use data from individual RFID installations for new insights,
the so-called digital twin concept is suitable, as the provision of data from different
sources can be streamlined with a digital twin (Lietaert
et al., 2021). A digital twin is the virtual representation of a real object or a
process that can be used in different areas, such as supply chain and logistics, production,
or health management (Abideen et al., 2021; Niaki & Shafaghat, 2021; Tao et al., 2019). Digital twins bring several benefits to the
industry, such as higher reliability, greater robustness, more predictable production
processes, and the ability to proactively manage supply chain risks (Greis et al., 2021; Jamwal et al.,
2021; Lietaert et al., 2021; Tao et al., 2019). The digital twin concept can be
used in the automotive industry for real-time material flow optimisation, as load carriers
equipped with RFID tags and intralogistics processes of an automotive manufacturer can be
represented by a digital twin. We note that the term real-time is described in DIN 44300
(Deutsche Industrienorm) as a type of processing in which the processing results are
available within a defined (and typically very short) period of time (DIN, 1988; Scholz, 2005). In
this paper, real-time is understood as soft real-time, since the digital twin is currently
not a production-critical system, but rather an addition to the existing logistics backend
system. Soft real-time means that occasional exceeding of the defined time span can be
tolerated (Scholz, 2005). The goals of the digital twin are to create transparency of the
material flow, optimise processes, save costs, and simplify and optimise planning. However,
it is a challenge to meet the various qualitative requirements of the digital twin, such as
the ability to perform big data analyses, a short development time, and easy maintenance.
Therefore, a suitable architecture is needed to take these diverse qualitative requirements
into account. As the literature review that we provide in Section 5 shows that existing big
data and digital twin architectures are not optimally suited for the real-time digital twin
of the RFID-enabled material flow (DTRMF), because they are either too complex to implement
and to operate, they are not described in concrete terms, or they do not optimally support
the execution of batch jobs for big data analyses. Therefore, the aim of this paper is to
develop a suitable architecture for DTRMF in real-time and to answer to the following
research question
Methodology and structure of the work
The methodology used in this paper is based on the Design Science Research Methodology according to Hevner ( Hevner et al., 2004; Hevner et al., 2010). Figure 1 shows the three design cycles from this methodology, namely the (i) relevance cycle, (ii) design cycle, and (iii) rigour cycle, which have been used to develop the DTRMF. The structure of the present work is shown in Fig. 2. In Section 3 the use of RFID in automotive industry logistics is explained. Section 4 then defines the DTRMF, describes the target image and data of the DTRMF as well as the target groups, and determines the requirements of the DTRMF based on expert discussions (relevance cycle). Section 5 describes the state-of-the-art of digital twin and big data architectures, which were identified through a systematic literature review and thus form the knowledge base (rigour cycle). Based on this knowledge base, the architecture of the DTRMF is developed and described in Section 6 (design cycle). Subsequently, the appropriate technologies for the implementation of the architecture are selected in Section 7, where a database comparison is performed. The selected technologies expand the knowledge base (rigour cycle). Afterwards, Section 8 reports a case study for the implementation of the DTRMF in the body shop of an automotive manufacturer, whose evaluation and achieved benefits are reported in Section 9 to also contribute on the knowledge base (design and rigour cycle). Lastly, Section 10 draws conclusions and suggests possible future research lines.

Design science research cycles according to Hevner et al. (2010).

Structure of the remainder of the paper.
RFID technology might be used at various points in the logistics of the automotive industry to optimise material flows (Richter, 2013). The advantage of RFID compared to barcode is that line-of-sight contact is not required and manual barcode scanning is no longer necessary (Finkenzeller, 2015). In addition, many RFID tags can be read automatically in bulk, thus saving time (Fleisch & Mattern, 2005). In the following, the use of RFID in logistics is explained using the example of the body shop of an automotive manufacturer. The material flow is shown in the simplified scheme reported in Fig. 3. The supplier or the press shop delivers the components for the vehicle in load carriers. These load carriers are usually stored for a given time in a warehouse, and then henceforth transported to the body shop. A distinction is often made between universal load carriers and special load carriers. Universal load carriers are standardised, such as in VDA 4500 (VDA, 2018) and VDA 4520 (VDA, 2011). Universal load carriers may be used by various companies in so-called pool-systems and are managed by pool-operators. Special load carriers are made for components that are not suitable for transport in universal load carriers, and they only circulate between individual suppliers and the car manufacturer. In the examined use-case, special load carriers are equipped with permanent multi-use RFID tags and universal load carriers are affixed with a single-use RFID label. A Unique Item Identifier (UII), also known as Electronic Product Code (EPC) when using the EPC Tag Data Standard (GS1, 2019), is written on the passive RFID tags, and all other information about the UII is stored in the logistics backend system.

Overview of the simplified material flow of the body shop.
In the suggested scenario, RFID tags are currently read in two different applications: automatic goods receipt and plausibility check. The aim of the automatic goods receipt with RFID gate is that not every barcode of the load carriers has to be scanned individually, and thus time is saved during the goods receipt. For this purpose, RFID antennas are mounted on the sides of the goods receiving gate, which automatically read the RFID tags of the load carriers when a forklift truck drives through. The read UIIs are sent from the reader to the logistics backend system together with further information such as gate ID and timestamp. In the logistics backend system, this information is compared with the delivery notified by the supplier. If the information matches, the goods are booked, and this is displayed to the employee on the forklift terminal. If one or more RFID tags cannot be read, e.g. due to a hardware defect or interferences, the emergency process requires for these RFID tags to be scanned manually with a handheld. After passing through the gate, the load carriers are first placed on a transfer area and from there they are normally transported to the various warehouses of the body shop. As soon as the material is needed for production, the logistics backend system generates a transport order, and the load carriers are driven from the warehouse to the production supply area.
In the body shop, there are different check points where it must be ensured that the right load carriers, and thus the right materials, are supplied. To ensure this, an RFID reader is present at these stations, to automatically read the RFID tag of the load carrier. The UII of the load carrier is transmitted to the logistics backend system, which then performs a plausibility check. If this plausibility check is successful, the open transport order is automatically acknowledged and the robot begins to remove parts from the load carrier. When the material has been removed from all load carriers, the empty load carriers are transported to a transfer area. From there, they are transported directly or via a dedicated warehouse to the suppliers or the press shop. There are other places in the material flow where RFID could be used to track the movements of the load carriers and thus increase transparency, for example, at the incoming and outgoing goods of the supplier / press shop, the warehouse, and the warehouse dedicated to empty load carriers. RFID gates are suitable for this because they read the RFID tags automatically (Fleisch & Mattern, 2005), and thus no additional manual work is required.
There are many different definitions of the term digital twin in the literature ( Al-Sehrawy & Kumar, 2021). In general, a digital twin is a virtual image of products, systems, or processes (Jamwal et al., 2021; Klostermeier et al., 2020). In Grieves (2014) the concept of the digital twin is described, which consists of the following components: “a) physical products in Real Space, b) virtual products in Virtual Space, and c) the connections of data and information that ties the virtual and real products together” (Grieves, 2014). Since there are different definitions, characteristics and possible applications of digital twins (Niaki & Shafaghat, 2021), the DTRMF in real-time is described below for this paper. This digital twin maps the special load carriers with multi-use RFID tags, as well as the various logistics processes, which the load carriers undergo. In this DTRMF, the data exchange only goes in one direction, i.e. from the real space of the product / process (material flow) to the virtual space (digital twin). Information from the DTRMF is made available to production supply and logistics planning, whose staff can intervene to optimise the material flow, if necessary.
The goals of the DTRMF are to create transparency of material flow, optimise processes, save time, reduce costs, and simplify and optimise planning. In practice, these goals shall be achieved by means of various big data analyses, such as by automatically evaluating process and supply times for a timely detection of anomalies, for example in the case of late material delivery. Another big data analysis would be the RFID-enabled automated inventory of the special load carriers, where the actual stock can be compared with the purchased stock. This analysis could make it easier to estimate how many special load carriers must be purchased for reliable operations, thus saving costs and warehouse space. Lastly, the trend of warehouse occupation rate can be evaluated with respect to given targets, to produce comparisons and use the historical data for sound future planning.
For these big data analyses and the creation of the digital twin, data from various sources are needed, as can be seen in Fig. 4. The DTRMF will be developed based on data generated automatically by RFID readers, by handhelds (by scanning barcodes or RFID tags), or by manual input via the forklift terminals or via notebooks. These data are then sent to the logistics backend system via various interfaces. The logistics backend system contains additional information, for example, on transport orders, materials stock, and notified deliveries from suppliers. Data from the logistics backend system are stored in a database and will be linked in the DTRMF with data from other sources, such as those from the load carrier management system. In addition, the DTRMF could also store historical data for big data analyses. The material flow data accumulated over time have a timestamp and form a time series. Finally, other relatively static data are available with basic information about load carriers, warehouses, production supply areas, goods receipts, et cetera.

Data sources for the digital twin.
To achieve the goals of the digital twin, certain qualitative requirements must be met. The
most important qualitative requirements for the DTRMF were determined on the basis of expert
discussions with some future users of the DTRMF from the production supply and logistics
planning from a German OEM and were defined as follows in accordance with Goll (2014) and ISO/IEC (2011):
This section describes the state of the art of digital twin and big data architectures. For this purpose, a systematic literature review was conducted. This started from a Google Scholar search (freely available as opposed to commercial offerings such as Scopus) with the terms “digital twin” AND architecture AND logistics and “digital twin” AND architecture. Only references published after 2017, and other interesting papers were added to the list according to their technical relevance. Table 1 shows the results of the literature review, the properties of the architectures, and the assessment regarding the fulfilment of the requirements R1 - R4 for the DTRMF. The requirements R5 - R10 are not considered in this evaluation because they should be fulfilled through the choice of appropriate technologies or environments.
Overview of the properties of the architectures and the fulfilment of the
requirements
Overview of the properties of the architectures and the fulfilment of the requirements
Legend:
= described in detail / requirement completely
fulfilled;
= described superficially or partially /
partially fulfilled,
= not described / not
fulfilled.
The lambda architecture was developed by Marz, and the approach was introduced in 2011 in the blog post “How to beat the CAP theorem” (Lin, 2017; Marz, 2011). It describes an architecture (at that time still called batch/real-time architecture) “that beats the CAP theorem by preventing the complexity it normally causes” (Lin, 2017; Marz, 2011). The CAP theorem (Consistency, Availability, Partition Tolerance) states that the three properties consistency, availability, and partition tolerance cannot be fulfilled simultaneously in massively distributed computer systems, but only two properties at most (Meier, 2018). The resulting lambda architecture consists of three layers: speed layer, serving layer, and batch layer, and it has been implemented in later studies, e.g., in Sanla and Numnonda (2019). Since the Lambda architecture requires two systems to be implemented for the same tasks (i.e., batch and speed layer), it has the disadvantage of requiring double effort for the programming and maintenance (Kreps, 2014), and therefore it does not fulfil the two requirements R2 (development time) and R3 (maintainability) for the digital twin.
The kappa architecture has been implemented, for example, in Sanla and Numnonda (2019) and the problem of double effort is solved in the kappa architecture (Kreps, 2014), because this architecture contains no batch layer but a messaging system, stream processing system (speed layer) and a serving database (serving layer). Therefore, the kappa architecture fulfils the requirements R2 (development time) and R3 (maintainability), but the entire processing logic must be realised as stream processing. This can be impractical for analyses where data are processed far from the past (Berle, 2017) and therefore the kappa architecture does not fulfil requirement R1 (big data analyses).
In Haße et al. (2019) the lambda architecture for real-time IoT analytics in logistics is described, which consists of the layers: data visualisation, data processing, and semantic layer, and the optional data acquisition layer. Because this architecture is based on the lambda architecture, it does not fulfil requirements R2 (development time) and R3 (maintainability). The semantic layer is not included in the lambda architecture and serves to overcome the “difficulties due to a lack of semantic interoperability between architectures, standards and ontologies” (Haße et al., 2019) in digital twins, and therefore the requirement R4 (data interoperability) is fulfilled.
In Korth et al. (2018) the architecture for simulation-ready digital twin for real-time management of logistics systems is presented, which consists of the six components event controller, simulation, logbook, model, reporting, and persistence. This architecture is not described in enough detail to adopt it, for example, it is unclear whether the model includes stream processing or batch processing, or both, and it is not mentioned which technologies can be used to implement this architecture. Therefore, the fulfilment of the requirements R2 (development time), R3 (maintainability) and R4 (data interoperability) cannot be assessed.
The cloud-fog-edge-based digital twin control framework is presented in Pan et al. (2021). The framework distinguishes between a physical and a virtual layer. The physical layer contains the entity & IoT, control strategy, computing, local optimiser dimension, and the virtual layer includes model centre, data centre, synchronisation control centre. In Pan et al. (2021) “a case is simulated as an example to prove the feasibility and effectiveness of the synchronization mechanism“ (Pan et al., 2021) where MATLAB is used, however, the entire framework is not implemented and no concrete technologies for implementation are mentioned. Furthermore, the framework is very complex, and therefore, it is assumed that it does not meet requirements R2 (development time) and R3 (maintainability). Requirement R4 (data interoperability) is met because the model centre contains an ontology model.
The digital twin architecture for enabling digital services described in Merkle et al. (2019) consists of the three layers hardware including connectivity, twin level and service level. The architecture has not been implemented, and no concrete technologies for an implementation are named. As a result, concrete information for the realisation of this architecture is missing, and the fulfilment of the requirements R2 (development time) and R3 (maintainability) cannot be assessed. This architecture does not contain a separate semantic layer, but the “one general purpose of the service level is the generic access to the digital twin data” (Merkle et al., 2019) and therefore the requirement R4 (data interoperability) is partially met.
The digital twin reference architecture model in industry 4.0 described in Aheleroff et al. (2021) comprises the three axes digital twin layers, iterative / incremental approach, and level of integration. The digital twin layers are the physical layer, communication layer, digital layer, cyber layer, and application layer. In Aheleroff et al. (2021), the concept of Digital Twin as a Service (DTaaS) aims to solve the problem of individualisation which is not possible with other reference architectures. The architecture is implemented in an example use case, but details for the implementation are missing, such as information on data storage. Therefore, the achievement of the requirements R2 development time and R3 maintainability cannot be assessed. The requirement of data interoperability (R4) is partially fulfilled, as the architecture contains no explicit semantic layer, but it is written that “adopting scalable and autonomous capabilities in Industry 4.0, along with the digital representation of unique physical assets, create an opportunity to address some challenges concerning semantics” (Aheleroff et al., 2021).
In Talkhestani et al. (2019) the intelligent digital twin architecture is presented, because “there lacks, however, a clear, encompassing architecture covering necessary components of a Digital Twin to realize various use cases in an intelligent automation system“ (Talkhestani et al., 2019). This architecture contains the components data-acquisition interface, synchronisation interface, operation data, relations, DT version management, model, ID, organisational / technical specification, feedback interface, intelligent algorithm, DT model comprehension, service, and co-simulation interface. This architecture was only partially realised, and not for all components, a concrete technology was proposed for implementation, such as intelligent algorithm. Due to a lack of information on technical implementation, requirements R2 (development time) and R3 (maintainability) cannot be assessed. The requirement R4 (data interoperability) is fulfilled because the architecture contains the components model and relations.
In Guerreiro et al. (2019) the architecture for a digital twin for intra-logistics process planning is presented. The architecture consists of five layers: (i) data collection and ingestion, (ii) data storage, (iii) data processing engines, (iv) querying-analytics-visualisation, and (v) others (e.g. APIs, reverse proxy coordination, deployment). The implementation of this is not described in detail; for example, it is unclear which and how many technologies or systems are needed to realise the querying-analytics-visualisation layer. Furthermore, it is unclear whether the data processing is batch, stream, or batch and stream processing, and how many systems are required as a minimum. Therefore, the fulfilment of requirements R2 (development time) and R3 (maintainability) cannot be assessed. Furthermore, this architecture does not contain a semantic layer for data interoperability and therefore requirement R4 is not fulfilled.
In Redelinghuys et al. (2019) the six-layer digital twin architecture is presented. The architecture consists of (1) physical devices, (2) local controllers, (3) local data repositories, (4) IoT gateway, (5) cloud-based information repositories, and (6) emulation and simulation. The architecture was verified using a case study. However, concrete technologies are not proposed for all layers; for example, for layer 5 it is only mentioned that the Google Cloud Platform was used, but not which database or service was used. Furthermore, nothing is specifically described for big data analyses, but it is assumed that these can be carried out in layer 6, where simulations are also located. Because a “custom C# program was developed as the IoT Gateway” (Redelinghuys et al., 2019) for layer 4 and the “complex installation of some OPC UA drivers required for the development of the IoT Gateway” (Redelinghuys et al., 2019) is mentioned as a limiting factor, it is assumed that the requirement R2 (development time) is not met. As it is not specifically described how layer 5 (cloud-based information repositories) is implemented, requirement R3 (maintainability) cannot be assessed. The paper does not describe a semantic layer for data interoperability and, therefore, the optional requirement R4 is not fulfilled.
The digital twin architecture based on the industrial internet of things technologies is described by Souza et al. (2019) and contains the three components: physical twin, IIoT gateway, and internal server (digital twin). The architecture is verified using an experimental application, but not for all components a specific technology is mentioned, for example, for the internal server. This architecture contains no component for big data analyses or data processing and no semantic data layer and therefore requirements R1 (big data analyses) and R4 (data interoperability) are not fulfilled. Requirements R2 (development time) and R3 (maintainability) cannot be assessed due to a lack of technical information on the implementation.
The generic digital twin architecture proposed by Steindl et al. (2020) contains the six layers including (1) asset, (2) integration, (3) communication, (4) information, (5) functional and (6) business. The generic digital twin architecture has been implemented as a prototype, but not for all layers specific technologies for implementation are mentioned. In this architecture, big data analyses could be implemented as services in the functional layer. Requirements R2 (development time) and R3 (maintainability) cannot be assessed due to the lack of technical implementation details. Requirement R4 (data interoperability) is fulfilled by the shared knowledge base included in this architecture.
The reference architecture described in ISO 2347-2 (ISO, 2021) “provides guidance for implementing digital twins in manufacturing” (ISO, 2021). The standard ”does not prescribe specific data formats and communication protocols” (ISO, 2021) and no concrete technologies are mentioned for implementation. Furthermore, the architecture is not implemented as an example. Since no details are given on the technical implementation, the fulfilment of requirements R2 (development time) and R3 (maintainability) cannot be assessed. Requirement R4 (data interoperability) is fulfilled because the architecture contains a component for ”interfaces to other digital twins in conjunction with the interoperability“ (ISO, 2021).
This literature review shows that no existing architecture is described in sufficient details to be applicable to, and to fulfil all the requirements of the DTRMF. Thus, the following section will provide details of a suggested architecture that aims to fulfil all DTRMF requirements.
As previously mentioned, a new architecture is developed in this section, by starting from the architectures proposed by Haße et al. (2019) and Guerreiro et al. (2019), which were used as a basis and adapted, as they fulfil several of the requirements of Section 4, and they pursue goals that are similar to those of the DTRMF, as they were developed specifically for the logistics sector. However, unlike in Haße et al. (2019), no lambda architecture is used, but only one layer and one system to process the data with the aim of keeping low both development time and maintenance needs. Furthermore, the kind of processing is described and every layer only communicates with the upper or the lower one, to reduce complexity, this also differs from Guerreiro et al. (2019). The only exception is the semantic layer, which communicates directly with the database. The architecture of the DTRMF and the data flow between the individual layers can be seen in Fig. 5. To keep the architecture general, no concrete technologies are specified here.

Layers and data flow of the DTRMF architecture.
The
As soon as the data are available, they are processed in the
The
The
The
Environment of the digital twin
To develop and deploy the DTRMF in the optimal environment, the advantages and disadvantages of a cloud environment and on-premises are compared (cf. Table 2). One advantage of the cloud over on-premises servers is greater simplicity, and better scalability, because with an on-premises server, for example, the hardware would first have to be installed for a storage expansion (Apostu et al., 2013; Morefield Communications, 2022). Better scalability also leads to higher flexibility (Morefield Communications, 2022). Another advantage is that the purchase costs of the hardware are eliminated, and often only the actually used resources must be paid for, as it is the case with Azure Databricks, for example (Apostu et al., 2013; Databricks, n.d.; Morefield Communications, 2022). Furthermore, creating backups is easier and there is no need for in-house IT staff to maintain the hardware and infrastructure. Cloud-based offerings also enable data access and services from anywhere (Apostu et al., 2013; Morefield Communications, 2022). However, accessing the cloud via the internet can also be a disadvantage, because it requires a stable and fast internet connection, and a failure of the internet leads to a failure of the application running in the cloud from the user’s point of view (Morefield Communications, 2022). Moreover, fast scalability can also be costly if costs are not adequately monitored. Another disadvantage is security and data protection, as data are managed by a cloud provider and using a cloud can lead to more attacks on an organisation (Apostu et al., 2013; Morefield Communications, 2022). With on-premises servers, on the other hand, security and data protection are easier to ensure as the servers are only accessible within a network, and therefore no internet connection is required for access. This can lead to cost savings if a slower and therefore cheaper internet connection is used. Another advantage can be the sovereignty over the hardware and data, whereby customised changes can be made, and thus flexibility is achieved. A disadvantage of on-premises servers, however, is that the servers themselves have to be maintained, and thus additional staff is required and costs are incurred. In addition, the required hardware has to be ordered and purchased, and therefore scalability is also worse than with the cloud (Morefield Communications, 2022).
Advantages and disadvantages of cloud and on-premises
Advantages and disadvantages of cloud and on-premises
The advantages of the cloud environment clearly outweigh the disadvantages, and the cloud meets the requirements of easy maintenance (R3) and scalability (R8) of the digital twin. In addition, a high-availability environment is not required as the system is not production-critical (requirement R9). For these reasons, the cloud is considered as the optimal environment for the DTRMF.
The DTRMF in this paper is based on the database of the logistics backend system, which is an Oracle database. The Qlik Replicate software should be used for data connection because it supports many different source and target endpoints (QlikTech, n.d.–a, n.d.–b), ”minimizes impact on [ . . . ] source production operations” (QlikTech, 2020) and the software is also already used in the company, so that the requirement R10 is fulfilled.
Data storage layer
In order to find a suitable database, a comparison of the four databases PostgreSQL, TimescaleDB, InfluxDB, and MongoDB was done, which can be seen in Table 3. The comparison of the four databases PostgreSQL, TimescaleDB, InfluxDB, and MongoDB showed that TimescaleDB meets the requirements best, as TimescaleDB has the primary database model Time Series DBMS (Database Management System) (DB-Engines, n.d.–d), is open source (DB-Engines, n.d.–d), is horizontally scalable through the chunking method (Freedman & Nordström, 2019), uses the query language SQL (Structured Query Language), which means that no specific query language has to be learned (TimescaleDocs, n.d.–a) and provides the features continuous aggregates and real-time aggregation (Klemm & Freedman, 2020) for real-time applications. Furthermore, benchmarks show that TimescaleDB achieves “20x higher inserts, 2000x faster deletes, 1.2x-14,000x faster queries” (Kiefer, 2017) than PostgreSQL, “higher insert performance, up to 53x faster queries” (Kiefer et al., 2020) than MongoDB and “for workloads with high cardinality, TimescaleDB has 3.5x the insert performance as InfluxDB” (Freedman & Sewrathan, 2020) for time-series data. The other databases do not optimally fulfil the requirements for the DTRMF, as PostgreSQL is not a Time Series DBMS (DB-Engines, n.d.–c), InfluxDB does not support horizontal scalability in the open source edition (Influxdata, n.d.–b) and does not support ACID (Atomicity, Consistency, Isolation, Durability) transactions (GeeksforGeeks, 2020), and MongoDB has the Time Series DBMS only as a secondary database model (DB-Engines, n.d.–b) and the specific MongoDB query language must be used (Jayaram, 2020).
Comparison of databases
Comparison of databases
1In addition to the Open Source Edition, there are also the InfluxDB Cloud and InfluxDB Enterprise editions (Influxdata, n.d.–b). 2The high availability feature, which belongs to replication in this paper, is only available in the InfluxDB Cloud and InfluxDB Enterprise editions (Influxdata, n.d.–b). 3TimescaleDB, unlike PostgreSQL for example, relies on chunking instead of sharding. The difference is that chunks are created automatically whereas shards are typically created manually. Furthermore, chunking can also be used for scale-up and offers advantages such as elasticity and partitioning flexibility. (Freedman & Nordström, 2019). 4Horizontal scalability (clustering) is only available in the Cloud and Enterprise editions, but not in the Open Source edition (Influxdata, n.d.–b). 5On the MongoDB website it is described that “single-document updates have always been atomic” (MongoDB, n.d.–d). However, if related data is stored in multiple documents and these are modified, MongoDB has so-called multi-document ACID transactions, which have the ACID properties and thus ensure data integrity (MongoDB, n.d.–d). 6Since in MongoDB the write and read operations are performed via the primary replica set, MongoDB is consistent. There is a possibility that secondary replicas are read from and in this case, the data are eventual consistent (MongoDB, n.d.–c). 7InfluxQL is a SQL-like query language (DB-Engines, n.d.–a; Influxdata, n.d.–a). 8“Read-only SQL queries via the MongoDB Connector for BI” (DB-Engines, n.d.–b).
The Python programming language and analysis platform Databricks, which is based on Apache Spark, are used for processing (Etaati, 2019). Apache Spark is also listed by Guerreiro et al. (2019) and Haße et al. (2019) as a technology for data analysis and for the data processing engine. Furthermore, Databricks supports both stream and batch processing and is scalable because “it is possible to resize automatically and autoscale the size of the cluster” (Etaati, 2019). Furthermore, Databricks is already being used in the company. For these reasons, Databricks is ideally suited for data processing for the DTRMF.
Visualisation layer
To keep both development time and complexity low, Databricks dashboards are initially used for visualisation. However, due to the loose coupling, this technology could also be replaced at a later date.
Semantic layer
The semantic layer could be realised by using the Resource Description Framework (RDF) or Web Ontology Language (OWL), for example, as described by Steindl et al. (2020) and Haße et al. (2019). The GS1 Core Business Vocabulary (CBV) 2.0 standard could also be considered in the implementation of a semantic layer (GS1, 2022). Since the semantic layer is only an optional layer, it will not be considered further in this paper. However, it would be conceivable to automatically generate the semantic layer from the database of the digital twin.
A case study for the implementation of the DTRMF in the body shop of an automotive manufacturer
For the application and verification of the new architecture and the proposed technologies described in Sections 6 and 7, these are used to implement the DTRMF in the body shop of an automotive manufacturer. In this case study the analysis of the expiration date for adhesive parts after a production break is implemented by using the digital twin. The details of the case study are described in Table 4.
Case Study: Analysis of the expiration date for adhesive parts after a production
break
Case Study: Analysis of the expiration date for adhesive parts after a production break
The big data platform eXtollo from Mercedes-Benz Group AG ( Mercedes-Benz Group AG, n.d.), which is based on Microsoft Azure, is used as the environment for the digital twin. The platform is suitable for use cases in the field of analytics and artificial intelligence and provides Azure Databricks, for example. One advantage of this platform is that the data are stored in encrypted form (Mercedes-Benz Group AG, n.d.). Therefore, this platform is used for the development of the DTRMF.
Data ingestion layer
First, the data required for the case study of the DTRMF must be determined, and the systems, tables, and columns where these data are stored must be identified. For the case study described here, only data from the logistics backend system are required, which are stored in an Oracle database (SAP AmSupply). Table 5 shows the four required tables, their table names and data types. Master data rarely change, whereas transaction data often change (SAP, n.d.). Since the real-time data connection via Qlik Replicate does not yet exist for organisational reasons at this point in time (October 2022), manual CSV (Comma-separated values) exports of these tables from the logistics backend system were created and uploaded to Databricks manually.
Required tables from the logistics backend system
Required tables from the logistics backend system
The TimescaleDB database is run in a Docker container in an Azure VM, and pgAdmin is used as the administration interface for the database. To run and configure the two containers TimescaleDB and pgAdmin, Docker Compose is used. To store the data in the database permanently, a permanent disk is attached to the virtual machine. In the Docker Compose file (see Listing 1), the volume where the data should be stored is specified.
- “5432 : 5432"
- POSTGRES_USER = postgres
- POSTGRES_PASSWORD = password
- POSTGRES_DB = DigitalTwinDB
- timescaledb-docker
- PGADMIN_DEFAULT_EMAIL = email
- PGADMIN_DEFAULT_PASSWORD = password
Listing 1 Docker compose YAML file
Before the data can be stored in the database, a data model must be developed, and then the tables must be created. Figure 6 shows the entity relationship diagram of the DTRMF for the case study described here using the Unified Modelling Language (UML). This data model consists of the entities material, storage quant and handling unit.

Entity relationship diagram of the digital twin
The processing is programmed and executed in Python in Azure Databricks. Since no big
data analyses are required for the implementation of this case study, only pre-processing
is described here. In pre-processing, the data are prepared so that they can be used later
in big data analyses and visualisation. The pre-processing can be seen in the flow chart
in Fig. 7. The first flow chart
shows the general pre-processing procedure for the tables of the digital twin, but not all
steps have to be performed for all tables. Before the pre-processing begins, a current
timestamp is generated and stored in a variable, which is needed for versioning of the
data. The first step is to read the required tables from the manually uploaded CSV files
and store them in a dataframe. When reading the tables, columns that are not needed
are filtered out directly to reduce the amount of data. In the second step the format and the data type of the columns, which contain
quantity data, are changed, because the data come from a German-language system and
here the dot is not the decimal separator but only marks visual thousandths, and the
comma is the decimal separator. Therefore, all dots are removed by means of a regex,
then all commas are replaced by a dot, and finally the data type string is converted
into the data type double. For example, such a column contains before the value
“6.628,000” and after the pre-processing
“6628.000". Afterwards, the data type of all other columns that do not contain text will be
converted into the correct format. For example, columns with a date are casted into
the data type date, timestamps are converted into the data type timestamp or numbers
are converted into the data type int. In the fourth step, if necessary, new columns are added which contain calculated
values or transformed data. For example, the spaces are removed from the material
number and this is added as a new column in order to simplify the input later when
filtering by material number for the users. Here is an example of a material number
with blanks “A 223 682 79 02“ and the same without blanks
“A2236827902“. If there are several source tables, and thus dataframes, these are linked together
in the fifth step. Here it is important to ensure that the correct JOIN type, e.g.
“LEFT JOIN” is used. As the columns contain the technical names of the tables in the logistics backend,
they are renamed to English and more readable names. For example, the
MATNR column is renamed to material_number. For the versioning and history of the digital twin, the timestamp generated at the
beginning is now added to each row. Finally, the dataframes are saved to the TimescaleDB database of the digital twin.
Here, the ‘append’ mode is used so that the existing data are not overwritten.

Data pre-processing.
To execute the pre-processing, a job is created that runs the code every minute to keep the DTRMF updated in real-time.
To make the data from the DTRMF database available to users, it must be visualised. For this purpose, Databricks Dashboards are used for this case study.
For the visualisation of the case study, an SQL query is sent to the TimescaleDB of the
digital twin, which links the tables storage_quants,
material, handling_units and selects, based on the
timestamp, only the most current data. When considering the expiration date, the remaining
time of each material must be taken into account because if this is exceeded, the material
may no longer be used by logistics. The remaining time is stored in the system for each
material number. For this reason, a new column is first added that contains the expiration
date without the remaining time, which is calculated as follows:
Subsequently, it is calculated whether the end of the production break precedes (or is
equal to) the expiration date without remaining time, i.e., whether the material can still
be used after the end of the production break or not. If it can no longer be used, an
“X” is displayed in a new column. To be able to show how many days
after the production break the material can be stored and used, the expiration date
without remaining time is subtracted from the date of the end of the production break.
To also visualise the last transport orders to the stations (target storage location), an SQL query is executed on the database table transport_order.
Finally, the columns are renamed so that they are understandable for the users. The date of the end of the production break can be entered via a text widget. Furthermore, there is the option to filter whether all storage quants or materials should be displayed or only those whose expiration date will be expired after the production break and a filter to enter a specific storage location. The resulting dashboard can be seen in Fig. 8.

Dashboard of the case study.
This section describes how the qualitative requirements for the DTRMF were implemented and
fulfilled.
All qualitative requirements for the DTRMF have been met, with exception of the real-time data connection, which is currently not yet available, due to organisational processes. Therefore, the architecture and the technologies chosen have proven to be well suited.
Also, to demonstrate the advantages of application of the DTRMF, we performed a comparison between the current state of the logistics backend system and its expected future state enabled by the DTRMF. The comparison is reported in Table 6, and it shows that the DTRMF provides enormous advantages, because production supply and quality can be ensured more easily, time is saved in the analysis of expired material and greater transparency is provided in all areas of the RFID-enabled material flow.
Comparison of logistics backend system and DTRMF
Comparison of logistics backend system and DTRMF
In this paper we investigate the potential of RFID data in the automotive industry
logistics to automate processes and optimise material flow, by means of a DTRMF, that is a
digital twin of the RFID-enabled material flow in real-time. We started by listing the
different qualitative requirements that must be met by a big data and digital twin
architecture in the automotive industry logistics. We then performed an extensive literature
review, and we noted there is no available architecture in the literature that do optimally
fulfil the requirements we listed and, at the same time, it is detailed enough to support a
real application. Therefore, we propose in this paper a new architecture that consists of
different layers, namely the data ingestion layer, data processing and analyses layer, data
storage layer, visualisation layer, and the optional semantic layer. The suitability of this
architecture was demonstrated by the implementation of a practical case study as an answer
to
The next step is to realise and demonstrate the real-time data connection and to implement further case studies with big data analyses to further verify the architecture and the suitability of the proposed technologies. Once the real-time data connection exists, a detailed field test and user evaluation should be conducted to verify if all requirements for the DTRMF have been met or, conversely, if another iteration should be conducted in the design cycle. Furthermore, a semantic layer for the DTRMF could be investigated and implemented to show how data interoperability can be realised and make the data accessible to other applications. Additionally, a versioning method could be developed to update master data only in the database when changes were made to reduce the amount of data that must be stored. Furthermore, a method to compress and delete older transaction data could be developed to save storage space and improve the performance of the database by reducing the amount of data. In addition, an economic evaluation of the DTRMF in real-time could be carried out to better understand the expected costs and revenues of its full implementation.
Footnotes
Acknowledgments
We thank all those who made the success of this paper possible. Especially we thank Mercedes-Benz AG for the funding of the Ph.D. position of one of the authors.
