Research on the railway multi-source homonymous geographical entity matching algorithm based on dynamic time warping

Abstract

This study aims to explore an efficient technique for matching multisource homonymous geographical entities in railways to address the identification issues of homonymous geographical entities. Focusing on railway line vector spatial data, this research investigates the matching problem of multisource homonymous geographical entities. Building on statistical feature matching of attribute data, a curve similarity calculation method based on the DTW algorithm is designed to achieve better local elastic matching, overcoming the limitations of the Fréchet algorithm. The empirical study utilizes railway line layer data from two data sources within Beijing’s jurisdiction, fusing 6237 segment lines from source 2 with 105 long lines from source 1. The structural comparison between the two data sources is conducted through statistical methods, applying cosine similarity and the maximum similarity value of TF-IDF for text similarity calculation. Finally, Python is used to implement the DTW algorithm for curve similarity. The experimental results show an average DTW distance of 3.92, a standard deviation of 4.63, and a mode of 0.005. Similarity measurement results indicate that 95.53% of records are within the predetermined threshold, demonstrating the effectiveness and applicability of the method. The findings significantly enhance the accuracy of railway data matching, promoting the informatization of the railway industry, and hold substantial significance for improving railway operational efficiency and system performance.

Keywords

Railway line matching geographical entities dynamic time warping vector data python

1. Introduction

The operation of railway networks serves as a critical indicator for data analytics and economic development, and has been a key component of geographical economic studies. Following the release of the “Digital Transportation Development Outline” by the Ministry of Transport in 2019 and the “Digital Railway Plan” by the China National Railway Group Co., Ltd. in 2023, the move towards comprehensive digitalization and data sharing of railway services has become a central force in driving the construction of modern infrastructure. At present, extensive spatial geographic data has been developed across all facets of railway operations, including marketing, production, and safety processes. This data encompasses a wide range of elements such as locomotives, rolling stock, stations, tracks, bridges, tunnels, communication and signal equipment, as well as power and water supply facility [1].

The application of GIS technology in high-speed rail projects encompasses three-dimensional design [2], track cable-stayed bridges [3], deep excavation with high-speed rail coordinated early warning systems [4], management of rolling stock depot facilities and operations [5], and studies on the integrated application platform for design outcomes [6, 7]. This illustrates the pivotal role of GIS throughout multiple phases of high-speed rail development, enhancing both project efficiency and safety. In the realm of GIS system development and management, the focus includes spatial hierarchical representation techniques for railway GIS [8], the Beijing-Zhangjiakou High-Speed Railway’s integrated construction and maintenance management [9]. These studies underscore GIS technology’s crucial impact on elevating railway project management efficiency, bolstering infrastructure oversight, and refining operation and maintenance practices. Analytical and monitoring approaches cover the planning and design of railway routes [10] and creation of high-speed rail noise maps [11]. These investigations highlight the broad application of remote sensing and GIS technologies in railway construction, operational management, and environmental impact assessments, thereby improving the planning, safety, and environmental monitoring capabilities of railway systems.

However, due to the differences in application requirements and the relative independence of different systems, the issue of semantic inconsistency in multi-source vector data has become increasingly prominent in the railway data integration process. As pointed out in the “14th Five-Year Plan for Digital Transportation Development,” the existing information still lacks in depth and breadth of integration. Research on the fusion of map routes and related data mainly focuses on using different data processing methods to integrate map information from various sources. Representative studies include map matching techniques [12] and those emphasizing the use of spatial indices and optimized data structures [13]. In terms of curve similarity measurement, methods such as Euclidean distance [14], Dynamic Time Warping [15], Pearson correlation coefficient [16], Spearman’s rank correlation coefficient [17], curve fitting [18] and Fréchet distance [19] can also be applied to the comparison of geometric features of routes. In addition, methods based on deep learning, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [20], can be used to extract features from curve data and perform similarity comparisons. Among these, discrete Fréchet distance and Dynamic Time Warping are two commonly used methods that can effectively evaluate and quantify the similarity between two routes. These studies not only improve the speed and efficiency of data processing but also enhance the accuracy of map data fusion.

So, in this paper, the issue of matching multisource homonymous geographical entities is investigated in this paper, with railway line vector spatial data being utilized as the subject of research. Initially, the concept of multisource heterogeneous spatial data for railways is defined, and a methodology for the matching of multisource homonymous geographical entities within the railway context is developed. Subsequently, attribute features are matched using Term Frequency-Inverse Document Frequency (TF-IDF), and a method for geometric feature matching based on Dynamic Time Warping (DTW) is proposed. An empirical investigation is conducted using railway line maps from two distinct sources. By integrating railway line data from diverse origins effectively, an approach is aimed at minimizing redundant data storage and addressing the semantic heterogeneity issue inherent in multisource vector data of railways, thereby enhancing the data’s utilization and efficiency.

2. Materials and methods

2.1 General data

To validate the effectiveness of the algorithm, data fusion is performed on railway and subway information from two sources. Source1 is from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, and Source2 is from OpenStreetMap (OSM). By applying geographic analysis and clipping techniques, the recorded railway lines in Beijing are extracted, as shown in Fig. 1.

The first source comprises a total of 105 entries, with relatively longer route lengths. The second source consists of 6,237 records, which are more detailed, and the recorded lengths of the routes are comparatively shorter.

2.2 Data structure

An analysis is conducted on the table structures storing the two layers of data. The following provides an explanation for each table structure’s meaning in an academic context. In the dataset for Source 1, the “Shape” field is consistently “Polyline” across all records. FID serves as an identifier to distinguish each row of data. Shape indicates the geometric type of the data. OBJECTID is another unique identifier used to uniquely identify each object in the database. RN is a reference number. NAME refers to the name of the route or object. TYPE is a category field, indicating the classification or type of the object. Shape_Length represents the length of the geometric shape. See Tables 1 and 2 for details.

Table 1
Dataset descriptors for source 1

Field name	Description	Data type
FID	Serial number	Object ID
Shape	Geometry	Geometry
OBJECTID	Identifier	Long integer
RN	Code	Text
NAME	Route name	Text
TYPE	Category	Text
Shape_Length	Length	Double

Table 2

Example data for source 1

FID	OBJECTID	RN	Name	Type	Shape_length
1	3887	3002	Beijing-Guangzhou High-Speed Railway	High-Speed Rail	2.737329
2	3956	3010	Beijing-Tianjin Intercity	High-Speed Rail	1.136578
…	…	…	…	…	…
105	3979	343	Huangliang	Electric	0.003064

Figure 1.

Beijing rail map (source 1: left, source 2: right).

In the dataset for Source 2, the “Shape” field is consistently “Polyline” for all records. Similarly, the “code” field is uniformly “6101,” and the “fclass” for all entries is designated as “rail”. ObjectID serves as a unique identifier. Shape denotes the geometric type of the data. osm_id is an identifier used by OpenStreetMap to recognize specific geographical features or objects. code is used for categorization or identifying specific features. fclass represents the characteristic of the route. name refers to the name of the route or object. layer indicates the object’s level within the Geographic Information System. bridge indicates whether the structure is a bridge, tunnel indicates whether it is a tunnel. Shape_Length represents the length of the line object. See Tables 3 and 4 for details.

Table 3

Dataset descriptors for source 2

Field name	Description	Data type
ObjectID	Identifier	Object ID
Shape	Geometry	Geometry
osm_id	OSM identifier	Text
Code	Type code	Short integer
Fclass	Category	Text
Name	Route name	Text
Layer	Layer	Double
Bridge	Bridge	Text
Tunnel	Tunnel	Text
Shape_length	Length	Double

Table 4

Example data for source 2

OBJECTID	Osmid	Name	Layer	Bridge	Tunnel	Shape_Length
1	9853311	Beijing-Harbin Line	1	T	F	0.001533
5	24834552	Beijing-Kowloon Line	0	F	F	0.007579
…	…	…	…	…	…	…
6237	3979	Beijing-Tianjin Intercity Railway	0	T	F	0.000518

2.3 Methods

Transport vector data digitizes transportation information into vector formats, detailing the geometric attributes of transportation infrastructures such as roads, which include their width, length, position, and orientation. The classification of transport vector data varies, with a common approach being to categorize it based on the type of content it contains, typically including road vector data, vehicle vector data, and transportation facility vector data. Moreover, based on the source of data and the domain of application, transport vector data can be divided into categories like aviation traffic vector data, railway traffic vector data, and waterway traffic vector data.

Railway vector data encompasses a comprehensive set of information on infrastructure elements such as tracks, stations, bridges, tunnels, culverts, and embankments. This dataset constitutes approximately 50% of all spatial data in the railway sector, making it an essential information resource for railway spatial databases. Due to the variety in equipment and methodologies used for processing and recording the original data across different subsystems, the same traffic vector might be documented in different ways, leading to the creation of multi-source homonymous geographical entities. These entities are defined as geographical entities that possess the same name or identifier in different data sources, as shown in Fig. 2.

Figure 2.

Railway geographic vector classification.

Through the employment of this hierarchical approach, railway authorities can effectively navigate the complexities inherent in managing spatial data, thereby enhancing operational efficiencies and elevating safety standards within the railway transportation ecosystem.

(1) Identification and Comparison of Point Entities:

Given two points P1 ( $x_{1},y_{1}$ ) and P2 ( $x_{2},y_{2}$ ), the distance between the two points is calculated as follows:

$\displaystyle d=\sqrt{(x_{2}-x_{1})^{2}+(y_{2}-y_{1})^{2}}$ (1)

(2) Identification and Comparison of Line Entities: Line entities can be represented as a collection of points. For two lines L1 and L2, their point sets are compared for overlap or proximity. The minimum distance between line segments or a comparison of their direction and length can be calculated to determine if they represent the same entity.

For every pair of points ( $x_{ai},y_{ai}$ ) and ( $x_{bi},y_{bi}$ ) on curves A and B, compute their Euclidean distance:

$\displaystyle d_{i,j}=\sqrt{(x_{ai}-x_{bi})^{2}+(y_{ai}-y_{bi})^{2}}$ (2)

Construct a distance matrix D, where D[i][j] $=$ di,j. Use dynamic programming to find the path from D[0][0] to .D[n][m] .D[n][m] (where n and m are the lengths of curves A and B, respectively) with the smallest maximum distance value along that path.

(3) Identification and Comparison of Area Entities:

Area entities can be represented as closed point sets or polygons. Suppose there are two areas A1 and A2, defined by point sets {( $x_{1i},y_{1i}$ )} and {( $x_{2i},y_{2i}$ )}.

$\displaystyle d_{i,j}=\frac{1}{2}\left|{\sum\limits_{i=1}^{n-1}(x_{i}y_{i+1}-x% _{i+1}y_{i})+(x_{n}y_{1}-x_{1}y_{n})}\right|$ (3)

If the areas of two regions or their boundaries are very close, they may represent the same entity.

Based on Dynamic Time Warping, the railway route curve similarity calculation is a method used to measure the similarity or dissimilarity between two curves. This approach is commonly applied to compare curve data such as time series, trajectories, graphs, sound signals, etc. Railway route curve similarity refers to a method used to assess the similarity or dissimilarity between two railway routes. This similarity measurement method is typically employed to compare the characteristics and geometric shapes of different railway routes to determine their similarities or differences. The evaluation of railway route curve similarity encompasses considerations of geometric shapes, route features, load characteristics, safety, and reliability. Dynamic Time Warping-based Railway Route Curve Similarity Calculation is shown in Table 5.

Table 5

Dynamic time warping-based railway route curve similarity calculation

Aspect	Content
Geometric shape	Compare the similarity of geometric features such as curves, slopes, and curvature of railway routes.
Route features	Consider the similarity of features such as tracks, signals, and intersections along railway routes.
Load characteristics	Analyse the operational characteristics of different trains and cargo on railway routes to determine
	their suitability for specific transportation requirements.
Safety and reliability	Assess the safety and reliability of railway routes to ensure the safe and efficient transportation of
	passengers and cargo.

Common methods for curve similarity calculation include Euclidean distance, correlation coefficients (such as Pearson or Spearman rank correlation coefficients), curve fitting, dynamic time warping (DTW), Fréchet distance, edit distance, and deep learning-based approaches. Among these methods, Fréchet distance is widely used due to its advantages:

Geometric-based: Fréchet distance considers the geometric properties between curves, rather than just the distances between points.

Invariance to scaling and translation: Fréchet distance is invariant to scaling and translation of curves.

Applicability to multi-dimensional curves: Fréchet distance is not only applicable to one-dimensional curves (e.g., time series) but can also be extended to multi-dimensional curves.

However, Fréchet distance has some drawbacks, especially concerning local similarity evaluation:

High computational complexity: Algorithms for computing Fréchet distance typically have high computational complexity, especially for long curves or high-dimensional data.

Sensitivity to noise: Fréchet distance is sensitive to noise and local perturbations.

Lack of flexibility in handling time variations: Fréchet distance often assumes consistent motion speeds between two curves.

Due to the extensive nature of railway lines, there is an issue with the inconsistency of curve segmentation. As illustrated in Fig. 3, the segmentation standards differ across various data sources, thus necessitating a consideration of local similarity. Consequently, this paper employs the principles of Dynamic Time Warping (DTW) algorithm for the computation of similarity in railway line routes.

Figure 3.

Divergent segmentations of a railway line.

Therefore, due to the considerable length of railway lines, there exists a problem of inconsistent curve segmentation. Different data sources have different segmentation standards, necessitating the consideration of local similarity issues. Therefore, this paper proposes the introduction of the Dynamic Time Warping (DTW) method. Its advantages in measuring local similarity, compared to other methods, are primarily reflected in its capability for elastic matching in time or spatial sequences, strong ability to handle nonlinear relationships, and superior fault tolerance and robustness. The foundational principle of DTW applied to railway line comparison is as follows:

Consider two sequences to be matched, P and Q, where each sequence may contain different numbers of elements.

$P=p_{1},p_{2}\ldots,p_{m}$ , where each $p_{i}$ is an element of sequence P, $Q=q_{1},q_{2}\ldots,q_{n}$ , where each ${q}_{i}$ is an element of sequence Q. Furthermore, construct an $m\times n$ matrix D, where each element $d(i,j)$ represents the distance between the ith element of P and the jth element of Q in the sequence. The distance can be measured as follows:

$\displaystyle d(i,j)=||p_{i}-q_{j}||$ (4)

In constructing the cost matrix C, each element $C(i,j)$ represents the cost of matching the ith element of P with the jth element of Q, with the optimal path representing the minimal cumulative cost. $C(i,j)$ is defined by the following recursive formula, which takes the minimum of three adjacent elements:

$\displaystyle C(i,j)=D(i,j)+\min\{C(i-1,j-1),C(i-1,j),C(i,j-1)\}$ (5)

Moreover, $C(i-1,j-1)$ , $C(i-1,j)$ and present the cumulative cost of matching, including the operation of matching the ith element of P with the jth element of Q, deletion, and insertion, respectively. For all $i>1$ and $j>1$ The starting boundary conditions are:

$\displaystyle C(1,1)=d(1,1)$ $\displaystyle C(i,1)=d(i,1)+C(i-1,1)$ (6) $\displaystyle C(1,j)=d(1,j)+C(1,j-1)$

After filling in the cost matrix $C$ using the formula, we find the matching cost of the last element of both sequences, $C(m,n)$ , and trace back to find the optimal path from $C(1,1)$ , which is the optimal matching path between the two sequences. This path minimizes the cumulative matching cost, and hence, DTW can effectively find the optimal matching path between two sequences with the minimum cumulative cost, even when there is noise or missing elements in the sequences.

3. Results

Using 6,237 railway lines from source 2 for comparison with source 1, for numerical attributes, conduct comprehensive statistical analyses encompassing measures such as mean, standard deviation, minimum value, 25th percentile, median, 75th percentile, and maximum value. For categorical attributes, delve into statistical examinations including the count of unique values and the frequency distribution of the most prevalent values. These analytical efforts aim to elucidate the distributional characteristics and prevalent patterns inherent within the dataset.

3.1 Statistical analysis

Exploratory analysis is conducted on the fields, with separate statistical summaries for numerical and textual data. Numerical statistics are shown in Tables 6 and 7, non-numerical data statistics are shown in Tables 8 and 9.

Table 6
Source 1 numerical statistics

Statistical metric	FID	OBJECTID	Shape_length
Mean	52	4144.32381	0.28438923
Standard deviation	30.45488	211.091827	0.5647934
Minimum	0	3887	0.000517
25th percentile	26	4002	0.029829
Median	52	4079	0.123507
75th percentile	78	4126	0.23742
Maximum	104	4608	3.602816

Table 7

Source 2 numerical statistics

Statistical metric	osm_id	Code	Layer	Shape_length
Mean	311000000	6101.5	$-$ 0.0013	0.00794767
Standard deviation	224000000	0.9139	0.9191	0.020798768
Minimum	9853311	6101	$-$ 6	0.000047
25th percentile	122000000	6101	0	0.000753
Median	251000000	6101	0	0.002692
75th percentile	456000000	6102	0	0.007273
Maximum	760000000	6108	4	0.330322

Table 8

Source 1 non-numerical data statistics

Statistical metric	Shape^*	RN	TYPE
Number of unique values	1	20	3
Frequency of most common value	105	43	64

Table 9

Source 2 non-numerical data statistics

Statistical metric	Shape^*	Fclass	Bridge	Tunnel
Number of unique values	1	8	2	2
Most common value	Polyline	Rail	F	F
Frequency of most common value	6237	4673	4981	5593

3.2 Implementation of DTW in Python

To implement the Dynamic Time Warping curve similarity in Python, as described, you’ll need to install certain libraries and follow a series of steps to process geographic data and calculate DTW distances. The libraries required for this task include pandas, geopandas, shapely, numpy, openpyxl, and scipy. Here’s the main code snippet for these steps:

import pandas as pd import geopandas as gpd from shapely.geometry import Point, LineString import numpy as np import openpyxl from scipy.spatial.distance import euclidean # Define a function to calculate the simplified DTW distance between two sequences. def simplified_dtw_distance (seq1, seq2): n, m $=$ len (seq1), len (seq2) dtw_matrix $=$ np.zeros ((n $+$ 1, m $+$ 1)) # Initialize the first row and column of the matrix with infinity. for i in range (1, n $+$ 1): dtw_matrix [i, 0] $=$ np.inf for j in range (1, m $+$ 1): dtw_matrix [0, j] $=$ np.inf # Calculate DTW distance. for i in range (1, n $+$ 1): for j in range (1, m $+$ 1): dist $=$ euclidean (np.array (seq1 [i-1]), np.array (seq2 [j-1])) dtw_matrix [i, j] $=$ dist $+$ min (dtw_matrix [i-1, j], dtw_matrix [i, j-1], dtw_matrix [i-1, j-1]) return dtw_matrix [n, m] # Read CSV files into pandas DataFrames # Convert DataFrames to GeoDataFrames # Iterate through each OSM line, find the most matching ZKY line # Example iteration and comparison (You’ll need to adjust based on your actual data structure and requirements)

3.3 Similarity calculation results

The results of the DTW similarity calculation are as shown in Table 10. The dataset contains 6,237 observations, with the average DTW distance between the two railway maps being 3.918. This average suggests a moderate level of dynamic time warping distance between the two sets of railway lines, indicating how similar or dissimilar they are on average. The standard deviation of the DTW distance is 4.634, highlighting a significant degree of variability around the mean, which points to a wide range of similarities and dissimilarities among the compared railway lines. Statistical analysis of the results yielded the following Table 11.

Table 10
DTW similarity calculation results

Source2 ID	Source2 name	Source1 ID	Source1_name	DTW similarity
1	Jingha Line	3	Jingguang	1.564107
2	Jingjiu Line	4	Jingguang	5.973228
…	…	…	…	…
1449	Airport Express	1	Jingjin Intercity	32.86039
1468	Beijing Subway Line 15	2	Jinghu	33.44418
…	…	…	…	…
6236	Jinghu Line	2	Jinghu	4.215232
6237	Jinghu Line	3	Jingguang	1.13555

Table 11

Statistical summary of DTW distance calculations

Metric	Value	Explanation
Count	6237	Total number of observations in the dataset.
Mean	3.918027	Represents the average DTW distance between the two sets of railway maps.
Std	4.633556	Indicates the variability around the mean.
Min	0.004667	Minimum value.
Max	72.52057	Maximum value.
25%	1.524001	Indicates that 25% of data points are below this value.
Median (50%)	2.428675	Indicates that 50% of data points are below this value.
75%	4.269025	Indicates that 75% of data points are below this value.
Mode	0.004667	The most common value.
Skewness	4.24569	Indicates the symmetry of the data distribution. A positive value indicates right skew.
Kurtosis	30.19052	Indicates the thickness of the tail of the data distribution. High value indicates more extreme
		values.

The range of DTW distances spans from a minimum of 0.005 to a maximum of 72.521, showcasing extreme variability and indicating that while some railway lines are very similar, others are vastly different. The quartiles-25th percentile at 1.524, the median (50th percentile) at 2.429, and the 75th percentile at 4.269-further demonstrate the distribution of DTW distances, with the median indicating that half of the data points have a DTW distance below 2.429, suggesting a moderate similarity level for the majority of the railway lines when compared. The most frequently occurring DTW distance is 0.005, which is the mode of the dataset, indicating the most common level of similarity found between some of the railway lines. The skewness of the dataset is 4.246, which shows a positive skew, meaning that the bulk of the DTW distances are concentrated on the lower end of the scale, but there are also some extremely high values indicating significant dissimilarities. The kurtosis value of 30.191 indicates a heavy-tailed distribution, suggesting the presence of outliers or extreme values that are significantly different from the rest. The statistical analysis of the DTW distances reveals a broad spectrum of similarities and dissimilarities among the railway lines compared, with a moderate average similarity. The wide range, significant standard deviation, and the presence of extreme values as indicated by the high kurtosis, all suggest that while many railway lines share moderate to high levels of similarity, there are also notable exceptions that are significantly dissimilar. This nuanced distribution underscores the complexity of comparing railway lines based on their geographic alignment and highlights the utility of DTW in capturing these variations.

Combining the aforementioned statistical data with the analysis of heatmaps to measure the correlation of DTW_Distance with other variables reveals the following insights: The correlation between OSM1_ORIG_FID and ZKY1_ORIG_FID is very low and negative, indicating no significant linear relationship between them. Similarly, the correlation between OSM1_ORIG_FID and DTW_Distance is also low and negative, suggesting no significant linear relationship exists between them. The correlation between ZKY1_ORIG_FID and DTW_Distance is positive but still relatively low (0.28), indicating a weak positive linear relationship between them. Heatmaps demonstrate that there is no strong linear correlation between identifiers and DTW_Distance (a measure of similarity). This outcome is expected if these identifiers are categorical variables and DTW_Distance is a continuous variable. The weak correlation between ZKY1_ORIG_FID and DTW_Distance may warrant further exploration to see if there is any pattern, or if this is merely a result of random variation in the data. In the “Z” table’s TYPE field, the only values are blank and “electric,” while in the “O” table’s fclass field, the only values are “rail” and “subway.” This indicates significant differences in the data and formats of the type fields between the two tables, making them unsuitable as keywords for merging the two tables, as shown in Fig. 4.

Figure 4.

Heatmap of correlations.

Histograms display the distribution of DTW_Distance (the result of similarity calculations). The data distribution is right-skewed (positively skewed), meaning most railway line pairs are relatively similar, but some pairs have a greater difference in similarity. Most DTW_Distance values are concentrated in the lower range, especially between 0 and 10. Data concentration: The histogram shows that the vast majority of data points are concentrated in the lower value area, consistent with your data’s mean (about 3.92) and median (about 2.43), both of which indicate a tendency for the data to cluster in the lower numerical range. Skewness: Although most of the data is concentrated in the lower area, the histogram also shows an extension to the right, meaning there are some larger DTW_Distance values. This right skew (positive skewness) indicates that besides a large number of closer data points, there are a few data points that are further away, as shown in Fig. 5.

Figure 5.

Histogram of DTW similarity.

Figure 6.

Scatter plot of DTW similarity.

Scatter plots in the DTW_Distance scatter plot, each point represents a data record, with the horizontal axis being the record’s index and the vertical axis being the corresponding DTW_Distance value. Through this chart, Data points distribution: The scatter plot shows how data points are dispersed on the vertical axis (i.e., DTW_Distance values). Most data points are concentrated in the lower value area, consistent with observations from histograms and box plots. Visualization of outliers: The higher DTW_Distance values form some clearly separated points in the chart. These may correspond to potential outliers previously identified in box plots. The scatter plot intuitively shows these values’ distribution relative to the overall dataset, as shown in Fig. 6.

Applying the empirical rule (68-95-99.7 rule), with the threshold being $\pm$ 2 standard deviations from the mean, to all data.

$\displaystyle\text{Upper threshold}=\text{Mean}+2\times\text{Standard deviations}=13.18726$ (7) $\displaystyle\text{Lower threshold}=\text{Mean}-2\times\text{Standard deviations}=-5.35044$

In Excel, the formula $=$ COUNTIFS (E2:E6231,“ $>=-$ 5.35044”, E2:E6231, “ $<=$ 13.18726”) was used for calculation, resulting in 5958 records. Therefore, about 95.5267% of the records fall within the threshold range. Having $>$ 95% of records within the threshold indicates that the comparative method is statistically reasonable and effective for identifying and processing normal and outlier values within the dataset. The 4.4733% of outliers are mainly due to differences in statistical times and ranges, leading to certain variances.

4. Discussion

This paper introduces a railway multisource homonymous geographical entity matching method based on Dynamic Time Warping (DTW), crucial for the integration and management of railway spatial data, enhancing railway informatization, safety, and operational efficiency. Using Python, the study meticulously processes data and calculates DTW distances, optimizing the accuracy and efficiency of railway data integration. Empirical results show a high matching success rate, with approximately 95.5267% of records falling within the acceptable threshold, underlining the method’s robustness. Analysis of DTW distances has revealed effective quantification of similarities between different railway lines, with some line pairs displaying minimal distances, indicating high geometric and path configuration consistency. Statistical analysis has pointed out that most DTW distances tend to cluster at lower values, suggesting a general similarity among different data sources’ railway lines. Yet, there are outliers with higher DTW distances, signaling data discrepancies or geographical feature differences. The algorithm has successfully identified line pairs with extremely high similarity, with DTW distances close to zero, confirming the algorithm’s effectiveness in integrating railway vector data and identifying multisource homonymous entities. The DTW algorithm’s ability to pinpoint local similarities between globally dissimilar lines is especially valuable for railway data processing, ensuring segment-specific precision. Overall, this research enhances the precision and efficiency of railway vector data matching and provides new insights into the consistencies and differences of railway line records, significantly contributing to the railway industry’s informatization and intelligent development.

Data sharing agreement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, author-ship, and publication of this article.

Funding

This article was supported by China National Railway Group Co., Ltd. under the Science and Technology Research and Development Program (Project No. L2022X001); and the China Academy of Railway Sciences Corporation Limited under the Scientific Research Project (Project No. 2022YJ355).

Footnotes

Acknowledgments

My sincere thanks go to all the individuals and organizations that contributed to the success of this study.

References

Zhu

. Organization and application of high-speed railway spatial data based on ArcGIS. Southwest Jiaotong University. 2012.

Guo

. Research and application of 3D design of Chongqing suburban railway Yongchuan line based on BIM

+

GIS electronic sand table. Railway Standard Design. 2024; 1-7.

Huang

Luo

Wang

. Application research of BIM+GIS technology in high-speed railway ballastless track cable-stayed bridges. Western Transportation Science and Technology. 2023; (03): 107-108+111.

Feng

. Research on deep foundation pit and high-speed railway collaborative early warning system based on BIM+GIS+IoT. Railway Construction Technology. 2022; (10): 55-59.

Zhao

. BIM+GIS-based O&M management platform for high-speed train depot equipment and facilities. Railway Technical Innovation. 2022; (01): 7-13.

Han

. Research on key technologies of digital twin of high-speed train depot based on BIM+GIS technology. Railway Standard Design. 2022; 66(09): 160-165.

. Research on comprehensive application platform of West Ten High-Speed Railway design results based on GIS+BIM. Railway Standard Design. 2022; 66(01): 13-16+25.

Wang

Liu

, et al. Technology of railway GIS spatial hierarchical expression based on vector tiles. Railway Construction. 2022; 62(10): 156-160.

Sun

Feng

Wei

, et al. Application of BIM+GIS technology in integrated management of Beijing-Zhangjiakou high-speed railway construction and maintenance. China Railway. 2022; (07): 96-101.

10.

. Railway route planning and design using remote sensing and geographic information system. Manager World of Transportation. 2023; (29): 158-160.

11.

Liu

. High-speed railway noise map drawing technology based on geographic information system. China Railway Science. 2022; 43(01): 182-188.

12.

Quddus

Ochieng

Noland

. Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportation Research Part C: Emerging Technologies. 2007; 15(5): 312-328.

13.

Samet

. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. 2006.

14.

, et al. Stress tensor similarity index based on Euclidean distance for numerical back analysis of in situ stress fields. Computers and Geotechnics. 2023; 159: 105457.

15.

Lee

. Normalization and possibility of classification analysis using the optimal warping paths of dynamic time warping in gait analysis. Journal of Exercise Rehabilitation. 2023; 19(1): 85.

16.

Millot

Blache

Dinu

, et al. Center of mass velocity comparison using a whole body magnetic inertial measurement unit system and force platforms in well trained sprinters in straight-line and curve sprinting. Gait & Posture. 2023; 99: 90-97.

17.

Latorre-Carmona

Huertas

Pedersen

, et al. Proposal of a new fidelity measure between computed image quality and observers quality scores accounting for scores variability. Journal of Visual Communication and Image Representation. 2023; 90: 103704.

18.

Arlinghaus

. Practical handbook of curve fitting. CRC press. 2023.

19.

Buchin

Fan

Löffler

, et al. Fréchet distance for uncertain curves. ACM Transactions on Algorithms. 2023; 19(3): 1-47.

20.

Goodfellow

Bengio

Courville