Abstract
This study aims to explore an efficient technique for matching multisource homonymous geographical entities in railways to address the identification issues of homonymous geographical entities. Focusing on railway line vector spatial data, this research investigates the matching problem of multisource homonymous geographical entities. Building on statistical feature matching of attribute data, a curve similarity calculation method based on the DTW algorithm is designed to achieve better local elastic matching, overcoming the limitations of the Fréchet algorithm. The empirical study utilizes railway line layer data from two data sources within Beijing’s jurisdiction, fusing 6237 segment lines from source 2 with 105 long lines from source 1. The structural comparison between the two data sources is conducted through statistical methods, applying cosine similarity and the maximum similarity value of TF-IDF for text similarity calculation. Finally, Python is used to implement the DTW algorithm for curve similarity. The experimental results show an average DTW distance of 3.92, a standard deviation of 4.63, and a mode of 0.005. Similarity measurement results indicate that 95.53% of records are within the predetermined threshold, demonstrating the effectiveness and applicability of the method. The findings significantly enhance the accuracy of railway data matching, promoting the informatization of the railway industry, and hold substantial significance for improving railway operational efficiency and system performance.
Introduction
The operation of railway networks serves as a critical indicator for data analytics and economic development, and has been a key component of geographical economic studies. Following the release of the “Digital Transportation Development Outline” by the Ministry of Transport in 2019 and the “Digital Railway Plan” by the China National Railway Group Co., Ltd. in 2023, the move towards comprehensive digitalization and data sharing of railway services has become a central force in driving the construction of modern infrastructure. At present, extensive spatial geographic data has been developed across all facets of railway operations, including marketing, production, and safety processes. This data encompasses a wide range of elements such as locomotives, rolling stock, stations, tracks, bridges, tunnels, communication and signal equipment, as well as power and water supply facility [1].
The application of GIS technology in high-speed rail projects encompasses three-dimensional design [2], track cable-stayed bridges [3], deep excavation with high-speed rail coordinated early warning systems [4], management of rolling stock depot facilities and operations [5], and studies on the integrated application platform for design outcomes [6, 7]. This illustrates the pivotal role of GIS throughout multiple phases of high-speed rail development, enhancing both project efficiency and safety. In the realm of GIS system development and management, the focus includes spatial hierarchical representation techniques for railway GIS [8], the Beijing-Zhangjiakou High-Speed Railway’s integrated construction and maintenance management [9]. These studies underscore GIS technology’s crucial impact on elevating railway project management efficiency, bolstering infrastructure oversight, and refining operation and maintenance practices. Analytical and monitoring approaches cover the planning and design of railway routes [10] and creation of high-speed rail noise maps [11]. These investigations highlight the broad application of remote sensing and GIS technologies in railway construction, operational management, and environmental impact assessments, thereby improving the planning, safety, and environmental monitoring capabilities of railway systems.
However, due to the differences in application requirements and the relative independence of different systems, the issue of semantic inconsistency in multi-source vector data has become increasingly prominent in the railway data integration process. As pointed out in the “14th Five-Year Plan for Digital Transportation Development,” the existing information still lacks in depth and breadth of integration. Research on the fusion of map routes and related data mainly focuses on using different data processing methods to integrate map information from various sources. Representative studies include map matching techniques [12] and those emphasizing the use of spatial indices and optimized data structures [13]. In terms of curve similarity measurement, methods such as Euclidean distance [14], Dynamic Time Warping [15], Pearson correlation coefficient [16], Spearman’s rank correlation coefficient [17], curve fitting [18] and Fréchet distance [19] can also be applied to the comparison of geometric features of routes. In addition, methods based on deep learning, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [20], can be used to extract features from curve data and perform similarity comparisons. Among these, discrete Fréchet distance and Dynamic Time Warping are two commonly used methods that can effectively evaluate and quantify the similarity between two routes. These studies not only improve the speed and efficiency of data processing but also enhance the accuracy of map data fusion.
So, in this paper, the issue of matching multisource homonymous geographical entities is investigated in this paper, with railway line vector spatial data being utilized as the subject of research. Initially, the concept of multisource heterogeneous spatial data for railways is defined, and a methodology for the matching of multisource homonymous geographical entities within the railway context is developed. Subsequently, attribute features are matched using Term Frequency-Inverse Document Frequency (TF-IDF), and a method for geometric feature matching based on Dynamic Time Warping (DTW) is proposed. An empirical investigation is conducted using railway line maps from two distinct sources. By integrating railway line data from diverse origins effectively, an approach is aimed at minimizing redundant data storage and addressing the semantic heterogeneity issue inherent in multisource vector data of railways, thereby enhancing the data’s utilization and efficiency.
Materials and methods
General data
To validate the effectiveness of the algorithm, data fusion is performed on railway and subway information from two sources. Source1 is from the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, and Source2 is from OpenStreetMap (OSM). By applying geographic analysis and clipping techniques, the recorded railway lines in Beijing are extracted, as shown in Fig. 1.
The first source comprises a total of 105 entries, with relatively longer route lengths. The second source consists of 6,237 records, which are more detailed, and the recorded lengths of the routes are comparatively shorter.
Data structure
An analysis is conducted on the table structures storing the two layers of data. The following provides an explanation for each table structure’s meaning in an academic context. In the dataset for Source 1, the “Shape” field is consistently “Polyline” across all records. FID serves as an identifier to distinguish each row of data. Shape indicates the geometric type of the data. OBJECTID is another unique identifier used to uniquely identify each object in the database. RN is a reference number. NAME refers to the name of the route or object. TYPE is a category field, indicating the classification or type of the object. Shape_Length represents the length of the geometric shape. See Tables 1 and 2 for details.
Dataset descriptors for source 1
Dataset descriptors for source 1
Example data for source 1
Beijing rail map (source 1: left, source 2: right).
In the dataset for Source 2, the “Shape” field is consistently “Polyline” for all records. Similarly, the “code” field is uniformly “6101,” and the “fclass” for all entries is designated as “rail”. ObjectID serves as a unique identifier. Shape denotes the geometric type of the data. osm_id is an identifier used by OpenStreetMap to recognize specific geographical features or objects. code is used for categorization or identifying specific features. fclass represents the characteristic of the route. name refers to the name of the route or object. layer indicates the object’s level within the Geographic Information System. bridge indicates whether the structure is a bridge, tunnel indicates whether it is a tunnel. Shape_Length represents the length of the line object. See Tables 3 and 4 for details.
Dataset descriptors for source 2
Example data for source 2
Transport vector data digitizes transportation information into vector formats, detailing the geometric attributes of transportation infrastructures such as roads, which include their width, length, position, and orientation. The classification of transport vector data varies, with a common approach being to categorize it based on the type of content it contains, typically including road vector data, vehicle vector data, and transportation facility vector data. Moreover, based on the source of data and the domain of application, transport vector data can be divided into categories like aviation traffic vector data, railway traffic vector data, and waterway traffic vector data.
Railway vector data encompasses a comprehensive set of information on infrastructure elements such as tracks, stations, bridges, tunnels, culverts, and embankments. This dataset constitutes approximately 50% of all spatial data in the railway sector, making it an essential information resource for railway spatial databases. Due to the variety in equipment and methodologies used for processing and recording the original data across different subsystems, the same traffic vector might be documented in different ways, leading to the creation of multi-source homonymous geographical entities. These entities are defined as geographical entities that possess the same name or identifier in different data sources, as shown in Fig. 2.
Railway geographic vector classification.
Through the employment of this hierarchical approach, railway authorities can effectively navigate the complexities inherent in managing spatial data, thereby enhancing operational efficiencies and elevating safety standards within the railway transportation ecosystem.
(1) Identification and Comparison of Point Entities:
Given two points P1 (
(2) Identification and Comparison of Line Entities: Line entities can be represented as a collection of points. For two lines L1 and L2, their point sets are compared for overlap or proximity. The minimum distance between line segments or a comparison of their direction and length can be calculated to determine if they represent the same entity.
For every pair of points (
Construct a distance matrix D, where D[i][j]
(3) Identification and Comparison of Area Entities:
Area entities can be represented as closed point sets or polygons. Suppose there are two areas A1 and A2, defined by point sets {(
If the areas of two regions or their boundaries are very close, they may represent the same entity.
Based on Dynamic Time Warping, the railway route curve similarity calculation is a method used to measure the similarity or dissimilarity between two curves. This approach is commonly applied to compare curve data such as time series, trajectories, graphs, sound signals, etc. Railway route curve similarity refers to a method used to assess the similarity or dissimilarity between two railway routes. This similarity measurement method is typically employed to compare the characteristics and geometric shapes of different railway routes to determine their similarities or differences. The evaluation of railway route curve similarity encompasses considerations of geometric shapes, route features, load characteristics, safety, and reliability. Dynamic Time Warping-based Railway Route Curve Similarity Calculation is shown in Table 5.
Dynamic time warping-based railway route curve similarity calculation
Common methods for curve similarity calculation include Euclidean distance, correlation coefficients (such as Pearson or Spearman rank correlation coefficients), curve fitting, dynamic time warping (DTW), Fréchet distance, edit distance, and deep learning-based approaches. Among these methods, Fréchet distance is widely used due to its advantages:
Geometric-based: Fréchet distance considers the geometric properties between curves, rather than just the distances between points. Invariance to scaling and translation: Fréchet distance is invariant to scaling and translation of curves. Applicability to multi-dimensional curves: Fréchet distance is not only applicable to one-dimensional curves (e.g., time series) but can also be extended to multi-dimensional curves.
However, Fréchet distance has some drawbacks, especially concerning local similarity evaluation:
High computational complexity: Algorithms for computing Fréchet distance typically have high computational complexity, especially for long curves or high-dimensional data. Sensitivity to noise: Fréchet distance is sensitive to noise and local perturbations. Lack of flexibility in handling time variations: Fréchet distance often assumes consistent motion speeds between two curves.
Due to the extensive nature of railway lines, there is an issue with the inconsistency of curve segmentation. As illustrated in Fig. 3, the segmentation standards differ across various data sources, thus necessitating a consideration of local similarity. Consequently, this paper employs the principles of Dynamic Time Warping (DTW) algorithm for the computation of similarity in railway line routes.
Divergent segmentations of a railway line.
Therefore, due to the considerable length of railway lines, there exists a problem of inconsistent curve segmentation. Different data sources have different segmentation standards, necessitating the consideration of local similarity issues. Therefore, this paper proposes the introduction of the Dynamic Time Warping (DTW) method. Its advantages in measuring local similarity, compared to other methods, are primarily reflected in its capability for elastic matching in time or spatial sequences, strong ability to handle nonlinear relationships, and superior fault tolerance and robustness. The foundational principle of DTW applied to railway line comparison is as follows:
Consider two sequences to be matched, P and Q, where each sequence may contain different numbers of elements.
In constructing the cost matrix C, each element
Moreover,
After filling in the cost matrix
Using 6,237 railway lines from source 2 for comparison with source 1, for numerical attributes, conduct comprehensive statistical analyses encompassing measures such as mean, standard deviation, minimum value, 25th percentile, median, 75th percentile, and maximum value. For categorical attributes, delve into statistical examinations including the count of unique values and the frequency distribution of the most prevalent values. These analytical efforts aim to elucidate the distributional characteristics and prevalent patterns inherent within the dataset.
Statistical analysis
Exploratory analysis is conducted on the fields, with separate statistical summaries for numerical and textual data. Numerical statistics are shown in Tables 6 and 7, non-numerical data statistics are shown in Tables 8 and 9.
Source 1 numerical statistics
Source 1 numerical statistics
Source 2 numerical statistics
Source 1 non-numerical data statistics
Source 2 non-numerical data statistics
To implement the Dynamic Time Warping curve similarity in Python, as described, you’ll need to install certain libraries and follow a series of steps to process geographic data and calculate DTW distances. The libraries required for this task include pandas, geopandas, shapely, numpy, openpyxl, and scipy. Here’s the main code snippet for these steps:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
import numpy as np
import openpyxl
from scipy.spatial.distance import euclidean # Define a function to calculate the simplified DTW distance between two sequences.
def simplified_dtw_distance (seq1, seq2): n, m
Similarity calculation results
The results of the DTW similarity calculation are as shown in Table 10. The dataset contains 6,237 observations, with the average DTW distance between the two railway maps being 3.918. This average suggests a moderate level of dynamic time warping distance between the two sets of railway lines, indicating how similar or dissimilar they are on average. The standard deviation of the DTW distance is 4.634, highlighting a significant degree of variability around the mean, which points to a wide range of similarities and dissimilarities among the compared railway lines. Statistical analysis of the results yielded the following Table 11.
DTW similarity calculation results
DTW similarity calculation results
Statistical summary of DTW distance calculations
The range of DTW distances spans from a minimum of 0.005 to a maximum of 72.521, showcasing extreme variability and indicating that while some railway lines are very similar, others are vastly different. The quartiles-25th percentile at 1.524, the median (50th percentile) at 2.429, and the 75th percentile at 4.269-further demonstrate the distribution of DTW distances, with the median indicating that half of the data points have a DTW distance below 2.429, suggesting a moderate similarity level for the majority of the railway lines when compared. The most frequently occurring DTW distance is 0.005, which is the mode of the dataset, indicating the most common level of similarity found between some of the railway lines. The skewness of the dataset is 4.246, which shows a positive skew, meaning that the bulk of the DTW distances are concentrated on the lower end of the scale, but there are also some extremely high values indicating significant dissimilarities. The kurtosis value of 30.191 indicates a heavy-tailed distribution, suggesting the presence of outliers or extreme values that are significantly different from the rest. The statistical analysis of the DTW distances reveals a broad spectrum of similarities and dissimilarities among the railway lines compared, with a moderate average similarity. The wide range, significant standard deviation, and the presence of extreme values as indicated by the high kurtosis, all suggest that while many railway lines share moderate to high levels of similarity, there are also notable exceptions that are significantly dissimilar. This nuanced distribution underscores the complexity of comparing railway lines based on their geographic alignment and highlights the utility of DTW in capturing these variations.
Combining the aforementioned statistical data with the analysis of heatmaps to measure the correlation of DTW_Distance with other variables reveals the following insights: The correlation between OSM1_ORIG_FID and ZKY1_ORIG_FID is very low and negative, indicating no significant linear relationship between them. Similarly, the correlation between OSM1_ORIG_FID and DTW_Distance is also low and negative, suggesting no significant linear relationship exists between them. The correlation between ZKY1_ORIG_FID and DTW_Distance is positive but still relatively low (0.28), indicating a weak positive linear relationship between them. Heatmaps demonstrate that there is no strong linear correlation between identifiers and DTW_Distance (a measure of similarity). This outcome is expected if these identifiers are categorical variables and DTW_Distance is a continuous variable. The weak correlation between ZKY1_ORIG_FID and DTW_Distance may warrant further exploration to see if there is any pattern, or if this is merely a result of random variation in the data. In the “Z” table’s TYPE field, the only values are blank and “electric,” while in the “O” table’s fclass field, the only values are “rail” and “subway.” This indicates significant differences in the data and formats of the type fields between the two tables, making them unsuitable as keywords for merging the two tables, as shown in Fig. 4.
Heatmap of correlations.
Histograms display the distribution of DTW_Distance (the result of similarity calculations). The data distribution is right-skewed (positively skewed), meaning most railway line pairs are relatively similar, but some pairs have a greater difference in similarity. Most DTW_Distance values are concentrated in the lower range, especially between 0 and 10. Data concentration: The histogram shows that the vast majority of data points are concentrated in the lower value area, consistent with your data’s mean (about 3.92) and median (about 2.43), both of which indicate a tendency for the data to cluster in the lower numerical range. Skewness: Although most of the data is concentrated in the lower area, the histogram also shows an extension to the right, meaning there are some larger DTW_Distance values. This right skew (positive skewness) indicates that besides a large number of closer data points, there are a few data points that are further away, as shown in Fig. 5.
Histogram of DTW similarity.
Scatter plot of DTW similarity.
Scatter plots in the DTW_Distance scatter plot, each point represents a data record, with the horizontal axis being the record’s index and the vertical axis being the corresponding DTW_Distance value. Through this chart, Data points distribution: The scatter plot shows how data points are dispersed on the vertical axis (i.e., DTW_Distance values). Most data points are concentrated in the lower value area, consistent with observations from histograms and box plots. Visualization of outliers: The higher DTW_Distance values form some clearly separated points in the chart. These may correspond to potential outliers previously identified in box plots. The scatter plot intuitively shows these values’ distribution relative to the overall dataset, as shown in Fig. 6.
Applying the empirical rule (68-95-99.7 rule), with the threshold being
In Excel, the formula
This paper introduces a railway multisource homonymous geographical entity matching method based on Dynamic Time Warping (DTW), crucial for the integration and management of railway spatial data, enhancing railway informatization, safety, and operational efficiency. Using Python, the study meticulously processes data and calculates DTW distances, optimizing the accuracy and efficiency of railway data integration. Empirical results show a high matching success rate, with approximately 95.5267% of records falling within the acceptable threshold, underlining the method’s robustness. Analysis of DTW distances has revealed effective quantification of similarities between different railway lines, with some line pairs displaying minimal distances, indicating high geometric and path configuration consistency. Statistical analysis has pointed out that most DTW distances tend to cluster at lower values, suggesting a general similarity among different data sources’ railway lines. Yet, there are outliers with higher DTW distances, signaling data discrepancies or geographical feature differences. The algorithm has successfully identified line pairs with extremely high similarity, with DTW distances close to zero, confirming the algorithm’s effectiveness in integrating railway vector data and identifying multisource homonymous entities. The DTW algorithm’s ability to pinpoint local similarities between globally dissimilar lines is especially valuable for railway data processing, ensuring segment-specific precision. Overall, this research enhances the precision and efficiency of railway vector data matching and provides new insights into the consistencies and differences of railway line records, significantly contributing to the railway industry’s informatization and intelligent development.
Data sharing agreement
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, author-ship, and publication of this article.
Funding
This article was supported by China National Railway Group Co., Ltd. under the Science and Technology Research and Development Program (Project No. L2022X001); and the China Academy of Railway Sciences Corporation Limited under the Scientific Research Project (Project No. 2022YJ355).
Footnotes
Acknowledgments
My sincere thanks go to all the individuals and organizations that contributed to the success of this study.
