Abstract
Transportation agencies are increasingly integrating third-party traffic data into their core business function areas such as system performance monitoring, project programming, traffic incident management, and safety analysis. However, linking private-sector data with agency asset inventory data has been a major challenge because the networks typically have different referencing systems, segmentation schemes, and representations of travel directions. This paper presents an effective conflation algorithm that associates spatial features between large-scale road networks. Instead of breaking lines into smaller pieces, which is a common technique in transportation applications, we use an intersection-based approach that leverages the inherent topological similarities between networks. The underlying uncertainty and imprecision in network geometries and road names are addressed through application of a fuzzy logic inference technique. We then implement an effective mechanism to handle differences in representations of divided roadways and travel directions in the two networks. The algorithm was tested on Kentucky statewide roadway networks and achieved a matching accuracy of over 99%. This approach has been successfully applied by the Kentucky Transportation Cabinet in its project identification and prioritization process.
Keywords
Recent technological advancements have greatly increased the availability of transportation data, including probe vehicle speeds, travel time, origin–destination matrices, pedestrian and bike activities, and incident alerts. These data, which provide insights into a wide range of travel characteristics, are processed by third-party data providers and made available to transportation agencies as a cost-effective alternative which meet their data needs. Many state departments of transportation (DOT) and metropolitan planning organizations (MPO) have used such data for various business cases, such as real-time traffic management, travel model improvement, safety analysis, and performance management and reporting ( 1 ).
Although private-sector data capture a wide range of information, they lack many critical attributes required for transportation applications (e.g., traffic volume), which must be obtained through agency-maintained databases such as the Highway Performance Monitoring System (HPMS) ( 2 ). With private-sector data and agency-maintained data often being tied to geospatial networks with different referencing systems and segmentation schemes, a direct join is impossible. The challenge is compounded by the need to match directions that are handled differently in these two networks, as integrating transportation data (e.g., speeds, volumes, crashes) by directions is a prerequisite for many transportation applications.
There are other issues associated with conflation that state agencies are struggling with. First, different state agencies and vendors use different networks and the same agency may have different network versions. For example, at the Pennsylvania DOT, interstate segments are always half a mile long and do not consider ramps, while non-interstate roadways break at major landmarks such as intersections and bridges with approximately half a mile length ( 3 ). The Ohio DOT has two Linear Referencing Systems (LRS) (i.e., county-divided and statewide) for their 121,000 centerline mile roadways ( 3 ). At the vendor side, there are INRIX XD, HERE NAVSTREETS, TomTom Multinet, among others, with different segmentations and attributes. Meanwhile, both state and vendor networks are periodically updated, which means the conflated network will have to be updated as well. In addition, developing a conflation process requires extensive Geographic Information System (GIS) and programming knowledge, which can be particularly challenging for agencies with limited IT resources and assets. Many agencies have to outsource the conflation task to consultants, universities, or vendors ( 1 ).
A limited number of transportation-focused publications have proposed automated conflation processes, and none explicitly discuss how to match directions ( 2 , 4–7). For instance, Lomax et al. ( 4 ) was among the first to match the HPMS and Traffic Message Channel (TMC) based private-sector network. To align segmentations of both networks, they first used HPMS link end points to break the TMC network and then obtained candidate TMCs intersecting with or falling within the buffer created around each HPMS line feature. In a recent study, Kaushik et al. ( 7 ) tried to conflate the HPMS with the TomTom Multinet network by breaking down long HPMS links into shorter ones and then applying a scoring process to determine the matching HPMS link for each Multinet link. In general, existing methods have been designed for arterial networks, such as the National Highway Network. Their performance significantly deteriorates when applied to locations with high network density.
In other fields, such as cartography and photogrammetry, several studies have been conducted to address the conflation of heterogeneous geospatial data ( 8 ). With respect to the main geometric features used in the conflation process, these studies have developed node-based methods ( 9 – 13 ), line-based methods ( 14 – 18 ), and polygon or area-based methods ( 19 , 20 ). Various techniques have been applied to identify matching features, such as growing buffer ( 14 ), probabilistic relaxation ( 15 ), network flow optimization ( 18 ), and a delimited stroke-oriented approach ( 21 ). Yet those studies mainly aimed to find the correspondence of geospatial features in undirected networks. As such, adapting them for directed roadway networks is difficult.
In this paper, we present an effective algorithm for conflating large-scale directed networks from a state transportation agency and private sector. Instead of splitting line features into smaller pieces, as commonly done in transportation studies, we use an intersection-based approach that takes advantage of inherent topological similarities between networks while retaining the geometry of individual links. Additionally, such an approach lets us effectively incorporate link direction information into the conflation process. Our algorithm encompasses three essential steps: (1) preprocessing that sets up the networks for conflation; (2) matching intersections based on geometrical and non-spatial attributes and a fuzzy logic inference technique; and (3) matching links and directions. A tailored mechanism handles differences in representations of divided and undivided highways as well as travel directions. The algorithm only uses common attributes found in both highway inventory and private-sector networks, and can thus be quickly adopted by transportation agencies.
The rest of the paper is organized as follows. In the next section we introduce and compare the two networks used to develop the conflation algorithm. One network is from a public agency (Kentucky Transportation Cabinet) and the other is from a private vendor (HERE Technologies). From here, we describe development of the algorithm, followed by the presentation, evaluation, and discussion of conflation results. The last section concludes the paper.
Description of Networks and Problems
We used two highway networks covering the entire state of Kentucky to develop and demonstrate the conflation algorithm. The first is the All Roads Network of Linear-Referenced Data from the Kentucky Highway Information System. It is maintained by the Kentucky Transportation Cabinet. The All Roads Network served as the target network and is referred to as Network A throughout the rest of this paper. The second is the HERE Technologies proprietary street network, which has attached probe speed data. This network, referred to as Network B, served as the matching network.
Network A covers all the public roads across Kentucky and has 169,816 directional miles. It contains attributes that support various of the Kentucky Transportation Cabinet’s core functions. The network is generated following LRS standards, which characterize the network elements, such as points and lines, as a distance from a specific point along a linear feature. Each line feature has an LRS ID consisting of 17 characters, which are unique to a road and a county in Kentucky, as well as beginning and ending milepoints that indicate location and length. The direction of a divided road section is either cardinal, along which milepoints increase in value, or non-cardinal along which milepoints decrease in value. For undivided roads, Network A uses one line to represent both directions with a cardinal LRS coding and an indication of two-way operation.
Network B, on the other hand, references each line feature with a unique Link ID. Directional coding of lines is contingent on the coordinates of two end nodes. The reference node is the end node with the lower latitude, or longitude if the latitudes are the same. The direction is coded as F if the line emanates from the reference node, and T if it is directed toward the reference node. One line is used to represent an undivided road with the direction denoted as Bi. However, speed data are attached to a unique Link ID and travel direction (T or F) combination.
Spatially, Networks A and B share very similar level of details for public roads, but disparities are apparent at some locations. Figure 1a illustrates the coverage difference as there are links not included in either network. Figure 1b highlights the different segmentation schemes, making it challenging to determine the associations for individual links. Figure 1c captures an example of geometric inconsistencies at an interchange, with ramp merging and diverging locations in two networks being far apart. Figure 1d shows how a segment may be coded differently in the two networks based on whether it is divided (using two lines) or undivided (represented by a single line). These create additional complications in matching travel directions.

Discrepancies between Networks A and B: (a) coverage; (b) segmentation scheme; (c) geometry; and (d) divided versus undivided.
An Intersection-Based Conflation Algorithm
We present an algorithm that overcomes the challenges described above and effectively associates links and directions between Networks A and B. We designed it to take advantage of an intersection being almost always represented as a node in both networks. The algorithm first identifies the matching nodes for each intersection and then associates the link(s) and directions for each leg of the intersection. Figure 2 illustrates the algorithm’s overall process.

Flowchart of the conflation algorithm.
Preprocessing
During preprocessing, each road network in the shapefile is converted into a directed graph with a node set and link set consisting of directed links with ordered pairs of nodes. During the conversion, the number of links connected to each node (i.e., node degree) is recorded to identify nodes where at least three links intersect. These nodes are labeled intersections. The direction of each link is based on the milepoints of two end nodes if it is in Network A or latitude/longitude coordinates if it is in Network B. Next, based on ordered node pairs and associated node degrees, the links bound by two intersections are chained into a link sequence.
Figure 3 provides a conceptual example. Two Network A links

Conceptual example of preprocessing step.
Intersection Matching
Next, the B intersection that best matches each A intersection is identified. The basic presumption is that each network most likely has a node representing an intersection of the road network. For example, in Figure 3, nodes
Select Candidate Intersections from Network B
A common method of identifying candidate intersections is to select intersections within a certain distance of the subject intersection. But this often captures irrelevant intersections when a roadway network is dense. For example, as shown in Figure 4a, among three intersections (in Network B) within 400 ft radius of the A intersection, node 2 from the opposing direction of the highway is closer to node A than node 1, which is the correct match for node A. Therefore, using spatial proximity as the sole matching criterion is insufficient.

Obtain candidate intersections: (a) basing on distance captures irrelevant intersections; and (b) basing on nearby link sequences leads to better matches.
We designed a more sophisticated approach based on the observation that line features are spatially proximate in the two networks. Specifically, we first find all A link sequences connected to the A intersection. In the example shown in Figure 4a, these are the two link sequences on the northbound mainline (partially displayed) and one off-ramp link sequence, which is shaded red. Along each link sequence, we then create at least nine equally spaced shape points and use them to find the nearest B link sequences that are within a radius of 100 ft. We chose a threshold of 100 ft, after testing several values, to balance computational efficiency and flexibility in handling geometric discrepancies between the two networks. Accordingly, the two mainline and one off-ramp link sequences in Network B are identified as in Figure 4b as the best matches for those three link sequences in Network A. The end nodes of these identified B link sequences (i.e., nodes 1 to 4 in Figure 4b) are labeled as the candidate intersections in Network B to the subject A intersection. Each intersection has two attributes—including coordinate and street name from the corresponding line sequence—which are used in next step.
Define Intersection Matching Criteria
Intersection proximity and road name similarity are used to evaluate candidate intersections and determine the best match. Proximity measures the distance between a candidate B intersection and the subject A intersection. Since the networks are projected into a Kentucky State Plane coordinate system, the Euclidean distance is calculated as follows.
where
Because road names are sometimes coded differently (e.g., 1st Street versus First Street), a string similarity indicator based on the Ratcliff/Obershelp pattern matching algorithm ( 22 ) is used to measure road name similarity (Equation 2).
where
When determining
With the matching criteria defined, two difficulties must be overcome to combine them into a holistic measure for ranking candidate intersections: 1) geometries and road names are inconsistent between the two networks, resulting in uncertainty and imprecision in the obtained measures; and 2) two measures have different scales with distance ranging from 0 to hundreds of feet while name similarity is measured on a scale of 0 to 1—thus combining them using a representative linear function is challenging.
Determine the Best Match
To deal with above difficulties, a fuzzy logic inference technique is employed to generate an overall matching score using the intersection distance and road name similarity as inputs. Fuzzy logic is effective in dealing with uncertain and imprecise conditions ( 23 ). Input variables with different ranges and scales can also be effectively combined during the inference process. Its utility has been demonstrated in previous transportation engineering studies, including for detecting traffic incidents ( 24 ), developing congestion measures ( 25 ), mapping GPS for vehicle navigation ( 26 ), analyzing bottleneck severity ( 27 ), and crash location mapping ( 28 ).
To run the fuzzy logic inference system, input (i.e., intersection distance and road name similarity) and output (i.e., intersection matching score) variables must be converted into fuzzy variables, a process called fuzzification. This is achieved by defining a membership function that maps an input value to a number in the interval
Determining an appropriate number of classes and representative membership functions can be subjective and sometimes arbitrary as well as highly dependent on specific applications. Empirical assessment is often conducted to account for the characteristics of a particular application ( 28 ). We fine-tuned the fuzzy logic inference system’s parameters by randomly selecting 500 intersections from Network A and manually matching them with their counterparts in Network B. The locations of those intersections are shown in Figure 5.

Locations of randomly selected intersections.
Our evaluation found that three classes are sufficient for both input and output variables in this study as they achieve a good balance between efficiency and accuracy. Given the size of the networks being processed, for membership functions we selected triangular and trapezoidal functions for their simplicity. Imprialou et al. ( 28 ) showed that function forms have limited impact on the results. The boundaries of membership functions are determined through trial and error until the performance is satisfactory for sampled locations. The final membership functions implemented in the study are shown in Figure 6. Using Figure 6a as an example, an intersection distance of 75 ft would not be considered as small since it exceeds 40 ft, which is the upper bound of the Small class. The distance would be considered fairly medium with a membership degree of 0.5 and slightly large with a membership degree of 0.11.

Implemented membership functions for determining matching intersection: (a) intersection distance; (b) road name similarity; and (c) intersection matching score.
After fuzzification, the linguistic representations of intersection distance and road name similarity are combined through a fuzzy inferencing engine, which produces a fuzzy set for the output measure (i.e., intersection matching score). The inferencing engine functions on top of a predefined set of fuzzy rules associating the input classes to the output classes. Table 1 lists all possible inferencing rules based on the previously defined membership functions.
Fuzzy Inference Rules for Intersection Matching
The last step of the fuzzy inference system is defuzzifying the output classes and associated membership degrees into a crisp value, which indicates the overall quality of the matching. There are several defuzzification methods available, including centroid of the area, mean of maximum, max-membership principal, weighted average, and bisector of the area. Their suitability varies by application. Here, the centroid of the area method is chosen as it is the technique most commonly used in transportation studies and performs well in preliminary evaluation.
By using the fuzzy logic inference technique, each candidate intersection receives a matching score. The one with the highest matching score is chosen as the matching intersection. If there is a tie in the highest score among multiple candidate intersections, the one with the shortest distance to the subject A intersection is chosen.
Link and Direction Matching
Once the matching intersections have been identified, the next step is to determine the correspondence of link sequences and then their comprising links and associated travel directions. Discrepancies in localized topologies between the two networks can lead to different sequence matching scenarios (Figure 7).

Link sequence matching scenarios: (a) one-to-one matching; (b) one-to-many matching; (c) many-to-one matching between undivided roads in both networks; (d) two directions of divided A road matched to undivided B road; and (e) undivided A road matched to two directions of divided B road.
Scenario (a) represents the majority of matching cases where we can directly find the corresponding A and B link sequences between two matched intersections. Scenarios (b) and (c) represent cases where there is no direct correspondence of existing link sequences in two networks as a result of an intersecting roadway being present in one network but not the other. To deal with this, the sequences split by the intersecting roadway are tied into a new link sequence, which is then matched to the one from the other network. Because scenarios (a), (b), and (c) involve undivided roadways only, the association of the opposite travel direction can be obtained using the same link IDs but reversing their direction codes.
Scenarios (d) and (e) result from different coding of divided and undivided highways in two networks. In scenario (d), two A link sequences in separate directions are both matched to the same B link sequence. Since the B link sequence is two-way, the corresponding B links with correct direction can be found for the respective A link sequence by reversing the beginning and end nodes.
Among all scenarios, Scenario (e) presents the most difficult complication, where the roadway is coded as undivided in Network A but divided in Network B. Therefore, two link sequences represent separate directions in Network B. Under this condition, the intersection matching process only pinpoints the end nodes of the B link sequence that are closer to the A link sequence. As a result, only one directional A link sequence is matched with the closer B link sequence while the other direction of the A link sequence has no matching counterpart as the more distant B link sequence is not included.
To address the above issue, a special procedure is applied to locate the B links within a certain buffer of the subject A link sequence and eliminate the B links that have already been matched so that only the links in the missing direction remain in the candidate list. Road name similarity and the angle between candidate B link and subject A link sequence are used to exclude irrelevant links that have been included because of their closeness. The threshold for road name similarity is set at 0.9 and the angle less than 30°, following the suggestion in Walter and Fritsch ( 14 ). The sequence of remaining links is then determined based on the obtained milepoints of their end nodes.
Based on the matched link sequences, the correspondences of links and associated directions are determined. Since A is the target network and B is the matching network, the B link length is adjusted proportionately to match the total length of A links. The implementation procedure is as follows.
Implementation Results and Discussions
Algorithm Execution
We programmed the conflation algorithm using Python open-sourced tools including geopandas, shapely, numpy, and difflib. Geopandas and shapely were employed to perform spatial operations such as retrieving intersection coordinates and establishing link sequences. Numpy was applied to calculate distance based on coordinates of endpoints, while difflib was used to determine string matching score of road names. Network and computational details are presented in Table 2. Network B contains more roads, including private roads, and tends to have finer segmentations, as reflected by directional mileage and number of links. The conflation algorithm produced over 1 million matches of directional links from two networks. Table 3 presents the format of conflation results.
Computational Details of the Conflation Algorithm
The Conflation Output Table
Conflation Results Evaluation
To evaluate the algorithm’s effectiveness, we selected four areas with distinctive geographical characteristics to perform manual verification (Figure 8). The first was a 2.5 mi × 2.5 mi area in downtown Louisville with a very dense roadway network and complex interchanges. The second one was a 2.5 mi × 2.5 mi area in a Lexington suburb that features many curvilinear local roads and cul-de-sacs. The last two were located in the western and eastern rural regions of Kentucky; both had 5 mi × 5 mile areas and are characterized by sparse rural roads.

Four selected areas for quantitative evaluation: (a) Louisville downtown; (b) Lexington suburban; (c) Western rural; (d) Eastern rural.
To evaluate the conflation algorithm’s performance, we used three metrics: accuracy, precision, and recall. All of these are commonly used in machine learning studies. Accuracy measures the overall ability of the algorithm to find correct matches if correspondences exist and to not force matches if links are only in Network A but not Network B (Equation 3). Precision is the percentage of correct matches out of all matches generated by the conflation algorithm (Equation 4). Recall is the percentage of correct matches out of all the matches that should be generated by the conflation algorithm (Equation 5). In each equation, a true positive is recorded when the conflation algorithm identifies the correct link and direction correspondences. A true negative is the product of the conflation algorithm correctly finding no match for a link that is only present in Network A. A false positive is the result of the conflation algorithm locating an incorrect link and/or direction correspondence. False negatives arise when the conflation algorithm fails to find a match when a correspondence exists. Given variations in segmentations, the equations use link length instead of link counts to calculate performance metrics.
Table 4 summarizes the matching results for each area, including total directional mileage of roadways and the length of roadways. The small number of false positives and false negatives in all the areas demonstrate the conflation algorithm is effective. The combined accuracy level for all four areas is 99.29%, higher than accuracy levels reported in existing literature. Because the Louisville area has the most complicated network structure, the accuracy level is a bit lower than in other areas, but still reaches 99%. The accuracy level for the other three areas ranges from 99.34% to 99.45%. Precision levels are also highly satisfactory, averaging 99.29%. The recall measures range from 99.96% to 100%, indicating the algorithm is extremely successful in correctly matching corresponding roadway objects.
Performance of the Conflation Algorithm in Four Areas
Spot Checks
Along with quantitative evaluations, we undertook manual spot checks to identify locations whose layout challenges our algorithm. Locations across the state were randomly inspected, with particular attention paid to complicated junction areas where geometric inconsistencies tend to occur.
The following examples showcase locations whose complexity may pose challenges for existing methods but where our algorithm works well. The first case is an interchange area nicknamed “spaghetti junction” where I-64, I-65, and I-71 meet in Louisville (Figure 9). Algorithms that only use geometrical attributes (e.g., distance, angle between lines) would not perform well as a result of the intertwining lines. Our algorithm considers additional information (e.g., topology, road name) and thus overcomes geometrical inconsistencies between the two networks, resulting in a highly accurate conflation. For example, at Location 1, the intersections of A and B are matched correctly despite appearing distant from one another. At Location 2, Intersection 4441546 is correctly matched between the two networks, even though the A intersection is much closer to B Intersection 4388289. With correctly matched intersection pairs, the correspondence of links between two intersections is successfully determined. At Location 3, the correct B link is found, even though the corresponding A link nearly overlaps with an irrelevant B link.

Conflation at a complicated interchange area.
Figure 10 shows an example of conflated road links at a diverging diamond interchange. Even though many diverging/merging points and links from two networks are not aligning well, our algorithm is able to determine the association of these points based on interchange structure and road names and accordingly find matching links.

Conflation at a diverging diamond interchange.
The conflation algorithm also works well at locations with large topological discrepancies. Figure 11a shows that the algorithm correctly matches links in a complex roundabout as well as on its approaches. Figure 11b displays a successful matching case where Network A roads are divided but Network B roads are undivided. Figure 11c provides an example of segments in Network A being undivided while those in B are divided. As the A link sequence is closer to the westbound B link sequence, existing algorithms using geometric and non-spatial characteristics would only match westbound links. Our algorithm successfully matches links in both directions using the procedure designed for link matching scenario (e) in Figure 7.

Conflation at locations with topological discrepancies: (a) a complex roundabout; (b) when a road is divided in network A but undivided in network B; and (c) when a road is undivided in network A but divided in network B.
At a small number of locations, complications arise as a result of topological inconsistencies between two networks that are too difficult for the algorithm to handle. Figure 12a shows a mismatch caused by a ramp splitting into two lines in Network A while remaining as one line at the end of the ramp in Network B. Figure 12b presents another challenge caused by the extra roadway having the same road name and small angle with a mainline link. Although rare, in some cases both networks may not reflect recent roadway alignment changes. Any automated conflation algorithms have difficulty handling these cases, thus requiring manual validation.

Locations where the algorithm gives mismatches: (a) a ramp splitting into two lines in network A; and (b) extra link with same road name in network A.
Conclusions
This paper described an automated process for conflating large-scale directed networks. Conflation processes used in existing transportation applications rely on breaking down links and are unable to effectively match directions. Our algorithm takes advantage of the topological similarity between networks and establishes correspondences between topological elements with a high level of confidence, including intersections and link sequences. Leveraging a diverse set of attributes related to topological, geometrical, and non-spatial characteristics yields considerable gains in effectiveness. A fuzzy logic inference system is incorporated into the conflation process to account for the underlying uncertainty and imprecision in the networks. Based on the intersection matching results, we can accurately determine correspondences of links and associated directions.
Given network complexity, our algorithm achieves satisfactory performance. We analyzed four areas with distinctive geographical characteristics whose road networks encompass over 780 directional miles, finding the algorithm achieves over 99.2% accuracy, 99.2% precision, and 99.9% recall performance. Manual verifications show that the algorithm finds correct matches at complex locations that methods currently or previously in use would struggle with. Conflation errors are mostly restricted to locations where two networks have clear topological inconsistencies (e.g., Figure 11).
The Kentucky Transportation Cabinet is using the conflated network in their statewide network screening and project prioritization process and is in the process of applying it in other applications. We also adapted the conflation algorithm to work with other types of networks for San Francisco County in California ( 29 ). The scripts involved in that work were made public at https://github.com/sfcta. The conflation methodology put forward here will help agencies overcome major challenges associated with integrating third-party data into agency databases.
Footnotes
Acknowledgements
The authors would like to thank Daniel Hulker for data support and technical advice and Chris Van Dyke for draft review.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: Zhang; data collection: Zhang, Chen; analysis and interpretation of results: Zhang, Chen; draft manuscript preparation: Zhang. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is made possible by the funding from the Kentucky Transportation Cabinet.
Data Accessibility Statement
The data used to support the findings of this study have not been made available because of the restriction of data use agreement.
