Abstract
In an era of data-driven smart cities, the possibility of using crowdsourced big data to support evidence-based planning and decision-making remains a challenge. Along with the increased availability and potential utility of crowdsourced data, there is a clear need to assess the validity of these data in order to determine their appropriate use for planning and management. Moreover, with growth and rapid urbanization in many cities, there are increasing challenges associated with urban mobility. The goal of this research is to develop an understanding of the geographical representativeness of crowdsourced data in the context of urban mobility through investigation of bicycling in Australian cities. In order to leverage both the geographic distribution and high volume of crowdsourced data for validity assessment, we present a two-stage statistical approach. First, we evaluate flow data through correlation between spatial interaction matrices in the presence of spatial autocorrelation. The second stage evaluates the quantity of information available within the interaction matrices. The approach is demonstrated with crowdsourced bicycling commuting routes recorded by the RiderLog app from 2010 to 2014 that are then correlated with census bicycling journey to work data. Data are from four of Australia’s state capital cities: Adelaide, Brisbane, Melbourne and Perth. These methods assess the representativeness of individual bicycle routes that address the full pattern of flows within multiorigin multidestination systems and incorporate spatial autocorrelation. Results indicate that these crowdsourced data are geographically representative of regional travel where there are higher data volumes, generally in central business districts and occasionally in outlying areas. This research provides insights into both methods for statistical comparison of flow data and the use of crowdsourced bicycling routes for urban planning and management.
Introduction
The recent emergence of activity tracking smart phone applications (apps) is resulting in a vast increase in human movement data that are changing the landscape of active transportation research (Pettit et al., 2016). Evolving from work based on household travel surveys as well as global positioning system (GPS) recording of bicyclist location and behavior, recent studies utilize smart device captured crowdsourced bicycling route data. Among the advantages of data generated through apps are increased sampling of cyclists, cost savings on GPS hardware, the capture of bicycling route data at a finer scale than traditional data (e.g., household travel surveys) and high data volumes. Such route-level data may help planners and decision makers improve bicycling facilities and safety, raise modal share and help realize the personal fitness and environmental benefits of active transportation. These types of smart device captured data are developing as an important foundation for evidence-based city planning, decision-making and more efficient provision of urban services (Batty, 2013). High volumes of bicycling route data may be used to indicate areas requiring improvements for rider safety (Blanc et al., 2016) as well as bicycling infrastructure planning and provision.
Crowdsourced bicycling data, however, have a number of potential problems. Even with the large volumes of data collected from apps, samples tend to be small compared to general cycling populations (Romanillos et al., 2016). Data are collected from a self-selected sample of individual users, which may result in geographic bias (e.g., Hecht and Stephens, 2014) where bicycling patterns from app data may not be representative of general bicycling patterns in a geographic region. There may also be demographic sampling bias (Blanc et al., 2016) where rider characteristics (for example, age, ethnicity and/or gender) are not representative of the general population of cyclists. Romanillos et al. (2016) identify problems of bias inherent in GPS bicycling data collected from self-selected survey of participants such as over- or under-representation of an age, income or gender grouping. Among their key observations, “there is no reason to believe that either BSP [bike share program] or big app data provide representative samples of a cities’ population of cyclists or potential cyclists” (Romanillos et al., 2016: 127). Conrow et al. (2018) highlight that “groups such as commuters, students, children, and average recreational riders could be missed completely” (p. 21) and that “this is problematic because relying on biased information could lead to increased inequities in transportation planning and policy” (p. 21). The result is that “the use of crowdsourced data for city planning is still hindered by doubts about their accuracy, objectivity and representativeness” (Leao et al., 2019: 486). Conrow et al. (2018) summarize the state of recent scholarship indicating “the analytical potential for understanding bicycling behavior with crowdsourced data remains unknown” (p. 23). They go on to suggest comparison between crowdsourced and conventional data in order to assess similarities and differences. There is a clear need to assess the validity of crowdsourced bicycling route data in order to determine their appropriate use for planning and management.
The goal of this investigation is to develop a method to assess the extent to which crowdsourced bicycling route data generated with the RiderLog smartphone app are representative of general bicycling patterns in the geographic regions where they are collected. RiderLog is a free Australian smartphone app that collects crowdsourced bicycling journeys. The app records bicyclists’ GPS traces as well as some rider demographic information (e.g., age and gender) and route information (e.g., distance, duration, average speed, top speed). The app is used by people monitoring their physical activity and bicycling performance as well as people who support the goals of the app developer, the Bicycle Network, a not-for-profit organization whose goal is to improve bicycling facilities, bicycling safety and overall bicycle usage (Pettit et al., 2016). In order to estimate the geographic representativeness of crowdsourced bicycling data, we compare RiderLog data with a conventional bicycling data set, i.e., census bicycling journey to work (JTW) data. JTW data record place of residence and place of work aggregated by statistical zones, and main modes of transport used for commuting. JTW contains information on the whole population, but without details of travel routes.
Bicycling as a spatial interaction system
Work on movement data, often referred to as spatial interaction data, has a long history and holds substantial contemporary relevance. Guo (2009) notes that people and information are constantly moving from place to place and these spatial interactions drive both socioeconomic and physical processes. Examples of what Holmes (1978) referred to as spatial interaction systems include travel between origins and destinations as well as messages, mail, phone calls, business transactions or other information flows. The contemporary relevance is highlighted by Batty (2013) who calls for a change in focus for urban research from locations to interactions where place is generated by an aggregation of flows.
There are two primary spatial interaction data structures, origin-destination (OD) matrices and dyadic matrices (Berry, 1968; Holmes, 1978). In OD matrices, rows represent origins, columns represent destinations (or vice versa) and cells of the matrix indicate flow magnitudes. In contrast, a row in a dyadic matrix represents an OD pair along with a measure of interaction (Marble et al., 1997; Yan and Thill, 2009). Dyadic matrices are the most frequently used format for recording spatial interaction systems including the crowdsourced bicycling data analyzed herein.
The goal of a validation process is to determine if a set of observations are representative of an overall system and if the degree of representativeness is high enough to be considered reliable. For bicycling data, “if the sample is valid, it means that it can be used for city planning and management of bicycling infrastructure as an accurate representation of the entire population” (Leao et al., 2019: 487). If crowdsourced data are representative, then they are a low-cost and high spatial and temporal resolution data source for bicycle transportation planning. If the data are not valid, they are not a reasonable representation of the real world and this limits their utility to support research, planning and policy development. In this case the direction of bias, details of over- and under-representation of certain regions or groups, should be made clear to users. This information should be carefully taken into consideration in any analysis of the data (Leao et al., 2017).
Early approaches to research on movement in geographic space indicate an emphasis on aggregate behavior patterns rather than disaggregated and individual space/time trajectories (Haggett et al., 1977). Early work such as Garrison (1959) and Nystuen and Dacey (1961) were strongly influenced by Christaller’s central place theory that, along with data availability, resulted in a focus on intercity and regional scales rather than intra-city and more localized scales. Analysis of flows across geographical space quickly turned toward spatial interaction modeling, defined as “the mathematical modelling of movements over space” (Fotheringham et al., 2007: 214). Haggett et al., (1977) notes the eighteenth century use of Newtonian analogies to develop location models of spatial flow. Haggett et al., (1977) also notes the change from gravity models as mathematical representations of spatial interaction to the reconfiguration of spatial flow as statistical mechanics as exemplified by Wilson (1967, 1975) where an issue was identifying the flow matrix (overall OD data set) that could contain the largest number of data records of moving individuals. Yan and Thill (2009) observe that exploratory analysis of spatial interaction data has not attracted much attention. The focus on modeling rather than analysis continues to the present day.
Costanzo and Gale (1984) highlight the search for objective and statistical methods for flow data comparison. They argue there is a paucity of methods for the statistical comparison of flow data and neglect of the topic in geographical research. They develop a similarity surface based on correlations between wind flow data at different locations. Their analysis is limited to consideration of single flows from individual locations and focuses on direction to the exclusion of magnitude of flow. This contrasts with opportunities presented by today’s large data sets that may include multiple flows from each source location (origin) as well as multiple flows to each destination. Costanzo and Gale (1984) also note the weakness of statistical approaches focusing on issues associated with tests of statistical significance. They argue directional information is more difficult to compare than other mapped data and note the problematic nature of assuming independent observations with geographic data and acknowledge but do not address spatial autocorrelation. To overcome the identified weaknesses of their statistical test, they suggest coupling statistical significance tests with mapping and direct visual comparison (Costanzo and Gale, 1984, 1985).
Hanson et al. (1992) also explore the utility of correlation for spatial interaction data. They too note the lack of research in the field. The methods presented by Hanson et al. (1992) to assess correlation between vector data sets aim to accommodate translation, scaling, rotation and reflection. Considering place-to-place flows across geographical space and retaining high correlations in the face of these transformations would appear to provide misleading results. Hanson et al. (1992) incorporate magnitude, but still only address one vector per location and acknowledge but do not computationally account for spatial autocorrelation.
Murray et al. (2012) investigate approaches for exploratory analysis and statistical testing of pattern significance for movement data. They note the use of cartography-based methods for exploring spatial interaction data and highlight the need for statistical methods of pattern comparison. They suggest extending Hanson et al. (1992) to include correlation that captures local heterogeneity of data rather than simply global values. They propose a goodness of fit measure that allows significance testing based on differences between observed and expected standardized vectors within a radial grid.
There are several recent papers relevant to validating bicycling and other crowdsourced data. Romanillos et al. (2016) highlight the potential of using journey data as validation data sets for other bicycling data. Hecht and Stephens (2014) investigate correlations between the urban percentage population within counties (USA) and properties of volunteered geographic information sourced from Flickr, Foursquare and Twitter. They compensate for spatial autocorrelation using the method of Clifford et al. (1989) to calculate correlation coefficients for spatial processes.
Blanc et al. (2016) assess the representativeness of crowdsourced bicycling data sets using chi-square and z-tests to contrast data collected through mobile apps with travel survey data sets for a number of North American cities. They highlight the potential benefits of the large sample size and high resolution data generated from smartphone apps. Their analysis focused on rider characteristics such as gender, age and income rather than route locations. Self-selection of those who contribute to crowdsourced data is identified as a potential source of bias. Their analysis was done at the city level and their statistical analysis did not incorporate spatial effects.
Leao and Pettit (2017) use an agent-based modeling (ABM) approach to RiderLog app data validation. They model shortest paths and compare those paths with the actual routes of 16 cyclists. They found 50% of app generated routes are nearly identical to ABM shortest paths. They conclude their simulation was able to correctly or partially model over two-thirds of the RiderLog bicycling routes and that more factors could be included to improve modeling accuracy.
Leao et al. (2017) highlight the potential of RiderLog bicycling data as an alternative or supplement to Australian Census JTW data. Their aim was to identify the extent to which crowdsourced bicycling data are representative of an entire bicycling population. They use an agent-based model to transform JTW data to bicycling routes that are comparable with app generated bicycling routes. Findings indicated a very strong correlation in distance ridden between RiderLog and agent-based bicycling tracks generated from JTW data. However, this approach has only been tested on a small sample of data in eastern Sydney.
Leao et al. (2019) develop and test five criteria for validating crowdsourced bicycling data: geographic coverage, OD match, demographic match, distance–duration distributions and route match. They convert JTW data to RiderLog format using an agent-based model. The study is limited to one destination area and their correlation calculations do not incorporate spatial effects. It is questionable whether the measure of geographic coverage and measures of OD match are scalable from the multiple origin single destination system analyzed to a multiple origin multiple destination system that more closely captures an urban system.
Conrow et al. (2018) also worked in greater Sydney using local indicators of spatial association to assess similarities and differences between crowdsourced Strava data and manual bicycling counts. They found that socioeconomic status and presence of bicycling infrastructure influenced ridership patterns among data sets. Despite geographic differences in overall ridership proportions, similarities were found between the data sets they compared (Strava and Super Tuesday manual count). Both data sets had elevated ridership near the central city, which was expected considering prior research indicating that conventional and crowdsourced data align in urban downtowns.
Research focusing on the quantitative comparison of flow data is underdeveloped. The history of work with spatial interaction data has been oriented toward modeling rather than analysis. Among the works directly addressing the comparison of flow data sets, researchers consistently confront a paucity of research methods and limitations of statistical approaches. As outlined above, much of the previous work does not account for multiple flows, spatial autocorrelation (Costanzo and Gale, 1984; Hanson et al., 1992), or pattern comparison and local heterogeneity (Hanson et al., 1992; Murray et al., 2012). Similarly, validation of crowdsourced movement data is an emerging research area. While the studies highlighted above compared crowdsourced bicycling data to authoritative data sets, they had notable limitations. Due to both methodological and geographical limitations, none of the bicycling studies highlighted herein addressed whether a crowdsourced sample of bicycling routes is representative of the geographic distribution of the bicycling routes used by a general population. The goal of this research is to develop a method to assess the geographic representativeness in a crowdsourced human movement data set. In so doing, we find spatial interaction analysis to be a potentially fruitful new direction. As highlighted by Yan and Thill (2009), today’s large data sets and increases in computing power offer opportunities for data-driven rather than mathematical modeling research as well as opportunities to better understand spatial interaction systems. Data volume disrupts the traditional notion of sample and population when comparing crowdsourced data with census data. Although census data are definitive for a population, the opportunity for a greater number of comparable observations with a crowdsourced data set suggests that crowdsourced data have the potential to be a better representation of the population’s mobility. In order to explore this opportunity, we take a two-stage approach involving initial statistical evaluation of flow data through spatial correlation of interaction matrices measuring a crowdsourced flow data sample against an available population. Then, following Costanzo and Gale (1984, 1985), we augment the correlation analysis with consideration of data volume and appropriate statistical tests for validation.
Data and data processing
The crowdsourced bicycling data used in this investigation are processed and cleaned RiderLog data from June 2010 through May 2014 data for three state capital cities in Australia: Adelaide, South Australia; Brisbane, Queensland; and Perth, Western Australia and for June 2010 through December 2012 for Melbourne, Victoria. These cities offered the largest number of observations of processed and cleaned raw data from the set of all potential study areas. Leao et al. (2017) present detailed data cleaning methods and a full data description. Data cleaning steps and analysis methods are presented graphically in Figure 1.

Data processing and analysis methods. ABS: Australian Bureau of Statistics; JTW: journey to work; SA2: Statistical Area Level 2.
The population data with which RiderLog data are contrasted are bicycle JTW data from the 2011 Australia Census at the Statistical Area Level 2 (SA2) census geography. In urban areas, the SA2 census geography is analogous to a district or small suburb. Based on population, SA2s are roughly comparable to US census tracts and UK census middle layer super output areas. While there are no spatially explicit and complete data sets of the routes used by the bicycling population in Australia, the JTW data set provides commuting flows across regions through a spatial interaction matrix. The Australian census is the official count of population and population characteristics, providing the most complete view of the nation (Australian Bureau of Statistics (ABS), 2016). These census data represent the general population and are, therefore, suitable for statistical comparison and validation of a data sample.
JTW data are limited, however, in that they only contain origins and destinations for trips rather than the full routes needed to examine features of the infrastructure and neighborhood contexts that influence bicycling activity. In addition, no details on trip distance, time or specific itinerary are available in the JTW data. This limits the utility of these data for planning and management of bicycling infrastructure (Leao et al., 2019). If the spatial distribution of crowdsourced RiderLog bicycling routes are a statistically valid representation of ABS JTW bicycling data, RiderLog data may be used to address bicycling use and infrastructure at a finer scale than is possible with JTW data.
ABS bicycle JTW data are dyadic spatial interaction matrices where each row represents an OD pair. The files contain three data attributes: SA2 origin number, SA2 destination number and the number of employed persons who traveled by bicycle to work between the OD pair. Data were formatted with an R script which completes missing SA2 names, removes observations with zero values and links the correct SA2 names with the SA2 number for both origins and destinations.
In order to compare RiderLog and JTW data, it is necessary to convert the data to comparable matrices. GIS software were used to covert Riderlog data to a dyadic matrix by first selecting and exporting the commuter trips from the overall Riderlog data set, linking origin and destination vertices of the routes with SA2 centroids, then converting these origin and destination points to line data. Dissolving by location sums the number of commuters for each SA2-based dyad. RiderLog route lines and the RiderLog dyadic matrix lines for Melbourne are presented in Figure 2(a) and (b). Converting JTW dyadic matrices from tabular to geographic data also required a series of spatial processing steps. First latitude and longitude (based on SA2 centroids) were added to the tabular data for each origin and destination record. These origin and destination points were then converted to lines. The resulting ABS 2011 Bicycle JTW dyadic matrix map for Melbourne is presented in Figure 2(c).

(a) RiderLog commuter routes, (b) RiderLog commuter dyads and (c) ABS Journey to work data in Melbourne, Victoria. JTW: journey to work; SA2: Statistical Area Level 2; SOS: Section of State.
Melbourne, Australia’s second largest city (population 4,485,211; ABS, 2017), has a number of factors that facilitate cycling including “favorable topography, climate, and road network as well as more supportive public policies … integrated cycling infrastructure as well as more extensive cycling programs, advocacy, and promotional events” (Pucher et al., 2011: 332). Pucher et al. (2011) indicate high (relative to Sydney) use of cycling routes into the central business district (CBD) as well as high and increasing commuter cycling. They found more cycling in the city center and close-in suburbs than in the outer suburbs. Pucher et al. (2011) suggest a number of probable reasons for increased bicycling mode share in inner areas including higher density, mixed land uses, shorter trip distances and proximity to jobs, extensive bicycling facilities, travel time and speed benefits where there is automobile traffic congestion as well as limited and expensive parking. High volumes of commuter bicycling in inner areas are easily visible in Figure 2.
Adelaide has a population of 1,295,714 (ABS, 2017). The spatial distribution of RiderLog and ABS JTW data in Adelaide indicate greater bicycle commuting activity closer to the CBD and relatively lower bicycle commuting activity in outlying areas. RiderLog commuter routes, commuter dyads and ABS JTW data for Adelaide are presented in Supplemental material 1.
Greater Brisbane, with a population of 2,270,800 (ABS, 2017), is the third largest city in Australia. Corcoran et al. (2014) present an assessment of bicycle JTW trends highlighting proactive local government efforts to encourage active transport and increase bicycling participation. Improved bicycling infrastructure is indicated as contributing toward increased bicycling mode share. In the mid-1980s, with no off-road cycling infrastructure, cycling mode share was less than 1% in almost all suburbs and a maximum of 2% in two suburbs north and east of the CBD. By 2006 construction of 75 km of off-road bicycling infrastructure coincided with much higher bicycling mode share (up to 10% in two suburbs south of the CBD) and much wider distribution of bicycling mode share (over 2% across much of the city) (Transport and Main Roads, 2011). RiderLog commuter routes, commuter dyads and ABS JTW data for Brisbane are presented in Supplemental material 1.
Perth has a population of 1,943,858 (ABS, 2017), the distribution of RiderLog commuter routes in greater Perth is largely congruent with the observations of Perkins and Blake (2016). Commuting cyclists in the Perth area heavily utilize transport corridors north, east and to a lesser extent (with the RiderLog data) south of the CBD. Also similar to Perkins and Blake (2016), RiderLog data show heavy use of the roads and bicycling infrastructure surrounding and crossing the Swan River. Furthermore, the RiderLog commuter dyads are congruent with the observations of Perkins and Blake (2016), i.e., the greatest numbers of commuting cyclists in the Perth area are going to and from the CBD. RiderLog commuter routes, commuter dyads and ABS JTW data for Perth are presented in Supplemental material 1.
Statistical methods
Nystuen and Dacey (1961) have established a cornerstone of analytical methods for spatial interaction analysis, noting flow representation may be anchored cartographically with point locations of linked objects (nodes). Developing a systematic statistical approach first requires further data disaggregation in order to facilitate direct quantitative comparison that incorporates both location and magnitude of flow. Flow analysis may be based on each origin, or, separately, each destination within a system. We follow this systematic approach for enabling quantitative comparison by disaggregating both RiderLog and ABS JTW vectors by SA2 destination. Figure 3 presents an example of the disaggregated RiderLog dyads and JTW data where the destination is the Perth City SA2. Converting flow data to vector data at an identical comparable scale enables statistical analysis including comparing survey and population data sets.

RiderLog dyads and ABS 2011 bicycle JTW for the Perth City SA2 destination. SA2: Statistical Area Level 2; SOS: Section of State.
Correlation between RiderLog and JTW data based on both number of riders and location was calculated for the set of all origins associated with each destination. For example, the basis of the calculation for the Perth City SA2s is the variety of origins shown in Figure 3. The correlation calculation includes the RiderLog routes where the destination is Perth City, the number of riders on those routes, the JTW data where the destination is Perth City and the number of riders on those routes. Statistical calculations employed Dutilleul et al.’s (1993) method for adjusting p-values for the calculation of correlation in the presence of spatial autocorrelation. Calculations were performed in R with the SpatialPack package (Osorio et al., 2016; Supplemental material 2).
The advantage of the spatial correlation procedure is that it incorporates spatial distribution and effectively eliminates much of the influence of high observation outliers, where numerous rides were conducted by one person. However, spatial correlation does not fully take advantage of the data volume offered by RiderLog. Following Costanzo and Gale (1984, 1985), who suggest augmenting identified limitations of statistical tests with mapping and visual comparison, this research augments spatial correlation of flow data with additional factors in order to more fully evaluate the representativeness of RiderLog data. These additional factors are developed by (1) designating some SA2s as overweight through quantity comparison of RiderLog and JTW route numbers, (2) developing a second grouping by separating observations based on the Central Limit Theorem (CLT) (Rogerson, 2014) and (3) establishing a third grouping labeled “information deficit” for those observations that are not part of the first two groups.
Data observations included in the first group, “overweight,” are determined by selecting destination SA2s with more RiderLog observations than JTW observations. The rationale is that where the size of the sample (RiderLog data) is greater than the size of the population (JTW), the sample/population relationship is disrupted and effectively reversed. These observations may be deemed to be representative indicating a different pattern of spatial representativeness. Next is identifying the destination SA2s where there are more JTW observations than RiderLog observations and selecting from those records with 30 or more RiderLog observations (n ≥ 30) based on the CLT. For these selected observations, the CLT allows the data to meet as closely as possible the required assumptions for the calculation of Pearson’s correlation coefficients. The remaining records, i.e., those where there are more JTW observations than RiderLog observations and less than 30 observations for each destination SA2, were designated as “information deficit.” Finally, information deficit is based both on the small number of observations (n) and the fact that within this category the number of RiderLog routes is less than the number of JTW observations. The number of observations (n) is the number of observations in each correlation calculation and the number of origins associated with each SA2 destination. For each study area, similarity and dissimilarity between these groups is assessed using analysis of variance and t tests.
Results and discussion
Correlation results for the four capital city study areas are presented in Figure 4. The darkest shading indicates higher correlation between RiderLog and JTW data. The statistical significance of the correlations are presented for those SA2s where p < 0.05 (red) and p < 0.1 (blue). Negative correlations arise when corresponding RiderLog and JTW data records have opposite high and low values. Results confirm urban bias in the RiderLog data as few SA2s outside major urban areas in any of the four study areas contain many recorded routes. This is somewhat expected as the Riderlog app was developed with commuters in mind rather than those bicyclists undertaking longer distance recreational rides.

RiderLog dyad and ABS 2011 Bicycle JTW Pearson’s r correlations and significance (a to d) and, geographical validation of RiderLog Routes (e to h).
Results for Adelaide (Figure 4(a)) show the highest correlations in the CBD and the Toorak Gardens SA2 just southeast of the CBD. Statistically significant correlations are primarily found in a discontinuous line of SA2s running west to east just north of the CBD. The highest correlation in the Brisbane (Figure 4(b)) area is in the Strathpine SA2 northwest of the CBD. There are relatively high correlations in Geebung and Chermside north of the CBD and in South Brisbane and the West End, just across the river to the southwest of the CBD. Statistically significant correlations are mostly found close to the Brisbane River between the CBD and the coast. Results for Melbourne (Figure 4(c)) show higher correlations in the middle ring and outlying suburbs and statistically significant correlations concentrated in the inner city area. Higher correlations in Melbourne tend to overlap higher population SA2s. The results for Perth (Figure 4(d)) show SA2s along the Swan River have the highest correlations. Statistically significant correlations are found along the Swan River and both north and south of the river near the Indian Ocean.
Those observations with the high correlations may be viewed as representative of general bicycling patterns. The correlation threshold used to establish representativeness may vary for different communities and different applications. Strong (>0.6), relatively strong (>0.4 to 0.6) and even moderate (>0.20 to 0.4) correlations, when considered with significance levels, allow appropriate assessments of confidence for use in divergent circumstances. Results for Adelaide include one SA2 with moderate and two with relatively strong correlations. Results for Brisbane indicate three SA2s with moderate, two with relatively strong and three with strong correlations. Results for Melbourne show 17 SA2s with moderate correlations, 14 SA2s with relatively strong and 16 with strong correlations. Results for Perth show one SA2 with moderate correlation, three relatively strong and one strong correlation.
For the remaining destination SA2s, data volume may be leveraged to contribute to the overall assessment of the representativeness of RiderLog data. Scatterplots of the number of observations and correlation results for the overweight, CLT and information deficit data groups for each study area are presented in Figure 5. The scatterplots present a pattern where CLT results are associated with a higher number of observations (n) and generally higher correlations than the overweight and information deficit groups, which suggests the need for further evaluation.

Quantity of information analysis. CLT: Central Limit Theorem.
The similarity and differences between data groups in the four study areas are presented in Table 1. For Adelaide, the amount of information in the overweight category is not statistically different than the information available in the information deficit category. Neither category of information is, therefore, considered representative. For Brisbane and Perth, the information available within overweight groups is statistically different and higher than the information available in the CLT and information deficit groups. For both Brisbane and Perth, both the overweight and higher correlation observations may be viewed as representative. In Melbourne, the overweight group is statistically similar to the CLT group and, therefore, both the overweight and higher correlation CLT observations may be viewed as representative. The difference between the Melbourne study area and the Brisbane and Perth study areas is that there are more higher-correlation CLT observations in Melbourne; those CLT observations as a group track more closely with the overweight group than in Brisbane or Perth.
Statistical similarity and dissimilarity summary.
CLT: Central Limit Theorem.
Where sufficiently representative, RiderLog routes enable bicycle infrastructure planning and management (Leao et al., 2017) at a finer scale than the regional travel patterns indicated by JTW data. Figure 4 presents the individual RiderLog routes determined to be representative and not representative through spatial correlation and additional assessment based on data volume using a correlation threshold of approximately 0.29 for Melbourne (Figure 4(g)), 0.27 for Adelaide (Figure 4(e)) and 0.20 for Brisbane (Figure 4(f)) and Perth (Figure 4(h)).
Conclusions
The goal of this research was to develop a method to assess the geographic representativeness of crowdsourced human movement data. We undertook a two-stage statistical approach. The first stage was a novel statistical comparison of flow matrices through spatial correlation. This extended limited previous work in quantitative evaluation of flow matrices to include multiple origins and multiple destinations while accounting for spatial autocorrelation. The second stage leveraged the high data volumes associated with crowdsourced data to augment the correlation-based assessment through statistical evaluation of the quantity of information available within the interaction matrices. Broadly, the techniques extend methods available for quantitative comparison of flow data, especially with respect to crowdsourced bicycling route data, and advance discussions of the limitations of statistical approaches to evaluating flow data.
Results are presented for four state capital areas in Australia. Individual RiderLog routes are determined to be representative or not representative. Overall, RiderLog data are found to be valid representations of JTW data where there are higher data volumes, generally in central business districts and occasionally in more outlying areas. Although the spatial pattern of representativeness varies from city to city, areas of high correlation are generally more urban. These observations are congruent with Hecht and Stephens’ (2014) findings that volunteered geographic information tends to have greater presence in urban areas, and with Conrow et al. (2018), who found increased levels of bicycling in urban downtowns and central business districts.
The combination of spatially explicit data on route location linked with an assessment of the geographic representativeness of each route creates new opportunities for understanding urban systems. High levels of representativeness may increase planners’ and decision maker’s confidence to incorporate crowdsourced data in evidence-based urban planning and management (Leao et al., 2019). Where crowdsourced route data are representative of regional movement, they also contribute to the general goals of active transportation research. Planning and management may be based on these crowdsourced data with the knowledge that it both represents finer-scale data on human movement as well as broader-scale data about how people are moving through the built environment. These data can lead to a deeper understanding of bicycle behavior, mobility and urban systems as well as contribute to evidence-based planning that may help improve bicycling infrastructure, facilities and safety, raise modal share and help to bring about the personal fitness and environmental benefits of active transportation.
Within those areas where RiderLog data are not representative, planners may be made aware of the limitations of these crowdsourced data so they can account for bias and representativeness issues in the decision-making process. Visualizing routes determined to be representative (Figure 4) is one way to communicate this to planners and other end users of the data. Areas where data are not representative may be augmented with additional data sets and/or manual counts in a time and cost-effective way of extending the spatial extent of representative crowdsourced data. There is also the possibility, noted by Blanc et al. (2016), to use these results in conjunction with app developers and promoters to reach out to under-sampled geographic areas and populations with targeted promotional campaigns to increase app usage.
A limitation of this approach is that it does not perform well when assessing correlation where there are few origin SA2s. OD pairs with only one route are typically dropped from results as there is insufficient information to calculate a meaningful correlation. The technique itself has an urban bias where performance will be improved if there are more SA2 origins for each destination-based correlation calculation. Time and logical consistency are also complicating factors. The Australian census represents a data snapshot from the 9 August 2011. Although these data are considered to be the definitive record of the population until the release of the next census, a one-day snapshot correlates differently with different temporal slices of RiderLog data in what is essentially a modifiable temporal unit problem (Çöltekin et al., 2011). Temporal issues and appropriate adjustments are worthy of further investigation. The RiderLog data have a more robust temporal range and they may, therefore, be useful in understanding how JTW data can be imputed to reflect diurnal variations in ridership volumes.
This research may be expanded and refined in a number of areas. An obvious route for future research is to investigate other geographic areas. This analysis may be updated using both more temporally expansive RiderLog data sets and temporally corresponding updated census data. Extensions of this research may also provide insights into who app users are including cohorts within the data, for example, female versus male cyclists, which could also help with promotional campaigns. Data validation based on the geography of ride location and rider counts presented here may be extended to include demographic information such as rider gender. Rider gender is an attribute in both the RiderLog data and the ABS JTW data. This may help with the criticism observed by Fotheringham et al. (2007) that spatial modeling is inherently flawed as data and methods typically omit individual preferences. Well-attributed crowdsourced and big data hold the potential to overcome this issue. This is especially true for spatial interaction analysis. While spatial interaction modeling often buries individual preferences within equations representing aggregate systems, spatial interaction analysis highlights local variance, naturally leading to the question, “Why?” Additional data sources, such as Strava, may be used to further triangulate representativeness.
Another avenue for research is to identify areas over- and under-sampled by RiderLog. Routes in under-sampled areas may be weighted by population data or, as illustrated in Leao et al. (2017), ABM may be used to generate routes. With JTW data, the numbers of travelling cyclists are as correct as possible but the matter of actual routes are unknown, so generating valid routes based on the data would improve its use in planning and policy decisions. Areas oversampled by RiderLog may be downward weighted for certain observations; for example, instances of an individual making the same journey repeatedly, may be removed.
Haggett and Chorley (1967) describe the quantitative revolution of the 1950s and 1960s similarly to the way we see today’s era of big data, where an explosion in the size of data matrices requires new methods to, “rise above this flood tide of information” (p. 38). The era of big data in the second and third decades of the twenty-first century is a second quantitative revolution analogous to the first quantitative revolution in geography, where data availability inspired new methods. The recent increased availability of crowdsourced, real-time and “big” data capturing human movement requires new methods for data wrangling, processing and developing data into information. This research extends over 50 years of inquiry on location-to-location movement by presenting a method for assessing the geographic representativeness and spatial bias inherent in crowdsourced flow data representing human movement. By describing systems in detail, we can develop a deeper overall understanding of those systems and, ideally, uncover broadly applicable patterns. The utility of this approach is posited by Hanson et al. (1992) who note, “vector correlation should find nearly as many uses within geography as scalar correlation has found” (p. 113).
Supplemental Material
EPB894334 Supplemental Material1 - Supplemental material for Assessing geographical representativeness of crowdsourced urban mobility data: An empirical investigation of Australian bicycling
Supplemental material, EPB894334 Supplemental Material1 for Assessing geographical representativeness of crowdsourced urban mobility data: An empirical investigation of Australian bicycling by Scott N Lieske, Simone Z Leao, Lindsey Conrow and Chris Pettit in EPB: Urban Analytics and City Science
Supplemental Material
EPB894334 Supplemental Material2 - Supplemental material for Assessing geographical representativeness of crowdsourced urban mobility data: An empirical investigation of Australian bicycling
Supplemental material, EPB894334 Supplemental Material2 for Assessing geographical representativeness of crowdsourced urban mobility data: An empirical investigation of Australian bicycling by Scott N Lieske, Simone Z Leao, Lindsey Conrow and Chris Pettit in EPB: Urban Analytics and City Science
Footnotes
Acknowledgements
The authors would like to acknowledge Bicycle Network for supplying the RiderLog data and the Australian National Data Services (ANDS) for funding the “High-Value Data Collection Project (2016–2017),” focused on cleaning and validating RiderLog crowdsourced data.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
