Abstract
We model the prehistoric dispersals of two rice varieties, japonica and proto-indica, across Asia using empirical evidence drawn from an archaeobotanical dataset of 400 sites from mainland East, Southeast and South Asia. The approach is based on regression modelling wherein goodness of fit is obtained from log–log quantile regressions of the archaeologically inferred age versus a least-cost distance from the origin(s) of dispersal. The Fast Marching method is used to estimate the least-cost distances based on simple geographical features. We explicitly test three hypotheses for the arrival of japonica rice to India where, it has been proposed, it hybridized with the indigenous proto-indica, subsequently spreading again throughout India. Model selection, based on information criteria, highlights the role of the Inner Asia Mountain Corridor in introducing japonica rice into northeast India, followed closely by a ‘mixed-route’ model, where japonica was also almost simultaneously introduced via Assam, Bangladesh and Myanmar. Finally, we estimate the impact of future archaeological work on model selection, further strengthening the importance of the Inner Asia Mountain Corridor.
Introduction
Rice (Oryza sativa) has played a central role in the agriculture of India, China and Southeast Asia since the Neolithic, and the possibility of more than one geographically distinct origin of rice-based agriculture, in India and China for example, has been debated for some time. There is strong support from archaeology and genetics that rice was domesticated in China, probably in the Yangtze basin, but is unlikely to be the only domestication process or the only initiation of rice cultivation (Fuller et al., 2010; Gross and Zhao, 2014). A structure that is quite historically deep in this species has been recognized for some time in the division of Asian rice into two subspecies: indica and japonica (Chang, 1976; Grist, 1965; Oka, 1988). More recently, genetic research has identified a group of indica-like rice lineages, referred to as aus rices, that appear at least as distinct as the taxonomically recognized subspecies indica and japonica (e.g. Choi et al., 2017; Civáň et al., 2015; Garris et al., 2005; Kwon et al., 2006; McNally et al., 2009; Schatz et al., 2014; Travis et al., 2015; Zhao et al., 2011), but the region of aus origins is poorly constrained by genetics or archaeology. Models of a single origin propose that rice differentiated into indica and japonica early on in the history of domestication (e.g. Chang, 1976). Recent proponents of a single origin hypothesize that differentiation took place due to introgression of local wild rice genetics into domesticated rice as ancestral japonica spread from China across Southeast Asia and South Asia (e.g. Huang et al., 2012; Molina et al., 2011; Vaughan et al., 2008). One piece of evidence often cited in support of this is the presence of shared alleles involved with domestication traits, such as the sh4 mutation involved in non-shattering (Li et al., 2006), and other evidence for selective sweeps in the rice genomes inferred to be connected to domestication (Huang et al., 2012). However, data from chloroplast genomes, which are maternally inherited and not prone to recombination or introgression, have long indicated that indica and japonica derive from distinct wild source populations (e.g. Takahashi et al., 2008; Tang et al., 2004), while evidence across many parts of the nuclear genome also suggests differentiation in advance of domestication (e.g. Choi et al., 2017; Civáň et al., 2015; He et al., 2011; Kovach et al., 2007; Londo et al., 2006; McNally et al., 2009; Ohtsubo et al., 2004; Vitte et al., 2004; Yang et al., 2012;). Such evidence for multiple derivations from the wild must account for shared domestication loci through hybridization after they were selected for during the domestication process, and thus, a history of dispersal of domesticated rice, presumably from China, that came into contact with populations that were ancestral to indica is posited (Choi et al., 2017; Fuller and Qin, 2009). The aim of this study is to constrain when and where domesticated japonica is most likely to have come into contact with the maternal ancestor of indica, which we will refer to as proto-indica.
The proto-indica hypothesis proposes that domesticated japonica rice was introduced into northern India where proto-indica ancestors were under pre-domestication cultivation, and hybridization, followed by back-crossing with the proto-indica parent, transferred several genetic traits of domestication that improved the Indian rice, thus creating what is presently referred to as domesticated indica (Castillo et al., 2016; Fuller, 2011a, 2011b; Fuller et al., 2010). This hypothesis has been developed out of the ‘combination model’ of Sang and Ge (2007) that proposed that two separately domesticated rice lineages later exchanged domestication genes. However, domestication-related genes that have been re-sequenced together with flanking DNA sequences appear to be derived from japonica or from perennial Chinese Oryza rufipogon that was ancestral to japonica. This is clearly the case with the rc mutation that changes the rice pericarp colour from red to white (or pale brown) and is widespread, but not universal in cultivated rice (Sweeney et al., 2007). Among universal domestication genes, we can include prog1 which gives the rice plant a taller, straighter and more erect growth habit (Tan et al., 2008). It also seems likely this was the case with sh4, involved additively in non-shattering but not the exclusive determining mutation (Ishikawa et al., 2010; Zhu et al., 2012), and perhaps qsh3, which appears to be an additional allele for reduction of shattering together with sh4 (Htun et al., 2014). Another key domestication trait with a shared mutation in indica and japonica is the closed panicle trait, encoded by OsLG1, which may have evolved before non-shattering as it increases harvest returns even in wild-type shattering rice (Ishii et al., 2013). More recently LABA1, a gene involved in reducing awn length and awn bristles, has been identified as probably originating in japonica rice and being transferred to produce awnless indica varieties (Hua et al., 2015). A mutation which increases grain size, GS3, was also transferred after selection in japonica to some indica varieties (Takano-Kai et al., 2009). What these genetic factors in domestication indicate is that prior to the hybridization event (or events), proto-indica would have been a much less productive crop. For this reason, the main dispersal of rice throughout India (after 1000 BC) and the rise in sedentary farming villages with rice in their agriculture mix throughout the Ganges plains (mostly 1600–1000 BC) has been postulated to be a consequence of hybridization and the development of fully domesticated indica rice (Fuller, 2011a; Fuller et al., 2010a). This would place the hybridization near the start of the second millennium BC. Interestingly, this is quite close to the estimated coalescence age of ca. 3900 BP for a bottleneck deriving indica rice from japonica in a coalescence model based on diversity in a single nuclear chromosome (Molina et al., 2011). Also in support of an early hybridization is ancient DNA (aDNA) from charred rice grains that indicate a mixture of indica and japonica chloroplast lineages in single rice populations from two sites in India over 2000 years old, while all ancient samples in Southeast Asia of similar age were purely japonica (Castillo et al., 2016).
In this paper, we report how well the proto-indica hypothesis fits the archaeological record of rice and assess alternative dispersal routes by which japonica could have first come into contact with proto-indica rices. This builds on previous spatial modelling of rice dispersal used to define the most probable centres of origin from which Asian rice dispersed (Silva et al., 2015). This used the Fast Marching method (Sethian, 1996; Silva and Steele, 2012, 2014) to estimate the least-cost distances to all archaeological rice occurrences from any of 121 possible points of origin in mainland Asia where rice cultivation is possible. Using quantile regression between the archaeologically inferred ages and the costs distances from the sites to the possible point of origin, a best-fitting centre of domestication, falling between the Lower and Middle Yangtze valley, was identified. Further modelling allowed for dual origins and arrived at a stronger fit for a model with two distinct centres of domestication of rice in the Middle and Lower Yangtze regions. This is therefore in agreement with the inferred dual origins of japonica rice posited by a number of archaeologists (e.g. Crawford, 2012; Deng et al., 2015; Fuller and Qin, 2009; Gross and Zhao, 2014). Nevertheless, the best-fitting regression also revealed clusters of early rice finds that fit poorly with this model, including a number of anomalously early finds of rice in northern India and Pakistan, which could be suggested to fit the hypothesis of a separate proto-indica dispersal of unimproved, cultivated but not domesticated, rice (Silva et al., 2015). This study explicitly models a separate proto-indica origin and explores the best-fitting route and timing for japonica originating in the Yangtze to reach the proto-indica region.
Materials and methods
Materials
Our empirical evidence comes from the Rice Archaeological Database (RAD). The first version of this database was used for a synthesis of rice dispersal by Fuller et al. (2010), a slightly expanded dataset (version 1.1) was used to model the dispersal of rice, land area under wet rice cultivation and associated methane emissions from 5000–1000 BP (Fuller et al., 2011). The present dataset (version 2) was used in a previous analysis of the origins of rice domestication (Silva et al., 2015). The database records sites and chronological phases within sites where rice has been reported, including whether rice was identified from plant macro-remains, phytoliths or impressions in ceramics. Ages are recorded as the start and end date of each phase, and a median age of the phase is then used for analysis. Dating is based on radiocarbon evidence: when available, direct AMS radiocarbon dates on rice are used (53 dates), otherwise associated bulk charcoal radiocarbon dates (158 dates) or, in some cases, based on a cultural association (material culture) that is cross-dated from other sites (251 dates). In addition to recording the presence of rice, evidence for domesticated or wild status is recorded as well as an inference of whether wet or dry cultivation ecologies of rice were employed. Version 2.0 includes 400 sites and a total of 470 phases (some sites have multiple phases). Its spatial distribution, as well as the database itself, is available from a previous publication (Silva et al., 2015).
Modelling framework
Our approach expands on previous efforts to model the geographical origins, and subsequent spread, of japonica rice (Silva et al., 2015). The methodology is based on the explicit modelling of dispersal hypotheses using the Fast Marching algorithm, which computes the cost-distance of an expanding front at each point of a discrete lattice or raster from the source(s) of diffusion (Sethian, 1996; Silva and Steele, 2012, 2014). Sites in the RAD database are then queried for their cost-distance, the distance from the source(s) of dispersal along the cost-surface that represents the hypothesis being modelled (see Connolly and Lake, 2006; Douglas, 1994; Silva et al., 2015; Silva and Steele, 2014 for more on this approach) and, together with the site’s dating, used for regression analysis.
As an archaeological dataset such as RAD includes sites and phases which do not correspond to the first arrival of the dispersing element being studied, regressing to the conditional mean (as in an ordinary least squares regression) is not the most useful or valid approach, particularly when used for inference (see debate in Steele (2010) and Silva and Steele (2014). Quantile regression provides a reassuringly robust alternative as it allows for regression to a low quantile and hence to a better identification of first arrival, without the need to sub-set the data or use an ad hoc weight function (Koenker, 2005). Throughout this paper, we have therefore used quantile regressions to the 10th percentile.
Domestication is a long process, taking 2000 to 3000 years to complete (e.g. Fuller et al., 2009, 2014). It is unlikely that during this period, rice cultivation would have spread that far beyond its centre(s) of innovation, and indeed, there is little to no evidence for this. Rather, one would expect its dispersal to start very slowly, but quickly pick up pace once rice nears full domestication. We therefore expect a non-linear relationship between the age of the archaeological site/phase and its distance from the supposed centre of innovation. Scattergrams show the data to be consistent with a log–log relationship between age and distance (see Figure 1 and discussion in Silva et al., 2015), which involves calculating the natural logarithm of the age and cost distance for each site/phase and conducting the quantile regression on them. When the resulting regressed line is converted back into untransformed space, it produces a logarithmic curve of best fit to the data, such as the blue line in Figure 1.

Scatterplot of the dataset for the model of dispersal of japonica rice from two centres of innovation in the Yangtze basin (adapted from Silva et al., 2015). The blue line corresponds to the best-fitting model, whereas the dots within the yellow box correspond to those Indian sites that are much earlier than predicted by the model.
We have implemented a simple scenario where the domain available for dispersal is restricted only by two ecological factors: regions with a total number of degree days in the year (e.g. Guedes and Butler, 2014; Guedes et al., 2015) lower than 2500 as well as desert regions, where rice cannot grow. In addition, coastal transport was included by buffering out 40 km off-coast, allowing for land separated by a maximum of 80 km to be accessible for dispersal. This is the same scenario and domain used in Silva et al. (2015), where the subset of the Rice Archaeological Dataset that falls within this domain was published as supporting information.
The model previously identified as that which best-fits RAD v2 data is one where rice domestication originated from two independent centres, one in the Middle and another in the Lower Yangtze valley (Fuller, 2011a; Silva et al., 2015). This approach did not explicitly model the spread of proto-indica in the Indo-Gangetic plain; however, the model can be used to predict which database entries would correspond to proto-indica rice: they would be the ones that are older than the expected arrival of japonica in the Indian subcontinent (highlighted by the yellow square in Figure 1).
Modelling the spread of proto-indica rice
To model the spread of proto-indica, we first constructed its domain by identifying the ecoregions where such early instances of rice are found. This was done using the present potential ecoregions within the Terrestrial Ecoregions of the World dataset mentioned above (Olson et al., 2011). The identified regions correspond to the Indo-Gangetic plain but also extend further to the west, where some instances of early archaeological rice are found, as well as to Bangladesh in the east, where there is no such evidence (Figure 2).

Domain of spread for proto-indica rice. The corresponding present potential ecoregions are shown as coloured regions (see legend), whereas sites with early instances of rice, potentially corresponding to the proto-indica subspecies, are represented as black dots. The yellow star marks the site of Lahuradewa.
The spread of proto-indica was modelled on similar lines to that described above but limited to this considerably smaller domain. Its geographical origin was taken as the site of Lahuradewa which has been dated to 7000–3000 BC (Tewari et al., 2006, 2008) and is, therefore, the oldest in this region.
Our starting point was the regression model for japonica rice previously developed and described above (Figure 1). The process then was as follows:
Identify the sites that fall in the Indo-Gangetic domain described above and that are older than the dates predicted by the japonica model.
Do a log–log quantile regression of these sites against a simple cost-distance model for the spread of proto-indica.
Redo the japonica log–log quantile regression without the sites identified in step 1 above.
Repeat steps 1 through 3 until the sum of the log-likelihood of the two regressions is maximized.
This ensures that the split of the RAD data into its proto-indica and japonica components, with associated dispersals, is done is such a way that best-fits both processes (i.e. by maximizing the total likelihood). Each model has a minimum of eight parameters: the three associated with each regression (slope, y-intercept and variance), the speed ratio between the dispersal processes out of the Middle Yangtze and Lower Yangtze and the speed ratio between the proto-indica spread and one of the others, for which the Lower Yangtze process was chosen. However, once japonica and proto-indica met, they hybridized into indica which could then have spread through the Indo-Gangetic plain faster since it was already a region of rice farmers. To account for this, we have added another parameter that controls the speed with which the hybridized variety spread through the Indo-Gangetic plain. A final parameter was added to the models that included an extra corridor (see below) to account for the fact that dispersal through this corridor could be faster (or slower) than in the rest of the domain. These parameters were allowed to fully vary so that, for instance, the speed values, and thus whether the processes were fast or slow, were entirely dictated by the statistical analysis of the data. The best-fitting values for these parameters were estimated using a Genetic Algorithm which was run for 300 generations, with a population of 50 and a mutation rate of 20% (Haupt and Haupt, 2004; Silva and Steele, 2014).
Comparing different dispersal routes
We have tested models based on three hypotheses for the arrival of non-shattering japonica rice in the Indian subcontinent and its subsequent hybridization with native proto-indica rice (Table 1). A ‘southern route’ hypothesis (H1) would have seen the non-shattering variety introduced to northeast India from southwest China via Myanmar, Assam and Bangladesh. This was implicit in our previous modelling efforts (Fuller et al., 2011; Silva et al., 2015).
Literature-based models tested in this paper, with associated references.
Another hypothesis was formulated based on the suggestion of trade routes connecting Central Asia to western China. If trade was the driving force for the introduction of wheat into western China via one of the routes suggested by Barton and An (2014), then it is feasible that rice could have been traded back westwards and eventually introduced to northwest India (Fuller, 2011a; Stevens et al., 2016). Although Barton and An suggested three possible trade routes that circumvent the Tibetan plateau, considering that there is no archaeological evidence for rice in these regions, our modelling and statistical approach would be unable to differentiate between the different routes. Given this, we opted to implement the simplest one in schematic form only, as the others would yield the same results. This corridor was manually drawn and added to the dispersal domain mentioned above, which is available for the Fast Marching algorithm. The ‘Inner Asia Mountain Corridor’ hypothesis (H2) therefore predicts japonica rice to arrive first in northwest India via a route that starts in the Yellow river valley, travels west via the well-known Hexi corridor, then just south of the Inner Asian Mountains and thence to India (see Figure 3).

The three hypotheses used for the spread of rice into India.
During the course of this research, it was found that without imposing a barrier to movement somewhere between southwest China and Assam, a mixed scenario where rice was introduced to India both via the Inner Asia Mountain Corridor and the southern route entry points, roughly at the same time, was often recovered. The presence of such a barrier is suggested by both geographical and ecological features. First, northwestern Yunnan is a region where three of Asia’s greatest rivers run parallel to each other and are separated by the Hengduan mountain range, with peaks over 6000 m high. The succession of high peaks and deep gorges creates isolated pockets which have certainly fostered biological and cultural diversity. The Three Parallel Rivers of Yunnan Protected Areas include about 1.7 ma hectares (less than 0.2% of China’s land area) but contain 20% of China’s higher plant species (McGinley, 2008). This and the broader Yunnan region support high ethnolinguistic diversity (Erard, 2009). Similarly, the hills of northeastern India that separate the Brahmaputra valley from the plains of Myanmar, that is, the ‘Indo-Burma hotspot’, host high biotic and ethnolinguistic diversity (Gorenflo et al., 2012). This region is recognized as a major frontier in both human and biotic diversity and is postulated to have been a major barrier to human dispersal since the Palaeolithic and to have supported lower and more dispersed population densities (e.g. Boivin et al., 2013). This borderland area constituted a major break in human genetic diversity (Metspalu et al., 2004) and food-related traditions: from Assam, westwards dairy products are commonly used (Simoons, 1970), while from Assam, eastwards sticky varieties of rice and millets (with wx mutations) are popular but largely unknown to the west (Fuller and Castillo, 2016; Fuller and Rowlands, 2009; Sakamoto, 1996). This constituted a major fracture zone in the genetic diversity of other major domesticates, including pigs (Larson et al., 2010) and water buffalo (Mishra et al., 2015; Zhang et al., 2011). The highly variegated topography and dense, highly biodiverse forests can be expected to support a high degree of local adaptations and make traversing the region in an east to west direction difficult. Thus, the main axis of rice dispersal has been from China southwards to Southeast Asia, as long recognized in the patterns of archaeology and historical linguistics (e.g. Bellwood, 2011; Fuller, 2011a; Higham, 2014; Sagart, 2011).
A second, complementary point is the presence of aus rice varieties focused in Bangladesh and Assam. Genetically, aus is as distinct from indica and japonica as they are from each other (Civáň et al., 2015; Garris et al., 2005; Schatz et al., 2014; Travis et al., 2015; Zhao et al., 2011). Unlike indica, with which aus varieties were traditionally classified, aus does not share key domestication-related traits with japonica. For example, white pericarp varieties of aus have a unique rc-s mutation, although many varieties have the ancestral red trait (Kovach et al., 2007; Sweeney et al., 2007). The diversity of aus varieties comes from Bangladesh and India (Garris et al., 2005; Zhao et al., 2011), but where more detailed geolocations are available, these are specially from northeastern India, the states of Assam and West Bengal, although extending westwards as rainfed, short growing season rice through the Himalayan foothills. The term ‘aus’ is traditionally used (in the Bengali language) to refer to the short season early monsoon rainfed rice, in contrast to irrigated winter boro rice (also genetically aus group), or the deep water aswina and rayada rices (also aus group) grown in the Ganges–Brahmaputra delta (Travis et al., 2015). There are also very few genotypic japonica varieties found in Assam or West Bengal (Travis et al., 2015). In contrast, less than 3% of rice landraces in Yunnan are attributed to aus/boro varieties whereas >75% of landraces are japonica (Zeng et al., 2007). Thus, it seems unlikely that early domesticated japonica would have dispersed through this region to give domestication-related mutations to indica, such as rc, while different mutations in aus were not transmitted westwards. Varieties of aus found further west in India are likely to be younger, as are those in mainland Southeast Asia. Historical records indicate that aus rices came to China via Vietnam in the early Second Millennium AD, known as the ‘champa’ rices (Barker, 2011). We therefore conclude that it is unlikely that rice cultivation would have dispersed through this region or at least not quickly and easily. To represent this in hypothesis H2, we have imposed a simple geometrical barrier that stops potential east-to-west movement along this region and which is represented in Figure 3.
However, we have also included the emergent ‘mixed-route’ hypothesis (H3) in the modelling to see whether it is supported at all by the archaeological data. This hypothesis, original to this paper, suggests two concurrent introductions of japonica rice into the Indian subcontinent: one into the northwest and another into the northeast (Figure 3), as discussed above.
As models H2 and H3 include a trade route – the corridor – our modelling efforts must acknowledge that dispersal along such a route could occur at a different rate from that in the rest of our domain. To implement this, we have included an extra parameter in these models that effectively controls the friction of these corridors: a low friction yielding a higher spread rate or, alternatively, a higher friction yielding a lower spread rate (see Silva and Steele, 2014 for more on this approach).
Given this, one needs to be careful with our model selection methodology, as models with more parameters will almost always fit the data better. The Akaike Information Criterion (AIC) has been widely applied for model selection and inference and has a solid foundation in information theory and likelihood statistics (Akaike, 1973; Burnham and Anderson, 2002). It penalizes less parsimonious models so that it prefers a best-fitting model that does not do so at the cost of adding extra parameters. The best model is the one with the lowest AIC value, and it is therefore useful to calculate a Δ value by subtracting this lowest value from the AIC values of all models, Δ = AIC − AICmin. Burnham and Anderson (2002: 70–71) have argued that models with Δ values higher than four have very little empirical support in favour of them (see also Edwards, 1992). However, this threshold is not theoretically driven and rather arbitrary so one should treat it as less of an automatic cut-off and more as a value judgement (Anderson, 2008: 90). For this, it helps to calculate wi, the Akaike weights or model probabilities, which provide a measure of support for the different models (Burnham and Anderson, 2002: 74–80). We have used the small sample corrected version for the AIC, often denominated AICc, throughout but, for simplicity, we have followed Burnham and Anderson (2002) in simply naming it AIC.
Estimating impact of future archaeological work on model selection
To estimate how the model selection results would be affected by future targeted archaeological work in the Myanmar/Assam region (see Discussion for the importance of this), we have simulated a number of extra site/phases and recalculated the corresponding AIC values for each model. The location of these spurious sites was randomly sampled from a domain falling between latitudes 15.5ºN–30.5ºN and longitudes 87ºE–101ºE. The simulated age was a combination of the expected age at that site taken from the model corresponding to the underlying assumption with random noise sampled from a Rayleigh distribution with a parameter of 0.05 (on this choice of distribution see Silva and Steele, 2014). The AIC values of each model, assuming the same parameter values, as well as their Δ values were then calculated. This was done 1000 times for each extra site/phase, and the distributions of the resulting Δ values were then plotted as boxplots.
Results
Table 2 shows the statistical results for the hypotheses considered. Included are all quantities used to calculate the AIC values, as well as the Δ values and model probabilities wi, which provide a measure of support for the model, given the data. We also present the previously found best-fitting model, which did not explicitly model the spread of proto-indica, for comparison (here model H0, corresponding to model L7 of Silva et al., 2015). The estimated values for the dispersal parameters – that is, the speed ratios on the different domains – are also given in Table 2.
Results for all considered models.
Estimated parameters are speed ratios with relation to the Lower Yangtze process. N stands for sample size, K for the total number of parameters of each model (i.e. the dispersal parameters plus the regression parameters), inf stands for an infinitesimally small value. The best-fitting model (H2) is highlighted in boldface.
Model H2, the Inner Asia Mountain Corridor hypothesis, is the preferred model with a 94.9% likelihood of it being the best model out of this set and given the present data. We therefore present also the scatterplot of the RAD data for this model (Figure 4).

Scatterplot of the best-fitting model H2 separated into its japonica (main plot) and proto-indica components (inset).
Discussion
The first result is that the explicit modelling of proto-indica spread (which was only done for models H1, H2 and H3) significantly increases the fit of the models to the archaeological dataset. This was to be expected by the addition of the extra parameters. However, the increase in log-likelihood is accompanied by an increase in the penalty given by the AIC to models with extra parameters, meaning that this addition is justified by the data.
The results also show that the addition of the Inner Asia Mountain Corridor significantly improves the model’s fit to the data, particularly model H2 where rice is introduced to the Indian subcontinent exclusively via a trade route that circumvents the Tibetan plateau (see Figure 5). This agrees with independent archaeological evidence that sees millets spread westwards along this corridor perhaps as early as 3000 BC (e.g. Boivin et al., 2012; Kohler-Schneider and Canepelle, 2009; Rassamakin, 1999) and certainly by 2500–2000 BC (Frachetti et al., 2010; Spengler, 2015; Stevens et al., 2016), that is, in the same time frame as that predicted for rice in model H2. The arrival of western livestock (sheep, cattle) into central China, 2500–2000 BC (Fuller et al., 2011; Yuan and Campbell, 2009), and wheat, ca. 2000 BC (Betts et al., 2014; Flad et al., 2010; Stevens et al., 2016; Zhao, 2015), add evidence for the role of the Inner Asia Mountain Corridor for domesticated species dispersal in this period.

Predicted arrival times of the two rice varieties across eastern and southeastern Asia, based on best-fitting model H2. The sources of dispersal are identified by yellow stars and the colours indicate arrival time in 500-year intervals.
This modelling approach, albeit entirely based on spatiotemporal data and dispersal hypotheses, predicts which archaeological rice entries should be of the proto-indica variety. Table 3 presents the list of sites that are predicted to have had rice of this variety before the arrival of the hybrid non-shattering variety (indica). Also included is the median for the earliest date associated with rice at the site (more details can be found in RAD), as well as the predicted arrival time for indica at the site. The discrepancy between the two dates shows how long proto-indica would have been in use at the site. These are predictions that are testable in the future with recourse to morphometric and/or archaeogenetic analysis of the macro-remains of these sites (e.g. Castillo et al., 2016). In fact, archaeogenetic analysis of rice from the Historic period in the Indian site of Balathal, shows the presence of both indica and japonica varieties. So, it would be possible to test for the presence of proto-indica rice using aDNA in the Chalcolithic level at Balathal. Such analyses can, subsequently, be fed back into the model to further improve it. It should be noted that at least one or two sites that the model predicts were part of the proto-indica spread have produced evidence for domesticated or possible japonica-type rices, although in this case the site occupation period, at one standard deviation, included the predicted arrival date. The first is Mahagara, where a small sample of spikelet bases are predominantly non-shattering, implying that hybridization between proto-indica and japonica had already occurred by the time this site was occupied (ca. 1650 BC), although predicted arrival date for domesticated indica, 1395 BC, nearly falls within the 1-sigma range of summed radiocarbon dates from the site (1700–1400 BC). The second site is Pirak (ca. 1750 BC), in western Pakistan, which produced grains that were argued to look like japonica on the basis of grain shape (Costantini, 1979), although this could perhaps be re-assessed using more recent methods. These occurrences are around 250 years earlier than the expected arrival of domesticated indica and may indicate that after hybridization this species spread more quickly than predicted by the present model. This makes sense as rather than the spread being by demic diffusion, it could have been by direct transfer along trade routes between communities that were already farmers. In this sense, the initial spread of domesticated indica through the territory of proto-indica can be expected to be much faster as it is a case of substituting an improved seed grain by those who already knew rice or similar crops.
List of sites whose rice are predicted to be of the proto-indica variety by the best-fitting model, with its RAD code, dates and predicted arrival date for indica rice.
The discrepancy between the dates shows how long proto-indica would have been in use at the site.
A key expectation of the proto-indica hypothesis is that fully domesticated rice with non-shattering spikelet bases would appear in India first in the northwest and only after indirect contacts with China had been established. Recent archaeobotanical data on rice from Haryana fits and strengthens this hypothesis (Bates et al., 2017). Rice spikelet bases from Mature Harappan Masudpur VII (2500–2100 BC) and Masudpur I (2300–2000 BC) are both predominantly wild type. While the significant minorities of non-shattering (11.2% and 20%) suggest that some selection for this domestication trait had taken place in India, thus confirming proto-indica cultivation (Bates et al., 2017), they do not suggest a fully indigenous domestication process. By contrast, the later site of Bahola (1900–1700 BC) has more or less fully domesticated types (>76% non-shattering) comparable to sites near the end of domestication processes in Chinese rice or near eastern cereals (see Fuller et al., 2014). This shift would either require local evolution at an order of magnitude faster than any previously documented domestication in the archaeobotanical record (see Fuller et al., 2014: Table S1, available online) or indicates that domesticated japonica or hybrid indica had arrived by this later period. In other words, these new data suggest a time for the arrival of japonica or japonica-derived hybrid indica. As these data were published after this article was initially submitted and after our modelling had been completed, we simply compare the predictions of the best-fitting model with the dates of these more recent studies, showing good agreement (Figure 6).

Comparison of the best-fitting model’s predictions for the arrival of non-shattering rice (blue-shaded normal distributions) with the latest dates from the Indian subcontinent. Red-shaded areas indicate the model’s predicted time range for proto-indica, whereas green-shaded areas indicate that for indica.
Future refinements will be welcomed, not only on the modelling but also on the data collection front. Myanmar (Burma) and Assam are clear gaps in our dataset of particular importance to the present question as they would form the entry point of Chinese rice into northeastern India in models H1 and H3 although, in the latter, the influence of this in the spread of rice in the Indian subcontinent would have been limited. Targeted archaeobotanical work in these regions would considerably help further understand the dynamics of rice dispersal, and the timing of its arrival, in this region.
From a purely statistical point of view, the Δ values obtained (Table 2) can be considered low: they are certainly lower than the ones obtained in our previous analysis of different literature-based hypotheses for the geographical origins of japonica using the same dataset (Silva et al., 2015). The Δ between models H2 and H3 (Δ = 5.866) is not that much greater than 4, suggesting that there is still some, albeit small, empirical support in favour of model H3. It is therefore valid to ask whether with an improved dataset this value could be reduced sufficiently for the two models to be statistically indistinguishable (i.e.Δ between −4 and 4) or whether the mixed-route model would in fact become the best-fitting model (i.e. Δ values lower than −4).
To predict how model selection would be affected by future archaeological work, we have simulated the presence of a number of early archaeological sites in this region. The methodology developed for this was described above, whereas the results are shown in Figure 7. Figure 7 shows the current Δ values as filled dots and their expected distributions for increasing number of extra site/phases. The boxplots represent the expected variation in the age determination, hence the (relatively) large spreads. Nevertheless, the trends are quite clear. First, assuming rice cultivation was introduced to this region from the east, as in models H1 and H3 (left figure), model H3 (in blue) will always outperform model H1 (in red), with an almost constant difference of about 14 in median Δ values. This gives further strength to the importance of the northern trade route in the narrative of rice dispersals into India, even as future empirical data might suggest a level of input from the southern route. Furthermore, one can see that one would need four or more early site/phases for model H3 (in blue) to outperform model H2 (viz. negative Δ value with respect to model H2). Model H3, however, would only become the best-fitting model when this Δ value would drop below −4, which occurs for seven or more early site/phases. That information criteria would not be affected by the inclusion of a small number of extra site/phases is to be expected from the statistical nature of the analysis; however, that with only seven more site/phases one would have to select a different model is a welcome surprise. On the right-hand side of Figure 7, we have the converse hypothesis: that rice was introduced to the Myanmar/Assam region from India, as in model H2. Under these circumstances, model H2 would be increasingly better than the others, and as expected, the Δ values have a steady increase.

Impact of future targeted archaeological work on model selection. The left figure assumes that rice arrived in the Assam/Myanmar region from the east (that is, as in models H1 and H3), whereas the right figure assumes that it arrives from the west (as in model H2). The boxplots show the spread in Δ values between the three models (vertical axis) for an increasing number of dated early archaeological site/phases (horizontal axis). Red boxplots for the Δ value of model H1 and blue for H3, both with respect to model H2. The grey-shaded area represents the region with Δ values between −4 and 4, that is, models that fall in this region have empirical support close to that of model H2.
Considering these two figures together, and under the assumption that their underlying hypotheses are equiprobable, one can say that with 4–10 extra dated early site/phases in the Myanmar/Assam region, one would be able to more securely differentiate between the Inner Asia Mountain Corridor and the mixed-route hypotheses. However, given our modelling assumptions and present data, an exclusively southern route is essentially out of the question.
Conclusion
Through a combination of explicit spatial modelling and simulation, we have demonstrated the high likelihood that dispersal of rice via traders in Central Asia introduced japonica rice into South Asia. Only slightly less likely is a combination of introduction via two routes including a Central Asia to Pakistan/northwestern India route as well as introduction to northeastern India directly from China/Myanmar. However, there is a very low probability that current archaeological evidence for rice fits with a single introduction of japonica into India via the northeast. We have also simulated the minimum amount of archaeobotanical sampling from the Neolithic (to Bronze Age) period in the regions of northeastern India and Myanmar that will be necessary to strengthen support for the combined introduction (model H3) or a single Central Asian introduction (model H2). As few as four Neolithic data points, and certainly less than 10, ought to resolve this. This means that targeted archaeobotanical sampling in these regions should be a current research priority and will greatly refine our understanding of the early evolution and geographical distribution of cultivated rice subspecies with relatively little archaeological effort.
In addition, our model explicitly predicts the spread of non-domestication cultivation of proto-indica in India and Pakistan. Therefore, sites listed in Table 3, or sites of the same cultural affiliations (age and region) are predicted to have used proto-indica. This can be tested through future archaeobotanical studies. For example, such sites are predicted to produce rice remains with morphologies consistent with wild rice, such as shattering and immature rice spikelet bases (sensu Deng et al., 2015; Fuller et al., 2009), grain shapes consistent with wild rice or indica (sensu Castillo et al., 2015) and aDNA signatures of the indica chloroplast haplotypes (Castillo et al., 2016). Only after the arrival of japonica, with maximum ages for different regions predicted in Figure 8, we expect aDNA evidence for the presence of japonica chloroplast haplotypes, and mixtures of japonica and indica haplotypes in archaeological material, as found in currently available aDNA in India, from after 300 BC (Castillo et al., 2016). In addition, we expect to find mixtures of indica and japonica grain morphology after this introduction (as reported for after 300 BC in Castillo et al., 2016) and the presence of non-shattering, domesticated spikelet bases. Presently, such spikelet bases have been observed from Mahagara (ca. 1650 BC; Fuller, 2011b), Gopalpur and Golbai Sassan in eastern India (ca. 1250 BC; Kingwell-Banham, 2015), Wari-Bateshwar in Bangladesh after 400 BC (Rahman et al., in press), Paithan in south India after 300 BC (Fuller, in press) and in northern Sri Lanka at Mantai after 100 BC (Kingwell-Banham, 2015) and Kantharodai after 150 BC (Murphy et al., 2018; Figure 8). In the case of the eastern Indian site, the occurrences of domesticated rice, and indeed the foundation of these sites based on available radiocarbon dates, are fairly close to the predicted arrival date of domesticated indica, at most ca. 250 years earlier (ca. 1500 BC for the earliest range of their radiocarbon dates).

Predicted arrival times of the non-shattering rice variety (japonica or the hybrid indica) across southern Asia based on best-fitting model H2. Included are also sites with known presence of non-shattering spikelet bases (see text).
As noted earlier, spikelet bases from Mahagara are earlier than predicted by the present model, also by close to 250 years, and therefore, even if this site is within the spread zone of proto-indica, this particular community appears to have been founded after domesticated rice had become established in the middle Ganges region. Also, grains from Pirak in Pakistan were inferred to be japonica (Costantini, 1979), which would contradict the model prediction that this site was part of the proto-indica spread. There are likely two factors to contribute to these contradictions. The first is that the ecological zones that we used to define the geographical constraints of proto-indica may be imperfect, as in addition to bioclimatic conditions, factors of cultural tradition may have limited the dispersal of rice. Second, once hybridization occurred, the spread of domesticated indica was very fast within the region that already had agriculture and/or familiarity of proto-indica, which makes the initial adoption and substitution of this rice variety different from the spread of the crop into new regions, which was more likely driven by demic diffusion-like processes, for example, with the spread of japonica through Southeast Asia or the later spread of indica through southern India and Sri Lanka. Nevertheless, systematic collection of more such data from sites nearer to the predicted time of japonica arrival and region, or regions, of that arrival in northwestern and/or northeastern India can further serve to test the predictions of our alternative northwestern (H2) or combined introduction (H3) models. Such research indicates that archaeobotanical evidence provides a fossil record of cultivated rice that can be quantitatively modelled to complement and improve inferences of rice’s evolutionary history derived only from modern genetic evidence.
Footnotes
Acknowledgements
All rasters and map outputs were generated in GRASS GIS (version 7), and the off-coast buffering was done using its r.buffer routine. All modelling and other output generation were done in R (R Core Team 2014); in particular, the quantile regressions were done using the quantreg package (Koenker, 2013) and parameter estimation was done using the genetic algorithm of package GA (Scrucca, 2013). The modified Fast Marching implementation of Silva and Steele (2012, 2014) was recoded in R, based on previously written MATLAB code, and is available to download at
. The authors would like to acknowledge the work by Chris J. Stevens on the database, as well as the many discussions around the proto-indica hypothesis.
Funding
This research is part of the Natural Environment Research Council (UK)–funded project ‘The impact of evolving rice systems from China to Southeast Asia’ (NE/G005540/1) awarded to Dorian Q Fuller and Andrew Bevan. Some work on the database by EKB and CM was carried out as part of the European Research Council project ‘Comparative Pathways to Agriculture’ (Advanced Grant no. 323842).
