Abstract
The epidemic of HIV-1 CRF01_AE in China was driven by multiple lineages of HIV-1 viruses introduced in the 1990s and increasing; it is important to investigate their epidemic status in China. In this study, we download all available CRF01_AE sequences (n = 2,931) from China and their associated epidemiological information in the Los Alamos HIV database for our analysis to explore their epidemic status in China. The results showed there were 11 distinct clusters of CRF01_AE strains in China, and 4 major clusters that accounted for 80.0% (1,793/2,241) of Chinese CRF01_AE strains in total had led a real epidemic. Clusters 1 and 2 were epidemic among heterosexuals and injecting drug users in southern and southwestern China, while Clusters 3 and 4 were predominant among homosexuals in eastern and central China and northeastern China, respectively. HIV-1 CRF01_AE strains detected in heterosexuals had the most complex characteristic, underscoring its important role in the occurrence of multiple CRF01_AE lineages. Furthermore, epidemic history reconstruction analysis using the birth-death susceptible-infected-removed package revealed that the four clusters had gone through varying epidemic stages. Clusters 2 and 3 were near the peak of the local epidemic, while Clusters 1 and 4 were just in the very early stage of their epidemic. The epidemic status of CRF01_AE clusters in the future is mainly determined by the effect of prevention and control. Our study provides new insights into the understanding of the epidemic dynamics of CRF01_AE in China.
Introduction
CRF01_AE,
The network of contacts that provided the opportunity for transmission is important for viral spread, and the transmission mode can affect viral evolution, for example, the bottleneck effect in patients contracting the virus by sexual transmission, 11,12 and because HIV belongs to RNA viruses, characterized by fast evolution, the evolutionary and ecological processes are on the same timescale, and the genomic data of the epidemic viruses can preserve the dynamic message of the host that they infected, which can be used to reconstructed the population history. 13 It is reported that a typical coalescent model is not able to distinguish prevalence from incidence, 14 and it is difficult to get more detailed information from the inference result. The birth-death model is an alternative epidemiological coalescent model widely used in fast evolving viruses such as HIV and HCV. 15 Stadler et al. 16 compared the inference results of the HIV-1 data set in the United Kingdom by the coalescent skyline plot and birth-death model and found that when the basic reproductive number R0 > 1, it coincided with the increase of the effective population size Ne inferred by coalescent skyline, but R0 < 1 was not reflected in coalescent skyline because there was no obvious decrease in Ne, indicating that the birth-death model may provide more information from which we can get an overall evaluation of public prevention. So to provide a different lens of the epidemic status of each lineage, the birth-death model is a better choice.
To clarify the exact epidemic status and reconstruct the population dynamics history of HIV-1 CRF01_AE lineages in China, we downloaded and used all of the CRF01_AE sequences from China and their associated epidemiology information available in the public database. After identification of the main CRF01_AE clusters in China, we tracked the population dynamics history of major clusters with the birth-death susceptible-infected-removed (BD-SIR) model in BEAST v2.10.
Materials and Methods
Sequence data set and quality control
To make use of the extensive amount of genomic data as much as possible, we downloaded all CRF01_AE sequences (one sequence per patient) from China and representative foreign strain (from 10 countries, including Central African Republic, Cameroon, Vietnam, Japan, Finland, France, India, America, Afghanistan, and the United Kingdom) from the Los Alamos HIV database (
The phylogenetic analysis and identification of CRF01_AE lineages in China
We first used all the full-length env, gag, and pol gene fragment sequences to construct the phylogenetic tree to determine the number of major lineages of CRF01_AE. Then, we selected three sequences from each major cluster as reference sequences (the strain with NFLG of sequences was preferred) to represent this lineage. For other shorter sequences (e.g., V3 loop of env gene sequences, p17 of gag gene sequences, and protease region of pol gene sequence), we adjusted the reference sequences to the length of these sequences to construct phylogenetic trees to determine their lineages. We made this strategy to prove the accuracy of the phylogenetic analysis and identification of CRF01_AE lineages in China. The general time reversible model plus a gamma distribution among site rate heterogeneity were chosen in a hierarchical likelihood ratio test on the basis of the standard Akaike information criterion by the Modeltest. 18 Phylogenetic trees were constructed using the neighbor-joining and maximum likelihood methods in the MEGA 6.0 program. 19 We also construct ML tree with RaxML software. 20 The reliability of the topology of the tree was evaluated by bootstrap analyses with 1,000 replicates. As for the patients with more than one fragment in the data sets, we combined the phylogenetic tree results based on different fragments to distinguish their lineages and counted them once in the final statistical table. The 2,931 sequences in our data set belonged to 2,241 patients. The HIV-1 sequence fragments from one patient with discordant phylogenetic results or ambiguous lineage information were not added to the specific lineages.
The population dynamics history analysis of the four main clusters
The BD-SIR model in BEAST v2.10 is a classic epidemiological model that incorporated into the birth-death process, 21,22 which is also a classical population dynamics model that accounts for changing population composition. 23 In the SIR model, the whole population is divided into three parts. S (susceptible) represents the host who is potential to be infected by pathogens and I (infected) denotes the host who has already been infected. R (removed) means the kind of host removed from the whole population because of death, being treated, or change of behavior. For the phylogenetic analysis, results of gag, pol, and env were generally consistent; we used the cluster information based on env to reconstruct the population history of each lineage for convenience. We chose the BD-SIR (serial) as the tree prior in BEAST2.10, then assigned the distributions of the parameters and the number of Markov Chain Monte Carlo (MCMC) to make sure that the effective sample size (ESS) value is large enough (>200) after we ran the generated XML file. Finally, we could draw trajectories of the number of susceptible, infected, and removed individuals from which we can get information on the viral outbreak in each lineage using RStudio.
Results
Identification of the number of CRF01_AE lineages in China
The ML trees were constructed by Mega 6 and RaxML software based on full gag, pol, and env gene sequences with sufficient genetic information, and the results are similar. ML trees constructed by Mega 6 and RaxML are shown in Figure 1 and Supplementary Fig. S1(Supplementary Data are available online at

HIV-1 CRF01_AE strains from China grouped into four major distinct clusters. The maximum likelihood tree analysis of all env
The geographic and risk population distribution of four major CRF01_AE lineages in China
There were 2,241 (one sequence per patient) HIV-1 sequences from China in our data set. The four major clusters of CRF01_AE strains made up 80.0% (1,793/2,241) of the HIV lineages, and Cluster 2, which we designated as CRF01_AE-v, had the highest proportion (28.1%, 629/2,241). 6 We extracted the transmission route of the strains in China from the database and original articles, and detailed information regarding the geographic sources and risk groups is shown in Supplementary Table S1. Although all four clusters were detected in more than 10 provinces in China, each cluster still retained distinct characteristics. The geographic distribution of four major CRF01_AE lineages in China is illustrated in Figure 2. Cluster 1 (n = 407), which was placed within the phylogenetic group, composed of CRF01_AE strains from Vietnam was circulating in Guangxi province, southern China, and Shanghai city, eastern China. Cluster 2 (n = 629), which was first found in Guangxi province, had the highest proportion (28.1%, 629/2,241). It was predominant in southern and southwestern China. Cluster 3 (n = 541) was dominant in Anhui province, central China, and major eastern cities. Cluster 4 (n = 216) was epidemic in Jilin and Liaoning province, northeastern China. According to the recorded risk group information, Clusters 1 and 2 were primarily transmitted by heterosexuals and IDUs, while Clusters 3 and 4 were prevalent in homosexuals. Only 38.6% and 28.3% of the sequences in Clusters 1 and 2, respectively, had risk group information available. Therefore, these results should be taken with caution. However, 62.3% and 53.7% of the sequences in Clusters 3 and 4, respectively, had risk group information available, making their risk group distributions more reliable. Most of these homosexuals were MSM. All of these results showed that the four major lineages of the CRF01_AE strains had displayed a specific geographic and risk group distribution after they were imported to China for 20 years.

Geographic distributions of the four major distinct phylogenetic clusters of CRF01_AE in China. The distribution of the four major CRF01_AE clusters in most of the provinces and cities across China is illustrated based on the data in Supplementary Table S1.
Epidemic status of the four major CRF01_AE lineages in China
We applied the BD-SIR method to explore the epidemiological dynamics of the four HIV-1 CRF01_AE clusters in China. We have used a previously reported evolutionary rate (6.59 × 10−3) as the prior 9 evolutionary rate estimated in our study is 8.1 × 10−3, which is similar to this studies. Bayesian estimates for the epidemiological parameters and time to the most recent common ancestors of the clusters are summarized in Supplementary Table S2. The estimated population change trajectories of the susceptible (S), infected (I), and removed (R) individuals in the four local epidemics are shown in Figure 3.

SIR trajectories, incidence, and prevalence of four HIV-1 CRF01_AE lineages in China. The overall SIR dynamics
The estimated mean reproduction ratio R0 of the four clusters (Supplementary Table S2) ranges from 3.06 (95% HPD: 3.03–3.08) in Cluster 3 to 5.05 (95% HPD: 5.01–5.09) in Cluster 2. Median value of the rate to become noninfectious ranges from 0.20 to 0.44, indicating that the average infectious period lasts for about 2–5 years. Cluster 2 turns out to be the most informative cluster, which was sampled from a huge population with the most susceptible individuals.
Figure 3 shows the posterior medians of the epidemic time series and suggested that the four local epidemics corresponding to four genetic clusters had gone through varying epidemic stages until the most recent sampling dates in our data set. The results suggested that Clusters 2 and 3 are just before the peak of the local epidemic, and we could expect the explosive outbreak as they reached the peak in the near future. These dynamics can also be seen in the plots of the average effective reproduction ratio R0 over time (Fig. 4). R of Clusters 2 and 3 falls close to one at year 2008 and below to one after year 2010, respectively. However, Clusters 1 and 4 were in the early stage of inherent epidemic for high R0 and a susceptible but low infected individual population. The basic reproduction ratios of Clusters 1 and 4 were always above one.

Reconstructed effective reproduction ratio over time. Black curved line represents median effective reproduction ratio for each cluster. Gray curved line shows the 95% PHD interval, while the gray straight line marks the value where the effective reproduction ratio equals 1.
Discussion
In this study, we found 11 distinct clusters of CRF01_AE strains in China, and four major clusters, which accounted for 80.0% (1,793/2,241) in total, had led a real epidemic. Clusters 1 and 2 were epidemic in southern and southwestern China and prevalent among heterosexuals and IDUs, while Clusters 3 and 4 were dominated among homosexuals in eastern and central China and northeastern China, respectively.
The pre-existing complexity of the epidemic (multiple sources of introductions from diverse localities) is the main reason for the continuous extensive geographical dispersal across the viral phylogeny. 25 Multiple lineages of CRF01_AE strains were introduced into China during the early to mid-1990s. 7 –9,24 Some of these strains expanded locally and then spread into other regions (founder effect), and some strains failed to initiate a local expansion, which may be the reason that four major distinct clusters, some small clusters, and many sporadic strains of CRF01_AE were found in China. The effect of differential sampling still could not be excluded. For example, there were only two CRF01_AE strains from northwestern China. Some areas such as Guangxi province, Beijing city, Guangdong province, Yunnan province, and Liaoning province were posed on more extensive epidemiologic investigation than other regions; however, these areas also belonged to the major locations most affected by HIV epidemic in China. For the sampling bias, although our data set cannot represent the distribution directly, it is also valuable for understanding the overall epidemic trend in China. In this study, we found the biggest number of lineages by using all of CRF01_AE sequences reported according to the previous studies' principle. Although the exact number of lineages in our study was not completely the same with previous reports due to the differences in the number of sequences and principle used in the phylogenetic analysis, all the studies supported that the epidemic of CRF01_AE in China was driven by multiple lineages of HIV-1 strains. Indeed, except four major lineages that took up 80% of the total number of strains, other lineages had not really caused an epidemic. The epidemic caused by four major lineages was also proved in previous studies with fewer sequences. Hence, we think that the phylogenetic signal in our data set is believable. It is likely that we will observe the continued evolution and diversification within and across the four major CRF01_AE lineages with an increase in the number of HIV samples.
The network of contacts that provided the opportunity for transmission is important for HIV spread, and the transmission mode can affect HIV evolution. 11,12 In our compiled database with record transmission mode information, HIV-1 CRF01_AE detected in heterosexuals had almost all the lineages (including four major clusters) found in China, while CRF01_AE in homosexuals almost belong to Clusters 3 and 4, and in IDUs mostly belong to Clusters 1 and 2. It implied the important role of heterosexual transmission in the occurrence of multiple CRF01_AE lineages in China. For the high prevalence of CRF01_AE and subtype B in MSM, the new recombinant virus CRF55_01B and CRF59_01B was recently reported to circulate among MSM in China. 26,27 It is also meaningful to explore the evolutionary characteristics of the CRF01_AE strains to understand the source of parental strains of these recombinant viruses.
From the result of the epidemic history reconstruction for each of the four major lineages, we could infer that they had gone through different epidemic stage by the end of the most recent sampling date. The SIR trajectories showed that the former two had gone through a larger part of their epidemic processes than the latter two. Clusters 1 and 4 were far before the peak of the local epidemics, for there were no obvious flows from susceptible individuals into infected individuals during the sampled interval. Cluster 1 was mingled and interplayed with the Vietnamese strains, which had been proven previously 6 and in this article; maybe they were still in a run-in period and did not cause a large outbreak. Cluster 4 was a lineage that occurred much later in China and it was still in its initial epidemic period in northeastern China. However, Clusters 2 and 3, which were two indigenous CRF01_AE lineages in China, showed a clear sign of outbreak. Our results indicated that Cluster 2, which was dominated by CRF01_AE-v designed by us, was sampled from a huge population with most susceptible individuals. This revealed the seriousness of the epidemic in the corresponding areas. The values of R0 of Clusters 1 and 4 remained above one persistently in the sample interval, indicating a potential outbreak, while those of Clusters 2 and 4 ran approximate to one and below one at year 2008 and 2010, respectively. A previous literature 14 studying the five clusters of HIV-1 type B in the United Kingdom found that one cluster was in its end of epidemic, while in our analysis of CRF01_AE in this article, the four clusters were before or near the peak of the local epidemic.
Our study is the most extensive report to clarify the epidemic status of multiple lineages of CRF01_AE strains in China using large sequence data sets (n = 2,241) covering 22 provinces and cities of China. Our study reported that there were almost 11 CRF01_AE clusters in China, and four major distinct clusters had resulted into an epidemic with their own geographic and risk population characteristics. We first documented the current epidemic population stage and speculate the future trend of four major clusters of CRF01_AE strains in China and make an alarm on their effective prevention and control. It is useful for epidemiologists to understand the epidemic status of CRF01_AE in China. The continuous monitoring of HIV-1 CRF01_AE variants is important for understanding the dissemination dynamics of HIV and helps to inform effective intervention programs.
Footnotes
Acknowledgment
This work was supported by the Key National Science and Technology Program in the 12th Five-Year Period [grant number 2012ZX10001-002].
Sequence Data
The GenBank accession numbers of sequences used in this study are in the Supplementary Table S3.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
