Abstract
Abstract
Many CpG island detection methods have been proposed based on sliding window and clustering technology, but the accuracy of these methods is proportional to the time required. Therefore, an accurate and rapid method for identifying CpG islands remains an important challenge in the complete human genome. We propose a hybrid method CpGTLBO to detect the CpG islands in the human genome. The method uses the clustering approach and the teaching–learning-based optimization (TLBO) algorithm. The clustering approach is used to detect CpG island candidates, and it can effectively reduce the huge volume of unnecessary DNA fragments. TLBO was used to accurately predict CpG islands among promising CpG island candidates. A comparison based on six contig data sets and a whole human genome analysis showed that the identifying stability of CpGTLBO outperformed eight existing methods in terms of sensitivity (SN), specificity (SP), accuracy (ACC), performance coefficient (PC), and correlation coefficient (CC) and processing time. Results indicated that ClusterTLBO can effectively overcome the drawbacks and maintain the advantages in both the CpGcluster and TLBO.
1. Introduction
C
Gardiner-Garden and Frommer (GGF) noted that, “Stretches of DNA with a high G + C content, and a frequency of CpG dinucleotides close to the expected value, appear as CpG clusters within the CpG-depleted bulk DNA, and are now generally known as CpG islands” (Gardiner-Garden and Frommer, 1987). GGF defines a CpG island as fulfilling three conditions: (i) GC content (i.e., GC% or the percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine) exceeds 50%, (ii) observed-to-expected ratio (i.e., O/E ratio or the ratio of observed vs. expected number of CpGs) exceeds 0.6, and (iii) length exceeds 200 bp. Takai and Jones (2002) also define a CpG island as fulfilling three conditions: (i) O/E ratio exceeds 0.65, (ii) GC% exceeds 55%, and (iii) length is at least 500 bp.
Most CpG island detection methods are based on sliding window [including CpGplot (Olson, 2002), CpGProD (Ponger and Mouchiroud, 2002), CpGIS (Takai and Jones, 2002)] and particle swarm optimization (PSO) based (Chuang et al., 2011) methods. These methods scan DNA sequences to identify potential CpG islands that conform to the GGF or Takai and Jones definitions (Gardiner-Garden and Frommer, 1987). Sliding window is similar to brute force search in that it requires considerable lengths of time, and the window size is an important parameter to determine the quality of CpG island prediction results. The CpGcluster method directly uses statistical properties (p-value) to detect CpG clusters without considering CpG island definitions (Hackenberg et al., 2006), and it is likely to feature a relatively short length, high O/E ratio, and high GC% due to the strict p-value, leading to regions failing to comply with CpG island definitions (Su et al., 2010).
This study proposes a hybrid method CpGTLBO that uses the clustering approach and the teaching–learning-based optimization (TLBO) algorithm. The GGF definition was used to allow for comparisons with other methods. Results showed that CpGTLBO provides reduced search time, higher sensitivity, higher accuracy, and a higher correlation coefficient in the human genome.
2. Methods
2.1. TLBO algorithm
TLBO was introduced by Rao et al. (2011). TLBO is based on the influence of a teacher on learner output. The main terminate processes of TLBO can be divided into two phases, that is, “teacher phase” and “learner phase.” The candidate solutions of the teacher phase are randomly distributed over the search space, and the best solution is selected from among these candidates. In the learner phase, the solutions seek to pass their own information through mutual interaction between learners.
2.2. CpGTLBO
CpGTLBO is based on the CpGcluster algorithm, which observes that the distance between CpGs in a CpG island is significantly shorter than within the bulk DNA sequences (Hackenberg et al., 2006), and CpG islands feature a high degree of local clustering of CpGs. CpGTLBO uses the physical distance between neighboring CpGs to directly detect more intensive CpG clusters in the genome according to CpGcluster. All CpG clusters are calculated by their statistical p values, and each statistically significant CpG cluster is a CpG island candidate instead of to the sliding window method, thus effectively reducing the search range in the DNA sequence. All CpG island candidates are extended both forward and backward by 200 bp for the second step to detect the CpG islands and, thus, meet the GGF criterion. These CpG island candidates are then accurately predicted by the TLBO algorithm. Figure 1 shows the CpGTLBO flowchart, and the CpGTLBO procedure pseudo-code is shown in Supplementary Figure S1.

CpGTLBO flowchart.
2.2.1. Cluster the CpGs by distance-based algorithm
This step determines that the CpGs are set in a cluster and whether this cluster is an CpG island candidate. All steps are described in detail as follows:
1. All CpG positions are scanned from 5′ to 3′ in a DNA sequence, and the CpG positions are collected into a set C = {c1, c2, …, cn}, where n is the total number of CpGs. 2. Calculate the physical distance of all adjacent CpGs, which is computed by di = ci+1 – ci – 1, where i = 1 to n–1, and collect all physical distances into a set Cs. In sequence, the shortest distance between adjacent CpGs (i.e., CGCG) is 1. 3. Set a distance threshold (dt), which is used to determine whether the adjacent CpGs belong to a same cluster or not. 4. The clustering process begins from the first CpG of set C and extends downstream (5′ → 3′) to the last CpG. All CpGs are divided into clusters based on the distance threshold. When the adjacent distance between CpGs is smaller than dt, the two adjacent CpGs are regarded as belonging to the same cluster; otherwise, the position of the downstream C nucleotide of the adjacent CpGs is defined as the start position of a new CpG cluster.
2.2.2. Detect statistically significant CpG island candidates by p-value criterion
The clustering technology detects all CpG clusters, and the p-value of a CpG cluster is used to calculate the probability of a CpG cluster appearing in a random sequence. When the p-value of a cluster is smaller than 0.01 (Hackenberg et al., 2006), the cluster is retained; otherwise, the cluster is discarded. Each statistically significant CpG cluster is a CpG island candidate. The probability mentioned earlier is calculated by the cumulative density function at point nf of the CpG cluster, and it is taken as the p-value:
where N is the number of CpGs in the cluster, nf is the number of independent non-CpGs in the cluster, L is the number of nucleotides in the cluster, and p is the probability of successfully finding a CpG. Ns and Nis are, respectively, the number of CpGs and the number of independent dinucleotides in the DNA sequence.
2.2.3. Predict CpG islands by TLBO algorithm
If stage 1 obtains CpG island candidates, the CpG island candidates may be shorter than 200 bp. The lengths of these CpG island candidates must be extended to apply the TLBO algorithm for predicting the CpG island region. All CpG island candidates are extended forward and backward by 200 bp, after TLBO is implemented in each specific CpG island candidate to predict the CpG island. TLBO is able to find a better CpG island region in a specific CpG island candidate and confirms the CpG island to comply with the CpG island definition. The process of the TLBO algorithm involves five steps (Fig. 2), including initialization, fitness evaluation, teacher phase, learner phase, and termination criterion. The initialization step generates the population (learners) for searching the rational CpG island region. The fitness evaluation step analyzes the learner value according to the CpG island properties. The teacher phase step generates new learners by simulating while learners are learning from the teacher. The learner phase step simulates that learners increase their knowledge via interaction among themselves. If the learner gives a better function value, the learner is accepted. The termination criterion step determines whether TLBO iteration is capable of reaching the terminate criteria. The learner with the highest fitness value is predicted as the CpG island for the specific CpG island candidates. All steps are explained in detail as follows:

The diagram of TLBO algorithm predicts CpG islands. TLBO, teaching–learning-based optimization.
1. Initialization
In TLBO, the learner is defined as a possible CpG island region. Equation 4 is the learner encoding formula.
where Xi is the ith learner, xsi and xli are, respectively, the start position and length of a predicted CpG island in a CpG island candidate. At initialization, each learner Xi is randomly generated by the parameters xsi and xli.
2. Fitness evaluation
According to the GGF CpG island definition (Gardiner-Garden and Frommer, 1987), we use properties of length ≥200 bp, GC% ≥50%, and O/E ratio ≥0.6 to design a fitness function for evaluating the learners and to determine whether the learner range includes the CpG island or not. Equation 5 is the normalization function for calculating the predicted CpG island length. Equations 6 and 7 are, respectively, the GC% function and the O/E ratio function. Equation 8 is the designed fitness function that sums the values of three properties, with a higher fitness value indicating that the learner range has a better CpG island prediction result.
where CpGlength represents the number of nucleotides in the learner Xi. Xmin is the starting position of the CpG island candidate minus 200, and Xmax is the ending position of the CpG island candidate plus 200. #C and #G are, respectively, the numbers of C and G nucleotides in the learner Xi. #CpG is the number of CpGs in the learner Xi.
3. Teacher phase
In a population, learner improvement depends largely on the presence of a highly learned person (teacher) and the mean level of the population (M), which are defined as follows:
where g represents the number of generations, n is the number of learners, and i is the target learner. In addition, the amount of learning of each learner (Difference_Meani) is produced according to Equation 11; hence the amount of learning for different learners varies depending on the random value of the following equation:
where the range of random number ri is between 0 and 1. Tf is a teaching factor that decides which value of the mean needs to be changed. The Tf is randomly generated as either 1 or 2, which is again a heuristic step and is decided randomly with equal probability as in Equation 12. The learners are updated by Equation 13. The difference just mentioned modifies the existing learner according to Equation 13.
4. Learner phase
Learners increase their knowledge by two different means. In this step, a learner randomly interacts with other learners (i.e., randomly selecting another learner Xj, such that i ≠ j). Thus, fitness can be improved if the other learner has relatively more knowledge. The learner modification step is expressed as follows:
5. Confirm whether the stop criterion is met or not
This step repeats the subsections 1 to 4 until the maximum generation is reached. When all potential CpG island candidates are predicted by TLBO, all identified CpG islands are the CpG island detection results. In the clustering stage, the parameters p-value and distance threshold are, respectively, set to 0.01 and 65th. In the TLBO stage, the population size is 300, and the maximum generation is 100 (Gudise and Venayagamoorthy, 2003).
2.3. Example of CpGTLBO
An example is provided in the Supporting Information to illustrate how the CpGTLBO works.
3. Results
3.1. Availability of supporting data
All contig sequences, entire chromosomes, and verified CpG islands were obtained from GenBank Database in NCBI (www.ncbi.nlm.nih.gov), along with the entire human genome (NCBI.36). We selected six regular contig sequences in chromosomes 21 and 22. Chromosome 21 contains the NT_113952.1 (184,355 bp), NT_113953.1 (131,056 bp), NT_113954.1 (129,889 bp), NT_113955.2 (281,920 bp), and NT_113958.2 (209,483 bp), whereas NT_028395.3 (647,850 bp) belongs to chromosome 22. Known CpG islands published in the NCBI are used as the gold standard. These CpG islands in contig sequences were verified based on sequence analysis and bisulfite sequencing (BS-seq). CpG islands were extracted from these contigs and entire chromosomes with the following detection algorithm. The CpGTLBO program and datasets can be downloaded from the following link https://goo.gl/0jtp0R.
3.2. Comparison of the six contig sequences for CpG island detection methods
Contig sequences are often used to test the CpG island detection methods in terms of detecting the CpG islands (Chuang et al., 2011). Table 1 shows the detection results of six contig sequences by using nine methods taken from the relevant literature. CpGPlot showed excellent specificity (SP) results in all contig sequences. The performance measurement was introduced in Supplementary Material. CPSORL showed better sensitivity (SN), performance coefficient (PC), and correlation coefficient (CC) values than CpGPlot, CpGcluster, CpGProD, CpGIS, PSO, PSORL, and CPSO. However, CpGTLBO outperforms CPSORL for SN, accuracy (ACC), PC, and CC in the six contig sequences. Supplementary Table S1 shows the detailed results for CPSORL and CpGTLBO in the six contig sequences, including GC% (average), O/E ratio (average), and number of detection CpG islands. CpGTLBO had a higher GC% (average) and O/E ratio (average) than CPSORL, indicating that CpGTLBO tends to search shorter but CpG-richer CpG island regions. We used the p-value of the Wilcoxon Signed-Rank test for pairs of result groups to determine whether the difference between the two methods is significant or not. In the six contig sequence results, all p values of CpGTLBO compared with the other eight methods were p < 0.001, indicating that the CpGTLBO was, indeed, effective in identifying CpG islands in the DNA sequences.
The bold type indicates the best value in all methods.
3.3. Comparison of the whole human genome for various CpG island detection methods
A comparison of CpG island detection methods can facilitate the evaluation of three conditions for detected CpG islands in the whole human genome. In Table 2, the results of the CpG island detection methods are obtained from the published literature, including CpGplot (Olson, 2002), CpGProD (Ponger and Mouchiroud, 2002), CpGIS (Takai and Jones, 2002), and PSO based (Chuang et al., 2011). CpGcluster detected the minimum CpG island length (8 bp). In chromosome 21, the total length of the true CpG islands is 1,719,555 bp and the coverage of true CpG islands is 3.66%. CpGTLBO detected a total length of 1,733,292 bp, and the coverage was 3.69%. The CpG island length and coverage values of CpGTLBO approximate the real values. In chromosome 22, CpGTLBO, respectively, detected the total length of true CpG islands and detected CpG islands as 3,114,716 and 2,998,371 bp, with respective coverage of 6.27% and 6.17%. The results indicate that CpGTLBO outperformed the other methods for detecting CpG islands in chromosomes 21 and 22.
Table 3 shows the results of CpG island identification obtained by CpGcluster, CpGIS, CPSORL, and CpGTLBO in the whole human genome. The resulting average island length using CpGTLBO was longer than that using CpGcluster (443 bp vs. 273 bp) and shorter than CpGIS (443 bp vs. 1090 bp), but similar to CPSORL (443 bp vs. 572 bp). We examined the promoter and transcription start site (TSS) that overlaps with the CpG island region. A promoter region was defined as −1500 to +500 bp around the TSS. The TSS number for CpGTLBO was higher (below 9.65%) than that for CpGcluster, CpGIS, and CPSORL. The promoter region of CpGTLBO was higher (below 11.8%) than that of CpGcluster and CpGIS.
TSS, transcription start site.
Supplementary Figure S2 shows the length distribution of the CpG islands. The CpG island length result of CpGTLBO ranged from 200 to 749 bp. Supplementary Figure S3 shows the distributions of GC% and the O/E ratio in the detected CpG islands using CpGTLBO. Most GC% were between 0.5 and 0.7, and the O/E ratios were between 0.6 and 1.0, indicating that the results of CpG islands detected using CpGTLBO conform to the GGF criteria. Chuang et al. (2011) proved that CPSORL was superior to CpGcluster and CpGIS. Therefore, we compared the measurement performance of CPSORL and CpGTLBO in the whole human genome. In Supplementary Table S2, CpGTLBO shows a very strong improvement in SN, SP, ACC, PC, and CC, with SN showing the most significant difference. For the Wilcoxon Signed-Rank test, the respective p values of SN, SP, ACC, PC, and CC were 2.352E-5, 6.296E-5, 1.813E-5, 1.822E-5, and 1.818E-5, indicating the strong superiority of CpGTLBO over CPSORL for CpG island detection in the whole human genome.
4. Discussion
When using the sliding window approach, CpG island prediction quality is affected by the window size. Given a large window size, regions including several short and loose CpG islands can possibly be combined into a single larger region. This large window size provides a high SP value, but an increased false negative (FN) can reduce SN (Lai et al., 2008; Sujuan et al., 2008; Han and Zhao, 2009; Hackenberg et al., 2010; Mabrouk, 2013). The PSO-based methods apply the optimization algorithm in each window and use the GGF to fit the CpG islands. Although this method successfully enhances the CpG island detection, PSO must be implemented in many windows that do not include CpG islands, incurring a huge computational cost; thus, PSO-based methods significantly increase computational loading. In addition, since CpGcluster replaces the CpG island definition with strict p values, some detected CpG islands fail to meet the definition, resulting in omissions of regions, which do not comply with the CpG island criteria. Therefore, in this study, we propose CpGTLBO to combine the CpGcluster and TLBO method. The results indicated that CpGTLBO can efficiently detect the CpG islands with GGF criteria, and the examples of performance measurement and running times are illustrated in Figures 3 and 4, respectively.

Results of the three methods, showing the position of a true CpG island and the detection of CpG island positions of the TLBO, CpGcluster, and CpGTLBO methods. This figure shows the search range of the three methods in a true CpG island of NT_113955.2 contig; this island is located from 121,928 to 122,307 bp

Running times comparing the search efficiency among the five methods for the six contig sequences. Running times for PSO-based and CpGTLBO methods are used to assess search efficiency in the six contig sequences based on running time and the log10 value for the sequence's presently detected position on the x and y axes, respectively. This figure shows the running times of PSO-based methods and CpGTLBO in six contig sequences. All PSO-based methods have similar execution times and require more execution time than CpGTLBO. Although the time complexity of TLBO is greater than PSO-based methods, the first step of CpGTLBO can effectively reduce the number of unnecessary regions in DNA sequences; thus, the execution time for CpGTLBO is significantly less than that for PSO-based methods for the entire DNA sequence. Supplementary Tables S3 and S4, respectively, show the PSO-based and CpGTLBO methods with PSO and TLBO run times in six contigs and the entire human genome. The total length of the NT_113952.1 contig sequence is 184,355 bp. CpGTLBO detects the 21 CpG island candidates after the p-value assessment; thus, TLBO is only run 21 times, whereas the PSO-based methods need to run PSO 93 times. In the NT_113958.2 and NT_028395.3 contig sequences (long sequences), CpGTLBO requires relatively less time to be spent on pre-processing. Although the initial CpGTLBO scanning sequence may proceed more slowly than in other methods, sequence pre-treatment provides a strong time advantage for complete genome detection. PSO, particle swarm optimization.
Detection stability is an important performance criterion in optimization algorithms. CpG island detection results can be affected by different random seeds. Supplementary Figure S4 shows that PSO-based methods produced similarly different results for worst and best values for ACC, SN, SP, PC, and CC, indicating that some random seeds may result in failed searches, thus reducing detection accuracy. However, the CpGTLBO method produces high levels of detection accuracy, even with unfavorable seed values (see the error bars near the left boxes that indicate 10th percentiles). CpGTLBO was found to produce greater stability than the other PSO-based methods.
The GC% (average) and O/E ratio (average) results using CPSORL are similar to those of PSO-based methods for CpG island detection, but smaller than CpGcluster values. CpGcluster is usually used to identify short CpG islands that may have high O/E ratio values and high GC% levels, but low sensitivity. CpGTLBO is based on the CpGcluster approach, but we use the GGF CpG island definition. Thus, CpGTLBO can significantly reduce the O/E ratio and GC%, thus substantially improving sensitivity. Supplementary Table S2 shows the entire human genome analysis, in which CpGTLBO outperforms CPSORL in terms of detecting higher SN, SP, ACC, PC, and CC values. These results indicate that CpGTLBO provides accurate CpG island identification. CpGTLBO inherits the advantages of CpGcluster in terms of CpG island detection with TSSs and promoters (Table 3). Given the overlap with conserved elements or promoter regions, the CpGcluster is co-localized more specifically to TSSs and many of the small CpG islands detected by CpGcluster may be functional (Han and Zhao, 2009).
Until now, the identification of CpG islands can fetch the disease analysis. Hence, for identifying CpG islands, a CpGTLBO-based clustering and optimization has been proposed in this article. CpGTLBO has several advantages over the CpGcluster and PSO-based method alone: (i) CpGTLBO running time and search stability can be significantly improved by pre-treatment with CpGcluster technology; (ii) the lower sensitivity of CpGcluster can be improved by the use of the GGF criteria; and (iii) CpGTLBO effectively improves the CpGcluster and the PSO-based methods, and these improvements greatly enhance the accuracy for CpG island detection. The results indicate that the CpGTLBO outperforms the other methods.
Footnotes
Acknowledgments
This work was partly supported by the National Science Council in Taiwan (under Grant nos. MOST 105-2811-E-151-002, MOST 104-2221-E-214-035-MY2, MOST 103-2221-E-151-029-MY3, and MOST 104-2320-B-037-013-MY3).
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
