Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions

Abstract

Duplicate sequence records—that is, records having similar or identical sequences—are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.

1. Introduction

Biological sequence databases are used to accumulate a wide variety of observations of genetic and protein sequences and to provide access to a massive number of sequence records submitted from individual laboratories (Baxevanis and Bateman, 2015). Their primary use is for sequence database search, where database users prepare query sequences such as uncharacterized proteins, perform sequence similarity search of a query sequence against deposited database records [typically via utilities such as BLAST (Altschul et al., 1990)], and judge the output, a ranked list of retrieved sequence records.

A key challenge to the quality and efficiency of database search is duplication, that is, the presence of multiple records that contain similar or even identical sequences (Bursteinas et al., 2016). Duplication has two immediate impacts on database search: it dramatically increases database search time and the top-ranked retrieved sequences can be highly similar. Together, these impacts lead to uninformative search results and make it difficult to find potentially interesting sequences that are only slightly less similar to a search query. A possible solution is to remove duplicate records. However, the notion of duplication is context-dependent; removal of records that might be regarded as duplicates for one task may be harmful to other tasks (Chen et al., 2017b).

Machine learning techniques are often used to solve computational problems in biology, such as clustering methods that have been widely applied (Fu et al., 2012) to address duplication. These methods cluster a sequence database at a user-defined sequence identity threshold, creating a non-redundant database. Users search against representative records (one sequence per cluster) of nonredundant databases and can further expand search results by exploring records from the same clusters. Search against representatives reduces search time and provides more diverse search results; exploration of cluster members, in addition, should ensure that potentially interesting records will still be found.

To assess the search effectiveness of nonredundant databases, Precision and Recall have been used in existing studies (Suzek et al., 2015). BLAST all-by-all search results (where each sequence was compared with all sequences in a database via BLAST) from original databases are used as a gold standard. Precision measures the proportion of retrieved sequences labeled as relevant, based on this gold standard, out of the total number of retrieved sequences from nonredundant databases; Recall measures the proportion of relevant sequences retrieved by nonredundant databases out of the total number of relevant sequences in the gold standard set. Nevertheless, the use of Precision and Recall has limitations. Precision assumes that users will examine all the retrieved sequences; however, this is not realistic: biological database users pay more attention to top-ranked retrieved sequences (Cole et al., 2008). Likewise, Recall is not an effective information retrieval measurement, evident by findings in related studies for over a decade (Zobel et al., 2009; Webber, 2010; Walters, 2016).

In this study, we focus on search effectiveness in nonredundant databases.^* The contributions are twofold. First, we propose metrics to better model user behavior and reassess search effectiveness, summarized in Section 5. The evaluation results demonstrate that Recall is not effective in information retrieval tasks. For example, reaching Recall of more than 90% is at the cost of users having to manually examine more nonrelevant sequence records for about 90% of queries than in original databases. Furthermore, the results illustrate that the Precision of expanded clusters (representatives and members) is distinctly lower than the Precision when only using representatives. Second, to address these problems, we propose a ranking model that returns a user-specified proportion of top similar cluster members, based on their sequence and annotation similarity to representatives, summarized in Section 6. We demonstrate that it improves the Precision of expanded clusters, facilitates more satisfactory user search by providing only top similar cluster members instead of all results, and does not increase practical searching time.

2. Duplication in Biological Sequence Databases

Biological sequence databases archive knowledge about sequences, structures, and biological function. They are used for data storage: newly sequenced records are submitted to databases and corresponding database record IDs are cited for research communications (Karsch-Mizrachi et al., 2017). They are also used for data search: uncharacterized sequences are searched against database records for homology prediction, coding region identification, and function classification (Gish and States, 1993).

Advances in high-throughput sequencing mean that, on the one hand, genomes are being sequenced ever more efficiently with fewer errors. On the other hand, this has led to a vast increase in the number of sequence records deposited in databases. For instance, GenBank (a primary nucleotide database coordinated by the National Center for Biotechnology Information) has more than 200 million nucleotide records consisting of more than 250 billion base pairs. Its average annual growth in terms of the number of base pairs was 35.52% (Benson et al., 2017). Likewise, UniProtKB/TrEMBL (a primary protein database coordinated by the UniProt Consortium) contains more than 120 million protein records, almost doubling in size in 2016 (The UniProt Consortium, 2018).

Such dramatic increases in numbers of sequence records lead to duplication. Duplicate records can be broadly categorized as of two kinds: entity duplicates, which are records belonging to same entities (Chen et al., 2016b, 2017b; Yonchev et al., 2018), such as when the same gene records are submitted to the same database (Chen et al., 2017b); and near-duplicates or redundant records (Suzek et al., 2015; Mirdita et al., 2016; Chen et al., 2018), where records share some specified percentage X% similarity defined by users. For example, the Uniclust protein database defines redundant records at sequence similarity thresholds of 30%, 50%, and 90% for different purposes (Mirdita et al., 2016). Both kinds of duplicate records have impacts on different use cases. In biocuration (biological data curation), biocurators have to manually resolve conflicts and inconsistencies caused by entity duplicates (Magrane and UniProt Consortium, 2011); in database search, users have to manually explore repetitive or near identical search results caused by redundant records (Bursteinas et al., 2016). We focus on redundant records in this article.

Redundant records challenge database search in terms of both efficiency and effectiveness. A particular instance was the high level of redundancy in UniProtKB/TrEMBL. Records from different strains of the same bacterial species were deposited; for example, records from 1692 strains of Mycobacterium tuberculosis were overrepresented in close to 6 million records.^† Such redundancy dramatically lowers the search speed and brings highly repetitive search results. To address this, UniProt staff removed about 50 million redundant records using a combination of manual and automatic approaches (Bursteinas et al., 2016). Similar examples were also observed in sequencing repositories, where database managers encourage a joint effort to handle redundancies (Gabdank et al., 2018).

3. Sequence Clustering to Address Redundancies

Clustering is the main technique currently used to address redundancies in biological sequence databases. It is an unsupervised machine learning approach that groups records based on a similarity function. We have detailed sequence clustering methods and their related applications in previous works (Chen et al., 2016a, 2018). Briefly, sequence clustering methods are greedy algorithms that aim to reduce the number of sequence alignments. It has the following three primary steps (Fu et al., 2012): 1.

Sequences are sorted by decreasing length. The longest sequence is by default the representative of the first cluster.

The remaining sequences are processed in order. Each is compared with the cluster representative. If the sequence identity for some cluster is no less than the user-defined threshold, it is assigned to that cluster as a member. It will be a new cluster representative if it is not similar to any existing representatives.

The generated representatives and corresponding members comprise the nonredundant (or clustered) database.

3.1. Sequence search on nonredundant databases

Figure 1 compares sequence search on the UniProtKB database, consisting of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL records, and UniRef50, a nonredundant database created by clustering at a threshold of 50% similarity from UniProt records. Sequence search on a common database compares query sequences against all the database records (assuming that no heuristics are applied). In some cases, it produces highly similar search results due to redundancies, as shown in Figure 1a.

FIG. 1.

Search of query sequences against original database versus nonredundant database using search results of UniProtKB/Swiss-Prot record A7FE15 on UniProtKB and UniRef50 (a clustered database) as an example. (a) The top retrieved results of original database may be highly similar. (b) The top retrieved results of the nonredundant version are more diverse. (c) The expanded set makes the search results more complete.

In contrast, sequence search on nonredundant databases consists of two steps. First, query sequences are only searched against cluster representatives, as shown in Figure 1b. The retrieved records are effectively a ranked list of cluster representatives in the nonredundant database. Second, search results are further expanded by exploring associated cluster members if users are interested in the retrieved representatives, as shown in Figure 1c. By searching against representatives only, the search takes less time and brings more diverse results. By expanding the clusters, users can explore the members that are similar to cluster representatives to confirm their findings.

4. Existing Search Effectiveness Measures and Limitations

Previous studies used Precision and Recall to assess the search effectiveness of nonredundant databases (Suzek et al., 2015; Mirdita et al., 2016). The formulas are shown in Equation (1). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Precison = { \frac { \vert E \cap G \vert } { \vert E \vert } } \quad \quad Recall = { \frac { \vert E \cap G \vert } { \vert G \vert } } . \tag { 1 } \end{align*} \end{document}

Here, we explain how UniRef staff used Precision and Recall to evaluate search effectiveness as an example (Suzek et al., 2015). First, a BLAST all-by-all search was performed on the UniProtKB database. The BLAST results are commonly called query–target pairs or hits, where a query sequence is paired with multiple retrieved sequences in a database ordered by BLAST scores. The query–target pairs with BLAST e-value no more than a user-defined threshold, such as 10, were selected as gold standard relevant hits, identified as G in the equation. Another BLAST all-by-all search was performed on UniRef50 database.^‡ The expanded query–target pairs—the representatives satisfying the e-value threshold and their cluster members—were considered as retrieved results (identified as E in the equation). Precision measures the proportion of relevant results out of all retrieved results: given E retrieved target-pairs, the number identified as relevant in the gold standard. Recall measures the proportion of relevant results out of total relevant query–target pairs: given G relevant target-pairs, the number of them retrieved from the nonredundant database. The evaluation results on UniRef50 illustrate that Precision and Recall were consistently more than 0.83 and 0.95, respectively, for multiple thresholds.

Precision and Recall are popular metrics for the measurement of classification tasks: in this case, Precision quantifies the accuracy of search results on nonredundant databases; Recall quantifies whether important search results can be retrieved via nonredundant databases. However, the adopted metrics and evaluations do not reflect user behavior in practice. Specifically, there are three limitations of concern.

First, users may not expand cluster members when searching nonredundant databases. Some use cases have a higher priority on search diversity, where only representatives are used as search results. For instance, Malde and Furmanek (2013) only used representatives of UniRef50 to find a large diversity of protein sequences for identification of protein structures via transitive alignments. Similarly, Gu et al. (2016) only used representatives of UniRef90 for sequence evolutionary conservation analysis. In contrast, expansion of cluster members is more important for other cases. For example, Remita et al. (2016) searched against UniRef for miRNAs regulating glutathione S-transferases and expanded the results from the associated UniRef clusters to obtain alignment information, Gene Ontology (GO) annotations, and expression details to ensure that they did not miss any other related data (Remita et al., 2016). Thus, the evaluation should consider both circumstances.

Second, the relevance of top-ranked results can be critical to users. Precision assumes that users will examine all the retrieved sequences and judge whether they are of interest. Nevertheless, this assumption is not realistic in practice. For large collections such as UniProt databases, a BLAST search could yield over hundreds of hits. It is tedious for users to manually check all the hits, and it is not necessary because only top-ranked hits have high BLAST alignment scores (or, equivalently, lower e-values). For instance, Cole et al. (2018) created a protein sequence structure prediction website that searches user-submitted sequences against UniRef and selects the top retrieved representatives based on e-values. Thus, the Precision should be adapted to focus more on top-ranked results.

Third, Recall also has failings. It has been a long-term concern that Recall may not be effective for information retrieval measurement because it is not unrelated to user satisfaction level in regard to any single query (Zobel, 1998; Webber, 2010; Walters, 2016). As we will quantitatively show later, a high Recall means that users have to manually examine more nonrelevant hits.

5. Reassessing Search Effectiveness

We propose updated evaluation metrics and reassess the search effectiveness of nonredundant databases, as summarized below.

5.1. Proposed evaluation metrics

Enhanced evaluation metrics are proposed for modeling user behaviors, as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} P { @ } K \left( F \right) = \frac { 1 } { K } \mathop \sum \limits_ { i = 1 } ^K S \left( { { F_i } } \right) , \tag { 2 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} P { @ } { K_ { equal } } \left( E \right) = \mathop \sum \limits_ { i = 1 } ^K \frac { 1 } { { K \vert { C_i } \vert } } \mathop \sum \limits_ { j = 1 } ^ { \vert { C_i } \vert } S \left( { { C_ { i , j } } } \right) , \tag { 3 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} P { @ } { K_ { weight } } \left( E \right) = \mathop \sum \limits_ { i = 1 } ^K { \frac { \vert { C_i } \vert } { \mathop \sum \limits_ { i = 1 } ^K \vert { C_i } \vert } } \mathop \sum \limits_ { j = 1 } ^ { \vert { C_i } \vert } S \left( { { C_ { i , j } } } \right). \tag { 4 } \end{align*} \end{document}

The denotations are as follows. Given a query Q, let F be the list of fetched (retrieved) representatives from the nonredundant database, E its expanded set, and R the set of relevant sequences. Here, F is a ranked list, consisting of representatives ordered by BLAST scores, whereas E contains representatives and the associated cluster members, which do not have a particular order. R in this case stands for all the fetched sequences for Q from the original databases as the gold standard. Each sequence, either in F or in E, has a relevance score based on a scoring function S, that is, 0 if it is not in R, or 1 otherwise.

As previously mentioned, the Precision of representatives only (F) and expanded clusters (E) should be evaluated separately. For representatives only, the existing metric \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} , commonly used in the Information Retrieval community, is satisfactory. Its formula appears in Equation (2), measuring the Precision of the top K representatives. For the latter case, however, it is not straightforward. E in this context is a proxy for both the top K representatives and for their cluster members. Its size is more than K and the members do not have a rank; therefore, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} cannot be directly used. We propose two \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} derived metrics for E, presented in Equations (3) and (4). In these formulas, C_i, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {C_i} \vert$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${C_{i , j}}$$ \end{document} are, respectively, an expanded cluster, the expanded cluster size, and a sequence in the expanded cluster. The aim is to transform the score of a sequence relative to the cluster size; for example, the score of a sequence in a cluster of 10 records will be 1/10. The former formula treats every cluster equally, that is (1/K); the latter weighs clusters by size such that larger clusters have higher weights.

5.2. Data set, tools, and experiments

We used the complete UniProtKB/Swiss-Prot Release 2016–15 as our experimental data set. It consists of 551,193 protein sequence records. CD-HIT (4.6.5) was used to construct the associated nonredundant UniProtKB/Swiss-Prot; NCBI BLAST (2.3.0+) was used to perform all-by-all searches. UniProtKB/Swiss-Prot is the largest expertly curated protein databases; CD-HIT is one of the most popular sequence clustering tools. They were selected as representatives.

CD-HIT by default removes sequences of length no greater than 10 since such short sequences are generally not informative. We removed those records correspondingly in the complete UniProtKB/Swiss-Prot. The updated data set has 550,047 sequences. We used them as queries and performed BLAST searches on the updated UniProtKB/Swiss-Prot and its nonredundant version at 50% threshold generated by CD-HIT. The nonredundant database at 50% consists of 120,043 sequences. 547,476 of 550,047 query sequences have at least one retrieved sequence in both databases. The commands for running CD-HIT^§ and BLAST^** strictly follow user guidance. NCBI BLAST staff (personal communication via e-mail) advised on the maximum number of output sequences to ensure sensible results. Note also that this study focuses on general uses of the tools, while, for instance, UniRef and Uniclust may use different parameters to construct nonredundant databases for specific purposes.

For BLAST query–target pairs obtained by all-by-all searches, we removed two types of query–target pairs: where the target is the query itself; and the same sequence retrieved more than once for a query. BLAST performs local alignment—a target sequence may be retrieved multiple times for the same query sequence if its different regions (subsequences) are similar to the query. In this case, it biases statistical analysis.

We measured the Precision for both representatives ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K ( F )$$ \end{document} ) in Equation (2) and expanded clusters [ \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{equal}} ( E )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{weight}} ( E )$$ \end{document} in Equations (3) and (4), respectively]. As a complement, we also measured Recall to assess whether E is (near) identical to R. Recall [Equation (1)] was used in previous studies (Suzek et al., 2015). We do not recommend the use of Recall; it is only used for completeness.

5.3. Search effectiveness results and discussion

Our search effectiveness results illustrate two main observations.

The Precision of the expanded set distinctly degrades at top-ranked hits. Table 1 summarizes different levels of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} on representatives and the expanded sets. We assessed both measures at depths 10, 20, 50, 100, and 200 to quantify the Precision of the top-ranked hits that are more likely examined by users. In general, top-ranked hits from representatives are valuable: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} for representatives only is more than 96% across different K. In contrast, the Precision of the expanded set, either \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{equal}}$$ \end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{weight}}$$ \end{document} , is always lower than that of representatives, with degradation of up to 7% at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K = 200$$ \end{document} . We further compared Precision in detail on an individual query level, as illustrated in Figure 3. The Precision of representatives at the top K positions is higher than that of the expanded sets for at least 80% of the queries; the proportion increases as K grows. This shows that the prevalence of degradation is high. We suspect that it is because sequence clustering methods are greedy. As explained above, a sequence will be assigned to a cluster directly if the similarity between the sequence and the cluster representative is higher than a given threshold, without comparing with other cluster representatives. This improves the efficiency but decreases accuracy as a trade-off. In particular, it is problematic for low thresholds.

Table 1.

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} Measure Results

	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document}
	K = 10	K = 20	K = 50	K = 100	K = 200
Representatives	0.968	0.977	0.983	0.985	0.983
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{equal}}$$ \end{document} original	0.938	0.951	0.958	0.980	0.952
Ranked sequence	0.938, 0.946	0.952, 0.960	0.958, 0.966	0.959, 0.967	0.952, 0.963
Ranked seq & annotation	0.938, 0.947	0.952, 0.960	0.959, 0.967	0.959, 0.968	0.953, 0.953
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{weight}}$$ \end{document} original	0.924	0.935	0.938	0.929	0.917
Ranked sequence	0.926, 0.940	0.937, 0.952	0.940, 0.957	0.933, 0.953	0.922, 0.947
Ranked seq & annotation	0.926, 0.940	0.938, 0.952	0.941, 0.957	0.933, 0.954	0.923, 0.947

Representatives: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} for representatives [Equation (2)]; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{equal}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{weight}}$$ \end{document} are \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} for expanded sets [Equations (3) and (4), respectively]; Original refers to expanded whole records and Ranked refers to our ranked model [Equation (5)]. Ranked sequence takes sequence identity only; Ranked seq & annotation takes sequence identity weighted 80% and annotation similarity weighted 20%. The results of the ranked model were measured at 20%, 30%, 50%, 70%, and 80%, the user-specified proportion to expand search results, summarized in the form of min, max.

Recall is overestimated and in turn is not informative, due to the expanded set having even more query–target pairs than the original data set. Figure 2a compares the number of query–target pairs. The retrieved pairs among representatives include only about 15% of the pairs from the original data set. On the one hand, this indicates that nonredundant databases dramatically reduce search time and users can browse the search results more efficiently. On the other hand, it shows that expansion of results is valuable since potential interesting records may be in the other 85%. However, the expanded set produces 40,095,619 more pairs than the original. Figure 2b further shows that the expanded set produces more pairs on over 89% of queries (492,129 of 547,476), and on average produces about 10 pairs per query (Fig. 2c). Having more pairs results in high Recall. Both median and mean Recall (Fig. 2d) are above 90%, but this comes at the cost of producing more than 40 million pairs. We further computed Jaccard similarity (the number of shared query–target pairs over the total number of pairs between E and R). Jaccard similarity by comparison is almost 20% lower than Recall, which clearly shows the results of the expanded set are not similar to those of the original database. This observation is consistent with study results reported from other domains, showing that Recall is not an appropriate metric for search effectiveness measurement.

FIG. 2.

(a) Expansion brings more hits than original search. (b) After expansion, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\approx$$ \end{document} 90% of queries have more hits than search on the original database. (c) Those \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\approx$$ \end{document} 90% of queries have a median of 34 more hits than original search. (d) Recall is high but at the cost of returning more hits than original search. Jaccard similarity is lower than Recall, showing that the results of the expanded set are not similar to those of the original database.

6. Improving the Search Effectiveness of Expanded Clusters

Following these observations, we propose a solution that ranks cluster members in terms of their similarity to cluster representatives and only returns the top X%, a user-defined proportion, when they expand search results. To the best of our knowledge, existing databases such as UniRef select representatives based on whether a record is reviewed by biocurators, is from a model organism, or other such record-external factors. They do not compare and rank the similarity between records. Also, they expand all the records in a cluster rather than choosing only a subset.

6.1. A ranking model

In our proposal, the notion of similarity between a record and its cluster representative is modeled based on sequence identity and annotation similarity. This similarity function is shown in Equation (5), where R and M refer to a representative and an associated cluster member record, respectively. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Si{m_{seq}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Si{m_{annotation}}$$ \end{document} stand for their sequence identity and annotation similarity, respectively. Annotations are based on record metadata, such as GO terms, literature references, and descriptions. Sequence identity is arguably the dominant feature, but existing studies for other tasks demonstrate that combining sequence identity and metadata similarity is valuable (Chen et al., 2016c). Here, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} refer to their corresponding weights; for example, sequence identity accounts for 80% of the aggregated similarity and annotation similarity accounts for another 20% when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} is 0.8 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} is 0.2. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Sim ( R , \;M ) = \alpha \;Si{m_{seq}} ( R , \;M ) + \beta Si{m_{annotation}} \;( R , \;M ). \tag{5} \end{align*} \end{document}

The records in each cluster are thus ranked by this similarity function in descending order. The top-ranked X% records, with parameter X specified by a user, will be presented when the user expands search results. The ranked model can be adjusted by both database staff and database users. On the one hand, database staff can customize the ranking function, such as adjusting weights and selecting different types of annotations, when creating nonredundant databases. On the other hand, database users can select how many records to browse rather than seeing all records when expanding search results.

We used sequence identity reported by CD-HIT and molecular function (MF) GO term similarities as annotation similarity. MF GO terms are extracted from the UniProt-GOA data set (Courtot et al., 2015), and the similarity is calculated using the well-known LinAVG metric (Lin, 1998). We applied the ranking function with two sets of weights: the first is when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} = 100% and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} = 0%, that is, only rank based on sequence identity, whereas the second is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} = 80% and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} = 20%. We then measured in different proportions of 20%, 30%, 50%, 70%, and 80% to reflect how much proportion users want to expand. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$RA ( seq , \;annotation , \;proportion )$$ \end{document} used in Figure 4 shows the values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} , and the returned proportion, respectively.

6.2. Ranking model results and discussions

Table 1 compares detailed \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} measures for the ranked model with the original unranked expanded set. The ranked model always has higher Precision across different ratios and values of K. Figure 3 shows that more than 85% of queries have higher Precision in representatives than the expanded set. The ranked model decreases this dramatically, to about 35%, showing that it has the potential to maintain Precision over expanded search results. Results in Figure 4 further confirm the findings. Figure 4b illustrates that user-defined proportions can significantly reduce the number of expanded query–target pairs: even the highest proportion 80% has about 50 million fewer query–target pairs than the full expanded set, and its median and mean Precision are higher than that of the full expanded set (shown in Fig. 4a). This shows that in practice users can browse many fewer results, highlighting the plausibility of our solution and also demonstrating that metadata is effective in the context of sequence search. Another advantage of our solution is that it does not require additional real time in sequence searching: CD-HIT by default reports the identities between representatives and members; MF GO terms similarities can also be precomputed.

FIG. 3.

Proportion of queries having higher Precision in representatives than in the expanded set. We removed queries that have the same number of hits in both (it means retrieved representatives do not have any member records). The first row compares unranked expanded set (a) with our proposed ranked model (b) using the metric \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{equal}}$$ \end{document} ; the second row compares unranked expanded set (c) with our proposed ranked model (d) using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} {K_{weight}}$$ \end{document} .

FIG. 4.

Comparative results for original (unranked) expanded set and our proposed ranked model. Subgraphs (a) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} measures, (b) the number of retrieved hits, (c) Recall results, and (d) Jaccard results. Each of them shows the mean and median results of the metrics, where median is represented with a dash. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$RA ( seq , \;annotation , \;proportion )$$ \end{document} refers to the ranked model summarized in Section 6, where seq and annotation refer to the weight of sequence identity and annotation similarity, effectively \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} in Equation (5), and proportion refer to the proportion specified by users to expand search results.

A limitation of the approach is that it has lower Recall than the full expanded set (shown in Fig. 4c, d). However, it is our view that the number of expanded query–target pairs and Precision measures are more critical to user satisfaction. For instance, proportion at 20% produces around 200 million fewer query–target pairs and has 2% higher \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$P {@} K$$ \end{document} and mean Precision. Users may already find enough interesting results from the expanded 20% results.

7. Conclusion

We analyzed the search effectiveness of nonredundant databases generated by sequence clustering methods. We propose innovative evaluation metrics to better model user behaviors. The detailed assessment results illustrate that the Precision of representatives is high, but that expansion of search results can degrade Precision and reduce user satisfaction by producing large numbers of additional hits. We propose a model that ranks cluster members in terms of sequence identity and annotation similarity. The comparative results show that it has the potential to bring more precise results and reduce user effort, that is, yielding efficient and accurate discovery of relevant answers.

Footnotes

Acknowledgments

We appreciate the advice of the NCBI BLAST team on BLAST-related commands and parameters. The work of Q.C. is supported by the Melbourne International Research Scholarship from the University of Melbourne. The project receives funding from the Australian Research Council through a Discovery Project grant, DP150101550.

Authors' Contributions

Q.C. and X.Z. conceived and designed the experiments. Q.C. performed the experiments. All the authors analyzed the data. Q.C., J.Z., and K.V. wrote the article. All the authors approved the final article.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

*

This is an updated version of our previous study (Chen et al., ).

†

‡

Note that UniRef50 contains records from sources other than UniProtKB; only overlapped sequences were selected for evaluation.

§

./cd-hit -i input_path -o output_path -c 0.5 -n 2, where -i and -o stand for input and output path. -c stands for identity threshold and -n specifies word size recommended in the user guide.

**

./blastp -task blastp -query query_path -db database_path -max_target_seqs 100000, where blastp specifies protein sequence, -query and -db specifies query and database path. -max_target_seqs is the maximum number of returned sequences for a query.

References

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

Baxevanis

A.D.

, and Bateman

2015. The importance of biological databases in biological discovery. Curr. Protoc Bioinf. Volume 50.

Benson

D.A.

, Cavanaugh

, Clark

, et al. 2017. Genbank. Nucleic Acids Res, 46, 41–47.

Bursteinas

, Britto

, Bely

, et al. 2016. Minimizing proteome redundancy in the uniprot knowledgebase. Database, 2016, 39.

Chen

, Wan

, Lei

, et al. 2016a. Evaluation of cd-hit for constructing non-redundant databases, 703–706. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, Shenzhen, China.

Chen

, Zhang

, Wan

, et al. 2017a. Sequence clustering methods and completeness of biological database search. In Proceedings of the Workshop on Advances in Bioinformatics and Artificial Intelligence, Melbourne, Australia.

Chen

, Zobel

, and Verspoor

2016b. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database.

Chen

, Zobel

, and Verspoor

2017b. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: A descriptive study. Database (Oxford), 2017.

Chen

, Zobel

, Zhang

, et al. 2016c. Supervised learning for detection of duplicates in genomic sequence databases. PLoS One, 11, e0159644.

10.

Chen

, Wan

, Zhang

, et al. 2018. Comparative analysis of sequence clustering methods for deduplication of biological databases. J. Data Inf. Qual. 9, 17.

11.

Cole

, Barber

J.D.

, and Barton

G.J.

2008. The jpred 3 secondary structure prediction server. Nucleic acids Res. 36(suppl 2), W197–W201.

12.

Courtot

, Shypitsyna

, Speretta

, et al. 2015. Uniprot-goa: A central resource for data integration and go annotation, 227–228. In SWAT4LS. Cambridge, UK.

13.

, Niu

, Zhu

, et al. 2012. Cd-hit: Accelerated for clustering the next-generation sequencing data. Bioinformatics, 28, 3150–3152.

14.

Gabdank

, Chan

E.T.

, Davidson

J.M.

, et al. 2018. Prevention of data duplication for high throughput sequencing repositories. Database (Oxford), 2018. DOI:10.1093/database/bay008

15.

Gish

, and States

D.J.

1993. Identification of protein coding regions by database similarity search. Nat. Genet. 3, 266.

16.

, Li

, Dong

, et al. 2016. Structural basis of outer membrane protein insertion by the bam complex. Nature, 531, 64–69.

17.

Karsch-Mizrachi

, Takagi

, Cochrane

, et al. 2017. The international nucleotide sequence database collaboration. Nucl. Acids Res. 46(D1), D48–D51.

18.

Lin

1998. An information-theoretic definition of similarity. ICML, 98, 296–304.

19.

Magrane

, and UniProt Consortium. 2011. Uniprot knowledgebase: A hub of integrated protein data. Database (Oxford), 2011, bar009.

20.

Malde

, and Furmanek

2013. Increasing sequence search sensitivity with transitive alignments. PLoS One, 8, e54422.

21.

Mirdita

, von den Driesch

, Galiez

, et al. 2016. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45(D1), 170–176.

22.

Remita

M.A.

, Lord

, Agharbaoui

, et al. 2016. A novel comprehensive wheat mirna database, including related bioinformatics software. Curr. Plant Biol. 7, 31–33.

23.

Suzek

B.E.

, Wang

, Huang

, et al. 2015. Uniref clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31, 926–932.

24.

The UniProt Consortium. 2018. Uniprot: The universal protein knowledgebase. Nucleic Acids Res. 46, 2699.

25.

Walters

W.H.

2016. Beyond use statistics: Recall, precision, and relevance in the assessment and management of academic libraries. J. Librarianship Inf. Sci. 48, 340–352.

26.

Webber

W.E.

2010. Measurement in information retrieval evaluation [PhD thesis]. Melbourne, Australia.

27.

Yonchev

, Dimova

, Stumpfe

, et al. 2018. Redundancy in two major compound databases. Drug Discov. Today, 23, 1183–1186.

28.

Zobel

1998. How reliable are the results of large-scale information retrieval experiments? 307–314. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.

29.

Zobel

, Moffat

, and Park

L.A.F.

2009. Against recall: Is it persistence, cardinality, density, coverage, or totality?. ACM SIGIR Forum, 43, 3–8. ACM.