Abstract
In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.
1. Introduction
In recent years, deep neural networks (DNNs) have provided considerable performance improvements for a wide variety of machine learning (ML) prediction problems. However, their adoption has been slower in applied domains such as the life and medical sciences, where practitioners often require that models make decisions for biologically relevant reasons. Without this property, models trained on experimentally collected data cannot be trusted since they may fit to systematic biases introduced during data collection. Therefore, there is demand for tools that evaluate the consistency between prior knowledge about how predictions should be made and the mechanisms actually employed by a given ML model in practice.
This article presents a set of analytical tools for validating model behavior extending the Sufficient Input Subsets (SIS) method (Carter et al., 2019). We focus on models that predict the functional classification of a protein sequence from a raw amino acid sequence. Such models can not only categorize existing protein sequences found by high-throughput screening but also have potential to guide the discovery of novel proteins that catalyze reactions, are important for biotechnology, or produce new therapeutics. Hidden Markov models (HMMs) provide state-of-the-art performance (Finn et al., 2011) for this task, but DNNs have the potential to provide a new modeling paradigm (Hou et al., 2017; Li et al., 2017a; Liu, 2017; Dalkiran et al., 2018; Seo et al., 2018; Bileschi et al., 2019).
In particular, HMMs do not capture interactions between nonlocal positions in protein sequences, which have been shown to be highly predictive of protein tertiary structure (Marks et al., 2011; Evans et al., 2018; Xu, 2018) and function (Bitbol et al., 2016; Gueudré et al., 2016; Riesselman et al., 2018). DNNs have been shown to achieve extremely high accuracy on held-out data (Bileschi et al., 2019); to accurately label remote homologs (Li et al., 2017a); and in some cases, their predictions have been experimentally validated (Liu, 2017). Unfortunately, DNNs are complex and poorly understood, leading to lack of trust among practitioners (Angermueller et al., 2016; McCloskey et al., 2018; Montavon et al., 2018). This motivates development of tools to interrogate these black-box functions since traditional metrics such as held-out test accuracy do not provide assurance that the models can be used as effective surrogates for laboratory experiments.
We present six novel extensions of SIS that allow us to inspect and critique the decision-making of DNNs that predict protein function from the sequence. Our tools expose that while DNNs may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. Our approach can be easily adopted in other scientific contexts where model interpretability is essential.
Our implementation of SIS is available on GitHub (https://github.com/google-research/google-research/tree/master/sufficient_input_subsets).
2. Related Work
2.1. Methods for interpreting ML predictions
Many methods have been developed to interpret the often complex behavior of ML models. Common approaches include restricting architectures to human-interpretable predictors (Lei et al., 2016; Sha and Wang, 2017). However, in many cases, specific architectures may be required to attain superior performance. Other model-agnostic interpretability approaches produce attributions that quantify the importance of each input feature. Such methods include LIME (Ribeiro et al., 2016), saliency maps (Baehrens et al., 2010; Simonyan et al., 2013), layer-wise relevance propagation (Bach et al., 2015), DeepLIFT (Shrikumar et al., 2017), integrated gradients (Sundararajan et al., 2017), and input signal-based techniques using gradients of f (Springenberg et al., 2014; Zeiler and Fergus, 2014; Kindermans et al., 2017b); however, gradient-based methods may be unreliable (Kindermans et al., 2017a,b).
Another recent approach is SIS (Carter et al., 2019), a local explanation framework that produces rationales (sufficient input subsets) for a model that maps inputs
using a function
. Each sufficient input subset consists of a minimal subset of features in x that suffices for f to produce the same decision. SIS presumes that the decision f (x) exceeds some threshold
2.2. Neural networks for proteins
Recent work has demonstrated that deep learning models have significant potential for accurate classification of protein sequences. For example, DeepFam (Seo et al., 2018) is a convolutional neural network (CNN) that classifies sequences from 2892 COG families, showing comparable performance with profile HMMs (pHMMs). DeepSF is a CNN that classifies sequences from 1195-fold classes from SCOP, reporting better fold recognition abilities than the state-of-the-art tool HHsearch, although the deep models did not perform as well on finer-grained superfamily and family levels (Andreeva et al., 2004; Söding, 2004; Hou et al., 2017). A deep CNN has also been trained to annotate sequences from SwissProt with hundreds of class labels, including multilabel examples (The UniProt Consortium, 2016; Szalkai and Grolmusz, 2018). Recurrent neural networks (RNNs) for classifying sequences from four functional classes have also been experimentally validated (Liu, 2017). Similarly, a model combining a CNN with an RNN was used for enzyme classification (Li et al., 2017b). Finally, Bileschi et al. (2019) demonstrate that a dilated CNN can annotate protein function from a sequence as accurately as BLASTp (Altschul et al., 1997) and pHMMs in a classification task on Pfam sequences. This work demonstrates the potential for DNNs to accurately predict the function of sequences that cannot be annotated using existing HMMs.
3. Methods
3.1. Models and data
We train three DNN architectures on a protein family classification task: a deep CNN (deep CNN); a simpler shallower CNN (shallow CNN); and an RNN. We adopt models and training procedures from the study by Bileschi et al. (2019) (details in the Model Details and Hyperparameters section in Supplementary Material and Supplementary Table S1). To reduce and evaluate the effect of noise, we retrain each of our architectures 10 times with different random parameter initializations and input sampling during training.
We use precut sequence domains from Pfam v32.0 (El-Gebali et al., 2018), as processed by Bileschi et al. (2019). Each sequence is represented by a raw amino acid sequence and belongs to 1 of 17,929 families. For parts of our evaluation, we apply SIS to a subsample of Pfam seed containing 13188 sequences from 5933 families (details in the Additional Engineering Details section in Supplementary Material).
We apply SIS to the function giving the scalar output probability of the predicted class. We use a threshold τ = 0.9 and mask amino acids by replacement with X, represented as a uniform distribution over the amino acids (see the SIS for Protein Sequence Classification section in Supplementary Material).
3.2. Methodological contributions
3.2.1. SIS logo
We visualize SIS collections found across all sequences belonging to a particular family to yield an immediate summary of all SIS associated with the family and, moreover, the level of variation among SIS within that family. Our visualization builds on the concept of sequence alignment. Multiple sequence alignments (MSAs) are available for all families in Pfam (El-Gebali et al., 2018). MSAs allow us to map each SIS within a family to the same reference alignment and compute a histogram across the set of SIS collections for the family, indicating the frequency with which each amino acid occurs at each aligned sequence position. An MSA can be represented as a sequence logo (Schuster-Böckler et al., 2004) (Fig. 1a), in which each possible amino acid at each position is represented by its frequency, which is transformed to relative entropy compared with the background frequency of amino acids. High-entropy positions represent conserved regions that are generally important for protein tertiary structure and function (Ashkenazy et al., 2016). To produce our SIS logo, we align the sequences to the reference family MSA, mask non-SIS amino acids with X, and visualize the resulting SIS logo using Skylign (with the “Convert to profile HMM—keep all columns” option) (Wheeler et al., 2014). The SIS logo allows us to confirm concordance with the family HMM logo, indicating that a model learns discriminative features that are conserved across the family (see Section 4.1).

3.2.2. Explaining misclassifications
We compare the SIS collection of a misclassification with SIS collections of sequences from both its predicted and true families. For each of the true and predicted families, we report the distribution of edit distances from the misclassified sequence's SIS to all members of that family and the closest SIS from any member of that family to the misclassified sequence's SIS. Distance is computed as edit (Levenshtein) distance between sequences in which the non-SIS amino acids of each sequence are replaced with X. Our approach provides a numerical explanation of an incorrect decision (see Section 4.2).
3.2.3. Model inconsistency
One potential concern for practitioners is the stability of models: Does the same architecture retrained on the same data (with different random initialization) make the same decisions? Are the decisions made for the same reasons? To address these questions, we compare two replicates of the same architecture and use each to make predictions on the SIS from the other model, as in the study by Carter et al. (2019). In this study, we also compare SIS from each of the models on the same examples to evaluate whether the models are consistent in rationalizing predictions on specific instances (see Section 4.3).
3.2.4. Feature compression
We compare models by assessing the succinctness of their rationales. For each protein sequence x and SIS S in its SIS collection, we compute the fraction of x comprising S as |S| / |x| (see Section 4.4).
3.2.5. SIS location
We examine whether there are differences in which sequence positions (locations) are used by different model architectures to classify sequences. We observe which positions are captured in each SIS and normalize by the sequence's length to relative sequence location (see Section 4.5).
3.2.6. SIS-3D
SIS can be understood as justifications for a model's decisions. However, viewing these subsets in the sequence space is of limited utility because the sequence position of an amino acid generally has little relationship to the functional role that it plays. In contrast, the location of SIS positions in the three-dimensional (3D) structure of the folded protein contains much more information about their involvement in protein function (Berezin et al., 2004; Marks et al., 2011). For example, amino acids that line the substrate binding pocket of a protein often play a central role in substrate recognition, while those found at a protein binding interface contribute significantly to protein interaction specificity. Therefore, interpretability is significantly improved by locating these amino acids within a 3D rendering of the folded protein structure (see Section 4.6).
4. Results
4.1. SIS logos
We compute SIS collections for many sequences from the Pfam v32.0 full dataset (El-Gebali et al., 2018) predicted to belong to the Kunitz domain family (PF00014) with high confidence (probability ≥ 0.9) by the deep CNN. Figure 1b shows the corresponding SIS logo. We observe a high degree of agreement between the prominent sequence positions in the SIS logo and the highly conserved positions in the MSA for this family (Fig. 1a).
Some regions of a protein domain may not play a major functional role, and the exact amino acids at these positions may have little effect on protein function. However, a single change to a key amino acid, for instance, at an active site, might fundamentally impair a protein's function. Traditionally, MSAs are used to identify such conserved sequence positions, which are likely to be important for protein function (Ashkenazy et al., 2016). Figure 1 suggests that SIS are highly enriched for evolutionarily conserved sequence positions. Note, however, not all conserved indices appear in the SIS logo as the model learns discriminative features to distinguish families. Some sequence positions may be highly conserved, but occur in similar contexts in many families.
4.2. Explaining misclassifications
Our approach also enables us to explain misclassifications. The deep CNN misclassifies sequence Q22DM3_TETTS/41-68, a member of family EF-hand_6, as belonging to family EF-hand_1, another helix-loop-helix structure family. We analyze this particular sequence because its predicted probability is nearly 1.0, and the model has few misclassifications from EF-hand_6 to EF-hand_1, and vice versa. We find that the SIS for Q22DM3_TETTS/41-68 is closer to SIS of sequences in the predicted family (two edits) than those from the true family (three edits) (Fig. 2b). Moreover, we find that the SIS for this misclassified sequence is closer on average to SIS from the predicted family than from the true family (Fig. 2a).

4.3. Model inconsistency
One concern for practitioners (e.g., biologists) when considering deep learning models is stability when retrained on the same data. We compare our 10 replicate models for each architecture. The small variance in accuracy we observe across the replicates of each architecture trained with different random initialization may suggest that the replicate models behave similarly (Table 1). This is satisfying given that each DNN training run performs nonconvex optimization from a different random initialization. However, while the models seem to be stable in terms of their outputs, they may still learn different decision boundaries from the training data and hence classify sequences for different reasons. We probe this question by evaluating whether SIS are consistent across the replicates.
Final Test Set Accuracies of Our Trained Models
Each value indicates mean over the 10 replicate models per architecture (see Section 3.1) ± 2σ.
CNN, convolutional neural network; RNN, recurrent neural network.
We choose two deep CNNs (selected at random from the replicates) and use each to make predictions on SIS from the other model (using all sequences in our Pfam seed subsample). Figure 3a shows the distributions of these predictions for the two models and suggests that they classify sequences for different reasons. Figure 3b shows an example of a protein sequence, F1QK57_DANRE/2132-2166 (chosen at random), for which the SIS identifies an SIS collection with one sufficient input subset for each of the models. The shaded characters indicate the SIS for this sequence for each model, such that each model classifies the sequence with confidence 0.9 using just the shaded amino acids. There is little overlap between the shaded subsets, suggesting that the two models (with identical architectures) are learning different decision boundaries.

4.4. Feature compression
We next compare our DNN models by evaluating how succinctly they rationalize their decisions. We compute SIS collections across our subsample of Pfam seed for all replicates of the three DNN architectures. Figure 4 shows the distribution of the fraction of the input sequence contained in the SIS for each of our three architectures. This result suggests that the RNN uses substantially fewer sequence positions than the CNN-based architectures to make classifications with the same confidence. The deep CNN requires a larger fraction of the sequence to classify proteins than does the shallow CNN. We find that the RNN can more succinctly rationalize its decisions than the CNNs. How verbosely a model should rationalize its decisions likely depends on its application.

Boxplot indicating the fraction of protein sequence positions in each SIS. Distributions are over the union of all SIS from all 10 replicates of each architecture.
4.5. SIS location
We apply our SIS logo methodology to examine whether different model architectures rely on amino acids at different locations to classify sequences. We visualize SIS logos using sequences from the Pfam v32.0 full dataset (El-Gebali et al., 2018) classified as lactate dehydrogenase (PF00056) by the deep CNN, shallow CNN, and RNN models (Fig. 5b–d, respectively). We compare these SIS logos with the family HMM logo (Fig. 5a). We observe that while there is significant intersection in the sequence positions comprising sufficient input subsets across these models, positions that are important for RNN classification are found almost exclusively at the ends of the sequence, whereas the deep CNN often also requires positions in the center of the domain.

To assess whether this property occurs more generally, we look at SIS collections from our Pfam seed subsample containing many families. We observe which positions are captured in each SIS and then normalize by sequence length to the relative position in the sequence. Aggregating these positional data (Fig. 6) reveals that the pattern observed for PF00056 is common across the entire dataset and reflects a difference in preference for these three different DNNs. To determine whether this is a stochastic feature of training or driven by the model architecture itself, we examine the 10 replicate models trained for each architecture with different random initialization. The positional biases of different replicates of each architecture are remarkably consistent (note error bars in Fig. 6), with all architectures biased to features at the ends of the sequence, but with this effect stronger in the shallow CNN than the deep CNN and stronger still in the RNN. This may reflect the difficulties that recurrent models can have in learning long-range dependencies. It is also likely that the hand-segmented nature of our training data means that the edges of sequences carry more information, imparted by the curators' segmentation decisions.

Density of SIS positions throughout protein sequences. For each of the three architectures, we plot the mean density at each position taken over 10 replicate models trained for each architecture. Error bars indicate minimum and maximum values over the replicates.
4.6. SIS-3D
The shaded amino acids in Figure 7a and b display sufficient input subsets for the human hemoglobin α and β domain sequences, which were correctly predicted to belong to the globin family (PF00042) by the deep CNN. We use PyMOL (Schrödinger, LLC, 2015) with coordinates from the solved protein structure 1a3n in the protein data bank (Tame and Vallone, 2000; Burley et al., 2017) to render this visualization. These subsets comprise roughly 10%–15% of the sequence and are scattered throughout the domain. However, distant positions in the sequence may become close when the protein folds into a 3D structure. Figure 7c shows the amino acids in the SIS for human hemoglobin α using all-atom sphere representation (blue) superimposed on a representation of the protein backbone (gray). The core function of hemoglobin is to transport oxygen, which binds to the iron atom contained in each associated heme group (red). The heme group binds to a pocket in the 3D structure of the hemoglobin α molecule. Figure 7c and d shows that the amino acids contained in the sufficient input subsets for both hemoglobin α and β are clustered around the heme binding sites in the 3D structure. The heme binding pocket is highly characteristic of members of the globin domain family, and the fact that the sufficient input subset overlaps with this functionally important feature of the globin tertiary structure suggests that the trained model has learned nontrivial information about the globin domain.

5. Discussion
We contribute a series of analytical tools that employ SIS to critique the behavior of DNNs for protein domain sequence classification. These allow us to understand the fidelity of the models to the underlying biology and to interrogate their stability and reliability. These techniques are general and can be applied more broadly to ML models in other domains.
We find evidence that our trained models make predictions for biologically relevant reasons and that misclassifications can be explained. However, we find that a sequence can be accurately classified using a small subset of its positions and the subsets chosen vary across different instantiations of the same model. Consequently, a model can classify a sequence accurately while, for example, ignoring all but its beginning and end, suggesting that the set of features that dictates family membership is redundant.
Depending on their specific use cases for such a model, researchers may use the above analyses to conclude that a model can or cannot be trusted. In certain contexts, the researcher may simply require that the model achieves high accuracy and makes predictions for sensible reasons. On the other hand, researchers sometimes use predictive models for higher-level tasks, for example, in silico screening to estimate the functional impact of mutations. In this study, the model is treated as a surrogate to laboratory experiments and is evaluated on both the original wild-type sequence and on sequence mutations. The experiment will be unreliable if the model's rationales are not scientifically sound, which may be the case if, for example, predictions are unstable across different initializations. Furthermore, the model may underestimate the impact of mutations in the middle of the sequence. Our techniques can be used to uncover such insights.
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
