Abstract
Abstract
The computation of the global minimum energy conformation (GMEC) is an important and challenging topic in structure-based computational protein design. In this article, we propose a new protein design algorithm based on the AND/OR branch-and-bound (AOBB) search, a variant of the traditional branch-and-bound search algorithm, to solve this combinatorial optimization problem. By integrating with a powerful heuristic function, AOBB is able to fully exploit the graph structure of the underlying residue interaction network of a backbone template to significantly accelerate the design process. Tests on real protein data show that our new protein design algorithm is able to solve many problems that were previously unsolvable by the traditional exact search algorithms, and for the problems that can be solved with traditional provable algorithms, our new method can provide a large speedup by several orders of magnitude while still guaranteeing to find the global minimum energy conformation (GMEC) solution.
1. Introduction
I
The aim of SCPD is to find the global minimum energy conformation (GMEC), that is, the global optimal solution of an amino acid sequence that minimizes a defined energy function. In practice, the rigid body assumption, which anchors the backbone template, is usually applied to reduce computational complexity. In addition, possible side-chain assignments for a residue are further discretized into several known conformations, called the rotamer library. It has been proved that SCPD is NP-hard (Pierce and Winfree, 2002) even with the two aforementioned prerequisites. A number of heuristic methods have been proposed to approximate the GMEC (Street and Mayo, 1999; Kuhlman and Baker, 2000; Marvin and Hellinga, 2001). Unfortunately, these heuristic methods can be trapped into local minima and may lead to poor quality of the final solution. On the other hand, several exact and provable search algorithms that guarantee to find the GMEC solution have been proposed, such as dead-end elimination (DEE) (Desmet et al., 1992), A* search (Leach et al., 1998; Lippow and Tidor, 2007; Donald, 2011; Zhou and Zeng, 2015), tree decomposition (Xu and Berger, 2006), branch-and-bound (BnB) search (Hong and Lozano-Pérez, 2006; Traoré et al., 2013; Allouche et al., 2014), and BnB-based linear integer programming (Althaus et al., 2002; Kingsford et al., 2005).
In our protein design scheme, a set of DEE criteria (Goldstein, 1994; Gainza et al., 2012) are first applied to prune the infeasible rotamers that are provably not in the GMEC solution. After that, the AND/OR branch-and-bound (AOBB) search (Marinescu and Dechter, 2009) is used to traverse over the remaining conformational space to find the GMEC solution. Based on an advanced heuristic function, our new design algorithm can fully exploit the graph structure of the underlying residue interaction network and efficiently prune a large number of infeasible branches during the search. In addition, we propose an elegant extension of this AND/OR branch-and-bound algorithm to compute the top k solutions within a user-defined energy cutoff from the GMEC. Our tests on real protein data show that our new protein design algorithm can address many design problems that cannot be solved precisely, and for the problems that were solvable formerly, our new method can achieve a significant speedup by several orders of magnitude compared to the traditional exact search algorithms.
1.1. Related work
The A* algorithm (Leach et al., 1998; Keedy et al., 2013) was proposed to combine with DEE pruning to compute the GMEC solution. It uses a priority queue to store all the expanded states, which unfortunately may exceed the hardware memory limitation for large problems. AOBB, on the contrary, uses depth-first-search strategy that only requires linear space complexity with respect to the number of mutable residues.
The traditional BnB search algorithm (Hong and Lozano-Pérez, 2006) usually ignores the underlying topological information of the residue interaction network constructed based on the backbone template, while AOBB is designed to exploit this property.
Although the tree decomposition method (Xu and Berger, 2006) utilizes the residue interaction network, it lacks an efficient method to traverse the whole conformational space when the tree width is large and the table allocated by its dynamic programming routine may be too large to fit in memory. To fix this problem, AOBB adopts the mini-bucket heuristic to prune a large number of states to speed up the search process.
2. Methods
2.1. Overview
Under the assumptions of rigid backbone structures and discrete side-chain conformations, the structure-based computational protein design (SCPD) can be formulated as a combinatorial optimization problem that aims to find the best rotamer sequence r = (r1, …, rn) that minimizes the following objective function:
where n stands for the number of mutable residues, ET (r) represents the total energy of the system in which the rotamer assignment of the mutable residues is r, E0 represents the template energy (i.e., the sum of the backbone energy and the energy among nonmutable residues), E1(ri) represents the self energy of rotamer ri (i.e., the sum of intra-residue energy and the energy between ri and nonmutable residues), and E2(ri, rj) is the pairwise energy between rotamers ri and rj.
To find the global minimum energy conformation (GMEC), we often need to search over a huge conformational space. Our search scheme followed the popular protein design pipeline in the literature (Leach et al., 1998; Keedy et al., 2013): First, a combination of several dead-end elimination criteria (Goldstein, 1994; Gainza et al., 2012) was applied to prune the rotamers that are provably not in the optimal solution and thus reduce the magnitude of the search space. Then, a combinatorial optimization algorithm, namely the AND/OR branch-and-bound search, is used to traverse the remaining conformational space and guarantees finding the GMEC solution.
2.2. AND/OR branch-and-bound search
2.2.1. Branch-and-bound search
Suppose we try to find the global minimum value of the energy function E(r), in which r∈R and R is the conformational space of the rotamers. The BnB algorithm executes two steps recursively. The first step is called branching, in which we split the conformational space R into two or more smaller spaces, that is, R1, R2, …, Rm, where
The second step of BnB is called bounding. Suppose the current lowest energy conformation is
2.2.2. Residue interaction network
Traditional BnB algorithm can hardly exploit the underlying graph structure of the residue–residue interactions. In a real design problem, some mutable residues can be relatively distant and thus the pairwise energy terms in Equation (1) between these residues are usually negligible. Based on this observation, we can construct a residue interaction network, in which each node represents a residue, and two nodes are connected by an undirected edge if and only if the distance between them is less than a threshold. Figure 1a gives an example of such a residue interaction network.

An example of constructing an AND/OR search tree. (
Consider a residue interaction network that contains two connected components (i.e., two clusters of mutable residues at two distant positions). Suppose each residue has at most p rotamers and the size of each connected component is q. Then the traditional BnB search needs to visit O(p2q) nodes in the worst case. However, from the residue interaction network, we know that two connected components are independent, which means that altering the rotamers in one connected component does not affect the pairwise energy terms in the other connected component. So we can run the BnB search for each connected component independently and then put the resulting minimum energy conformations together to form the GMEC solution, which only needs to visit O(pq) nodes in the worst case.
The independence requirement of connected components in a residue interaction network is too strict in practice. In fact, we can partition the whole network into several independent connected components after choosing particular rotamers for some residues. For example, after fixing the rotamers for residues A and B in the example shown in Figure 1a, we can obtain two independent components CE and DF. Then we can use the aforementioned method to reduce the size of search space and then search it using branch-and-bound algorithm. This is the major motivation of AND/OR branch-and-bound (AOBB) search (Marinescu and Dechter, 2009).
2.2.3. AND/OR search space
A pseudo-tree (Freuder and Quinn, 1985) of a graph (network) is a rooted spanning tree on that graph in which every nontree edge in the graph is connected from a node to its offspring in the spanning tree. In other words, nontree edges are not allowed to connect two nodes that are located in different branches of the spanning tree. Figure 1b shows an example of a pseudo-tree constructed based on the residue interaction network in Figure 1a.
The pseudo-tree is a useful representation because for any node x in the tree, once all the side-chains of x and its ancestors are fixed, all the subtrees rooted at the children of node x are independent. In other words, altering the rotamers for the subtree rooted at a child of x does not affect the total energy of the another subtree. Thus, the size of the search space for all subtrees rooted at children of node x is proportional to the sum of the sizes of these subtrees rather than the product of their sizes as in the traditional BnB algorithm. Therefore, AOBB often has a much smaller search space compared to the traditional BnB search.
The structure of an AOBB search tree is determined by its pseudo-tree. To represent the dependency between nodes, an AOBB search tree contains two types of nodes. The first type of nodes is called OR nodes, which splits the space into several parts that cover the original space by assigning a particular rotamer to a residue. The second type of nodes is called AND nodes, which decomposes the space into several smaller spaces where the computations of total energy of residues in different branches are independent of each other. The root of an AOBB search tree is an OR node, and all the leaves are AND nodes. For each node in an AOBB search tree, its type is different from that of its parent. An example of an AOBB search tree is given in Figure 1c.
Unlike the traditional BnB search, in which a solution is represented by a single leaf node, in an AOBB search tree, a valid conformation is represented by a tree, called the solution tree. A solution tree shares the same root with the AOBB search tree. If an AND node is in the solution tree, all its OR children are also in the tree. If an OR node is in the solution tree, exactly one of its AND children is in the tree. The tree with bold lines in Figure 1c shows an example of a solution tree. To compute the best solution tree with the minimum energy when traversing the search space, we can maintain a node value v(x) to store the total energy involving the residues in the subtree rooted at x. In an AOBB search tree, v(x) can be computed as follows:
where child(x) stands for the set of children of node x and e(y) is the sum of the self energy of the rotamer represented by y and the pairwise energy between the rotamer represented by y and other rotamers represented by the ancestors of y. Then the v(·) value of the root of the whole search tree is equal to the energy of the GMEC solution. The corresponding best solution tree can be constructed using a similar method.
Algorithm 1 provides the pseudocode of the AOBB search algorithm. For simplicity, we leave out the code of constructing solution trees and only describe the procedure of computing v(·) values. For each OR node x, we use c(x) to store the pointer to the child with the best v(·) value if the subtree rooted at x has been fully explored (line 13), or the pointer to the child whose subtree is currently being visited (line 6), or a null pointer if x has not been visited (line 1).
The bounding step can also be performed in AOBB to prune unpromising branches. The heuristic function h(x) returns a lower bound of v(x), which is used to compute the heuristic value of an incomplete solution tree. When performing the bounding step for an AND node x, we examine all the OR ancestors' nodes of x. For any OR ancestor y, if the heuristic value for the current incomplete solution tree rooted at y (computed by T
2.2.4. Heuristic function
The choice of the heuristic function h(x), which is a lower bound of v(x), heavily affects the performance of the AOBB algorithm. A popular heuristic function used with AOBB is called mini-bucket heuristic (Kask and Dechter, 2001), which is computed by the mini-bucket elimination algorithm (Dechter and Rish, 2003). The computation of mini-bucket heuristic can be accelerated through precomputation, so that h(x) can be computed efficiently by looking up of precomputed tables. The bound given by the mini-bucket heuristic can be further tighten by max-product linear programming (Globerson and Jaakkola, 2008) and join graph linear programming (Ihler et al., 2012).
The mini-bucket elimination is an approximation of the bucket elimination algorithm (Dechter, 1998), which is an another exact algorithm for solving the combinatorial problem with an underlying graph structure, such as protein design, based on a pseudo-tree. More specifically, the bucket elimination algorithm maintains an energy table hx(·) for each tree node x, which stores the exact lower bound on the sum of energy involving the residues in the subtree rooted at x given the rotamer assignments of x's ancestors. For instance, hD(rA, rB, rC) in Figure 2a stores the exact lower bound of node D given the rotamer assignments of its ancestors rA, rB, and rC. These energy tables can be computed in a bottom-up manner. As an example, Figure 2a shows the energy tables of the bucket elimination on a pseudo-tree of a residue interaction network, and we can compute hC(rA,rB) = minrC(E(rB,rC)+hD(rA,rB,rC)+hE(rB,rC)), where E(rB,rC) represents the pairwise energy term between rotamers rB and rC. The h value of the tree root, hA(), in this example, is the total energy of the GMEC. The time complexity of bucket elimination is O(n*exp(w)) (Dechter, 1998), where n is the number of the nodes and w is the tree width (Robertson and Seymour, 1986) of the graph.

An example of mini-bucket elimination.
If the tree width of a graph is large, the energy tables may be high-dimensional and thus can be too large to compute. The mini-bucket elimination is proposed to address this problem. In particular, it splits a node with a large energy table into multiple nodes with smaller energy tables, called mini-buckets, along with the pairwise energy term represented by the new added edges to decrease the dimension of its original energy table. We use h′x(·) to represent the new energy table for each node x computed by the mini-bucket algorithm. Figure 2b gives an example, in which hD(rA,rB,rC) is split into two smaller tables h′D(rB,rC) = min rD (E(rD,rB)+E(rD,rC)) and h′D′(rA) = minrDE(rD,rA). Because now D and D′ can be assigned with different rotamers, the new energy tables computed by the bucket elimination on the new graph is a lower bound of the original problem. Therefore, we can use the sum of h′x(·) on all mini-buckets of a node as the heuristic function for AOBB.
The size of an AOBB search tree is determined by the depth of its pseudo-tree, and usually the mini-bucket heuristic prefers a small tree width to generate a tight lower bound. The min-fill heuristic (Kjerulff, 1990) is often used with the mini-bucket heuristic to generate such a high-quality pseudo-tree.
2.3. Finding sub-optimal conformations
In practice, we often require the design algorithm to output at most k best conformations within a given energy cutoff Δ (Donald, 2011). In the BnB framework, this can be done easily by running the BnB search k times and removing the optimal conformations found in the preceding rounds from the search space. The task is more complicated to tackle in the AOBB because a conformation is represented by a solution tree rather than a tree node. Our solution consists of two parts:
1. In bounding steps, do not prune nodes in which the heuristic function values of the corresponding solution trees do not exceed the critical value by 2. Keep track of the k best solution trees and their v(·) values rather than only a single solution.
For the second part, we need to extend the procedure of computing v(x), originally described in Equation (2). For each node x, we now store the k node values. Let v1(x) be the best node value, v2(x) be the second one, and so on. For each leaf node x, v1(x) = 0 and
The merge operation for AND nodes is challenging. For each AND node x, let its children be y1, y2, …, yt. Our task is to find k different sequences (a1,…,aj,…,ak), where aj = (aj1,aj2,…,ajt) and aji ∈{1,2,…,k}, so that
A simple example and the pseudocode of our merge algorithm for an AND node are shown in Figure 3. This algorithm uses a priority queue Q, which is a data structure that supports the operations of inserting a key/value pair (i.e., element) and extracting the element with the minimum value. We first define an index sequence b = (b1,…,bt), where entry bi represents the index of the chosen v(·) value in child yi. Initially, b = (1,1,…,1) is pushed to Q. In this problem, the value of an element is the sum of v(·) values of the AND nodes' children computed using the index sequence b as the key (line 4). The initial index sequence b = (1,1,…,1) corresponds to the first sequence a1 because we choose the best v value for each child and thus we can get the best v(·) value for their parent. Each time we extract the element with the minimum value from Q as the next best sequence (line 6). Then we push all the successors of the extracted sequence, computed by increasing only one index for each element in the sequence, into the priority queue (lines 9 to 14). We repeat these steps until all the ai values are generated. The time complexity of this process is O(kt log(kt)), where k is the number of required suboptimal solutions and t is the number of children of an AND node. The proof of the correctness about our merge algorithm is provided in appendix section 5.2.

The merge operation for AND nodes.
2.4. Complexity analysis
In this section, we analyze the complexity of the AOBB search algorithm and compare it with other algorithms in the literature. Because AOBB uses the depth-first-search strategy, its space complexity is O(n), where n is the total number of mutable residues. The time complexity of the AOBB algorithm is determined by the depth of its pseudo-tree, which is O(n*pd) in the worst case, where p is the number of rotamers per residue and d is the depth of the pseudo-tree, as the size of the AOBB search tree is O(pd), and for each tree node we need to compute T
Therefore, the worst-case complexity of AOBB is
When computing the suboptimal solutions, AOBB applies a more complicated algorithm (section 2.3) than traditional A* search (Donald, 2011). The time complexity for AOBB to compute the suboptimal solution is O(npdklog(kn)), where k is the number of required suboptimal solutions, p is the number of rotamers per mutable residue, d is the depth of the pseudo-tree of the residue interaction network, and n is the total number of mutable residues. Thus, for large k, computing the sub-optimal solutions in AOBB may require a longer time compared than in traditional A* search.
3. Results
We conducted two computational experiments to evaluate the performance of our new AOBB-based protein design algorithm. In the first experiment, we compared our new AOBB-based algorithm with the traditional A*-based algorithm in a core redesign problem. To make a fair comparison, in this test we did not not make any approximation in the energy matrix (i.e., the residue interaction network is fully connected) because the A*-based algorithm cannot benefit much from such approximation. In the second computational experiment, we performed the full protein design to examine the performance of our algorithm on a larger residue interaction network.
Our AOBB-based protein design algorithm was implemented based on the protein design package OSPREY (Keedy et al., 2013) and the UAI branch of the AOBB search framework daoopt (Otten and Dechter, 2012; Otten et al., 2012). For comparison, we used the DEE/A* solver provided by the OSPREY package. In addition, we included the sequential A* solver with the improved computation of heuristic functions (Zhou et al., 2014). We used an Intel Xeon E5-1620 3.6 GHz CPU in all evaluation tests.
3.1. Core redesign
Core redesign can replace the amino acids in the core of a wild-type protein to increase its thermostability (Korkegian et al., 2005). In this experiment, we tested all the 23 protein core redesign cases that failed to be solved in using the expanded rotamer library with the rigid DEE/A* in 4G memory from Gainza et al. (2012). In addition, we picked another five design problems from Gainza et al. (2012) that were solvable within the given memory using the traditional DEE/A* algorithm. To make a fair comparison between A* and AOBB search algorithm, we did not remove any edge from the fully connected residue interaction network during the AOBB search in this test.
Table 1 summarizes the comparison results between A*-based and our AOBB-based algorithms, in which OOM and OOT represents “out of memory” and “out of time,” respectively. The memory was limited to 4G, which was the same as that in (Gainza et al., 2012), and the running time was limited to 8 hours. The first 23 rows show all the cases in Gainza et al., (2012), which were formerly unsolvable by the original A* algorithm. The column labeled as “Space size” shows the size of the conformational space after DEE pruning. The columns labeled as “A* time” and “cA* time” show the running time of the A* solvers from OSPREY and Zhou et al., (2014), respectively. The running time was measured in milliseconds and did not include the initialization steps of each algorithm. The initialization time of AOBB was relatively stable for all cases and typically took 90s to compute the mini-bucket heuristic tables and an initial bound for AOBB search.
As shown in Table 1, the AOBB algorithm can successfully find the GMEC solutions for 21 out of the 23 problems from Gainza et al.'s, (2012) data, which were formerly unsolvable by the original A* algorithm with 4G memory. Also, we find that the number of states expanded in the AOBB search was much less than that in the traditional A* search. Accordingly, for those cases solvable by both algorithms, the AOBB search consumed less time than the traditional A* search. Probably this improvement was due to the fact that the mini-bucket heuristic with MPLP and JGLP is tighter than the heuristic function used in OSPREY.
3.2. Full protein design
In the second computational experiment, we ran the full protein design to evaluate the performance of our AOBB-based protein design algorithm. In the full protein design problem, all residues of a protein are mutable, which leads to a much larger conformational space. For each residue, we picked 1-4 the most similar amino acids, according to the BLOSUM62 matrix, as the mutation candidates. For each pair of residues A and B, we added an edge (A,B) to the residue interaction network if and only if for all rotamer assignments rA and rB, (max rA,rB E(rA,rB)-min rA,rB E(rA,rB)) > λ, where threshold parameter λ was used to trade the precision of the energy with the ease of the problem. We used λ = 0.04 for all the test cases.
Table 2 shows the test results of this computational experiment. The running time was measured in milliseconds. Here we did not list the results of the traditional A*-based algorithm because we found that A*-based algorithms were unable to find the GMEC solutions for all these test cases within 4G memory. The AOBB-based search algorithm can find the GMEC solutions for all the test cases. This demonstrates the power of the AOBB search algorithm with the state-of-the-art heuristic function, which can effectively address full protein design problems.
4. Conclusion and Future Work
In this article, we developed a new protein design algorithm based on the new branch-and-bound search technique (i.e., AOBB) to find the global minimum energy conformation, which speeds up the search process by several orders of magnitude compared to the traditional provable algorithms. The AOBB-based algorithm accelerates the search process by using an advanced heuristic function and can fully exploit the topology of the residue interaction network while it only has linear memory consumption. The algorithm can also output suboptimal solutions by employing an elegant modification of the original search algorithm.
Currently, our algorithm is only implemented on a single machine. It is possible to further accelerate the design process by parallelizing the AOBB search on a GPU processor or a CPU cluster on a supercomputer, which will enable us to deal with large protein design problems.
5. Appendices
5.1. Introduction to traditional branch-and-bound search
Suppose we try to find the global minimum value of the energy function E(r), in which
The second step of BnB is called bounding. Suppose the current lowest energy conformation is
The BnB algorithm generally performs the branching step recursively until the current conformational space contains only one single conformation. The space generated from the branching step can form a BnB search tree, in which the union of subspaces represented by the children of a node covers the whole conformational space of this node. In the protein design problem, we can split a conformational space by assigning a particular rotamer in a residue. For each node in the search tree, the bounding step is applied to prune some branches and thus shorten the search time. Figure 4 shows an example of the traditional branch-and-bound (BnB) search tree.

An example of a branch-and-bound search tree, which contains three mutable residues. The first residue has three allowed rotamers while the other two only have two allowed rotamers. The coding of a conformation is given by three integers, each of which is the index of the rotamer in the corresponding residue. Each tree node represents a conformational space. For each tree node, we split its conformational space by determining a particular rotamer in a residue. The shaded nodes are pruned in the bounding steps because the lower bound of the energy values in these nodes given by the heuristic function is greater than one of the optimal conformations in its siblings. A brute-force search for the full conformational space requires the computation of twelve conformations to guarantee the GMEC solution, while BnB only needs to compute five of them.
To traverse the BnB search tree, a queue Q is often used to store the nodes to be expanded. Initially, Q only contains the node representing the whole conformational space. Also, we maintain a global variable u to store the current lowest energy value. Initially, u can be initialized to the energy of any conformation. In practice, we often use a stochastic local search algorithm to generate a local optimal conformation so that it can be used to prune more nodes at the beginning of the search process. BnB can be executed by looping the following steps until Q becomes empty:
1. Extract the conformational space R from Q; 2. If R only contains a single conformation, update u using the energy of this conformation and then restart the loop; 3. Otherwise, split R into R1, R2, …, Rm by fixing a particular rotamer of a residue; 4. For each i ∈{1,2,…,m}, compute the lower bound of the energy for all nodes in space Ri. If it is smaller than u, push Ri to Q.
Usually Q is implemented by a LIFO queue (i.e., stack), so that the energy of current best conformation u can be updated as early as possible to prune more branches. In this case, the BnB algorithm runs in a depth-first-search mode. Another benefit of this mode is that memory used by the BnB algorithm is only proportional to the depth of the search tree, namely the number of mutable residues in the protein design problem. On the other hand, other search strategies, such as A* search, can store an exponential number of nodes in memory in the worst case. Algorithm 2 gives the pseudocode of an implementation of the BnB search algorithm.
Traditional branch-and-bound search
5.2. Correctness for finding suboptimal solutions
In this section, we provide the proof of Theorem 1 in the article, which states the correctness of our merge algorithm for AND nodes. We restate that theorem in Theorem B.1.
Let i be the first round that the element with the i-th value is not in the priority queue. Suppose ai = (ai1,ai2,…,ait). Because i≠1, there exists j ∈{1,2,…,t} such that aij > 1. Sequence s = (ai1,…,ai,j−1,aij − 1, ai,j+1,…,ait) must have not been expended. Otherwise, according to line 13, sequence ai will be pushed to the priority queue, which contradicts the assumption. On the other hand, because vj(y) is monotone with respect to j, the value of sequence s is smaller than the value of sequence ai. This means that sequence s should be expanded before sequence ai and thus must have already been expanded, which gives the contradiction. ■
Also, we provide the implementation of the merge operation for OR nodes in Algorithm 3. The correctness of this algorithm is self-evident, so we omit its proof here.
Merge operation for OR nodes
Footnotes
Acknowledgments
We thank Dr. Lars Otten and Prof. Alex Ihler for their support in providing their code of the daoopt AOBB solver. Funding: This work was supported in part by the National Basic Research Program of China Grant 2011CBA00300 and 2011CBA00301; the National Natural Science Foundation of China Grant 61033001, 61361136003, and 61472205; and China's Youth 1000-Talent Program.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
