Molecule Sequence Generation with Rebalanced Variational Autoencoder Loss

Abstract

Molecule generation is the procedure to generate initial novel molecule proposals for molecule design. Molecules are first projected into continuous vectors in chemical latent space, and then, these embedding vectors are decoded into molecules under the variational autoencoder (VAE) framework. The continuous latent space of VAE can be utilized to generate novel molecules with desired chemical properties and further optimize the desired chemical properties of molecules. However, there is a posterior collapse problem with the conventional recurrent neural network-based VAEs for the molecule sequence generation, which deteriorates the generation performance. We investigate the posterior collapse problem and find that the underestimated reconstruction loss is the main factor in the posterior collapse problem in molecule sequence generation. To support our conclusion, we present both analytical and experimental evidence. What is more, we propose an efficient and effective solution to fix the problem and prevent posterior collapse. As a result, our method achieves competitive reconstruction accuracy and validity score on the benchmark data sets.

1. INTRODUCTION

The key challenge of material and drug design is to discover novel molecules that have the desired physical or chemical properties. This process can be understood as an optimization problem as described by Gómez-Bombarelli et al. (2018), and the optimization target is to search for molecules with the optimal desired property scores. However, exhaustive exploration in the molecule space is infeasible since the total number of estimated drug-like molecules is in the order of $1 0^{60}$ as estimated by Polishchuk et al. (2013). Besides, molecule synthesis methods such as Yan et al. (2020a, 2021) and molecule validation procedures are also time-consuming and expensive in practice, which makes the brute-force exploration infeasible.

As deep learning methods are making more and more achievements in multiple fields as in Miao et al. (2018) and Yang et al. (2020), they have also been applied for molecule sequence generation. The majority of existing molecule generation methods heavily rely on the variational autoencoder (VAE) proposed by Diederik et al. (2014) and Rezende et al. (2014). VAE is the combination of a deep latent variable model and an accompanying variational learning technique. As illustrated in Figure 1, drug molecules can be represented in the simplified molecular-input line-entry system (SMILES) format proposed by Weininger (1988). SMILES is a specification in the form of a line notation for describing the structure of chemical molecules. In Figure 1, the input SMILES sequence CO(C)C is first fed into the VAE encoder composed by gated recurrent unit (GRU) layers by Cho et al. (2014) to generate the latent representation.

FIG. 1.

Overview of our VAE model implementation. The encoder and decoder are built based on the bidirectional GRU and unidirectional GRU, respectively. Both the input and output of our model are SMILES sequences. GRU, gated recurrent unit; SMILES, simplified molecular-input line-entry system; VAE, variational autoencoder.

Then, the VAE decoder takes the latent vector as the input to reconstruct the original molecule sequence CO(C)C. One of the desirable properties of the VAE is that its latent space is continuous and smooth. As a result, it allows both semantically meaningful sampling and smooth interpolation in the latent space. In the case of molecule generation, the latent representations of semantically similar molecules (with similar chemical structures and properties) are often clustered together in the latent space. Thanks to the continuous latent space, novel molecules can be generated by randomly sampling from the latent space since the sampled latent vectors can be regarded as the interpolation of existing molecule representations. What is more, the desired properties can also be further optimized through exploring the latent space locally. The key idea behind the optimization process is to utilize the smoothness of the latent space to search for molecules that maximize a property score objective by perturbing slightly to the initial latent vector.

However, previous VAE models suffer from the posterior collapse issue, where the decoder tends to ignore latent vectors as described in Bowman et al. (2016) and Gómez-Bombarelli et al. (2018). This problem is more frequently observed in recurrent neural network (RNN)-based models as in He et al. (2019). In consequence, the generated molecules are in low diversity and are hardly relevant to the latent vectors as in Gómez-Bombarelli et al. (2018) and Kusner et al. (2017). This phenomenon has also been observed in natural language processing (NLP) tasks, such as the text generation by Bowman et al. (2016). The major focus of previous NLP-related studies is to propose various training strategies to alleviate this problem, such as the KullbackLeibler (KL) cost annealing by Bowman et al. (2016) and optimizing the decoder multiple times before each encoder update in He et al. (2019).

However, simply extending these methods to molecule generation cannot help molecule generation too much, mostly because the molecule sequences are strictly structured according to SMILES grammar rules and any mutation within the molecule sequences lead to invalid sequences. Motivated by the success of attribute grammars in the compiler design and parse trees in the NLP field, following works such as Kusner et al. (2017) and Dai et al. (2018) propose to incorporate grammar rules to guarantee the validity of generated SMILES sequences. As an alternative, a molecule can also be represented by a graph to avoid the posterior collapse as in Li et al. (2018) and Jin et al. (2018).

Thanks to the development of NLP text generation, the VAE model is applied for molecule generation for the first time in character variational autoencoder (CVAE) by Gómez-Bombarelli et al. (2018). They build a VAE encoder and decoder with GRU layers, representing molecules in the SMILES sequences. However, their model suffers from generating invalid SMILES sequences, which makes their model impracticable. To improve the prior validity, context-free grammars for SMILES are introduced in grammar variational autoencoder (GVAE) by Kusner et al. (2017) to represent a molecule in the sparse tree. However, the validity score is still unsatisfactory. Inspired by this method, syntax-directed variational autoencoder (SD-VAE) by Dai et al. (2018) incorporates extrasemantic rules to ensure that generated SMILES are valid, and it achieves the best performance among all SMILES-based methods. However, these models did not solve the model posterior collapse problem, and there is a large performance gap.

We propose a novel strategy to alleviate the posterior collapse problem considering the essential drawbacks of the contemporary RNN-based VAE models in the molecule generation situation. To achieve this goal, we carefully analyze the posterior collapse problem of the vanilla VAE model for SMILES sequence generation. We point out that the underestimated reconstruction loss triggers the posterior collapse issue in the molecule sequence generation, as the direct consequence of the imbalance between reconstruction loss and KL loss during VAE training.

To overcome the problem, we propose a novel loss function to leverage the trade-off between the reconstruction loss and the KL loss in VAE training. Without modifying the VAE network structures or costing extracomputational complexity, our proposed strategy is extremely simple yet effective in preventing the posterior collapse in molecule generation. We also provide a detailed analysis of our method* and empirically demonstrate its excellent reconstruction accuracy and competitive validity score on the ZINC 250K data set from Kusner et al. (2017) and on the GuacaMol data set from Brown et al. (2019).

This article is a major extension of our previous conference version (Yan et al., 2020b). In addition to the experimental verification for the statement that the underestimated reconstruction loss causes the posterior collapse of the VAE models, we also provide theoretical analysis and proof in this work. Besides, to further improve the validity score of our method, we introduce a partial SMILES sequence check toolkit PartialSmiles^† to verify the validity of the SMILES sequence during the molecule generation process. What is more, to better evaluate the proposed method, we include the results of two extra evaluation metrics novelty and uniqueness in experimental comparison with baseline methods. Last but not least, we conduct experiments on the extra large-scale data set GuacaMol, which consists of 1.6 M molecules to demonstrate the scalability and generalization of our proposed method.

2. METHODS

2.1. The variational autoencoder

The VAE is a specially regularized variant of the standard autoencoder (AE). It is appealing because it can learn complex distribution in an unsupervised manner and later can act as a generative model defined by a prior distribution $p (z)$ and a conditional distribution $p_{θ} (x | z)$ . Since the true data likelihood is often intractable, the VAE instead maximizes the evidence lower bound objective (ELBO) $ℒ (x; θ, ϕ)$ over the space of all $p_{θ}$ , and it is a valid lower bound of the true data log-likelihood $log p (x)$ : $\begin{matrix} log p_{θ} (x) & = ℰ_{z \sim q_{ϕ} (z | x)} [log p_{θ} (x)] (p_{θ} (x) d o e s n o t d e p e n d o n z) \\ = ℰ_{z} [log \frac{p_{θ} (x | z) p_{θ} (z)}{p_{θ} (z | x)}] (B a y e s ‘ t h e o r e m) \\ = ℰ_{z} [log \frac{p_{θ} (x | z) p_{θ} (z)}{p_{θ} (z | x)} \frac{q_{ϕ} (z | x)}{q_{ϕ} (z | x)}] (M u l t i p l y b y a c o n s t a n t) \\ = ℰ_{z} [log p_{θ} (x | z)] - ℰ_{z} [log \frac{q_{ϕ} (z | x)}{p_{θ} (z)}] + ℰ_{z} [log \frac{q_{ϕ} (z | x)}{p_{θ} (z | x)}] \\ = ℰ_{z} [log p_{θ} (x | z)] - D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z)) + D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z | x)) \\ \geq ℰ_{z} [log p_{θ} (x | z)] - D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z)) \\ = ℒ (x; θ, ϕ), \end{matrix}$ (1)

where the VAE encoder $q_{ϕ} (z | x)$ is parameterized with $ϕ$ and learns to map the input x to a variational distribution represented by z, and the VAE decoder $p_{θ} (x | z)$ parameterized with $θ$ reconstructs the input x given the latent vector z. The inequality holds since the $D_{K L} \geq 0$ . In practice, $q_{ϕ} (z | x)$ is usually modeled as a Gaussian distribution, and it is optimized to approximate the true posterior $p_{θ} (z | x)$ to reduce the gap between ELBO and true data log-likelihood $log p (x)$ .

The VAE training is optimized to maximize the ELBO, where (1) negative reconstruction loss $ℰ_{z \sim q_{ϕ} (z | x)} [log p_{θ} (x | z)]$ enforces the encoder to generate meaningful latent vector z, so that the decoder can reconstruct the input x from z, and (2) the KL regularization loss $D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z))$ minimizes the KL divergence between the approximate posterior $q_{ϕ} (z | x)$ and the prior $p_{θ} (z) \sim N (0, I)$ .

2.2. Posterior collapse problem in VAE

The posterior collapse phenomenon has also been reported in previous works on NLP text generation such as Bowman et al. (2016), Yang et al. (2017), and Kim et al. (2018). When posterior collapse happens, the model training falls into the local optimum of the ELBO, in which the decoder tends to ignore z when training the VAE model and the variational posterior $q_{ϕ} (z | x)$ naively mimics the model prior $p (z)$ . Note that the KL loss in the ELBO objective can be further decomposed as in Hoffman and Johnson (2016):

where $I_{q} (x, z)$ is the mutual information between x and z given $q_{ϕ} (z | x)$ : $\begin{matrix} I_{q} (x, z) = ℰ_{q_{ϕ} (z | x)} [log q_{ϕ} (z | x)] - ℰ_{q_{ϕ} (z)} [log q_{ϕ} (z)] . \end{matrix}$ (3)

When posterior collapse occurs, the KL loss decreases nearly to zero so that I_q is also close to zero [both items on the right-hand side in Equation (2) are non-negative] during the VAE model training process. It is especially evident when modeling discrete data with a strong autoregressive network such as long short-term memory (LSTM) by Hochreiter and Schmidhuber (1997) and GRU by Chung et al. (2014), which is exactly our case for molecule sequence generation. This is undesirable since the VAE model fails to learn meaningful latent representations for input molecule sequences.

For text generation task in NLP, the posterior collapse problem has been mainly attributed to the low quality of latent representations z at the early stage of model training as pointed out by Bowman et al. (2016), He et al. (2019), and Fu et al. (2019). To be more specific, the decoder $p_{θ} (x | z)$ falls behind the encoder $q_{ϕ} (z | x)$ at the initial training stage, and $q_{ϕ} (z | x)$ generates low-quality latent representations so that it is very hard for $p_{θ} (x | z)$ to recover the input sequences. As a result, the model is forced to ignore z. Many solutions have been proposed to solve the problem, and they have demonstrated satisfactory improvement on various NLP data sets.

However, molecule SMILES generation is a quite different scenario, although it appears to be same as the NLP text generation. First of all, its vocabulary size is far less than the NLP text generation data sets. The token size of NLP text data is usually tens of thousands or even larger, while it is less than 100 for chemical molecule data. The smaller token size makes the molecule reconstruction task much easier. Second, the molecule sequence is composed strictly following the SMILES grammar rules, and the reconstructed sequence must be exactly the same as the input to be matched successfully. Any token mutations can result in an invalid sequence. However, there are no such rigid grammar rules applied to the NLP text, and the exact match is not required.

We find that the existing solutions of He et al. (2019) and Fu et al. (2019) for NLP text generation performs poorly in the chemical molecule generation. This motivates us to propose such a solution for molecule sequence generation.

2.3. Underestimated reconstruction loss

To investigate the cause of the posterior collapse in the VAE for molecule sequence generation, we conduct extensive analysis and investigation into the posterior collapse process. We find it is the underestimated reconstruction loss that causes posterior collapse during VAE training process. Both theoretical analysis and experimental support are provided to verify our hypothesis.

The reconstruction loss term $ℰ_{q_{ϕ} (z | x)} [log p_{θ} (x | z)]$ in Equation (1) measures the reconstruction ability of the decoder given the latent vector z. The decoder should only receive information from z and tries to reconstruct the full sequence accurately from the given z. However, in practice, the RNN models are usually trained with the teacher forcing method proposed by Williams and Zipser (1989), in which the RNN input at each step is the ground-truth instead of the prediction from a prior time step.

We can rewrite the reconstruction loss term in Equation (1) as:

where the T is the maximum time step, ${\tilde{x}}_{0, \dots, t - 1}$ is the predicted sequence prefix before time step t, the current input of the RNN is output ${\tilde{x}}_{t - 1}$ at the previous time step, and ${\tilde{x}}_{0}$ is the predefined start symbol.

With teacher forcing training method, now the actual reconstruction loss during VAE training is:

where $x_{0, \dots, t - 1}$ is the ground-truth prefix before time step t, the ground-truth token at previous time step $x_{t - 1}$ is the RNN input at each time step t, and x₀ is also the predefined start symbol.

We posit $log p_{θ} (x_{t} | z, x_{0, \dots, t - 1}) = log p_{θ} (x_{t} | z, x_{0, \dots, t - 1}, {\tilde{x}}_{0, \dots, t - 1})$ since with teacher forcing the RNN training does not rely on the prediction as the input. Then, we can prove the log-likelihood [Equation (5) is larger than Equation (4)]:

The ground-truth information $x_{0, \dots, t - 1}$ is incorporated additionally at each time step in Equation (5) when training the VAE, and it makes the decoder's prediction task easier; therefore, we can expect that the reconstruction ability of the decoder is largely overestimated compared with Equation (4). As a result, the reconstruction loss term is underestimated, which will potentially break the balance between reconstruction loss and KL loss in Equation (1). We calculate quantitatively how much the reconstruction loss is underestimated in Section 3.3.

2.4. Rebalanced VAE loss

Since reconstruction loss is underestimated during training and it breaks the balance with KL loss, which eventually leads to the posterior collapse, we propose to recover the balance by applying a weight $α$ to reconstruction loss: $\begin{matrix} ℒ (x; θ, ϕ) & = α ℰ_{q_{ϕ} (z | x)} [log p_{θ} (x | z)] \\ - D_{K L} (q_{ϕ} (z | x) | | p (z)), α > 1, \end{matrix}$ (7)

where $α$ can be estimated using Monte Carlo sampling in every training iteration. Specifically, we can sample a batch of data as input and run a VAE with/without teacher forcing, respectively. Since the reconstruction loss without teacher forcing can be regarded as the “true” reconstruction loss (the reconstruction loss it should be in VAE training), we approximate $α$ as the ratio of reconstruction loss without teacher forcing to that with teacher forcing. However, estimating $α$ in every training iteration is too expensive. In practice, we can set $α$ as a hyperparameter for simplicity and efficiency. We show how to decide the optimal value for $α$ in the experiment part.

Inspired by the $β$ -VAE (Higgins et al., 2017) formulation, we can instead reduce KL loss weight $β$ , which is equivalent to increasing reconstruction loss weight $α$ . It is more natural and convenient to weight the KL loss since increasing $β$ from 0 to 1 is a smooth transition from the AE to VAE. So we can have a similarly rebalanced VAE loss formulation: $\begin{matrix} ℒ (x; θ, ϕ) & = ℰ_{q_{ϕ} (z | x)} [log p_{θ} (x | z)] \\ - β D_{K L} (q_{ϕ} (z | x) | | p (z)), 0 \leq β < 1 . \end{matrix}$ (8)

Note that in our case $β < 1$ , while the $β$ -VAE requires the KL weight $β > 1$ . The $β$ -VAE is proposed in Higgins et al. (2017) to learn disentangled representation of generative factors by enforcing a larger penalty on KL loss since they postulate that $β > 1$ could place a stronger constraint on the latent representation to drive the VAE to learn a more efficient latent representation of input x. While we have a completely different motivation and goal of fixing the imbalanced VAE loss by reducing KL weight since we find that reconstruction loss is underestimated in ELBO.

Except for the above analysis, our method can also be explained from an intuitive perspective. In previous methods CVAE, GVAE, and SD-VAE, when sampling latent vectors z they have to reduce the standard deviation $σ$ to a small value of 0.01, otherwise the model will collapse and lose the reconstruction ability. However, the validity score is poor in these methods. Instead of reducing sampling $σ$ , we can anneal the KL loss weight $β$ to make the model gradually transform from AE to VAE as in Bowman et al. (2016) since the AE usually has a strong reconstruction ability. Different from Bowman et al. (2016), we restrict $β$ to be smaller than 1. By applying the optimal $β$ , we can arrive at a trade-off between the reconstruction accuracy and validity score.

We acknowledge that previous methods such as Dai et al. (2018), He et al. (2019), and Fu et al. (2019) have empirically tried to reduce the KL loss weight to avoid the posterior collapse. The $β$ -VAE ( $β$ = 0.4) alleviates the problem and achieves competitive performance on density estimation for NLP text data sets in He et al. (2019), which proves that reducing $β$ is viable for NLP text task. It is also indicated setting $β = 1 ∕ L a t e n t D i m e n s i o n$ could lead to better results in Kusner et al. (2017) and Dai et al. (2018). However, none of these methods provided any analysis or explanation. We are the first to recognize that the underestimated reconstruction loss leads to the posterior collapse problem in VAE molecule generation, and further, we propose to reduce KL weight to overcome the posterior collapse with detailed analysis and solid experimental support.

3. RESULTS

Our proposed solution to the VAE model posterior collapse is simple but extremely effective and efficient. We do not need to modify the network architecture and only adjust the training loss slightly, without introducing much extra computation. In this section, we will first train a vanilla VAE model and track the process of model collapse, as well as experimentally verify that the reconstruction loss is underestimated. Then, we will conduct extensive experiments to demonstrate the effectiveness of our proposed method.

3.1. Experimental settings

We build our VAE model based on GRU layers. The VAE encoder is composed of two layers of bidirectional GRU, which is good at capturing sequence representation as Schuster and Paliwal (1997), and hidden size of each GRU layer is 512. The decoder is made up of four layers of unidirectional GRU with the same hidden size of 512. Following previous works (Gómez-Bombarelli et al., 2018; Jin et al., 2018), we use unit Gaussian prior and set the latent vector dimension to be 56. The ELBO objective is optimized with Adam optimizer by Kingma and Ba (2014) and learning rate is 0.0001. The model is trained with teacher forcing and KL loss annealing. We train the model for 150 epochs and report the performance of the final model. Experiments are conducted on a machine with an Intel Core i7-5930K@3.50GHz CPU and a GTX 1080 Ti GPU.

We experiment on the ZINC 250K data set by Kusner et al. (2017), which is a subset of the ZINC by Sterling and Irwin (2015). Molecule sequences are tokenized with the regular expression from Schwaller et al. (2018). We use the same training and testing split as previous works (Kusner et al., 2017; Jin et al. 2018) and have 10K hold-out data out of the training as the validation data. We also experiment on a large-scale data set GuacaMol by Brown et al. (2019), which is derived from the ChEMBL 24 database by Mendez et al. (2019) to demonstrate the scalability and generalization of our method. GuacaMol data set consists of 1.6M molecules, and we adopt the same data split provided by Brown et al. (2019). We will use the same experimental settings in all our experiments unless explicitly stated.

As for the model evaluation metrics, we report the reconstruction accuracy, validity, novelty, and uniqueness scores following previous work. Following Jin et al. (2018), we encode each molecule from test data set and then decode obtained latent vector to reconstruct input molecule SMILES. The reconstructed SMILES must be exactly the same as the input to be counted as successful. The reconstruction accuracy is defined to be the ratio of successfully reconstructed molecule sequences to the total tried reconstruction. To calculate validity, 10K latent vectors are randomly sampled from the prior distribution as the input for the decoder.

The validity is the portion of chemically valid reconstruction SMILES from the random sampling to the total decoded sequences. We use open-source tool RDKit by Landrum et al. (2006) to check the validity of SMILES. The novelty is the ratio of generated chemically valid molecules, which are not present in the training data set to the total generated chemically valid molecules. It evaluates the model's ability to generate novel molecules. The uniqueness is used to evaluate to what extent a model generates unique chemically valid molecules, and it is defined as the ratio of generated chemically valid molecules that are unique.

3.2. VAE training dynamics

We track the training process of a vanilla VAE model, as well as that of our proposed method. We investigate training dynamics including the KL loss weight, KL loss, reconstruction loss, mutual information, reconstruction accuracy, and validity score. Mutual information $I_{q} (x, z)$ can be calculated using Monte Carlo sampling as proposed in Hoffman and Johnson (2016) and Dieng et al. (2019): $I_{q} = D_{K L} (q_{ϕ} (z | x) | | p (z)) - D_{K L} (q_{ϕ} (z) | | p (z)),$ (9)

which is actually the same as Equation (2). We approximate the aggregated posterior $q_{ϕ} (z) = ℰ_{p_{d} (x)} [q_{ϕ} (z | x)]$ using Monte Carlo sampling. $D_{K L} (q_{ϕ} (z) | | p (z))$ can also be estimated by the Monte Carlo sampling, and we can obtain samples from $q_{ϕ} (z)$ by ancestral sampling: first sampling x from the data set distribution $p_{d} (x)$ and then sampling z from $q_{ϕ} (z | x)$ . More details about $I_{q} (x, z)$ computation can be found in Hoffman and Johnson (2016).

As a comparison, we also illustrate the training dynamics of our proposed method. We set KL weight $β$ = 0.1, which is explained and derived in the next section. We keep all the other experimental settings the same as the vanilla VAE to make a fair comparison.

Results of the two models' training are plotted in Figure 2. The vanilla VAE model performs well on the validation data at the early stage of the KL weight annealing. However, as the KL weight increases, KL loss drops quickly as expected since more penalty is added to the KL loss term, whereas the reconstruction loss starts to rise at the same time. The mutual information $I_{q} (x, z)$ decreases to 0.65 at the end, which means that the decoder does not absorb much information from the latent vectors when generating the output. This evidence indicates that the posterior collapse has happened. When looking at the model performance on validation data, we can notice that the reconstruction accuracy is close to 0 while the validity score is almost perfect. This indicates that too much pressure has been placed on the KL loss, which breaks the balance between the reconstruction loss and KL loss and results in the model posterior collapse.

FIG. 2.

Training dynamics of vanilla VAE model and our method on validation data. We track (a) KL weight $β$ , (b) KL loss $D_{K L} (q_{ϕ} (z | x) | | p (z))$ , (c) reconstruction loss $- ℰ_{q_{ϕ} (z | x)} [log p_{θ} (x | z)]$ , (d) mutual information $I_{q} (x, z)$ , (e) reconstruction accuracy, and (f) validity score during the full training process. The maximum KL weight $β$ of the vanilla VAE with KL loss annealing is 1. Our method reduces the maximum value of $β$ to 0.1. Both models are trained with KL weight annealing and teacher forcing. KL, KullbackLeibler.

However, our method achieves lower reconstruction loss early and can maintain it during model training. Although the KL loss of our method is larger than the vanilla VAE, considering that we have a smaller KL weight $β$ now, the equivalent KL loss added to the training objective should still be in the normal range. Especially, our method maintains the mutual information to be around 4.8, which means output sequences are strongly related to latent vectors. As for the model performance, our method achieves 92.7% reconstruction accuracy and 90.7% validity score, which proves the superiority of our method.

3.3. Underestimated reconstruction loss

We find that introducing ground-truth information into the decoder will result in underestimated reconstruction loss in the previous section and have provided our detailed analysis previously. In this section, we will experimentally verify that the reconstruction loss is indeed underestimated during the training. We can estimate how much the reconstruction loss has been underestimated using Monte Carlo sampling. Specifically, we can sample a batch of data and then run the model with and without the teacher forcing, respectively. The underestimated ratio can be approximated by the ratio of reconstruction loss with teacher forcing to that without teacher forcing.

We track the reconstruction loss on the validation data set when the teacher forcing is applied and removed, respectively. Results are shown in Figure 3a. When teacher forcing is applied, the reconstruction loss drops close to 1 quickly, whereas the loss is much larger (at least 7.5) without teacher forcing. This is not unexpected since the prediction error may be accumulated during the decoding process without teacher forcing. Any wrong token prediction as RNN input at the next time step may result in the following prediction totally different from ground-truth sequences.

FIG. 3.

(a) Reconstruction loss on validation data set. At each time step, model parameters are the same when calculating the reconstruction loss. (b) Underestimated ratio of reconstruction loss.

To quantitatively evaluate how much the reconstruction loss has been underestimated, we can compute the ratio as reconstruction loss with teacher forcing to that without teacher forcing at each time step. Results are shown in Figure 3b. It confirms our conclusion that the reconstruction loss is underestimated. To recover a rebalanced VAE loss, we can set KL loss weight exactly as the underestimated ratio in each epoch. But this requires us to compute the ratio repetitively during training, which is time-consuming. To be simplified, we set $β = 0.1$ during training and we find that it works very well in practice.

3.4. Model performance comparison

We summarize the molecule reconstruction accuracy, validity, novelty, and uniqueness scores on the ZINC 250K test data in Table 1. Our method outperforms all previous models in reconstruction accuracy by a large margin (16% larger than the second-best model). In the meanwhile, our method achieves 90.7% validity, which is much better than previous SMILES-based methods. We can further boost our model performance by incorporating a SMILES validating parser PartialSmiles, which can check the validity of the SMILES prefix easily when generating SMILES sequences token by token. The validity score can be boosted to 93.8% with PartialSmiles.

Table 1.
Reconstruction Accuracy and Validity Results on the ZINC 250K Data Set

Model Reconstruction (%) Validity (%) Novelty (%) Uniqueness (%)

SMILES-based

CVAE 44.6 0.7 98.0 2.1

GVAE 53.7 7.2 100.0 100.0

SD-VAE 76.2 43.5 100.0 100.0

Our method 92.7 90.7 100.0 100.0

Our method^a 92.7 93.8 100.0 100.0

Graph-based

GraphVAE — 13.5 — —

JT-VAE 76.7 100.0 99.9 99.1

Model	Reconstruction (%)	Validity (%)	Novelty (%)	Uniqueness (%)
SMILES-based
CVAE	44.6	0.7	98.0	2.1
GVAE	53.7	7.2	100.0	100.0
SD-VAE	76.2	43.5	100.0	100.0
Our method	92.7	90.7	100.0	100.0
Our method^a	92.7	93.8	100.0	100.0
Graph-based
GraphVAE	—	13.5	—	—
JT-VAE	76.7	100.0	99.9	99.1

Bold value in column 3 is the best result.

Baseline results are reported in Kusner et al. (2017), Dai et al. (2018), Simonovsky and Komodakis (2018), and Jin et al. (2018).

The SMILES validating parser PartialSmiles is applied during the generation. The novelty and uniqueness scores of baseline methods are copied from Samanta et al. (2020).

CVAE, character variational autoencoder; GVAE, grammar variational autoencoder; JT-VAE, junction tree variational autoencoder; VAE, variational autoencoder; SD-VAE, syntax-directed variational autoencoder; SMILES, simplified molecular-input line-entry system.

Compared with other SMILES-based methods, our model is much superior in both the reconstruction accuracy and prior validity, even if complex grammar or syntax rules are incorporated in Kusner et al. (2017) and Dai et al. (2018). Note that the junction tree variational autoencoder (JT-VAE) model assembles molecules by adding subgraphs step-by-step to make sure that the generated molecule graphs are always valid. However, these subgraphs are extracted from the training data set, which limits the JT-VAE from generating molecules with unseen subgraphs.

While our method achieves competitive validity performance without any constraints and is able to generate novel molecules that are not from the same distribution as the training data. That is one important reason why our method achieves better reconstruction accuracy, whereas JT-VAE suffers from reconstructing testing molecules (Mohammadi et al., 2019). Besides, our method is much more efficient than JT-VAE. When generating 10,000 unique valid SMILES from prior random sampling, JT-VAE^‡ (faster version) takes about 1450 seconds while our method only needs 9 seconds.

As for the novelty and uniqueness, our method achieves 100.0% for both metrics, which are the same as other SMILES-based methods including GVAE and SD-VAE. Note that the novelty and uniqueness are evaluated only on the chemically valid molecules. This indicates that even if both GVAE and SD-VAE achieve the same novelty and uniqueness scores, our method can generate much more valid molecules than GVAE and SD-VAE due to the better validity score. JT-VAE achieves only 99.9% novelty score and 99.1% uniqueness score. This demonstrates that our model is a better chemically valid molecule generator.

We also experiment on a large-scale data set GuacaMol to evaluate the scalability and generalization of our proposed method and report results in Table 2. We use the same experimental settings as the ZINC 250K data set. Our method achieves 92.6% reconstruction accuracy, 90.6% validity score, 100.0% novelty, and 100.0% uniqueness on the GuacaMol data set, which are similar to the performance on ZINC 250K. By checking the validity of SMILES during the generation, the validity score can be further boosted to 93.6%. The experiment on the large-scale data set demonstrates our method scales and generalizes well on a large data set.

Table 2.

Reconstruction Accuracy and Validity Results on the GuacaMol Data Set

Model	Reconstruction (%)	Validity (%)	Novelty (%)	Uniqueness (%)
Our method	92.6	90.6	100.0	100.0
Our method^a	92.6	93.6	100.0	100.0

The SMILES validating parser PartialSmiles is applied during the generation.

3.5. Error analysis and visualization

Our model achieves 92.7% reconstruction accuracy, and all reconstructed SMILES are valid on the ZINC 250K data set. We investigate the reconstruction results further and find that our model can predict 97.3% of all tokens correctly, which is measured at the level of the token instead of the sequence. Besides, most of the unmatched sequences (62%) are valid, and it confirms the reconstruction ability of our model. We show some valid but unmatched examples in Figure 4. Even for these unmatched examples, there is only a small ratio of the predicted tokens that are different from the ground-truth, which demonstrates the reconstruction ability of our method.

FIG. 4.

Reconstruction error examples. Unmatched tokens between the input and reconstruction SMILES are highlighted. Note that “[O-]” is a single token.

As for the validity sore, we also investigate the model outputs. We find that our model can generate complicated and diverse molecules with multiple rings. As for the invalid sequences, from both the reconstruction and prior sampling, there are several typical errors: (1) unkekulized atoms, (2) valence error, (3) unclosed ring, and (4) parentheses error. We believe that the grammar-based methods of Kusner et al. (2017) and Dai et al. (2018) are complementary to our method and can be combined together to reduce these errors.

3.6. Bayesian optimization

One of the important tasks in the drug molecule generation is to make molecules with desired chemical properties. We follow Kusner et al. (2017) and Jin et al. (2018) for all the experimental setting, and the optimization target score is: $y (m) = l o g P (m) - S A (m) - c y c l e (m),$ (10)

where $l o g P (m)$ is the octanol–water partition coefficients of molecules m, $S A (m)$ is synthetic accessibility score, and $c y c l e (m)$ is number of large rings with more than six atoms.

We first associate each molecule with a latent vector, which is the mean of the learned variational encoding distribution. The latent vector for each molecule will be treated as its feature, and we train a sparse Gaussian process (SGP) to predict the target score $y (m)$ given its latent vector. After training SGP, five iterations of batched Bayesian optimization (BO) are performed with expected improvement heuristics.

We report the SGP prediction performance when trained on latent representations learned by different models. We train the SGP with 10-fold cross-validation and report the top-3 molecules found by the BO. As shown in Table 3, molecules found by our model are much better than that found by previous SMILES-based methods, and our method is even superior to the graph-based method JT-VAE. Figure 5 shows the top-3 molecules found by our model.

FIG. 5.

Top-3 molecules and associated scores found by our model with Bayesian optimization.

Table 3.

Top-3 Molecule Scores Found by Bayesian Optimization

Model	First	Second	Third
SMILES-based
CVAE	1.98	1.42	1.19
GVAE	2.94	2.89	2.80
SD-VAE	4.04	3.50	2.96
Our method	5.32	5.28	5.23
Graph-based
JT-VAE	5.30	4.93	4.49

Baseline results are copied from Kusner et al. (2017), Dai et al. (2018), and Jin et al. (2018).

4. DISCUSSION

Our method is very efficient, and it works extremely well in the molecule generation, in which SMILES sequences are highly structured and grammarly organized. Our experimental results indicate that grammar and syntax rules are necessary to generate more valid SMILES sequences, and they are complementary to our method. Besides, SMILES-based methods and graph-based methods may also be combined together to boost the model performance further.

Although our primary focus is the VAE for molecule generation, our method can also help the NLP task as we mentioned at the end of Section 2.4. Reducing KL loss weight can help the VAE model for the NLP task avoid the posterior collapse as shown in He et al. (2019) and Fu et al. (2019).

The latent representation learnt by our model can be applied to various downstream tasks, such as molecule property prediction (Xu et al., 2017; Ma et al., 2020, 2021, 2022). In the future, we may explore more about this application.

5. CONCLUSIONS

In this work, we investigate the posterior collapse problem in VAE for molecule sequence generation. Through extensive analysis, we conclude that the underestimated reconstruction loss results in the posterior collapse. The conclusion is supported by both theoretical analysis and experimental results. Based on our analysis, we propose a simple and effective solution to overcome the underestimated reconstruction loss problem by weighting the KL loss term. With the proposed rebalanced VAE loss, the VAE model can avoid the posterior collapse problem and achieve excellent performance in both reconstruction accuracy and validity score on two data sets. We also demonstrate the excellent generalization of our method on a large-scale data set.

Footnotes

AUTHORS' CONTRIBUTIONS

C.Y.: Conceptualization, methodology. J.Y.: Formal analysis. H.M.: Writing—review and editing. S.W.: Resources. J.H.: Supervision.

ACKNOWLEDGMENT

This study is a major extension of our previous conference version (Yan et al., ), which was published as part of the ACM-BCB conference proceedings.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was partially supported by the U.S. National Science Foundation IIS-1553687 and Cancer Prevention and Research Institute of Texas (CPRIT) award (RP190107).

References

Bowman

S.R.

, Vilnis

, Vinyals

, et al. 2016. Generating sentences from a continuous space, 10–21. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics. Berlin, Germany.

Brown

, Fiscato

, Segler

M.H.

, et al. 2019. Guacamol: Benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108.

Cho

, Van Merriënboer

, Bahdanau

, et al. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics, and Structure in Statistical Translation, 103–111. Doha, Qatar.

Chung

, Gulcehre

, Cho

, et al. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.

Dai

, Tian

, Dai

, et al. 2018. Syntax-directed variational autoencoder for structured data. In ICLR, 2018. Vancouver, BC, Canada.

Diederik

P.K.

, and Welling

2014. Auto-encoding variational Bayes. In Proceedings of the ICLR, Benff, Canada.

Dieng

A.B.

, Kim

, Rush

A.M.

, and Blei

D.M.

2019. Avoiding latent variable collapse with generative skip models, 2397–2405. In The 22nd International Conference on Artificial Intelligence and Statistics. Naha, Okinawa, Japan.

, Li

, Liu

, et al. 2019. Cyclical annealing schedule: A simple approach to mitigating KL vanishing. In NAACL. Association for Computational Linguistics. Minneapolis, Minnesota, USA.

Gómez-Bombarelli

, Wei

J.N.

, Duvenaud

, et al. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci. 4, 268–276.

10.

, Spokoyny

, Neubig

, et al. 2019. Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the ICLR. Vancouver, BC, Canada.

11.

Higgins

, Matthey

, Pal

, et al. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the ICLR. Toulin, France.

12.

Hochreiter

, and Schmidhuber

1997. Long short-term memory. Neural Comput. 9, 1735–1780.

13.

Hoffman

M.D.

, and Johnson

M.J.

2016. Elbo surgery: Yet another way to carve up the variational evidence lower bound. In NIPS Workshop on Advances in Approximate Bayesian Inference. Barcelona, Spain.

14.

Jin

, Barzilay

, and Jaakkola

2018. Junction tree variational autoencoder for molecular graph generation, 2328–2337. In ICML. Stockholmsmässan, Stockholm, Sweden.

15.

Kim

, Wiseman

, Miller

, et al. 2018. Semi-amortized variational autoencoders, 2683–2692. In ICML.

16.

Kingma

D.P.

, and Ba

2014. Adam: A method for stochastic optimization. ICLR, 2015. San Diego, California, USA.

17.

Kingma

D.P.

, Salimans

, Jozefowicz

, et al. 2016. Improved variational inference with inverse autoregressive flow, 4743–4751. In Advances in Neural Information Processing Systems. Barcelona, Spain.

18.

Kusner

M.J.

, Paige

, and Hernández-Lobato

J.M.

2017. Grammar variational autoencoder, 1945–1954. In Proceedings of the 34th ICML-Volume 70. JMLR.org. Sydney, Australia.

19.

Landrum

2006. Rdkit: Open-Source Cheminformatics (Online).

20.

, Vinyals

, Dyer

, et al. 2018. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324.

21.

, Bian

, Rong

, et al. 2022. Cross-dependent graph neural networks for molecular property prediction. Bioinformatics, 38, 2003–2009.

22.

, Rong

, Liu

, et al. 2021. Gradient-norm based attentive loss for molecular property prediction, 497–502. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. New York, New York, USA.

23.

, Yan

, Guo

, et al. 2020. Improving molecular property prediction on limited data with deep multi-label learning, 2779–2784. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. New York, New York, USA.

24.

Mendez

, Gaulton

, Bento

A.P.

, et al. 2019. Chembl: Towards direct deposition of bioassay data. Nucleic Acids Res. 47(D1), D930–D940.

25.

Miao

, Zhen

, Liu

, et al. 2018. Direct shape regression networks for end-to-end face alignment, 5040–5049. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA.

26.

Mohammadi

, O'Dowd

, Paulitz-Erdmann

, et al. 2019. Penalized variational autoencoder for molecular design. ChemRxiv.

27.

Paszke

, Gross

, Chintala

, et al. 2017. Automatic Differentiation in Pytorch.

28.

Polishchuk

P.G.

, Madzhidov

T.I.

, and Varnek

2013. Estimation of the size of drug-like chemical space based on gdb-17 data. J. Comput. Aided Mol. Des. 27, 675–679.

29.

Rezende

D.J.

, Mohamed

, and Wierstra

2014. Stochastic backpropagation and approximate inference in deep generative models, 1278–1286. In ICML. Beijing, China.

30.

Samanta

, De

, Jana

, et al. 2020. Nevae: A deep generative model for molecular graphs. J. Mach. Learn. Res. 21, 1–33.

31.

Schuster

, and Paliwal

K.K.

1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681.

32.

Schwaller

, Gaudin

, Lanyi

, et al. 2018. found in translation: Predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9, 6091–6098.

33.

Simonovsky

, and Komodakis

2018. Graphvae: Towards generation of small graphs using variational autoencoders, 412–422. In International Conference on Artificial Neural Networks. Springer; New York, New York, USA.

34.

Sterling

, and Irwin

J.J.

2015. Zinc 15–ligand discovery for everyone. J. Chem. Inf. Model. 55, 2324–2337.

35.

Weininger

1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36.

36.

Williams

R.J.

, and Zipser

1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1, 270–280.

37.

, Wang

, Zhu

, et al. 2017. Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery, 285–294. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, New York, New York, USA.

38.

Yan

, Ding

, Zhao

, et al. 2020a. Retroxpert: Decompose retrosynthesis prediction like a chemist. Adv. Neural Inf. Process. Syst. 33, 11248–11258.

39.

Yan

, Wang

, Yang

, et al. 2020b. Re-balancing variational autoencoder loss for molecule sequence generation, 1–7. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Association for Computing Machinery. New York, New York, USA.

40.

Yan

, Zhao

, Lu

, et al. 2021. Retrocomposer: Discovering novel reactions by composing templates for retrosynthesis prediction. arXiv preprint arXiv:2112.11225.

41.

Yang

, An

, Wang

, et al. 2020. Label-driven reconstruction for domain adaptation in semantic segmentation, 480–498. In European Conference on Computer Vision. Springer; New York, New York, USA.

42.

Yang

, Hu

, Salakhutdinov

, et al. 2017. Improved variational autoencoders for text modeling using dilated convolutions, 3881–3890. In Proceedings of the 34th ICML, Volume 70. JMLR. org. Sydney, Australia.