Biocatalysed Synthesis Planning using Data-driven Learning

Enabling greener chemistry through biocatalysis-aware retrosynthetic pathway prediction.
Biocatalysed Synthesis Planning using Data-driven Learning
Like

Since life first began on earth some 3.5 billion years ago, nature has been hard at work optimising enzymes, highly efficient biological catalysts that accelerate chemical reactions in living organisms. Enzymes enable organisms to digest food (even when deep-fried), break down toxins, or replicate DNA. Importantly, they commonly do this under mild conditions--at body temperature and in water. Furthermore, just like other proteins in us and our food, enzymes consist of amino acids, rendering them essentially compostable.

Compared to this, traditional chemical processes often involve toxic solvents, high temperatures and pressures, as well as large amounts of waste in order to catalyse chemical reactions in industrial applications. For decades, scientists have been searching for and designing enzymes that are capable of replacing these traditional chemical processes to reduce waste and make the industry safer. However, finding a suitable enzyme to replace a given chemical reaction is a daunting task, which involves deep knowledge about chemistry and biology, and often years of work.

With our latest work, which is part of the IBM RXN for Chemistry Platform, we provide a machine learning model that can support scientists in this task by predicting which enzymes could be suitable replacements for a given reaction, thus lowering the barrier to adopt more sustainable and safer processes by harnessing the molecular machines optimised by a 3.5 billion year-long evolutionary process.

Composition of the ECREACT Data Set
The composition of the ECREACT data set and the distributions of the substrates and products. (a) Shows the composition of the ECREACT data set by enzyme commission number. (b) and (c) show TMAPs of the distribution of substrates and products in the data set, coloured by enzyme class.

The basis of every machine learning model is a suitable data set containing samples of the task to be learned. Unfortunately, no data set containing biocatalysed reactions has been readily available. Following a DIY approach, we sourced data from the public databases RheaBRENDAPathBank, and MetaNetX to create the data set ECREACT, which contains entries such as

CCCCSCC(N)C(=O)O.O|4.4.1.4>>CCCCS,4.4.1.4,brenda_reaction_smiles
O=CC(=O)[O-].[H+]|4.1.1.47>>O=CC(O)C(=O)[O-],4.1.1.47,rhea_reaction_smiles

Each sample consists of an extended reaction SMILES string, the enzymatic commission number of the catalysing enzyme, and the data source. The most interesting part is the extended reaction SMILES string, which encodes the biocatalysed reaction, including the substrates, the enzymatic commission number, and the product. If you are a chemist or remember long-ago chemistry lessons, you may notice that the SMILES-encoded reactions are unbalanced. And you are right. As we are only interested in the main product of a reaction, we removed common byproducts, coenzymes, and cofactors from the product side of the reaction, to make the task a little bit easier for our model down the line.

The extended reaction SMILES string allows us to train a variant of the Molecular Transformer, first introduced by our team in 2019. Using the Molecular Transformer, we can state the reaction as a natural language processing problem, where we want to translate a query product to substrates and the catalysing enzyme's enzymatic commission number. Pretty cool, eh?

When starting to train the model, we encountered a common problem in machine learning: Our newly created data set, containing 62,222 biocatalysed reactions, was too small. As the model should not merely associate certain products with a list of substrates and enzyme commission numbers but be able to predict substrates that it did not encounter during training, given a query that is also new, it needed not only to understand enzymatic reactions but also learn how to write valid molecules as SMILES. We turned to a well-known data set containing one million traditional chemical reactions to teach the model how to do this. We then used a technique called multitask learning. An analogy to this approach would be the following: You want to learn to play the lute, but you cannot find many songs to practice on, so you learn to play the guitar simultaneously as there are plenty of songs. While the two instruments are not identical, they are close enough for you to learn things like fretting, chords, and plucking playing the many available guitar songs (although beginners traditionally spend the first four weeks on Wonderwall and Stairway to Heaven) and then apply the gained knowledge when playing the lute.

Seq2Seq Molecular Transformer for Biocatalysed Synthesis Planning
An overview of the data pipeline. Biocatalysed reactions are sourced from Rhea, BRENDA, PathBank, and MetaNetX, converted into extended reaction SMILES strings, tokenized, and then used to train the Molecular Transformer. The USPTO data set is used to add more information on SMILES grammars and general chemical reactions during multitask learning.

After about 48 hours of training the Molecular Transformer model on traditional and biocatalysed reactions using multitask learning, our model was ready to be evaluated. While we provide the results of this evaluation in our publication, which is linked below, we think that you would probably prefer the opportunity to evaluate the model by yourself rather than looking at bar plots and confusion matrices. Thanks to the IBM RXN for Chemistry platform, this is possible right now, in your browser, for free. Let's have a quick look at how to get started.

How to use Biocatalysis Models in IBM RXN for Chemistry
A very quick tutorial on how to use the biocatalysis model on IBM RXN for Chemistry.