Predicting experimental steps for arbitrary chemical equations

Automatically predict the synthesis operations to execute arbitrary organic reactions in the lab!

Like Comment
Read the Paper

Summary picture


Look at the following chemical equation:

Chemical equation

If someone provided you with all the necessary reactants and reagents, would you be able to execute this reaction in the lab?

If you are an experienced chemist, you have a good chance of guessing an adequate synthetic procedure. In more challenging examples, you would know where to look for similar reactions to dissipate any doubt you may have on the ideal experimental protocol. But what if you're not an experienced chemist? Or what if you would like to be more efficient and avoid hunting through published procedures until you find something satisfactory?

At the end of 2019, my team at IBM Research Europe embarked into the automatic determination of the synthesis steps for arbitrary organic reactions with the help of machine learning. Several scientists had made such an attempt before, limiting the scope to the prediction of individual variables such as solvents, catalysts, as well as reaction temperatures or durations. But what would it take to predict the full sequence of experimental steps, including details such as the order of addition of the chemicals, the handling of highly reactive compounds (dropwise addition, quenching), or work-up information (liquid-liquid extractions, filtrations, etc.) when necessary? Our intuition was that if a computer sees enough reactions and associated procedures, it should be able to capture the underneath patterns between molecular structure and experimental procedures in order to recommend adequate synthesis steps for reactions it hasn't seen yet.

To train such a machine learning model, we needed some data, of course. And ideally a lot of it! And here came the first challenge: although public datasets reporting chemical reactions are available (see for instance the pioneering work of Daniel Lowe), none contains experimental procedures in a machine-friendly format. The knowledge of the yet-to-exist dataset is accessible, in principle, but it is buried in unstructured text in the chemical literature. Luckily, patents and scientific publications contain a great amount of such unstructured synthesis data. We turned our attention to patents (in particular, the ones available in the Pistachio database), which contain experimental procedures in text form for several million reactions. For the chemical equation shown above, for instance, we found the following experimental procedure:

A mixture of 1-(4-isopropyl-phenyl)-5-oxo-pyrrolidine-3-carboxylic acid ethyl ester obtained in step 2 (0.7 g, 2.65 mmol) and ethanol were cooled to 10-15° C. Sodium borohydride (0.25 g, 6.6 mmol) was added portion wise over a period of 20 min and the reaction mixture was stirred for 3.5 hrs at 20-25° C. The organic volatiles were evaporated and the residue was taken into brine solution (15 ml). The aqueous layer was extracted with ethyl acetate, dried over Na2SO4 and evaporated to obtain 4-hydroxymethyl-1-(4-isopropyl-phenyl)-pyrrolidin-2-one as an off white solid (0.5 g, 81%).

Learning from this data requires the extraction of the relevant synthesis information. Last year, we presented a machine learning model doing exactly this: converting prosaic text into a concise sequence of operations (more information in the associated blog post). The application of the model on the procedure text reported above provides the following synthesis actions:

  1. ADD 1-(4-isopropyl-phenyl)-5-oxo-pyrrolidine-3-carboxylic acid ethyl ester (0.7 g, 2.65 mmol)
  2. ADD ethanol
  3. SETTEMPERATURE 10-15° C
  4. ADD Sodium borohydride (0.25 g, 6.6 mmol) over 20 min
  5. STIR for 3.5 hr at 20-25° C
  6. CONCENTRATE
  7. ADD brine (15 ml)
  8. COLLECTLAYER aqueous
  9. EXTRACT with ethyl acetate
  10. DRYSOLUTION with Na2SO4
  11. CONCENTRATE
  12. YIELD 4-hydroxymethyl-1-(4-isopropyl-phenyl)-pyrrolidin-2-one (0.5 g, 81%)

At this point, we still need to format the input (chemical equation) and output (action sequence) for the machine-learning model. For the input, we rely on a text-based representation of reactions, the reaction SMILES notation. For the reaction above, this is:

CCO.CCOC(=O)C1CC(=O)N(c2ccc(C(C)C)cc2)C1.[BH4-]~[Na+]>>CC(C)c1ccc(N2CC(CO)CC2=O)cc1

For the output, we end up with a format similar to the one shown above; the main change relates to replacing compound names, temperatures, and durations by placeholders:

  1. ADD $2$
  2. ADD $1$
  3. SETTEMPERATURE #4#
  4. ADD $3$ over @1@
  5. STIR for @3@ at #4#
  6. CONCENTRATE
  7. ADD brine
  8. COLLECTLAYER aqueous
  9. EXTRACT with ethyl acetate
  10. DRYSOLUTION with Na2SO4
  11. CONCENTRATE
  12. YIELD $-1$

In this manner, we generated a dataset of roughly 700,000 reactions and associated action sequences. We successfully trained three models: a nearest-neighbor based on a recently published reaction fingerprint, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures.

The figure below shows how the trained models suggest synthesis actions for an arbitrary chemical equation: the chemical equation is converted into a tokenized reaction SMILES, for which the model predicts an action sequence. If necessary, we can then replace the placeholders by actual values.

Using Smiles2Actions

Let's look at the model predictions for one of the reactions in the test set:

Reductive amination of compound $2$ with the amine $4$.
Reductive amination of compound $2$ with the amine $4$.

Ground truth

Transformer model

BART model

Nearest-neighbor model

ADD $2$

ADD $2$

ADD $2$

ADD $4$

ADD $4$

ADD $4$

ADD $4$

ADD $3$

ADD $3$

ADD $3$

ADD $3$

ADD $5$ at #4#

ADD $1$

ADD $1$

ADD $1$

ADD $1$ at #4#

ADD $5$

STIR for @2@ at #4#

STIR for @1@ at #4#

STIR for @1@ at #4#

STIR for @4@ at #4#

ADD $5$

ADD $5$

ADD $2$

CONCENTRATE

STIR for @4@ at #4#

STIR for @4@ at #4#

STIR for @4@

PURIFY

CONCENTRATE

CONCENTRATE

QUENCH with water

YIELD $-1$

PURIFY

PURIFY

CONCENTRATE

YIELD $-1$

YIELD $-1$

EXTRACT with ethyl acetate/THF

WASH with brine

DRYSOLUTION over Na2SO4

FILTER keep filtrate

CONCENTRATE

ADD THF

PURIFY

YIELD $-1$

Predicted actions. The placeholders refer to the compounds involved in the reaction (see chemical equation), to a temperature of 25 °C (#4#), and to durations of 10 min (@1@), 1 h (@2@), and 1 day (@4@).

The deep-learning models predict a sequence identical to the ground truth except for an additional Stir action before the addition of sodium cyanoborohydride. They even predict the same order of addition of the compounds! The nearest-neighbor model predicts a longer action sequence including quenching and a more involved work-up. It is noteworthy that all the models predict an identical stirring duration. Out of the three models, we selected the transformer model for further use and called it Smiles2Actions.

We have been using the Smiles2Actions model quite extensively in the past year at IBM Research Europe: this model is in essence the brain of IBM RoboRXN, a cloud-based automated system combining chemical synthesis and AI. This technology provides a chemical laboratory embedded in the cloud, where users provide a target molecule through a web browser and a previously published model drives them in compiling a retrosynthetic route. Immediately afterwards, the Smiles2Actions model will suggest synthetic actions for each reaction step, and the actions are converted to robot instructions and sent to a robot for synthesis. More on this soon! In the meantime, go ahead and try out the AI models and the robot simulator on https://rxn.res.ibm.com!

Prediction of action sequences on IBM RXN

Alain Vaucher

Research Scientist, IBM Research Europe