Machine Learning with Physicochemical Relationships: Solubility Prediction in Organic Solvents and Water

Examining the physical chemistry relationship behind dissolution led to the creation of solubility prediction models with a small number of highly relevant descriptors.

Like Comment
Read the Paper

The use of solvents is ubiquitous across the chemical and physical sciences. The use of small organic molecules as drug candidates and intermediates is prevalent, and the creation of solubility prediction tools is vital to their development. From understanding the ADMET (availability, distribution, metabolism, excretion and toxicity) profile of a drug candidate to protein engineering, chemical process design, synthetic route prediction, extraction and crystallisation, solubility is a critical physical property. Aqueous solubility prediction has been the subject of the intensive research due to its biological relevance as well as importance in environmental and agrochemical predictions.

However, solubility prediction is still a major scientific challenge. This is due to the complex nature of dissolution, which involves lattice/sublimation energy, solvation energy, ionisation of solute, and solution phase interactions. Since it is difficult to predict solubility from first principles using physical chemistry, it has recently fallen to machine learning to link various inputs, called descriptors, to the solubility of the molecule.

We set out to predict solubility using our method of careful descriptor selection and analysis of resulting models, which we call Casual Structure Prediction Relationship (CSPR). This resulted in highly interpretable models, which could be subsequently rationally improved. We also report five newly curated datasets in four solvents (water, ethanol, benzene and acetone) which are now freely available to the academic community. These models and datasets are the first of its kind for benzene and acetone, and models across all four solvents compare favourable to previous benchmarks and on an external dataset of real pharmaceutical solubility data. Highly accurate and interpretable models are key to predicting chemistry, and we hope this work will stimulate further research in the field.

Examining the physical chemistry of solubility was vital to choosing relevant descriptors. Experimental melting point was used to represent lattice interactions, whereas the interaction of solute and solvent was represented with descriptors derived from Density Functional Theory (DFT) zero-point and Gibbs energies, volume, orbital energies and charges based on Natural Population Analysis (NPA). Models were careful analysed by Principal Component Analysis (PCA), correlation matrices, feature important plots and descriptor selection. This analysis led to two successful approaches to improve: a better solvation model was used for water predictions; and using a consensus approach by combining the predictions of four machine learning models with different protocols, Gaussian Process (GP), Support Vector Machine (SVM), ExtraTrees (ET) and Artificial Neural Networks (ANN), predictions could be enhanced further.

This led to highly interpretable models with accuracy close to the suspected experimental noise in training data (LogS±0.7), with final models giving 60-80 % of predictions in this interval. Moreover, the models outperformed established solubility prediction protocols, such as COSMOTherm, and was validated on external solubility data from industrial collaborators, AstraZeneca.

Figure 1 – (a) Gaussian Process model predictions for solubility prediction in water compared to experimental values. (b) Same as (a) but for predictions in benzene. (c) One of the ways that models were analysed was the importance weighting of descriptors for water (left, blue) and benzene (right, orange).

Samuel Boobier

PhD Researcher, University of Leeds