As I sit here, reflecting on the journey that led to the automated generation of over 67 million natural product-like molecules, I can't help but feel a sense of awe and excitement. This endeavour, driven by the power of artificial intelligence (AI) and deep generative models, has the potential to transform the way we explore and harness the wonders of nature for the betterment of society.
Nature is an extraordinary source of diverse and bioactive compounds that have the power to revolutionize fields such as medicine, agriculture, and food production. From ancient times, humans have been aware of the healing properties of plants and the unique chemistry they possess. In fact, many of our most effective antibiotics can be traced back to natural products. However, the process of discovering and harnessing these compounds has been slow and resource-intensive, often leading to limited success.
How can AI help us explore Nature?
That is where our work comes in. We embarked on a mission to leverage the power of AI and deep generative models to explore the vast chemical space of natural products in a high-throughput and cost-effective manner. By training a recurrent neural network on known natural products, we were able to generate a staggering 165-fold expansion over the known natural product space, reaching over 67 million compounds.
The motivation behind this work stemmed from the realization that traditional methods of natural product discovery were reaching their limits. The laborious and expensive process of manually curating and characterizing natural product libraries was a significant barrier to progress. The scientific community needed a breakthrough, a way to explore the uncharted territories of natural product chemical space efficiently and comprehensively.
How can deep generative models help?
Inspiration struck when we saw the potential of deep generative models. These AI-driven architectures have the unique ability to transcend human-dependent design and significantly expand the chemical search space. Variational autoencoders, recurrent neural networks, and generative adversarial networks became our tools of choice. Among them, the SMILES-based recurrent neural network with long short-term memory (LSTM) units emerged as the most suitable for our purposes. It demonstrated an impressive capability to generate novel and diverse molecules, even with limited training data.
Our approach was straightforward yet powerful. We trained the LSTM model on a vast collection of known natural products, enabling it to understand the molecular language of nature and learn how to assemble SMILES-based tokens into unique and natural product-like SMILES. We first generated a massive database of 100 million compounds, before eliminating invalid and duplicate compounds. The subsequent steps of curation, standardization, and analysis using cheminformatics toolkits refined the database to a robust collection of 67 million validated, unique, and natural product-like molecules.
What does this mean for society?
The impact of this innovation is multi-faceted and far-reaching. Firstly, the sheer expansion of the natural product library by 165-fold opens up uncharted territories of chemical space. The vast number of molecules generated provides a wealth of potential candidates for exploration, offering researchers a goldmine of bioactive compounds waiting to be discovered.
Moreover, our approach is a game-changer in terms of cost and efficiency. The time and resources required for traditional natural product discovery are significantly reduced. Our entire training and sampling process took less than 24 hours, using readily available computational resources. In contrast, commercially available natural product libraries can cost tens of thousands of dollars, making them inaccessible to many researchers. Our innovation democratizes access to a wealth of natural product-like molecules and empowers scientists across the globe to embark on transformative research.