Unsupervised word embeddings capture latent knowledge from materials science literature
In our paper, we show how the application of an unsupervised machine learning model can capture information from the materials chemistry literature in a way that also uncovers latent knowledge previously unknown to the research community. Image Credit: Olga Kononova
Over the last 15 years there has been a surge in the use of machine learning to gain materials chemistry insights. These methods use existing data (largely computed with ab-initio methods) to train statistical models that can make useful predictions about whether chemical compounds will be stable, and the properties they are likely to exhibit. However, a large majority of the knowledge the scientific community has generated to date is recorded as “unstructured” text, and has therefore been largely inaccessible to machine-learning and statistical analysis. In recent years however, the Natural Language Processing (NLP) research community has made great progress on methods to computationally parse and learn from unstructured text. In our paper, we show how the application of an unsupervised NLP model can capture information from the materials chemistry literature in a way that also uncovers latent knowledge previously unknown to the research community. Moreover, this can be done without expert pre-labeling of data or injecting the model with any domain-specific knowledge.
As a first step to utilizing information expressed in text, we must translate words into vector representations (called word embeddings) that are understandable by machine learning models. The key insight behind Word2Vec, the technique we use to train our word embeddings, is that words that appear in similar contexts usually have similar meanings. It uses relative positions of words in the text, training to predict words from their surrounding context, to build a word embedding for every word in our vocabulary.
In our paper, we train this model on a database of over 3 million abstracts from materials research articles, and then investigate the degree to which they encode knowledge of materials chemsitry. We show that these embeddings express a surprising amount of domain knowledge, such as the underlying structure of the periodic table, as well as the atomic weights and electronegativities of the elements. In fact, we observe that certain directions in the embedding vector space correspond to insights about materials and their properties. Simple vector operations on the embeddings can be used to predict the magnetic properties, crystal structures, and symmetries of materials, or even more complex, composite properties, such as whether materials might be suitable for use as cathode materials in Li-ion batteries or as absorbers in photovoltaic solar cells.
Figure 1: Chemistry is captured by word embeddings. a. Two-dimensional t-SNE projection of the word embeddings of 100 chemical element names (e.g. "hydrogen'', "helium'') labelled with the corresponding element symbols and grouped according to their classification. b. The periodic table colored according to the classification shown in a. c. Predicted versus actual (density functional theory) values of formation energies of approximately 10,000 ABC2D6 elpasolite compounds using word embeddings of elements learned from text as features. The data points in the plot use 5-fold cross-validation. d. Error distribution for the 10% test set of elpasolite formation energies (MAE=56 meV/atom).
Our most exciting result is the discovery that these unsupervised embeddings go beyond just capturing the “known knowledge” in the materials literature, and can in fact reveal previously unknown information about materials and their properties. We show that our model can suggest new candidate materials for desired applications, such as for thermoelectric devices, with a strong success rate. We test this both by comparing these predictions with the results of ab initio density functional theory calculations and experimental data sets. We also validate the predictions by “going back in time”: re-training our model using only abstracts published before 2009 and comparing the model’s predictions with the subsequent 10 years of discoveries. We find that our model would have predicted some of the best thermoelectric materials uncovered in the past decade, such as CuGaTe2, years in advance of their actual discovery.
Figure 2: Validation of the predictions. a. Results of thermoelectric materials prediction using word embeddings obtained from various historical data sets. Each gray line uses only abstracts published before that year to make predictions (for example, predictions for 2001 are performed using abstracts from 2000 and earlier). The lines plot the cumulative percentage of predicted materials studied as thermoelectrics in the years following their predictions; earlier predictions can be analyzed over longer test periods. The results are averaged in red and compared to baseline percentages from either all materials (blue) or non-zero DFT band gap 15 (green) materials. b. Evolution of prediction ranks of top 5 predictions from the year 2009 data set as more data is collected. The marker indicates the year of first published report of one of the initial top 5 predictions as a thermoelectric.
We are excited to see what the implications for these findings will be going forward and we hope that our work can serve as a the starting point for a new breed of studies that use NLP to gain insights into materials chemistry. Our own ongoing work aims to gain even deeper insights from the full text of research papers using more advanced NLP methods and building new ways to construct word embeddings that encode even more knowledge about materials and their properties.
1. Faber, F. A., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Machine Learning Energies of 2 Million Elpasolite (ABC2D6) Crystals. Physical Review Letters 117, 2–7 (2016). 1508.05315.