In this month’s issue of Nature Reviews Drug Discovery, Monya Baker wrote a ‘News & Analysis’ piece about open access chemistry databases: though there are a number of free chemical databases (including PubChem, which I blogged about last spring), “the chemical data [in these open access databases] still pale in comparison to what already exists in other databases and the published literature.”
One problem is that it takes a great deal of time to collect data for these large databases: “PubChem’s director, Stephen Bryant, says he lacks the staff and mandate to collect data from published literature and patents.” So it’s not surprising that the Chemical Abstracts Service (CAS) database contains more chemical information than PubChem: whereas PubChem has about eight million unique structures, CAS contains nearly 30 million organic and inorganic substances.
In addition, there are some concerns about quality control in these open access databases: “[t]he screening data [in PubChem] are less rigorous than those in peer-reviewed articles, and contain many false positives. Deposited data aren’t curated, and so mistakes in structures, units and other characteristics can and do occur.” I can’t imagine how frustrating it would be to synthesize a molecule that was listed as a ‘hit’ in one of those databases just to find out that it was inactive because someone mixed up the stereochemistry (or omitted a double bond)…
What are your experiences with these databases? Have you used them in your own work? If so, were they useful? What would you do to make them better? Do you think that the problems with these open access databases are the sort of ‘growing pains’ that happen for any new technology/database, or is there something special/unique about developing open access chemistry databases?
Joshua Finkelstein (Associate Editor, Nature)