Whether you're a medicinal chemist or an informatician, QSAR datasets can be very helpful in understanding complex biological phenomena. These datasets typically consist of a hundred or fewer compounds associated with a specific parameter such as intestinal absorption, volume of distribution, blood-brain barrier penetration, or activity at one or more biological targets. Most of them are published as part of a paper appearing in a peer-reviewed journal.
Unlike chemistry databases, which typically combine a search engine to a dataset of thousands or millions of compounds with a user interface, the QSAR dataset is much more focused and raw. You need to supply your own data viewer, report generator, and query tool.
The Internet hosts a bewildering assortment of QSAR datasets tucked into various nooks and crannies. The problem is finding them. One useful resource is cheminformatics.org, which hosts a page linking to forty-four datasets.
Recently, Shaillay Kumar Dogra, Scientific Editor of QSARWorld, wrote in to let me know about the site's offering of forty-eight free QSAR datasets. Each dataset is linked to the primary literature and is available in four formats, including SD File. In contrast to many datasets, those at QSARWorld are manually curated. QSARWorld is also actively seeking new datasets to convert into machine-readable form; if you find one, write to them to have it added in the collection.
Systematic efforts to collect, curate, and distribute raw data from the primary literature are long overdue. QSARWorld offers an intriguing model for doing so. Although some non-scientific issues, such as intellectual property rights, don't appear to have been addressed yet by QSARWorld, the site's offering of machine-readable raw data offers plenty of food for thought to anyone working with QSAR.
What's your favorite dataset resource?
Image Credit: B.G. Lewandowski