It's a simple and increasingly common question: how can I enable users to search my website by chemical structure? Text search can be done relatively easily, but chemical exact structure and substructure search are considerably more complex. This article offers a high-level overview for non-experts of what's needed to create a structure-searchable chemical database.
- Web Application Framework. This is the high-level software on which your current site is built or will be built. Popular packages include: PHP; Ruby on Rails; ASP .NET; and Django.
- A Database. There are many varieties to choose from. SQL flavors include MySQL and PostgreSQL. NoSQL flavors include CouchDB.
- Machine-Readable Structure Representations. A number of formats have been developed over the years. Due to its nearly universal support by software packages, one of the most widely-used is molfile.
- Structure Canonicalizer. This software is used for fast exact structure search. A canonicalizer converts a chemical structure into a unique string of text that can be stored and searched using standard database technologies. InChI should be the default choice.
- Fingerprint Generator. Converting chemical structures into fixed-length binary fingerprints, this software is used to pre-screen structures during substructure search. The reason is simple: substructure search is computationally expensive and good fingerprints can eliminate many unnecessary substructure searches. A number of open source packages are suitable, including: MX; CDK; OpenBabel; RDKit and a few closed-source offerings.
- Substructure Matcher. Performs substructure matching through atom-by-atom search (ABAS). This software is used after an initial fingerprint screen. Although fingerprints never generate a false negative, they can and do generate false positives. A substructure matcher ensures no false positives get presented to your users. Packages that can generate fingerprints can usually perform substructure matches as well.
- Chemical Structure Editor. Allows chemists to draw structures to be searched. A few free structure editors are available. One ergonomic and fast-loading commercial product is ChemWriter, which is sold by my company.
Your server receives the POST request and extracts the "molfile" field contained in the form data. This is a substructure search, so the first thing your server will do is generate a fingerprint of the query structure. Then, your application will perform a binary match of the query fingerprint to the fingerprints stored in your database.
A list of all candidate structures will be generated based on fingerprint matching. The members of this list will then be retained only if they pass the atom-by-atom search test performed by your Substructure Matcher.
After all structures have been tested with the Substructure Matcher, you'll now have a list of hits. Using your Web Application Framework, the server will prepare a view of the results and render it for Felix.
Your server receives a POST request and extracts the "molfile" field from the form data. Using the Structure Canonicalizer, a unique string is generated for Donepezil. Your application searches the database for one exact match to the string. Your Web Application framework then prepares a view and renders it for Felix.
Enabling exact structure search and substructure search takes a number of components working together. Although not extremely complicated, assembling the right software packages and integrating them is no trivial task.