It's a simple and increasingly common question: how can I enable users to search my website by chemical structure? Text search can be done relatively easily, but chemical exact structure and substructure search are considerably more complex. This article offers a high-level overview for non-experts of what's needed to create a structure-searchable chemical database.
- Web Application Framework. This is the high-level software on which your current site is built or will be built. Popular packages include: PHP; Ruby on Rails; ASP .NET; and Django.
- A Database. There are many varieties to choose from. SQL flavors include MySQL and PostgreSQL. NoSQL flavors include CouchDB.
- Machine-Readable Structure Representations. A number of formats have been developed over the years. Due to its nearly universal support by software packages, one of the most widely-used is molfile.
- Structure Canonicalizer. This software is used for fast exact structure search. A canonicalizer converts a chemical structure into a unique string of text that can be stored and searched using standard database technologies. InChI should be the default choice.
- Fingerprint Generator. Converting chemical structures into fixed-length binary fingerprints, this software is used to pre-screen structures during substructure search. The reason is simple: substructure search is computationally expensive and good fingerprints can eliminate many unnecessary substructure searches. A number of open source packages are suitable, including: MX; CDK; OpenBabel; RDKit and a few closed-source offerings.
- Substructure Matcher. Performs substructure matching through atom-by-atom search (ABAS). This software is used after an initial fingerprint screen. Although fingerprints never generate a false negative, they can and do generate false positives. A substructure matcher ensures no false positives get presented to your users. Packages that can generate fingerprints can usually perform substructure matches as well.
- Chemical Structure Editor. Allows chemists to draw structures to be searched. A few free structure editors are available. One ergonomic and fast-loading commercial product is ChemWriter, which is sold by my company.
Your server receives the POST request and extracts the "molfile" field contained in the form data. This is a substructure search, so the first thing your server will do is generate a fingerprint of the query structure. Then, your application will perform a binary match of the query fingerprint to the fingerprints stored in your database.
A list of all candidate structures will be generated based on fingerprint matching. The members of this list will then be retained only if they pass the atom-by-atom search test performed by your Substructure Matcher.
After all structures have been tested with the Substructure Matcher, you'll now have a list of hits. Using your Web Application Framework, the server will prepare a view of the results and render it for Felix.
Your server receives a POST request and extracts the "molfile" field from the form data. Using the Structure Canonicalizer, a unique string is generated for Donepezil. Your application searches the database for one exact match to the string. Your Web Application framework then prepares a view and renders it for Felix.
Enabling exact structure search and substructure search takes a number of components working together. Although not extremely complicated, assembling the right software packages and integrating them is no trivial task. If you'd like to learn more about enabling structure search on your website, please feel free to drop me a line.