Solutions to IT problems

Solutions I found when learning new IT stuff

Creating a Framework for Chemical Structure Search – Part 1

leave a comment »

Series Overview

This is Part 1 – Simplistic Introduction to Cheminformatics of the “Creating a Framework for Chemical Structure Search“-Series.



Anyone reading my previous blog entry Spring 3 and Hibernate 4 for beginners probably realized that I’m working on something that has to do with chemistry and I don’t mean the thing you have going on with your pretty co-worker next door!

I’m going to shamelessly use this blog to “promote” my Open Source Project Molecule Database Framework. This framework allows you to rapidly create chemical structure search enabled database applications.

At least the first few parts of a long series will focus on the involved chemistry aspects and not on programming and code examples. The arguments made here obviously affect the design of the end product. Basic chemistry knowledge is required for this post.

What is a Chemical Structure?

A Chemical Structure is a model of what atoms compounds are build up of and how they are connected with each other. But enough with words. Here are 2 examples:

Water (aka H2O):

Chemical Structure of Water

Caffeine (aka C8H10N4O2 or the stuff that doesn’t make you yawn at your boss):

Chemical Strucutre of Caffeine

Searching for Chemical Structures

After seeing above images you hopefully asked yourself the question: How can these “images” be searched?. There are several answer: it depends on how you want to search. There are fundamentally 3 types of searches:

  • Full Structure Search
  • Sub Structure Search
  • Similarity Search

Full Structure Search

This is the easiest type. The chemical structure is converted to a chemically meaningful String and then all you do is exact string matching. There are several such formats the 2 most common being


SMILES alone is not useful as depending on how the structure is “drawn” and on the implementation the output for the same structure can be different. To partially solve that problem you need to use so called canonical SMILES which always returns the same string. However the actual output will still depend on the used algorithm and portability to another one is not guaranteed. Advantage of SMILES is the human readability. One way to write the SMILES for caffeine is:



In contrast to SMILES the so called standard InChI will always generate the same String for the same chemical structure. InChI was initially developed by NIST and now is supported by the InChi Trust. InChI is a pretty new format and only in the last few years has become more widespread. Downside compared to SMILES is that it is not very human readable.

The generated InChI can be converted to a hash, the so called InChIKey. An InChIkey has no chemical meaning and can not be converted back to an InChI (it’s a hash…). You probably get the point. The InChikey is an ideal candidate for performing full structure search.

The InChI for caffeine is:


(note: the InChI=1S is part of the string and required.)

The InChikey for caffeine is:


Source of SMILES and InChI: chemspider

Sub Structure Search

A Sub Structure Search is for finding any structure that contains the query structure. As images say more than 1000 words below again caffeine (left) and an example of one of many sub-structures that match caffeine (right):

Chemical Strucutre of CaffeineImidazole

A substructure search for the chemical structure on the right, called Imidazole, will find caffeine. But it would also find a lot of other structures. A random example being:

4-Amino-1H-imidazole-5-carboxylic acid

I hope above examples made it clear what is meant by a substructure search (Note: This assumes above structures are in the database you are searching in).

The question you need to ask yourself is: How can this search be done? To point out the obvious: This can’t be done by comparing Strings. The solution comes from a specific math field called Graph Theory. A graph has nodes (the atoms) and edges (bonds). Here an example taken from Wikipedia:

graph example

Nodes and edges can have attributes, eg. atom type and bond type (single, double…). The specific problem from graph theory we are interested in is called subgraph isomorphism. If you are interested in the details, please read the linked articles.

The problem with subgraph isomorphism is that it is so called NP-complete.

Although any given solution to such a problem can be verified quickly, there is no known efficient way to locate a solution in the first place; indeed, the most notable characteristic of NP-complete problems is that no fast solution to them is known.

Or said in other words: subgraph isomorphism (substructure search) can be computationally very expensive. I will address the solution to this issue in-depth in the next part. Simplified you can do pre-filter structures that will certainly not match and then only need to to the subgraph isomorphism on a subset of the database.

Similarity Search

This will be addressed in the next part. The solution uses the same principle as above mentioned pre-filter.


Written by kienerj

April 5, 2013 at 13:25

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: