Solutions to IT problems

Solutions I found when learning new IT stuff

Creating a Framework for Chemical Structure Search – Part 2

leave a comment »


Series Overview

This is Part 2 – Substructure Search Performance of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In this part I will cover the solution for per-filtering when performing a chemical substructure search and go briefly into the topic of similarity searching.

Substructure Searching and Performance

History and User Expectations

As explained in Part 1 of this series chemical substructure search can be computationally very expensive and 20 to 30 years ago this was an extremely big issue due to the limited computing power available. Hence relying only on sub graph isomorphism was just not feasible and still is not. Because a researcher of today wants to search in databases containing millions of compounds and he wants the results to be displayed as quickly as possible, meaning seconds.

Fingerprints

The solution is to filter out any records that can not match the query structure before the actual substructure search. This filtering is done by the use of so called fingerprints. A fingerprint is just a set of bits. If a bit is set, it means that the given chemical structure has the feature associated with that bit (Note: simplified, not really true). Below an example of a hashed fingerprint:

chemical hashed fingerprints

Image Source and further explanations

The important part about a fingerprint is that any bit set in the query structure will also be set in a structure containing it. This is checked by a logical AND of the query structure and every target structure in the database. If queryFingerprint AND targetFingerprint == targetFingerprint then the target might be a potential sub structure. All other molecule can be filtered out. So there are false-positives but no false-negatives.

Comparing fingerprints is extremely fast in modern CPUs. Hence the time added for the fingerprint comparison is minimal. In contrast depending on you database content and query structure filtering by fingerprint can eliminate 90% of the records in your database.

Similarity Searching

Similarity searching is comparing fingerprints to each other using a certain algorithm (there are different ones) and the result is a percentage how similar the 2 chemical structures are. The most common used measure (algorithm) is tanimoto similarity. Of course the results also depends on the used fingerprint and not only the algorithm.

For a lot more details on fingerprints and similarity searching I highly recommend to read daylight theory about fingerprints.

What’s next?

After the first 2 parts that were a very brief and simplistic introductions into cheminformatics and particular into chemical structure searching, the next part will discuss the current landscape especially available solutions for chemical structure searching and explain why I created a free, open-source framework for creating chemical structure search enabled database applications.

Advertisements

Written by kienerj

April 7, 2013 at 14:49

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: