Solutions to IT problems

Solutions I found when learning new IT stuff

Posts Tagged ‘chemical-structure-search

Creating a Framework for Chemical Structure Search – Part 9

leave a comment »

Series Overview

This is Part 9 – Putting it all together of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Introduction

In this final post I’m going to show you a basic Spring MVC 3 Web Application I made based on MoleculeDatabaseFramework.

Functionality

This Web Application MDFSimpleWebApp lets you

  • import ChemicalCompounds from an SD-File
  • do a chemical substructure search for compounds
  • view the search hits in a paged, tabular fashion
  • view individual search hits
  • download all search hits as SD-File

This is of course only a subset of all the features offered by MoleculeDatabaseFramework but it gives you a general idea how the framework works in terms of writing code and performance.

Entity

MDFSimpleWebApp contains 1 ChemicalCompound implementation called SimnpleCompound. It is the most basic possible implementation of ChemicalCompound with no additional properties.

There is also the entity SimpleLot which extends Containable. However it is not yet currently used within the application.

Repository and Service

MoleculeDatabaseFramework requires that you create a repository interface, a repository implementation (for chemical structure searching), a service interface and a service for each of your entities. Hence I created a SimpleCompoundRepository, SimpleCompoundRepositoryImpl, SimpleCompoundService and SimpleCompoundServiceImpl. These classes offer no custom search methods. They just implement all the methods required by the framework. See the Repository- and Service Packages.

SimpleCompoundController

This is the controller for SimpleCompound. The controller takes web requests and passes them on the Service Layer, in this case this is SimpleCompoundService, an implementation of ChemicalCompoundService. The controller exposes certain methods from the service like importing of SD-Files, chemical substructure searching or image rendering of chemical structures.

Rendering Images of chemical structures

For displaying chemical compounds I choose the option to dynamically generate images of all chemical structures in the compound. This functionality is also provided by MoleculeDatabaseFramework. Hence the according controller method is very simple:

@RequestMapping(value = "/{compoundId}/render", method = RequestMethod.GET)
public void renderCompound(@PathVariable Long compoundId,
		final HttpServletResponse response,
		@RequestParam(defaultValue = "500") int width,
		@RequestParam(defaultValue = "150") int height) throws IOException {
	try (ServletOutputStream out = response.getOutputStream()) {
		IAtomContainer mol = compoundService.getCdkMolecule(compoundId);
		MoleculeRenderer renderer = new MoleculeRenderer(width, height);
		renderer.renderMolecule(mol, out);
	}
}

and in a web page you just need to add the according image tag.

JSP with JSTL:

<img src="<c:url value="/compound/${compound.getId()}/render?width=500&height=300"/>" />

Or generated in JavaScript:

var html = <img alt="' + smiles + '" src="/MDFSimpleWebApp/compound/'+ compoundId + '/render" />
// insert image into existing html element

As example here an image of the web page for viewing a compound:

rendering example

Importing SD-File

For uploading a file using Spring MVC 3 I followed this tutorial. I had to create the very simple class FileUploadForm and the controller method is rather simple too:

@RequestMapping(value = "/import", method = RequestMethod.POST)
public String importCompounds(Model model, FileUploadForm fileUploadForm,
		BindingResult result)
		throws IOException {

	if (result.hasErrors()) {
		model.addAttribute("hasError", true);
		model.addAttribute("bindingResult", result);
		model.addAttribute(fileUploadForm);
		return "importCompounds";
	}
	Reader reader = new InputStreamReader(fileUploadForm.getFileData().getInputStream(), "US-ASCII");
	EntityImportResult importResult = compoundService.importSDF(reader, true);

	model.addAttribute("hasError", false);
	model.addAttribute("imported", importResult.getImportedEntities().size());
	model.addAttribute("present", importResult.getEntitiesAlreadyInDatabase().size());
	model.addAttribute(new FileUploadForm());

	return "importCompounds";
}

Chemical Structure Search

Search Form

The Chemical Structure Search is made up of a page that contains a tool for drawing chemical structures and submitting the search and the actual page for displaying search results. For drawing chemical structures MDFSimpleWebApp initially used the JChemPaint Applet but I recently changed it to JSME, a JavaScript based drawing tool. See below the search form with JSME:

Chemical Structure Search Form

Search Result Page

The search results page relies heavily on AJAX using JQuery and the JQuery plugin datatables. The search hits are displayed in paged fashion using datatables server-side processing and hence only 1 page of results is fetched from the database. The results table contains an image of the chemical structure, the compounds name and its CAS number. Clicking on the image will show a JavaScript alert containing the SMILES String of the given chemical structure.

Search Results Page

For each new page a AJAX request is sent to the server and the according page is returned. Note that the initial load of the page can take a bit longer. This is due to the fact that the total amount of hits is determined (eg. no SQL LIMIT-Clause). This count is cached so that all page requests are as fast. However due to how OFFSET and LIMIT work, the higher the page number, the longer the search takes. So if you have a high number of hits (eg. several thousands) the last page will load slower than the first one. If you want to display search hits 10’000 to 10’004 the database will search up to hit number 10’004 and then return the last 5 hits. However in general you should improve your search if you get so many hits.

After the page is returned from the server, the data must be converted to JSON and in a format expected by datatables. To achieve that I create the helper class JQueryDatatablesPage that contains all the properties that datatables requires and the according getters and setters. JQueryDatatablesPage is then converted to JSON using Jackson 2 ObjectMapper.

@RequestMapping(value = "/search", method = RequestMethod.GET, produces = "application/json")
public @ResponseBody
String search(
		@RequestParam int iDisplayStart,
		@RequestParam int iDisplayLength,
		@RequestParam int sEcho, // for datatables draw count
		@RequestParam String structure) throws IOException {

	int pageNumber = (iDisplayStart + 1) / iDisplayLength;
	PageRequest pageable = new PageRequest(pageNumber, iDisplayLength);
	Page<SimpleCompound> page = compoundService.findByChemicalStructure(structure, StructureSearchType.SUBSTRUCTURE, pageable);
	int iTotalRecords = (int) compoundService.count(null);
	int iTotalDisplayRecords = (int) page.getTotalElements();
	JQueryDatatablesPage<SimpleCompound> dtPage = new JQueryDatatablesPage<>(
			page.getContent(), iTotalRecords, iTotalDisplayRecords,
			Integer.toString(sEcho));

	String result = toJson(dtPage);
	return result;

}

private String toJson(JQueryDatatablesPage<?> dt) throws IOException {
	ObjectMapper mapper = new ObjectMapper();
	mapper.registerModule(new Hibernate4Module());
	return mapper.writeValueAsString(dt);
}

Jackson 2 can deal with circular references if your entities are annotated with

@JsonIdentityInfo(generator=ObjectIdGenerators.IntSequenceGenerator.class, property="@id")

You also need to register the Hibernate4Module to deal with Lazy Collections!

Donwload of Search Hits

You can download search result hits as SD-File by clicking on the Download Hits-Link on the search results page. The browser will display a dialog were you want to save the file. This uses the exportSDF()-method of SimpleCompoundService.

@RequestMapping(value = "/downloadHits", method = RequestMethod.GET)
public void downloadHits(@RequestParam String structure, HttpServletResponse response) throws IOException {

	List<Long> ids = compoundService.findByChemicalStructure(structure, StructureSearchType.SUBSTRUCTURE);
	HashSet<String> properties = new HashSet<>();
	properties.add("compoundName");
	response.setContentType("chemical/x-mdl-sdfile");
	String disposition = "attachment; fileName=searchHits-" + structure + ".sdf";
	response.setHeader("Content-Disposition", disposition);
	ServletOutputStream output = response.getOutputStream();
	OutputStreamWriter writer = new OutputStreamWriter (output);
	compoundService.exportSDF(ids, writer, properties);
}

Final Words

See below a demo video of a Chemical Substructure Search in MDFSimpleWebApp with a database of 65’000 compounds. The demo runs on a dual-core mobile i5 running Windows 7 32-bit with 4 GB of RAM installed or said otherwise: The hardware is pretty mediocre.

MDFSimpleWebApp is hosted on bitbucket. If you want to try out this application you can go to the download section on bitbucket and download a fully working standalone version for Windows 64-bit including PostgreSQL, the Bingo Cartridge for Chemical Structure Searching, tomcat as servlet container and this web application. Note: This file is 105 MB due to PostgreSQL and tomcat being included.

Written by kienerj

June 6, 2013 at 12:41

Creating a Framework for Chemical Structure Search – Part 6

leave a comment »

Series Overview

This is Part 6 – Data Access Layer of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In the previous article I introduced the entity model of MoleculeDatabaseFramework. This article will explain the Data Access Layer which uses Spring-Data-JPA with Hibernate and how the Chemical Structure Search methods of the Bingo PostgreSQL Cartridge are exposed to Hibernate and QueryDSL.

How Spring-Data JPA works

Basic functionality

I quote from Spring-Data website:

Spring Data JPA aims to significantly improve the implementation of data access layers by reducing the effort to the amount that’s actually needed. As a developer you write your repository interfaces, including custom finder methods, and Spring will provide the implementation automatically.

You create a new interface that extends from generic interfaces provided by Spring-Data and represents the repository for an entity. There are different kinds of repository interfaces but the repositories in MoleculeDatabaseFramework all extend JpaRepository. JpaRepository provides CRUD-methods and some retrieval methods for your entity.

Repositories in MoleculeDatabaseFramework also extend QueryDslPredicateExecutor. This adds findOne(predicate) and findAll(predicate) methods. Predicates are basically type-safe WHERE-Clauses.

Custom query methods

Besides the provided methods you can add your custom search methods by following the findBy-method conventions of Spring Data JPA or by annotating a method with @Query were the value of the annotation is either a JPQL Query or native SQL.

Custom Queries providing your own method implementation

In case you have a very complex query that can’t be automatically created by Spring-Data, you can create them yourself.

1. Create Custom Query Interface

To achieve this you need to first create an interface containing the desired query method(s) and annotate it with @NoRepositoryBean:

@NoRepositoryBean
public interface ChemicalStructureSearchRepository<T> {

    Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType,
            Pageable pageable, Predicate predicate,
            String searchOptions,
            PathBuilder<T> pathBuilder);


    Page<T> findBySimilarStructure(String structureData,
            SimilarityType similarityType,
            Double lowerBound, Double upperBound,
            Pageable pageable, Predicate predicate,
            PathBuilder<T> pathBuilder);
}

This is the Source Code of ChemicalStructureSearchRepository minus JavaDoc comments.

2. Create a repository extending Custom Query interface

As an example below the Source Code for ChemicalCompoundRepository which extends ChemicalStructureSearchRepository:

@Repository
@Transactional(propagation = Propagation.MANDATORY)
public interface ChemicalCompoundRepository<T extends ChemicalCompound>
        extends ChemicalStructureSearchRepository<T>, JpaRepository<T, Long>,
        QueryDslPredicateExecutor<T> {
    
    List<T> findByCompositionsPkChemicalStructureId(Long structureId);
    
    T findByCas(String cas);

    @Query("select c from Containable c where c.chemicalCompound = ?1")
    List<Containable> getContainablesByCompound(ChemicalCompound compound);
}

3. Create an implementation of your repository

The convention is that the implementation is named after the repository with “Impl” appended, in this case ChemicalCompoundRepositoryImpl. This implementation must only implement your custom methods in this case defined in ChemicalStructureSearchRepository.

public class ChemicalCompoundRepositoryImpl<T extends ChemicalCompound>
        implements ChemicalStructureSearchRepository<T> {

	//...fields and constructors snipped...

    @Cacheable(STRUCTURE_QUERY_CACHE)
    @Override
    public Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType, Pageable pageable,
            Predicate predicate, String searchOptions,
            PathBuilder<T> compoundPathBuilder) {
			
			//...implementation snipped...
    }


    @Cacheable(STRUCTURE_QUERY_CACHE)
    @Override
    public Page<T> findBySimilarStructure(String structureData,
            SimilarityType similarityType, Double lowerBound, Double upperBound,
            Pageable pageable, Predicate predicate,
            PathBuilder<T> compoundPathBuilder) {
			
			//...implementation snipped...
    }
}

Below an UML Class Diagram that shows the relationships of ChemicalCompoundRepository:

ChemicalCompoundRepository UML

Spring-Data automatically detects the repository implementation and combines all provided and all your custom search methods into one object which you use by calling them from ChemicalCompoundRepository.


Page<T> page = getRepository().findByChemicalStructure(structureData, searchType,
                pageable, predicate, searchOptions, pathBuilder);

Using the Repositories

MoleculeDatabaseFramework provides generic repositories for all entities in the entity model.

Source Code for all Repositories

To make use of a chemical structure search enabled repository you need to extend it using your specific entity implementation and optionally add your custom find methods:

@Repository
public interface RegistrationCompoundRepository extends ChemicalCompoundRepository<RegistrationCompound> {

    List<RegistrationCompound> findByRegNumberStartingWith(String regNumber);

}

That’s it!

You can find further information on how to implement entities and repositories in the MoleculeDatabaseFramework Tutorial as this article is meant to show the inner workings of the framework and not how to use it.

Exposing Bingo PostgreSQL Cartridge Methods

This is done by using a custom dialect extending Hibernates PostgreSQL82Dialect:

public class BingoPostgreSQLDialect extends PostgreSQL82Dialect {

    public BingoPostgreSQLDialect() {
         registerFunction("issubstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.sub"));
         registerFunction("isexactstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.exact"));
         registerFunction("matchessmarts", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.smarts"));
         registerFunction("matchesformula", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.gross"));
         registerFunction("issimilarstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3, ?4, ?5)::bingo.sim"));
         registerFunction("hasmassbetween", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1 > ?2::bingo.mass AND ?1 < ?3::bingo.mass"));         
    }
}

And as a usage example a source code snippet from ChemicalCompoundRepositoryImpl:

public Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType, Pageable pageable,
            Predicate predicate, String searchOptions,
            PathBuilder<T> compoundPathBuilder) {
			
	//...snipped...
			
	BooleanExpression matchesStructureQuery; // this is a Predicate!

	switch (searchType) {
		case EXACT:
			matchesStructureQuery = BooleanTemplate.create(
					"isExactStructure({0},{1},{2}) = true",
					structure.structureData,
					ConstantImpl.create(structureData),
					ConstantImpl.create(searchOptions));
			break;
		case SUBSTRUCTURE:
			matchesStructureQuery = BooleanTemplate.create(
					"isSubstructure({0},{1},{2}) = true",
					structure.structureData,
					ConstantImpl.create(structureData),
					ConstantImpl.create(searchOptions));
			break;
		//...snipped other cases
	}

	baseQuery = baseQuery.from(compoundPathBuilder)
			.innerJoin(compound.compositions, composition)
			.innerJoin(composition.pk.chemicalStructure, structure)
			.where(matchesStructureQuery.and(predicate));
	//...snipped...
}

Full Source Code for ChemicalCompoundRepositoryImpl

The next Part will focus on the Service Layer. The Service Layer controls transactions and security.

Written by kienerj

May 2, 2013 at 07:51

Creating a Framework for Chemical Structure Search – Part 4

leave a comment »

Series Overview

This is Part 4 – Component Selection of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

Finally I will start with the actual creation of the framework. In this part I will introduce the main components (existing 3rd party frameworks and libraries) I use and briefly explain my choices. At this point I think it is fair to mention that my work was basically integrating different existing software components into my desired end-product while taking into account real-world problems and offering a solution for them. There are no new magic algorithms in chemical structure searching, modeling or drug discovery to be found here!

My first try

In my previous effort at creating a framework for chemical structure search, I thought being platform independent, especially regarding the used relational database management system (RDBMS), is an important aspect. Therefore I relied on doing the chemical structure search in the application and not the database. However it is exactly that part that lead to huge performance and efficiency problems. I had to do some stuff that just felt wrong and “hacky” to get usable performance.

Encountered issues with Application-based Substructure Search

Object Creation Performance

The first issue was, that for every structure search, all the structures (molfiles) passing the fingerprint screen had to be loaded from the database and converted to an IAtomContainer Object from the Chemistry Development Kit. It was the creation of these objects that was very CPU intensive. This was due to the fact that you had to detect aromaticity and similar things for every AtomContainer object. I found the solution for this in OrChem, a free cartridge for Oracle based on the CDK. The creators seemed to have the exact same issue and came up with their custom format. That format stored everything required like aromaticity and so forth in a CDK-specific way so the creation of IAtomContainers was not an issue anymore.

Substructure Search Performance

The second issue was the mediocre performance of the substructure search itself. The solution was a complex approach using multi-threading and queues. The first thread screened all structures using the pre-generated fingerprints. Fingerprints were stored in the database but loaded into memory on application start. If a structure passed the screen it’s database id was put into a queue. A second thread reads form that queue, loaded the molfile from database and generated the IAtomContainer and put them into a second queue. Then there were multiple threads (configurable amount) that took the AtomContainers from the queue and did the actual test for subgraph isomorphism. Again, if a structure passed this phase too, it’s database id was put into the output queue and the AtomContainer discarded. This last step was required because AtomContainers are memory hogs and you had to control somehow how many there were in memory at any time.

CPU load now easily reached 100% for seconds during substructure searches. I then realized that the database alone could easily use 20% or more of that probably due to loading all the structures form it. So I added the option to hold the custom format from OrChem in memory ( not big of an issue actually in terms of memory consumption) to reduce load on database and hence use those CPU cycles for substructure search. I guess you have long figured out how convoluted this all was. But it actually worked amazingly well! Because the hits were put into a queue it was easily possible to display the first say 5 hits on a web page while the search continued in the background. So you could give the impression of a very fast search!

Why start from scratch again?

So why change it? Tons of reasons. All of this was done with plain JDBC and various kinds of data transfer objects. Tight-Coupling and maintainability was a serious issue. On the application side of things it was impossible to sort the results because hits are returned somewhat randomly and hence real paging was not possible either. The second thing was how could you search for a substructure and a numeric property at the same time? Well the solution for that was, that one of the substructure search methods had a Set-argument. The Set should contain the database ids of the structures the search should be performed over. Hence do an SQL query for the numeric property first and feed the ids into the substructure search. That worked but again, not very straight forward. Adding and using such custom properties to the database was rather messy too, it lacked proper transaction support and so forth. All in all it was nothing to be proud of and certainly not usable in a real production environment. I did however learn a lot about the Java 5 concurrency package.

Component for Substructure Search

I decided that being dependent on a specific RDBMS is a minor issue compared to above outlined problems. I already knew about the open-source Bingo Cartridge and to my luck the company behind it was developing a version for PostgreSQL. So my choice of this component was easy. Use PostgreSQL with Bingo, both are free and open-source.

Application-side Chemistry toolkit

Especially for Input-output the framework required a Chemistry Toolkit and I again chose the Chemistry Development Kit CDK.

ORM

While it would be preferable to be independent of the ORM, I wasn’t able to achieve that but I admit I did not but much effort in it. MoleculeDatabaseFramework uses JPA 2.0 and hibernate as it’s JPA provider. The part that is hibernate specific is the custom SQL dialect I created for accessing the Structure Search functions of Bingo in JPQL and hence also QueryDSL. There is no specific reason I chose hibernate except I already knew it and it was able to do what I required. So I did not investigate any other JPA providers.

Application Framework – Dependency-Injection

Well I guess this is obvious. I chose Spring. I’ve heard and read a lot about Spring. I’ve always wanted to learn it and this was my chance. I also did not want the framework to depend an a full-blown Java EE Application server.

Data Access Layer – CRUD and Querying

I initial started the project with plain Spring and JPA (Hibernate). But shortly after I in my “research” I read about Spring Data JPA and it’s integration with QueryDSL. I quote from Spring-Data website:

Spring Data JPA aims to significantly improve the implementation of data access layers by reducing the effort to the amount that’s actually needed. As a developer you write your repository interfaces, including custom finder methods, and Spring will provide the implementation automatically.

To illustrate this here an example snippet showing an example implementation of my framework:

@Repository
public interface RegistrationCompoundRepository extends ChemicalCompoundRepository {

    List findByRegNumberStartingWith(String regNumber);

}

RegistrationCompound has a property called regNumber. Above interface method is automatically implemented by Spring Data and will return a result List of the RegistrationCompounds that match the passed in argument. That’s all you need to write. No SQL and not even a method implementation. Just create the interface and then follow the findBy method conventions of Spring Data.

A Spring Data repository can also make use of QueryDSL.

Querydsl is a framework which enables the construction of type-safe SQL-like queries for multiple backends including JPA, JDO and SQL in Java.

Example:

List result = query.from(customer)
    .where(customer.lastName.like("A%"), customer.active.eq(true))
    .orderBy(customer.lastName.asc(), customer.firstName.desc())
    .list(customer);

If you use QueryDSL in your Spring Data Repository using QueryDslPredicateExecutor

@Repository
@Transactional(propagation = Propagation.MANDATORY)
public interface ChemicalCompoundRepository
        extends ChemicalStructureSearchRepository, JpaRepository<T, Long>,
        QueryDslPredicateExecutor {
    //...
}

the repository will have additional methods that take a QueryDSL Predicate as an input. A Predicate is basically the WHERE-Clause of the query, like from above example customer.lastName.like("A%"). Some methods take additional parameter like a Pageable. This can be used for paging, the Pageable includes the paging (limit, offset) and sorting information.

This all means it is trivial to extend the repository my framework provides and add your own custom search methods to it. With using predicates you can create complex queries which at the same time search by chemical substructure, return the result sorted and paged and all this with a 1-line method declaration.

public Page findByChemicalStructure(String structureData,
            StructureSearchType searchType,
            Pageable pageable, Predicate predicate);

So I hope this got you interested!

Creating a Framework for Chemical Structure Search – Part 3

leave a comment »

Series Overview

This is Part 3 – Current Cheminformatics Landscape of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In this part I will explain why I believe that a free, open-source framework for creating chemical structure search enabled database applications is needed. This is my personal opinion, nothing less or more.

Cheminformatics Landscape

There are several companies that offer a range of standard products. Not all companies offer all products. Common products are:

  • Chemical Drawing Program
  • Client for local or remote chemical databases
  • Database Cartridge (for chemical structure search in relational database
  • Chemistry Toolkit (API for one or multiple programming languages)

A database cartridge is an “add-on” to a relational database management system (RDBMS) that enables chemical structure searching directly within the RDBMS. A cartridge usually also supports the conversion between different chemical formats. In the database client the chemical structure search usually takes place client-side and no cartridge is needed.

Out of above components the companies create and offer some typical applications (often called “solutions”) like

  • Chemical Registration
  • Chemical Inventory
  • Electronic Lab Notebook (ELN)
  • Search / Analysis Tool
  • others

Those applications are either web applications or client-server applications. They usually require a commercial RDBMS like Oracle or SQL Server.

Most suppliers create their own proprietary format for handling chemical structures. While they offer the option to convert them to standard formats like SMILES or molfiles, the available functionality with those is usually limited. Certain functionality can not be converted at all.

Besides the commercial suppliers there are multiple free, open-source chemistry tool-kits for different programming languages. Also there is a professionally developed open-source database cartridge for Oracle, SQL Server and PostgreSQL.

What is the problem?

Commercial Solutions

The commercial solutions have a high risk for vendor lock-in. Once you decide for one product it will be much easier to integrate it with a product from the same vendor and licensing a whole bundle of products from one vendor is a lot cheaper. After accumulating data over several years migrating the whole application suite becomes very cumbersome and expensive.

Another issue is the need for a specific commercial RDBMS which are also known to be fairly expensive. There is no guarantee that the free version or the cheapest version is supported by the database cartridge. Often this isn’t even clearly documented and especially free versions have some serious limitations.

The support offered for commercial products is in my opinion not worth the money. Usually support is outsourced to “cheap countries” and the positions are filled with entry-level staffers and proficiency in English and their accent can be an issue that makes dealing with support much harder and more time consuming than it needs to be. The ones that stay longer and get more proficient in the applications usually move out of support rather quickly. There are good people there but you normally just don’t get into contact with them directly. Anyway the issues support can solve are almost always those a competent application admin can solve himself. And in case you find real bugs there is no guarantee they will be fixed soon if at all. And if you want new features you either have to be very patient with no guarantee you will ever get it or finance it yourself and the supplier will happily implement it for you.

Another issue is scientific reproducibility. The code is proprietary and hence you can not reproduce experiments that use such proprietary solutions unless you own a license too. Also you can’t guarantee that the code doesn’t have some bug of any kind making the results invalid.

Open-Source Products

The current issue is that these are more like individual components but not whole applications. To create as example a fully functional, chemical structure searchable Inventory System, that would require a significant development (programming) effort and hence the according expertise, be it in-house or outsourced. So while the commercial solutions are expensive, creating your own using open-source components can be even more expensive. This only makes sense if you are a large company with very special needs.

Conclusion

If you are a small to mid-level organizations your basically forced to buy a pricey product and RDBMS which you can barley afford. These systems are in general used by research departments which usually get less budget than say “production department” meaning while it is fine to spend millions in CRM and SRM system, this is not true for research systems.

Proposed Solution

The proposed solution would be a free, open-source ecosystem of cheminformatics applications. This is however a very long way to go. In this series I’m introducing you to the first, basic step: A free, open-source framework for creating chemical structure search enabled database applications.

Written by kienerj

April 15, 2013 at 12:41

Creating a Framework for Chemical Structure Search – Part 2

leave a comment »

Series Overview

This is Part 2 – Substructure Search Performance of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In this part I will cover the solution for per-filtering when performing a chemical substructure search and go briefly into the topic of similarity searching.

Substructure Searching and Performance

History and User Expectations

As explained in Part 1 of this series chemical substructure search can be computationally very expensive and 20 to 30 years ago this was an extremely big issue due to the limited computing power available. Hence relying only on sub graph isomorphism was just not feasible and still is not. Because a researcher of today wants to search in databases containing millions of compounds and he wants the results to be displayed as quickly as possible, meaning seconds.

Fingerprints

The solution is to filter out any records that can not match the query structure before the actual substructure search. This filtering is done by the use of so called fingerprints. A fingerprint is just a set of bits. If a bit is set, it means that the given chemical structure has the feature associated with that bit (Note: simplified, not really true). Below an example of a hashed fingerprint:

chemical hashed fingerprints

Image Source and further explanations

The important part about a fingerprint is that any bit set in the query structure will also be set in a structure containing it. This is checked by a logical AND of the query structure and every target structure in the database. If queryFingerprint AND targetFingerprint == targetFingerprint then the target might be a potential sub structure. All other molecule can be filtered out. So there are false-positives but no false-negatives.

Comparing fingerprints is extremely fast in modern CPUs. Hence the time added for the fingerprint comparison is minimal. In contrast depending on you database content and query structure filtering by fingerprint can eliminate 90% of the records in your database.

Similarity Searching

Similarity searching is comparing fingerprints to each other using a certain algorithm (there are different ones) and the result is a percentage how similar the 2 chemical structures are. The most common used measure (algorithm) is tanimoto similarity. Of course the results also depends on the used fingerprint and not only the algorithm.

For a lot more details on fingerprints and similarity searching I highly recommend to read daylight theory about fingerprints.

What’s next?

After the first 2 parts that were a very brief and simplistic introductions into cheminformatics and particular into chemical structure searching, the next part will discuss the current landscape especially available solutions for chemical structure searching and explain why I created a free, open-source framework for creating chemical structure search enabled database applications.

Written by kienerj

April 7, 2013 at 14:49

Creating a Framework for Chemical Structure Search – Part 1

leave a comment »

Series Overview

This is Part 1 – Simplistic Introduction to Cheminformatics of the “Creating a Framework for Chemical Structure Search“-Series.

Follow-ups:

Introduction

Anyone reading my previous blog entry Spring 3 and Hibernate 4 for beginners probably realized that I’m working on something that has to do with chemistry and I don’t mean the thing you have going on with your pretty co-worker next door!

I’m going to shamelessly use this blog to “promote” my Open Source Project Molecule Database Framework. This framework allows you to rapidly create chemical structure search enabled database applications.

At least the first few parts of a long series will focus on the involved chemistry aspects and not on programming and code examples. The arguments made here obviously affect the design of the end product. Basic chemistry knowledge is required for this post.

What is a Chemical Structure?

A Chemical Structure is a model of what atoms compounds are build up of and how they are connected with each other. But enough with words. Here are 2 examples:

Water (aka H2O):

Chemical Structure of Water

Caffeine (aka C8H10N4O2 or the stuff that doesn’t make you yawn at your boss):

Chemical Strucutre of Caffeine

Searching for Chemical Structures

After seeing above images you hopefully asked yourself the question: How can these “images” be searched?. There are several answer: it depends on how you want to search. There are fundamentally 3 types of searches:

  • Full Structure Search
  • Sub Structure Search
  • Similarity Search

Full Structure Search

This is the easiest type. The chemical structure is converted to a chemically meaningful String and then all you do is exact string matching. There are several such formats the 2 most common being

SMILES

SMILES alone is not useful as depending on how the structure is “drawn” and on the implementation the output for the same structure can be different. To partially solve that problem you need to use so called canonical SMILES which always returns the same string. However the actual output will still depend on the used algorithm and portability to another one is not guaranteed. Advantage of SMILES is the human readability. One way to write the SMILES for caffeine is:

Cn1cnc2c1c(=O)n(c(=O)n2C)C

InChI

In contrast to SMILES the so called standard InChI will always generate the same String for the same chemical structure. InChI was initially developed by NIST and now is supported by the InChi Trust. InChI is a pretty new format and only in the last few years has become more widespread. Downside compared to SMILES is that it is not very human readable.

The generated InChI can be converted to a hash, the so called InChIKey. An InChIkey has no chemical meaning and can not be converted back to an InChI (it’s a hash…). You probably get the point. The InChikey is an ideal candidate for performing full structure search.

The InChI for caffeine is:

InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3

(note: the InChI=1S is part of the string and required.)

The InChikey for caffeine is:

RYYVLZVUVIJVGH-UHFFFAOYSA-N

Source of SMILES and InChI: chemspider

Sub Structure Search

A Sub Structure Search is for finding any structure that contains the query structure. As images say more than 1000 words below again caffeine (left) and an example of one of many sub-structures that match caffeine (right):

Chemical Strucutre of CaffeineImidazole

A substructure search for the chemical structure on the right, called Imidazole, will find caffeine. But it would also find a lot of other structures. A random example being:

4-Amino-1H-imidazole-5-carboxylic acid

I hope above examples made it clear what is meant by a substructure search (Note: This assumes above structures are in the database you are searching in).

The question you need to ask yourself is: How can this search be done? To point out the obvious: This can’t be done by comparing Strings. The solution comes from a specific math field called Graph Theory. A graph has nodes (the atoms) and edges (bonds). Here an example taken from Wikipedia:

graph example

Nodes and edges can have attributes, eg. atom type and bond type (single, double…). The specific problem from graph theory we are interested in is called subgraph isomorphism. If you are interested in the details, please read the linked articles.

The problem with subgraph isomorphism is that it is so called NP-complete.

Although any given solution to such a problem can be verified quickly, there is no known efficient way to locate a solution in the first place; indeed, the most notable characteristic of NP-complete problems is that no fast solution to them is known.

Or said in other words: subgraph isomorphism (substructure search) can be computationally very expensive. I will address the solution to this issue in-depth in the next part. Simplified you can do pre-filter structures that will certainly not match and then only need to to the subgraph isomorphism on a subset of the database.

Similarity Search

This will be addressed in the next part. The solution uses the same principle as above mentioned pre-filter.

Written by kienerj

April 5, 2013 at 13:25