Solutions to IT problems

Solutions I found when learning new IT stuff

Posts Tagged ‘JPA

Creating a Framework for Chemical Structure Search – Part 5

with one comment

Series Overview

This is Part 5 – Entity Model of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In this part I will introduce you to the chosen design for the model (entity classes) and I will explain the reasoning behind it. The model is fairly simple but it still took me rather long to finalize it. The issue is that I needed to consider what different applications using my framework might require and at the same time keep it as simple as possible.

Entity Model

I’m just going to show you a simple UML class diagram created with yuml.me – An Online UML Diagram Generator and then introduce each entity.

Class Diagram of Model

BaseEntity

This is a base class that holds metadata like creation date. This is a @MappedSuperclass which the other model classes extend.

Source Code for BaseEntity

UPDATE: Due to a new feature BaseEntity now extends MetaDataEntity. BaseEntity contains an extra abstract method public Long getId();. All entities except ChemicalCompoundComposition extend BaseEntity and ChemicalCompoundComposition extends MetaDataEntity as it has no id property and sadly it is non-trivial or not possible at all to add a generated id to an @Embeddable using JPA and Hibernate.

ChemicalStructure

Entity for holding the chemical structure data (SMILES or molfile) and the structure key (InChiKey). A ChemicalStructure is unique and immutable and managed by the framework. Users operate on ChemicalCompounds and not ChemicalStructures directly. Unique means if a new ChemicalCompound is saved, the framework checks if the ChemicalStructures in it already exist and if yes re-uses them. Immutable means that if a ChemicalCompound is updated and one of the ChemicalStructures has changed the framework will automatically check if the updated ChemicalStructure already exist and use it or create a new ChemicalStructure. The old one will remain unchanged!

Source Code for ChemicalStructure

ChemicalCompoundComposition

Links together ChemicalStructure and ChemicalCompound and defines the relative occurrence of the ChemicalStructure within the ChemicalCompound.

Source Code for ChemicalCompoundComposition

ChemicalCompound

Abstract model of a ChemicalCompound. A ChemicalCompound consists of ChemicalCompoundCompositions. The class contains some basic fields like compoundName and cas. A ChemicalCompound can also be associated with a Set of Containables. Developers using MoleculeDatabaseFramework must create concrete implementations of this class. An application can have multiple implementations of ChemicalCompound and each implementation is stored and searched separately (Table per Concrete class Inheritance). Note that due to better usability it was decided to make CAS-Number column nullable and it is not unique.

A ChemicalCompound is a “virtual entity” or “descriptive entity”. It is like a specific car model that describes all properties of that car but is not a concrete object that physically exists.

Source Code for ChemicalCompound

Containable

A Containable is like a set of a ChemicalCompounds that were produced in the same way. In a Chemical Registration System this would be a Batch and in an Inventory System a Lot. The important part is that ChemicalCompound and Containable are generic classes and must form a pair:


@Entity
@Table(name="registration_compound")
@Data
@EqualsAndHashCode(callSuper=false, of = {"regNumber"})
public class RegistrationCompound extends ChemicalCompound<Batch> {
    // snipped
}

@Entity
@Table(name="batch", uniqueConstraints=
        @UniqueConstraint(columnNames = {"chemical_compound_id", "batch_Number"}))
@Data
@EqualsAndHashCode(callSuper=true, of = {"batchNumber"})
public class Batch extends Containable<RegistrationCompound> {
    // snipped
}

Source Code for Containable

ChemicalCompoundContainer

A ChemicalCompoundContainer holds exactly 1 Containable of any type. An application should only have 1 implementation of this entity. This represents a concrete physically available object containing a ChemicalCompound linked by a Containable. ChemicalCompoundContainer has a barcode field which is unique and not nullable. The barcode hence uniquely identifies a physically available sample of a ChemicalCompound.

Role and User

Role and User are only relevant if you plan on using MoleculeDatabaseFramework with Spring-Security. ChemicalCompound and Containable hold a reference to their Read-Role. This is used to filter ChemicalCompoundContainers in the database based on the current Users privileges. Example:

Your application has 2 ChemicalCompound-Implementations, DefaultCompound and SecretCompound. Current User has the Role to read DefaultCompounds but not for reading (viewing) SecretCompounds. So if this User searches for ChemicalCompoundContainers, only ChemicalCompoundContainer that contain a DefaultCompound must be returned by the search. To achieve that the queries WHERE-clause is extended and the filter based on the Role is added automatically. The main advantage of doing this filtering in the database compared to filtering the results within the applications is that you get pageable results which would not be easily possible (if at all) with application-side filtering (and performance is probably a lot better too).

Source Code for Role
Source Code for User

I will go further into Spring-Security Integration in later article. If you are interested in learning more about it I can refer you to MoleculeDatabaseFrameworks Spring-Security Wiki Page.

Advertisements

Written by kienerj

April 30, 2013 at 07:22

Posted in Chemistry, Java, Programming

Tagged with , ,

Creating a Framework for Chemical Structure Search – Part 4

leave a comment »

Series Overview

This is Part 4 – Component Selection of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

Finally I will start with the actual creation of the framework. In this part I will introduce the main components (existing 3rd party frameworks and libraries) I use and briefly explain my choices. At this point I think it is fair to mention that my work was basically integrating different existing software components into my desired end-product while taking into account real-world problems and offering a solution for them. There are no new magic algorithms in chemical structure searching, modeling or drug discovery to be found here!

My first try

In my previous effort at creating a framework for chemical structure search, I thought being platform independent, especially regarding the used relational database management system (RDBMS), is an important aspect. Therefore I relied on doing the chemical structure search in the application and not the database. However it is exactly that part that lead to huge performance and efficiency problems. I had to do some stuff that just felt wrong and “hacky” to get usable performance.

Encountered issues with Application-based Substructure Search

Object Creation Performance

The first issue was, that for every structure search, all the structures (molfiles) passing the fingerprint screen had to be loaded from the database and converted to an IAtomContainer Object from the Chemistry Development Kit. It was the creation of these objects that was very CPU intensive. This was due to the fact that you had to detect aromaticity and similar things for every AtomContainer object. I found the solution for this in OrChem, a free cartridge for Oracle based on the CDK. The creators seemed to have the exact same issue and came up with their custom format. That format stored everything required like aromaticity and so forth in a CDK-specific way so the creation of IAtomContainers was not an issue anymore.

Substructure Search Performance

The second issue was the mediocre performance of the substructure search itself. The solution was a complex approach using multi-threading and queues. The first thread screened all structures using the pre-generated fingerprints. Fingerprints were stored in the database but loaded into memory on application start. If a structure passed the screen it’s database id was put into a queue. A second thread reads form that queue, loaded the molfile from database and generated the IAtomContainer and put them into a second queue. Then there were multiple threads (configurable amount) that took the AtomContainers from the queue and did the actual test for subgraph isomorphism. Again, if a structure passed this phase too, it’s database id was put into the output queue and the AtomContainer discarded. This last step was required because AtomContainers are memory hogs and you had to control somehow how many there were in memory at any time.

CPU load now easily reached 100% for seconds during substructure searches. I then realized that the database alone could easily use 20% or more of that probably due to loading all the structures form it. So I added the option to hold the custom format from OrChem in memory ( not big of an issue actually in terms of memory consumption) to reduce load on database and hence use those CPU cycles for substructure search. I guess you have long figured out how convoluted this all was. But it actually worked amazingly well! Because the hits were put into a queue it was easily possible to display the first say 5 hits on a web page while the search continued in the background. So you could give the impression of a very fast search!

Why start from scratch again?

So why change it? Tons of reasons. All of this was done with plain JDBC and various kinds of data transfer objects. Tight-Coupling and maintainability was a serious issue. On the application side of things it was impossible to sort the results because hits are returned somewhat randomly and hence real paging was not possible either. The second thing was how could you search for a substructure and a numeric property at the same time? Well the solution for that was, that one of the substructure search methods had a Set-argument. The Set should contain the database ids of the structures the search should be performed over. Hence do an SQL query for the numeric property first and feed the ids into the substructure search. That worked but again, not very straight forward. Adding and using such custom properties to the database was rather messy too, it lacked proper transaction support and so forth. All in all it was nothing to be proud of and certainly not usable in a real production environment. I did however learn a lot about the Java 5 concurrency package.

Component for Substructure Search

I decided that being dependent on a specific RDBMS is a minor issue compared to above outlined problems. I already knew about the open-source Bingo Cartridge and to my luck the company behind it was developing a version for PostgreSQL. So my choice of this component was easy. Use PostgreSQL with Bingo, both are free and open-source.

Application-side Chemistry toolkit

Especially for Input-output the framework required a Chemistry Toolkit and I again chose the Chemistry Development Kit CDK.

ORM

While it would be preferable to be independent of the ORM, I wasn’t able to achieve that but I admit I did not but much effort in it. MoleculeDatabaseFramework uses JPA 2.0 and hibernate as it’s JPA provider. The part that is hibernate specific is the custom SQL dialect I created for accessing the Structure Search functions of Bingo in JPQL and hence also QueryDSL. There is no specific reason I chose hibernate except I already knew it and it was able to do what I required. So I did not investigate any other JPA providers.

Application Framework – Dependency-Injection

Well I guess this is obvious. I chose Spring. I’ve heard and read a lot about Spring. I’ve always wanted to learn it and this was my chance. I also did not want the framework to depend an a full-blown Java EE Application server.

Data Access Layer – CRUD and Querying

I initial started the project with plain Spring and JPA (Hibernate). But shortly after I in my “research” I read about Spring Data JPA and it’s integration with QueryDSL. I quote from Spring-Data website:

Spring Data JPA aims to significantly improve the implementation of data access layers by reducing the effort to the amount that’s actually needed. As a developer you write your repository interfaces, including custom finder methods, and Spring will provide the implementation automatically.

To illustrate this here an example snippet showing an example implementation of my framework:

@Repository
public interface RegistrationCompoundRepository extends ChemicalCompoundRepository {

    List findByRegNumberStartingWith(String regNumber);

}

RegistrationCompound has a property called regNumber. Above interface method is automatically implemented by Spring Data and will return a result List of the RegistrationCompounds that match the passed in argument. That’s all you need to write. No SQL and not even a method implementation. Just create the interface and then follow the findBy method conventions of Spring Data.

A Spring Data repository can also make use of QueryDSL.

Querydsl is a framework which enables the construction of type-safe SQL-like queries for multiple backends including JPA, JDO and SQL in Java.

Example:

List result = query.from(customer)
    .where(customer.lastName.like("A%"), customer.active.eq(true))
    .orderBy(customer.lastName.asc(), customer.firstName.desc())
    .list(customer);

If you use QueryDSL in your Spring Data Repository using QueryDslPredicateExecutor

@Repository
@Transactional(propagation = Propagation.MANDATORY)
public interface ChemicalCompoundRepository
        extends ChemicalStructureSearchRepository, JpaRepository<T, Long>,
        QueryDslPredicateExecutor {
    //...
}

the repository will have additional methods that take a QueryDSL Predicate as an input. A Predicate is basically the WHERE-Clause of the query, like from above example customer.lastName.like("A%"). Some methods take additional parameter like a Pageable. This can be used for paging, the Pageable includes the paging (limit, offset) and sorting information.

This all means it is trivial to extend the repository my framework provides and add your own custom search methods to it. With using predicates you can create complex queries which at the same time search by chemical substructure, return the result sorted and paged and all this with a 1-line method declaration.

public Page findByChemicalStructure(String structureData,
            StructureSearchType searchType,
            Pageable pageable, Predicate predicate);

So I hope this got you interested!

Spring 3 with JPA 2.0 (Hibernate) for beginners

leave a comment »

Introduction

This is a follow-up post to Spring 3 and Hibernate 4 for beginners.

After some considerations I decided to change my application to use JPA instead of “native Hibernate”. The configuration here assumes a standalone application, meaning outside of application container like Tomcat. However changing to JNDI and JTA is “only” a matter of configuration.

Changes

Entity Classes

If you used JPA annotations for your entity classes you don’t have to change anything. If you want to be fully independent of the JPA Implementation you will need to remove all Hibernate specific annotations which can be problematic in certain cases.

Abstract DAO

This remains almost the same. You need to exchange the sessionFactory with entityManager and thats it.

@Repository
public abstract class AbstractJpaDAO< T extends Serializable> {

    private Class< T> clazz;
    @PersistenceContext
    private EntityManager entityManager;

    public AbstractJpaDAO(final Class< T> clazzToSet) {
        this.clazz = clazzToSet;
    } 

    public T getById(final Long id) {
        Preconditions.checkArgument(id != null);
        return getEntityManager().find(clazz, id);
    }

    public List< T> getAll() {
        return getEntityManager().createQuery("from " + clazz.getName())
                .getResultList();
    }

    public void create(final T entity) {
        Preconditions.checkNotNull(entity);
        getEntityManager().persist(entity);
    }

    public T update(final T entity) {
        Preconditions.checkNotNull(entity);
        return (T) getEntityManager().merge(entity);
    }

    public void delete(final T entity) {
        Preconditions.checkNotNull(entity);
        getEntityManager().remove(entity);
    }

    public void deleteById(final Long entityId) {
        final T entity = getById(entityId);
        Preconditions.checkState(entity != null);
        delete(entity);
    }

    /**
     * @return the entityManager
     */
    public EntityManager getEntityManager() {
        return entityManager;
    }

    /**
     * @param entityManager the entityManager to set
     */
    public void setEntityManager(EntityManager entityManager) {
        this.entityManager = entityManager;
    }
}

Configuration

First you will not need to create a persistence unit! You can but since spring 3.1 this is not required anymore. If you are using Netbeans and see the warning “The Project does not contain a persistence unit” you can safely ignore it.

Below the new Spring Application-Context file. The important changes are replacing session factory with entity manager factory and changing transaction manager to jpa transaction manager. The datasource and connection pool remains exactly the same.

I’ve also added Service classes to the configuration which I created after writing the previous post.

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:aop="http://www.springframework.org/schema/aop"
       xmlns:context="http://www.springframework.org/schema/context"
       xmlns:tx="http://www.springframework.org/schema/tx"

       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
          http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-3.1.xsd
          http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.1.xsd
          http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-3.1.xsd">      

    <context:component-scan base-package="org.bitbucket.kienerj.moleculedatabaseframework" />
    <context:annotation-config />    
    
    <bean id="entityManagerFactory" autowire="autodetect"
          class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <!-- This scan the packages for entity classes an hence no need for persistence unit -->
        <property name="packagesToScan" value="org.bitbucket.kienerj.moleculedatabaseframework.entity" />
        <property name="jpaVendorAdapter">
            <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter">
                <property name="showSql" value="false" />
                <!-- if this is true it can override hibernate.hbm2ddl.auto settings -->
                <property name="generateDdl" value="false" />
                <property name="databasePlatform" value="org.hibernate.dialect.PostgreSQLDialect" />
            </bean>
        </property>
        <!-- put any ORM specific stuff here -->
        <property name="jpaProperties">
            <props>
                <!-- for test config only --> 
                <prop key="hibernate.hbm2ddl.auto">create-drop</prop> <
            </props>
        </property>
    </bean>
       
    
    <!-- Spring bean configuration. Tell Spring to bounce off BoneCP -->
    <bean id="dataSource"
          class="org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy">
        <property name="targetDataSource">
            <ref local="mainDataSource" />
        </property>
    </bean>
            
    <!-- BoneCP configuration -->
    <bean id="mainDataSource" class="com.jolbox.bonecp.BoneCPDataSource" destroy-method="close">
        <property name="driverClass" value="org.postgresql.Driver" />
        <property name="jdbcUrl" value="jdbc:postgresql:MolDB" />
        <property name="username" value="postgres"/>
        <property name="password" value="123456"/>
        <property name="idleConnectionTestPeriod" value="60"/>
        <property name="idleMaxAge" value="240"/>      
        <property name="maxConnectionsPerPartition" value="60"/>
        <property name="minConnectionsPerPartition" value="20"/>
        <property name="partitionCount" value="3"/>
        <property name="acquireIncrement" value="10"/>                              
        <property name="statementsCacheSize" value="50"/>
        <property name="releaseHelperThreads" value="3"/>
    </bean>
    
    <bean id="txManager" class="org.springframework.orm.jpa.JpaTransactionManager">
        <property name="entityManagerFactory" ref="entityManagerFactory" />
    </bean>
    <tx:annotation-driven transaction-manager="txManager" />   
    
    
    <bean id="chemicalCompoundService" 
          class="org.bitbucket.kienerj.moleculedatabaseframework.service.ChemicalCompoundServiceImpl"/>
    
    <bean id="chemicalStructureService" autowire="autodetect"
          class="org.bitbucket.kienerj.moleculedatabaseframework.service.ChemicalStructureServiceImpl"/>
    
    <bean id="abstractJpaDAO" abstract="true"
          class="org.bitbucket.kienerj.moleculedatabaseframework.dao.AbstractJpaDAO"/>
    
    <bean id="chemicalStructureDAO" parent="abstractJpaDAO" autowire="autodetect"
          class="org.bitbucket.kienerj.moleculedatabaseframework.dao.ChemicalStructureDAO"/>
    <bean id="chemicalCompoundDAO" parent="abstractJpaDAO" autowire="autodetect"
          class="org.bitbucket.kienerj.moleculedatabaseframework.dao.ChemicalCompoundDAO"/>
</beans>

Quering

Since last post I have created some methods in the DAOs and created a Service Layer. Anyway I had to change my Queries too. The most notable change is that in queries that return 1 result only you need to change uniqueResult() to getSingleResult() but much more important is, that getSingleResult() will throw a NoResultException in case no result was found. In my case this is expected to happen and a common scenario. In my case I simulate behaviour of uniqueResult() by catching the exception and returning null.

public ChemicalStructure getByStructureKey(String structureKey) {

	try {
		ChemicalStructure structure = getEntityManager()
				.createQuery(
					"FROM ChemicalStructure structure "
				  + "WHERE structure.structureKey = :structureKey", ChemicalStructure.class)
				.setParameter("structureKey", structureKey)
				.getSingleResult();
		return structure;
	} catch (NoResultException nre) {
		return null;
	}        
}

Keep in mind you will need to change the application and test config! I wasted a lot of time because I changed application config and then ran tests which failed…Beginners mistake. 🙂

Hope this was helpful.

Written by kienerj

October 16, 2012 at 13:53

Posted in Database, Java, Programming

Tagged with , , ,