Solutions to IT problems

Solutions I found when learning new IT stuff

Posts Tagged ‘Spring-Data

Creating a Framework for Chemical Structure Search – Part 6

leave a comment »

Series Overview

This is Part 6 – Data Access Layer of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

In the previous article I introduced the entity model of MoleculeDatabaseFramework. This article will explain the Data Access Layer which uses Spring-Data-JPA with Hibernate and how the Chemical Structure Search methods of the Bingo PostgreSQL Cartridge are exposed to Hibernate and QueryDSL.

How Spring-Data JPA works

Basic functionality

I quote from Spring-Data website:

Spring Data JPA aims to significantly improve the implementation of data access layers by reducing the effort to the amount that’s actually needed. As a developer you write your repository interfaces, including custom finder methods, and Spring will provide the implementation automatically.

You create a new interface that extends from generic interfaces provided by Spring-Data and represents the repository for an entity. There are different kinds of repository interfaces but the repositories in MoleculeDatabaseFramework all extend JpaRepository. JpaRepository provides CRUD-methods and some retrieval methods for your entity.

Repositories in MoleculeDatabaseFramework also extend QueryDslPredicateExecutor. This adds findOne(predicate) and findAll(predicate) methods. Predicates are basically type-safe WHERE-Clauses.

Custom query methods

Besides the provided methods you can add your custom search methods by following the findBy-method conventions of Spring Data JPA or by annotating a method with @Query were the value of the annotation is either a JPQL Query or native SQL.

Custom Queries providing your own method implementation

In case you have a very complex query that can’t be automatically created by Spring-Data, you can create them yourself.

1. Create Custom Query Interface

To achieve this you need to first create an interface containing the desired query method(s) and annotate it with @NoRepositoryBean:

@NoRepositoryBean
public interface ChemicalStructureSearchRepository<T> {

    Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType,
            Pageable pageable, Predicate predicate,
            String searchOptions,
            PathBuilder<T> pathBuilder);


    Page<T> findBySimilarStructure(String structureData,
            SimilarityType similarityType,
            Double lowerBound, Double upperBound,
            Pageable pageable, Predicate predicate,
            PathBuilder<T> pathBuilder);
}

This is the Source Code of ChemicalStructureSearchRepository minus JavaDoc comments.

2. Create a repository extending Custom Query interface

As an example below the Source Code for ChemicalCompoundRepository which extends ChemicalStructureSearchRepository:

@Repository
@Transactional(propagation = Propagation.MANDATORY)
public interface ChemicalCompoundRepository<T extends ChemicalCompound>
        extends ChemicalStructureSearchRepository<T>, JpaRepository<T, Long>,
        QueryDslPredicateExecutor<T> {
    
    List<T> findByCompositionsPkChemicalStructureId(Long structureId);
    
    T findByCas(String cas);

    @Query("select c from Containable c where c.chemicalCompound = ?1")
    List<Containable> getContainablesByCompound(ChemicalCompound compound);
}

3. Create an implementation of your repository

The convention is that the implementation is named after the repository with “Impl” appended, in this case ChemicalCompoundRepositoryImpl. This implementation must only implement your custom methods in this case defined in ChemicalStructureSearchRepository.

public class ChemicalCompoundRepositoryImpl<T extends ChemicalCompound>
        implements ChemicalStructureSearchRepository<T> {

	//...fields and constructors snipped...

    @Cacheable(STRUCTURE_QUERY_CACHE)
    @Override
    public Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType, Pageable pageable,
            Predicate predicate, String searchOptions,
            PathBuilder<T> compoundPathBuilder) {
			
			//...implementation snipped...
    }


    @Cacheable(STRUCTURE_QUERY_CACHE)
    @Override
    public Page<T> findBySimilarStructure(String structureData,
            SimilarityType similarityType, Double lowerBound, Double upperBound,
            Pageable pageable, Predicate predicate,
            PathBuilder<T> compoundPathBuilder) {
			
			//...implementation snipped...
    }
}

Below an UML Class Diagram that shows the relationships of ChemicalCompoundRepository:

ChemicalCompoundRepository UML

Spring-Data automatically detects the repository implementation and combines all provided and all your custom search methods into one object which you use by calling them from ChemicalCompoundRepository.


Page<T> page = getRepository().findByChemicalStructure(structureData, searchType,
                pageable, predicate, searchOptions, pathBuilder);

Using the Repositories

MoleculeDatabaseFramework provides generic repositories for all entities in the entity model.

Source Code for all Repositories

To make use of a chemical structure search enabled repository you need to extend it using your specific entity implementation and optionally add your custom find methods:

@Repository
public interface RegistrationCompoundRepository extends ChemicalCompoundRepository<RegistrationCompound> {

    List<RegistrationCompound> findByRegNumberStartingWith(String regNumber);

}

That’s it!

You can find further information on how to implement entities and repositories in the MoleculeDatabaseFramework Tutorial as this article is meant to show the inner workings of the framework and not how to use it.

Exposing Bingo PostgreSQL Cartridge Methods

This is done by using a custom dialect extending Hibernates PostgreSQL82Dialect:

public class BingoPostgreSQLDialect extends PostgreSQL82Dialect {

    public BingoPostgreSQLDialect() {
         registerFunction("issubstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.sub"));
         registerFunction("isexactstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.exact"));
         registerFunction("matchessmarts", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.smarts"));
         registerFunction("matchesformula", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3)::bingo.gross"));
         registerFunction("issimilarstructure", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1  @ (?2, ?3, ?4, ?5)::bingo.sim"));
         registerFunction("hasmassbetween", new SQLFunctionTemplate(
                 StandardBasicTypes.BOOLEAN, "?1 > ?2::bingo.mass AND ?1 < ?3::bingo.mass"));         
    }
}

And as a usage example a source code snippet from ChemicalCompoundRepositoryImpl:

public Page<T> findByChemicalStructure(String structureData,
            StructureSearchType searchType, Pageable pageable,
            Predicate predicate, String searchOptions,
            PathBuilder<T> compoundPathBuilder) {
			
	//...snipped...
			
	BooleanExpression matchesStructureQuery; // this is a Predicate!

	switch (searchType) {
		case EXACT:
			matchesStructureQuery = BooleanTemplate.create(
					"isExactStructure({0},{1},{2}) = true",
					structure.structureData,
					ConstantImpl.create(structureData),
					ConstantImpl.create(searchOptions));
			break;
		case SUBSTRUCTURE:
			matchesStructureQuery = BooleanTemplate.create(
					"isSubstructure({0},{1},{2}) = true",
					structure.structureData,
					ConstantImpl.create(structureData),
					ConstantImpl.create(searchOptions));
			break;
		//...snipped other cases
	}

	baseQuery = baseQuery.from(compoundPathBuilder)
			.innerJoin(compound.compositions, composition)
			.innerJoin(composition.pk.chemicalStructure, structure)
			.where(matchesStructureQuery.and(predicate));
	//...snipped...
}

Full Source Code for ChemicalCompoundRepositoryImpl

The next Part will focus on the Service Layer. The Service Layer controls transactions and security.

Advertisements

Written by kienerj

May 2, 2013 at 07:51

Creating a Framework for Chemical Structure Search – Part 4

leave a comment »

Series Overview

This is Part 4 – Component Selection of the “Creating a Framework for Chemical Structure Search“-Series.

Previous posts:

Follow-ups:

Introduction

Finally I will start with the actual creation of the framework. In this part I will introduce the main components (existing 3rd party frameworks and libraries) I use and briefly explain my choices. At this point I think it is fair to mention that my work was basically integrating different existing software components into my desired end-product while taking into account real-world problems and offering a solution for them. There are no new magic algorithms in chemical structure searching, modeling or drug discovery to be found here!

My first try

In my previous effort at creating a framework for chemical structure search, I thought being platform independent, especially regarding the used relational database management system (RDBMS), is an important aspect. Therefore I relied on doing the chemical structure search in the application and not the database. However it is exactly that part that lead to huge performance and efficiency problems. I had to do some stuff that just felt wrong and “hacky” to get usable performance.

Encountered issues with Application-based Substructure Search

Object Creation Performance

The first issue was, that for every structure search, all the structures (molfiles) passing the fingerprint screen had to be loaded from the database and converted to an IAtomContainer Object from the Chemistry Development Kit. It was the creation of these objects that was very CPU intensive. This was due to the fact that you had to detect aromaticity and similar things for every AtomContainer object. I found the solution for this in OrChem, a free cartridge for Oracle based on the CDK. The creators seemed to have the exact same issue and came up with their custom format. That format stored everything required like aromaticity and so forth in a CDK-specific way so the creation of IAtomContainers was not an issue anymore.

Substructure Search Performance

The second issue was the mediocre performance of the substructure search itself. The solution was a complex approach using multi-threading and queues. The first thread screened all structures using the pre-generated fingerprints. Fingerprints were stored in the database but loaded into memory on application start. If a structure passed the screen it’s database id was put into a queue. A second thread reads form that queue, loaded the molfile from database and generated the IAtomContainer and put them into a second queue. Then there were multiple threads (configurable amount) that took the AtomContainers from the queue and did the actual test for subgraph isomorphism. Again, if a structure passed this phase too, it’s database id was put into the output queue and the AtomContainer discarded. This last step was required because AtomContainers are memory hogs and you had to control somehow how many there were in memory at any time.

CPU load now easily reached 100% for seconds during substructure searches. I then realized that the database alone could easily use 20% or more of that probably due to loading all the structures form it. So I added the option to hold the custom format from OrChem in memory ( not big of an issue actually in terms of memory consumption) to reduce load on database and hence use those CPU cycles for substructure search. I guess you have long figured out how convoluted this all was. But it actually worked amazingly well! Because the hits were put into a queue it was easily possible to display the first say 5 hits on a web page while the search continued in the background. So you could give the impression of a very fast search!

Why start from scratch again?

So why change it? Tons of reasons. All of this was done with plain JDBC and various kinds of data transfer objects. Tight-Coupling and maintainability was a serious issue. On the application side of things it was impossible to sort the results because hits are returned somewhat randomly and hence real paging was not possible either. The second thing was how could you search for a substructure and a numeric property at the same time? Well the solution for that was, that one of the substructure search methods had a Set-argument. The Set should contain the database ids of the structures the search should be performed over. Hence do an SQL query for the numeric property first and feed the ids into the substructure search. That worked but again, not very straight forward. Adding and using such custom properties to the database was rather messy too, it lacked proper transaction support and so forth. All in all it was nothing to be proud of and certainly not usable in a real production environment. I did however learn a lot about the Java 5 concurrency package.

Component for Substructure Search

I decided that being dependent on a specific RDBMS is a minor issue compared to above outlined problems. I already knew about the open-source Bingo Cartridge and to my luck the company behind it was developing a version for PostgreSQL. So my choice of this component was easy. Use PostgreSQL with Bingo, both are free and open-source.

Application-side Chemistry toolkit

Especially for Input-output the framework required a Chemistry Toolkit and I again chose the Chemistry Development Kit CDK.

ORM

While it would be preferable to be independent of the ORM, I wasn’t able to achieve that but I admit I did not but much effort in it. MoleculeDatabaseFramework uses JPA 2.0 and hibernate as it’s JPA provider. The part that is hibernate specific is the custom SQL dialect I created for accessing the Structure Search functions of Bingo in JPQL and hence also QueryDSL. There is no specific reason I chose hibernate except I already knew it and it was able to do what I required. So I did not investigate any other JPA providers.

Application Framework – Dependency-Injection

Well I guess this is obvious. I chose Spring. I’ve heard and read a lot about Spring. I’ve always wanted to learn it and this was my chance. I also did not want the framework to depend an a full-blown Java EE Application server.

Data Access Layer – CRUD and Querying

I initial started the project with plain Spring and JPA (Hibernate). But shortly after I in my “research” I read about Spring Data JPA and it’s integration with QueryDSL. I quote from Spring-Data website:

Spring Data JPA aims to significantly improve the implementation of data access layers by reducing the effort to the amount that’s actually needed. As a developer you write your repository interfaces, including custom finder methods, and Spring will provide the implementation automatically.

To illustrate this here an example snippet showing an example implementation of my framework:

@Repository
public interface RegistrationCompoundRepository extends ChemicalCompoundRepository {

    List findByRegNumberStartingWith(String regNumber);

}

RegistrationCompound has a property called regNumber. Above interface method is automatically implemented by Spring Data and will return a result List of the RegistrationCompounds that match the passed in argument. That’s all you need to write. No SQL and not even a method implementation. Just create the interface and then follow the findBy method conventions of Spring Data.

A Spring Data repository can also make use of QueryDSL.

Querydsl is a framework which enables the construction of type-safe SQL-like queries for multiple backends including JPA, JDO and SQL in Java.

Example:

List result = query.from(customer)
    .where(customer.lastName.like("A%"), customer.active.eq(true))
    .orderBy(customer.lastName.asc(), customer.firstName.desc())
    .list(customer);

If you use QueryDSL in your Spring Data Repository using QueryDslPredicateExecutor

@Repository
@Transactional(propagation = Propagation.MANDATORY)
public interface ChemicalCompoundRepository
        extends ChemicalStructureSearchRepository, JpaRepository<T, Long>,
        QueryDslPredicateExecutor {
    //...
}

the repository will have additional methods that take a QueryDSL Predicate as an input. A Predicate is basically the WHERE-Clause of the query, like from above example customer.lastName.like("A%"). Some methods take additional parameter like a Pageable. This can be used for paging, the Pageable includes the paging (limit, offset) and sorting information.

This all means it is trivial to extend the repository my framework provides and add your own custom search methods to it. With using predicates you can create complex queries which at the same time search by chemical substructure, return the result sorted and paged and all this with a 1-line method declaration.

public Page findByChemicalStructure(String structureData,
            StructureSearchType searchType,
            Pageable pageable, Predicate predicate);

So I hope this got you interested!

Spring Data JPA: My first steps

leave a comment »

Introduction

In my previous article I showed my first experience with Spring using Hibernate as JPA Provider. After that article I heard about Spring Data and was immediately convinced that it is ideal for my hobby! project. In contrast to my previous article this will not be a tutorial but a short review of my experience with Spring Data. For an introduction of Spring Data please see the Spring Data Reference Guide.

The Good

It works. It is pretty great and I’m sure it can greatly reduce development time for certain types of applications.

CRUD in < 5 minutes

In Spring Data JPA you extend provided repository interfaces. Depending on which interface you extend it comes with different methods, like for CRUD. You don’t have to specify or implement any CRUD methods. You just extend a provided interface and Spring Data JPA will automatically generate implementations at runtime!

Custom Queries

If you need more specific query methods not offered by the repository interfaces you can create your own query methods. There are several version to do so. You can to it by “implicit path”. Meaning you create methods starting with “findBy” and then follow the entities properties along which every path you like. Simple example:

public interface PersonRepository extends Repository<User, Long> {

List<User> findByLastnameStartingWith(String lastname);
}

This assumes User entity has a property named lastname. Above example is very simple but you can also create such query methods that travel over relationship mappings.

public interface PersonRepository extends Repository<User, Long> {

List<User> findByRolesRoleName(String roleName);
}

This assumes entity User has a OneToMany mapping to Role entity and Role has a property roleName.

Another option is to specify a JPQL query yourself by annotating it with @Query.

public interface UserRepository extends JpaRepository<User, Long> {

  @Query("select u from User u where u.emailAddress = ?1")
  User findByEmailAddress(String emailAddress);
}

@Query also supports native queries!

There are even more possibilities . See the Spring Data JPA Reference for further information.

QueryDSL Integration

Spring Data JPA can be easily integrated with QueryDSL.

Querydsl is a framework which enables the construction of type-safe SQL-like queries for multiple back-ends including JPA, JDO and SQL in Java.

This is also very useful.

The Bad

I’m still a novice at both Spring and JPA (Hibernate). So not everything that I mentioned in this section is directly related to Spring Data and can happen when only using Spring or hibernate.

Simplistic Examples

All examples and tutorials I found are mostly based on a simplistic domain model and are for a fixed, well defined, not extendable application. I’m creating a “framework” for applications that need chemical structure search and hence it must be expandable by whomever uses it. So what I’m saying is it works great for very basic, simple usage scenarios but if you have anything half-complex in terms of inheritance and relationships between entities be prepared for issues or the need for adjusting your design to be suitable for your JPA provider, Spring and Spring Data. It’s safe to say I wasted at least 10x times more time searching for solutions to my issues than actual coding.

StackTrace hell

In case of exception you get confronted with a huge stacktrace. While that might not be a bad thing (better than none at all), the actual exception thrown is often not helpful at all at helping you to fix it. This almost drove me crazy.

As example I have a OneToMany relationship between 2 abstract entity classes. In my tests i created 2 different implementations to check if it my framework can deal with that in the way I expect. I had entity C1 and C2 on the “One-side” and A and B on the “Many-side”. The relationship only allows C1 containing As and C2 containing Bs. In my test when loading a B-entity, I would get an exception that C2 does not containing property C1.regNumber. Of course that was true, only C1 has that property. The real issue was, why is hibernate trying to load that property for a C2 entity? It made no sense to me at all. I went on an insane debugging spree through spring and hibernate. I saw that Spring Data had it all right, requested C2 entity. Somewhere deep in hibernate suddenly it changed to a C1. After several hours it dawned on me. The issue was I forgot to add targetEntity = AbstractC to the “Many-side” of the relation, it was on the “one-side”. It appears that hibernate then just chose that targetEntity is C1 instead of throwing an exception the targetEntity property is missing.

Version hell

Due to above issue I tried it with never version of Spring and hibernate and ran into other issues (“Could not load application context”). The context was identical. It turned out some configuration which was not correct worked in earlier version but not in never. And if you have a big XML and exactly one simple property of 1 bean causes this, it is not at all easy to detect. So be prepared for everything to fall over even when doing a minor upgrade.

Configuration hell

This is simple to explain. Use the exact same configuration for your connection pooling and data source. Don’t omit the connection pool in test configuration. Your tests can pass without it and fail with one enabled, especially in my cause were the used version of the connection pool was incompatible with spring 3. You waste a ton of time again to figure that out. I also suggest to use the same database if you have a fixed target. If you application is portable I would still suggest to run at least some integration tests on a different database and probably a “real-one”, meaning not one of the in-memory kinds.

Conclusion

Suitable for beginners?

Spring Data JPA is great but it includes multiple technologies and you need to know each of them especially their quirks.: Spring (ie dependency-injection), JPA and the JPA Provider you use, Spring Data itself, in my case also QueryDSL and the connection pool and its configuration. I was a beginner at all of these technologies and at times it was just too much. So if you are in the same boat as me, you need to be really, really patient and willing to learn. You need to be an expert at using google and at asking questions in forums (stackoverflow) so that readers can easily understand. You must be persistent, have a clear goal and the will to reach it. Else it will be difficult.

However if you already experienced in Spring and the JPA Provider of your choice, I can only recommend you to try it out.

web-applications with spring-data?

Possible. But I have played with Grails and Grails offers most of the above mentioned querying stuff too and more, namely the actually web-part. Spring Data offers nothing in that regard and for an average web-application your better of just using Grails (or other web frameworks). That leaves the question for what you can actually use spring data? I can’t really answer that. I had a unique use-case, create a framework for a special type of database search (chemical structure search) and despite my issues it is very suitable for that. As another option i see is for replacing an existing Data access layer in an existing legacy application, make it more maintainable. Maybe for SOA, eg. Data access layer that has a lot of different clients, web, rich and “automated” ones. Sure there are other possibilities but the conclusion is that Spring Data is for the rapid creation of a maintainable, easy-to use and extensible Data Access Layer and not to create a full application.

Written by kienerj

February 20, 2013 at 14:11

Posted in Database, Java, Programming

Tagged with ,