2017/08/23

NoSQL enters jBPM ... as an experiment ... so far

Quite frequently there are questions around jBPM if there is anyway to use NoSQL as data store for persistable setup. From the very beginning persistence in KIE projects (drools and jBPM) was designed to be pluggable. In versions prior to 7 it was though rather tight integration which resulted in dependencies to JPA being still needed. With version 7 persistence layer was refactored (thanks to Mariano De Maio who did majority of work) and enabled much cleaner integration with different (than default) persistence store.

That opened the door for more research on how to utilise NoSQL data stores to benefit the overall projects. With that in mind, we started to think what options are valuable and initial set of them are as follows:

  • complete replace of JPA based persistence layer with another data store (e.g. NoSQL)
  • enhance persistence layer with additional data store tailored with its capabilities 

Replacement of default persistence layer with NoSQL - MapDB

When it comes to the first approach, it's rather self explanatory - it completely replaces entire persistence layer thus freeing it up from any JPA based mechanism. This actually follows Mariano's work on providing persistence mechanism based on MapDB. You can find that work here that provides rather complete replacement of JPA and covers:
  • drools use cases - persistence of KieSession
  • jBPM use cases - persistence of 
    • KieSession, 
    • WorkItem, 
    • ProcessInstance, 
    • Task
  • jBPM runtime manager use cases - mainly around PerProcessInstance and PerCase strategies
  • jBPM services use cases - additional implementation of RuntimeDataService and DeploymentService to take advantage of MapDB store - does not persist all audit log data so some of the methods from RuntimeDataService (like node instances or variables related) won't work
  • KIE Server use cases - an alternative implementation of jBPM KIE Server extension that uses MapDB as backend store instead of RDBMS - though it does have limited capabilities - only operations on process instances and tasks are supported, no async execution (jBPM executor)
The good thing with MapDB is that it's a transactional store so it fits nicely with jBPM infrastructure. 

Though it didn't prove (with basic load tests*) to be faster than RDBMS based store. Quite the opposite it was 2-3 times slower on single box. But that does not mean there is no value in that. 

Personally I think the biggest value of this experiment was to illustrate that a complete replacement of the persistence layer is possible (up to KIE Server). Although it is quite significant work required to do so and there might be some edge cases that could limit or change available features.

Nevertheless it's an option in case some environments can't use RDBMS for whatever reason.

* basic load tests consists of two types of requests - 1) just to start a process with human task, 2) start a process with human task and complete it.


Enhance persistence layer with additional data store

Alternative approach (and in my opinion that brings much more value and less work) is to enhance the persistence layer with additional data store. This means that default and used by the internal services data store is still JPA and thus requires RDBMS though it can be offloaded for certain use cases to another data store as it might be much better suited for that.

Some of the use cases we are exploring are:
  • aggregation of data from various execution servers (different dbs)
  • aggregation of business data and process data
  • analytics e.g. BAM, stream processing, etc
  • advanced search capabilities like full text search
  • replication across data centres for searchability 
  • routing across data centres that runs individual process engines
  • and more... in case you have any ideas feel free to comment

This was sort of possible already in jBPM by utilising event listeners (ProcessEventListenr, TaskLifeCycleEventListener) though it was slightly too fine grained and required to have a bit of plumbing code to deal with how the engine behaves - mainly around transactions. 

So to ease with this work, jBPM provides few hooks to allow easier integration and let developers to only focus on actual integration code with external data store instead of knowing all the details in the process engine. 

So the main two hook points are:
  • PersistenceEventManager - that is responsible for receiving information from the engine when instances (ProcessInstance, Task) are in anyway updated - created, updated or deleted. The other responsibility is to collect all those events and at some point push to the event emitter implementation for actual delivery to external data store.
  • EventEmitter - this is the interface that must be implemented to activate the PersistenceEventManager - if there is no emitter found PersistenceEventManager acts in no-op way. Event emitter has two main responsibilities:
    • provides EventCollection implementation that decides how to deal with events that are added (new instance), updated (updated instance), removed (deleted instance) - different implementation of the EventCollection can decide on individual events e.g. in case single instance is added and removed in the same scope (transaction) then collection can decide to drop it from itself and deal only with still active instances.
    • integrates with the external data store - encapsulate client api of the external system 


Implementations that comes out of the box


PersistenceEventManager

There is a default PersistenceEventManager provided that integrates with transactions. That means there is no need (in most of the cases) to implement new PersistenceEventManager. Default implementation collects events from single transaction and deliver them to emitter at:
  • beforeCompletion of the transaction, manager will invoke deliver method of the emitter - this is mainly to give a complete list of events in case emitter wants to send these events in transactional way - for example JMS transactional delivery
  • afterCompletion of the transaction, this will again deliver same list of events as on beforeCompletion and is more for emitters that can't send events in transactional way e.g. REST/HTTP call. Manager will invoke:
    • apply method of the emitter in case transaction was successfully committed
    • drop method of the emitter in case transaction was rolled back


EventCollection

There is also default EventCollection implementation BaseEventCollection that will collect all events (instances regardless of their event type - create, update, delete) though will eliminate duplicates, meaning it will have only the last state of the instance.

Events

Now let's take a look what is an event - this is maybe a bit overused term but it does fit well in this scenario - it is fired when things happen in the engine - these events mainly represent instances that process engine is managing:
  • ProcessInstance
  • Task
Currently only these two types are managed but the hooks within the engine allow to plugin more, for example async jobs.

As soon as instance is updated (created, updated, deleted) that instance is wrapped with an InstanceView type and delivered to PersistenceEventManager - over its dedicated method representing type of the event - create, update remove.

InstanceView will have dedicated implementation to provide access to individual instance details though every implementation will always provide the link to the actual source of this view. Why there is a need for the *View types? Mainly to simplify consumption of them - InstanceView type is designed to be serialisable - for example to JSON or XML without too much hassle.

Out of the box there are two implementations of the InstanceView:
InstanceView might decide when the data should be copied from source though at latest it will be invoked by the PersistenceEventManager before calling the deliver method - so it's important that in case InstanceView implementation copies data earlier it should mark as copied itself to avoid double copy.



That concludes the introduction into how jBPM looks into support for NoSQL. Following article will show some of the implementations of the second approach to empower jBPM with additional capabilities.

I'd like to encourage everyone to share their opinion on how NoSQL would provide value for jBPM or what use cases you see are good fit for NoSQL and thus jBPM should support that better.