Showing posts with label jbpm_no_sql. Show all posts
Showing posts with label jbpm_no_sql. Show all posts

2017/08/23

Elasticsearch empowers jBPM

As a follow up on article that introduced NoSQL experimental support for jBPM, this article aims at illustrating one potential integration to enhance search capabilities and potentially routing support for larger environments.

Elasticseach will be used as additional data store where both process instances and tasks will end up being indexed. Please keep in mind that at this point in time it is rather basic integration though has already proven to be extremely valuable. Before jumping into details let's look at what use cases this integration brings:
  • ability to collect process instances and tasks from different sources - e.g. different execution servers connected to different dbs
  • ability to search for process instances and tasks using full text search - indexed values etc
  • ability to search for process instances and tasks by their variables, multiple variables (both name and value) in single request
  • ability to retrieve variables with search results in single request
  • and all other things that Elasticsearch provides :)

Implementation


The actual implementation to integrate Elasticsearch with jBPM based on PersistenceEventManager hooks is actually simple - it consists of single class that implements EventEmitter interface - ElasticSearchEventEmitter

It utilises Elasticseach REST API - to be precise its _bulk REST endpoint. It does push all events in single HTTP call. This consists of both types of instances
  • ProcessInstanceView
  • TaskInstanceView
all views are serialised as JSON documents. This integration uses:
  • http://localhost:9200 as the location where Elasticsearch server is
  • jbpm as the name of the index
  • processes as the type for ProcessInstanceView documents
  • tasks as the type for TaskInstanceView documents

Location of the Elasticsearch server and name of the index is configurable via system properties:
  • org.jbpm.integration.elasticsearch.url
  • org.jbpm.integration.elasticsearch.index

There is one more file in the project and this is the ServiceLoader services file providing information on emitter implementation for discovery on runtime.

ElasticSearchEventEmitter is delivering actual events in an async way to not hold back thread that was used to execute the process so the impact (performance wise) on process engine is minimal. Moreover thanks to default PersistenceEventManager implementation, this emitter will only be invoked when transaction is completed successfully, meaning in case process instance is rolled back that information won't be in Elasticsearch.

Installation


For this who would like to try this out, first of all install Elastisearch on your box (or wherever you prefer as you can point it to any server via system property).
Next, build the elastisearch-jbpm project locally (it's not yet included in the regular jBPM builds) and drop it into KIE Server web app (inside WEB-INF/lib) and that's it!

Now when you execute any processes you will have it's data in Elasticsearch as well so you can nicely query them in very advanced way.


In action

Let's now look at short screen cast that shows this in action. This demo illustrates still rather small subset of data (around 12 000 process instances and 12 000 tasks instances) that will be queried. Anyway, what this will show is:
  • speed of execution
  • query in a way that neither JPA nor jBPM advanced queries allows to do without additional setup
  • data retrieved directly from the query


In details:

  • first search for all active process instances was done in workbench - this uses data sets / advanced queries - though it is slightly slower due to it collects execution errors so that does affect the performance and it's under investigation
  • Then it does the same query over KIE Server REST api - that uses JPA underneath 
  • Last it does the same query over Elasticsearch
  • Next it shows a bit more advanced queries by multiple variables, people assignment etc

What can be found in the screencast illustrates benefits but on small scale, more will be seen where there are several independent execution servers so you can search across them.


Main difference is that Elasticsearch directly returns process instance variables. Similar for user tasks, though it does provide much more information - both task inputs and outputs plus people assignments - e.g. potential owners, business admins and excluded owners.

Expect more integration with other NoSQL data stores to come... so stay tuned.

NoSQL enters jBPM ... as an experiment ... so far

Quite frequently there are questions around jBPM if there is anyway to use NoSQL as data store for persistable setup. From the very beginning persistence in KIE projects (drools and jBPM) was designed to be pluggable. In versions prior to 7 it was though rather tight integration which resulted in dependencies to JPA being still needed. With version 7 persistence layer was refactored (thanks to Mariano De Maio who did majority of work) and enabled much cleaner integration with different (than default) persistence store.

That opened the door for more research on how to utilise NoSQL data stores to benefit the overall projects. With that in mind, we started to think what options are valuable and initial set of them are as follows:

  • complete replace of JPA based persistence layer with another data store (e.g. NoSQL)
  • enhance persistence layer with additional data store tailored with its capabilities 

Replacement of default persistence layer with NoSQL - MapDB

When it comes to the first approach, it's rather self explanatory - it completely replaces entire persistence layer thus freeing it up from any JPA based mechanism. This actually follows Mariano's work on providing persistence mechanism based on MapDB. You can find that work here that provides rather complete replacement of JPA and covers:
  • drools use cases - persistence of KieSession
  • jBPM use cases - persistence of 
    • KieSession, 
    • WorkItem, 
    • ProcessInstance, 
    • Task
  • jBPM runtime manager use cases - mainly around PerProcessInstance and PerCase strategies
  • jBPM services use cases - additional implementation of RuntimeDataService and DeploymentService to take advantage of MapDB store - does not persist all audit log data so some of the methods from RuntimeDataService (like node instances or variables related) won't work
  • KIE Server use cases - an alternative implementation of jBPM KIE Server extension that uses MapDB as backend store instead of RDBMS - though it does have limited capabilities - only operations on process instances and tasks are supported, no async execution (jBPM executor)
The good thing with MapDB is that it's a transactional store so it fits nicely with jBPM infrastructure. 

Though it didn't prove (with basic load tests*) to be faster than RDBMS based store. Quite the opposite it was 2-3 times slower on single box. But that does not mean there is no value in that. 

Personally I think the biggest value of this experiment was to illustrate that a complete replacement of the persistence layer is possible (up to KIE Server). Although it is quite significant work required to do so and there might be some edge cases that could limit or change available features.

Nevertheless it's an option in case some environments can't use RDBMS for whatever reason.

* basic load tests consists of two types of requests - 1) just to start a process with human task, 2) start a process with human task and complete it.


Enhance persistence layer with additional data store

Alternative approach (and in my opinion that brings much more value and less work) is to enhance the persistence layer with additional data store. This means that default and used by the internal services data store is still JPA and thus requires RDBMS though it can be offloaded for certain use cases to another data store as it might be much better suited for that.

Some of the use cases we are exploring are:
  • aggregation of data from various execution servers (different dbs)
  • aggregation of business data and process data
  • analytics e.g. BAM, stream processing, etc
  • advanced search capabilities like full text search
  • replication across data centres for searchability 
  • routing across data centres that runs individual process engines
  • and more... in case you have any ideas feel free to comment

This was sort of possible already in jBPM by utilising event listeners (ProcessEventListenr, TaskLifeCycleEventListener) though it was slightly too fine grained and required to have a bit of plumbing code to deal with how the engine behaves - mainly around transactions. 

So to ease with this work, jBPM provides few hooks to allow easier integration and let developers to only focus on actual integration code with external data store instead of knowing all the details in the process engine. 

So the main two hook points are:
  • PersistenceEventManager - that is responsible for receiving information from the engine when instances (ProcessInstance, Task) are in anyway updated - created, updated or deleted. The other responsibility is to collect all those events and at some point push to the event emitter implementation for actual delivery to external data store.
  • EventEmitter - this is the interface that must be implemented to activate the PersistenceEventManager - if there is no emitter found PersistenceEventManager acts in no-op way. Event emitter has two main responsibilities:
    • provides EventCollection implementation that decides how to deal with events that are added (new instance), updated (updated instance), removed (deleted instance) - different implementation of the EventCollection can decide on individual events e.g. in case single instance is added and removed in the same scope (transaction) then collection can decide to drop it from itself and deal only with still active instances.
    • integrates with the external data store - encapsulate client api of the external system 


Implementations that comes out of the box


PersistenceEventManager

There is a default PersistenceEventManager provided that integrates with transactions. That means there is no need (in most of the cases) to implement new PersistenceEventManager. Default implementation collects events from single transaction and deliver them to emitter at:
  • beforeCompletion of the transaction, manager will invoke deliver method of the emitter - this is mainly to give a complete list of events in case emitter wants to send these events in transactional way - for example JMS transactional delivery
  • afterCompletion of the transaction, this will again deliver same list of events as on beforeCompletion and is more for emitters that can't send events in transactional way e.g. REST/HTTP call. Manager will invoke:
    • apply method of the emitter in case transaction was successfully committed
    • drop method of the emitter in case transaction was rolled back


EventCollection

There is also default EventCollection implementation BaseEventCollection that will collect all events (instances regardless of their event type - create, update, delete) though will eliminate duplicates, meaning it will have only the last state of the instance.

Events

Now let's take a look what is an event - this is maybe a bit overused term but it does fit well in this scenario - it is fired when things happen in the engine - these events mainly represent instances that process engine is managing:
  • ProcessInstance
  • Task
Currently only these two types are managed but the hooks within the engine allow to plugin more, for example async jobs.

As soon as instance is updated (created, updated, deleted) that instance is wrapped with an InstanceView type and delivered to PersistenceEventManager - over its dedicated method representing type of the event - create, update remove.

InstanceView will have dedicated implementation to provide access to individual instance details though every implementation will always provide the link to the actual source of this view. Why there is a need for the *View types? Mainly to simplify consumption of them - InstanceView type is designed to be serialisable - for example to JSON or XML without too much hassle.

Out of the box there are two implementations of the InstanceView:
InstanceView might decide when the data should be copied from source though at latest it will be invoked by the PersistenceEventManager before calling the deliver method - so it's important that in case InstanceView implementation copies data earlier it should mark as copied itself to avoid double copy.



That concludes the introduction into how jBPM looks into support for NoSQL. Following article will show some of the implementations of the second approach to empower jBPM with additional capabilities.

I'd like to encourage everyone to share their opinion on how NoSQL would provide value for jBPM or what use cases you see are good fit for NoSQL and thus jBPM should support that better.