2017/12/13

Be lazy with your data

jBPM comes with really handy feature called pluggable variable persistence strategy (or shorter marshalling strategies). This is the mechanism responsible for persisting process instance and task variables to their data store - whatever kind of data store you use.

An important thing to keep in mind when using marshalling strategies is the fact that each time variable is read it will read that from the data store which might be:

  • file system
  • data base
  • REST service
  • document management system e.g. ECM
  • and more
depending on the accessibility of this service and its performance this might not be a big deal but if that data is loaded many times it might become an issue. Especially if the data is of fair size like documents. 
Imagine process instance operates on bunch of documents (word, pdf, etc). Each of this document is of 2MB size which will give 20 MB for 10 documents. That means every time user interact with this process instance all documents will be loaded from external system. And this is not a problem (or it is actually how it should behave) when these documents are needed for the work user intend to perform. 

But what if (s)he is not going to use all documents or even not a single one? Or if it's not a user but an timer is firing of to send reminder to users?

Should all documents be loaded then? Obviously it does not make much sense though the problem with this decision is how would process instance know if given data will be used or not? And because of that it simply loads all variables whenever process instance (or task) is loaded.
This is default mechanism for process instance and task variables that are stored as part of them as it does not make much sense to separate them since their life cycle is bound to each other - always loaded and stored together. So this does not bring any overhead or additional communication effort.

Though situation looks slightly different for variables that are actually stored externally - like physical documents stored in ECM system. In this case, documents (including their content - 2MB each) will be loaded regardless if the data will be used or not. Moreover, in high volume systems this might put unnecessary load on ECM to constantly load and store documents which actually didn't change.

jBPM 7.6 comes with support for lazy load of variables to resolve these issues. Though it won't magically apply to all your variables that are stored externally, but will provide all the support from the engine side to take advantage of lazy loading. Since the mechanism for loading variables will differ based on the back end data store it's up to user to apply this principle which is rather simple as follows:
  • your variable must implement org.kie.internal.utils.LazyLoaded
  • your marshaler strategy needs to provide kind of service responsible for loading content of the variable - user by load method of the variable
  • your marshalling strategy should not load content by default but set that service on the variable so can be used to lazy load when needed
  • to avoid unnecessary operations to store variable, implement sort of tracking system on your data to identify if variable has changed and store it only if it did, your marshaler should check this tracking mechanism in marshal method and should reset it after loading variable in unmarshal method

jBPM 7.6 provides this mechanism as part of its support for documents (jbpm-document module). The DocumentImpl implements both LazyLoaded and tracking mechanism to resolve both issues (too often load and too often store). This extremely improves overall performance as it reduces number of reads and writes to document service at the same time provides content on demand and only when needed.

I'd like to encourage everyone who stores process or task variables externally to make it lazy loaded and tracked to improve the performance of your system and reduce load on your back end data store.