Evolution of the Liferay Scheduler API

Over the last few months, we at Liferay Support have encountered multiple customers who were having trouble with MEMORY_CLUSTERED scheduled jobs disappearing in their clustered environments.

In trying to figure out why their clustered jobs were disappearing, we learned that the problem lay in a misunderstanding of the (undocumented) API that the customer was calling during their component deactivation. Along the way, I realized that while I had an intuition about how scheduling was supposed to work in a cluster, I’d never done a deep dive into the scheduler engine code to say for certain that any of my intuitions were correct, something that should be pretty important when working with an undocumented API.

As a disclaimer, this post is a result of an investigation to resolve that knowledge gap problem, a description of how I believe things currently work and how that way of working came to be, based on reading through commit histories and JIRA tickets.

Understanding SchedulerEngine

What is SchedulerEngineProxyBean and why is its implementation so weird?

If you look at Liferay code, you’ll see a few classes with names ending in ProxyBean. When you open them up, you’ll see that they implement some interface, but the implementation of every method of that interface does nothing except throw an UnsupportedOperationException. Looking at those, you might end up being really confused, and find them weird, but there’s a reason they’re written that way.

However, before that, a little bit of backstory.

Before OSGi, Liferay introduced a Util class for every interface in order to allow instances of classes not managed by Spring (portlets, for example) or classes that lived outside of the Liferay Spring context (third-party web applications, for example) to call instances of Liferay’s Spring-managed classes.

Before Liferay addressed the inherent performance problem of long advice chains (see Shuyang’s blog from 2011 for additional details), Util classes also functioned to instantiate advice-like wrappers around existing implementations. For example, scheduler was initially implemented so that every call to the API was converted into a message bus message, with the real scheduler engine implementation happening in a message bus worker thread.

This design of putting wrapper-like logic into a Util class resulted in a particular side-effect. Since all of the message bus routing logic lived in the Util class, this meant that if you had a reference to the Spring-managed instance of SchedulerEngine, it would invoke the method directly, since your method calls would only be routed to the message bus if you used Util instead. In LEP-7304, the message bus routing was converted into a wrapper class, but the fundamental difference between a Util class and a direct reference was maintained.

Eventually, Liferay did solve that problem of long advice chains, and LPS-14031 rewrote the API layer of scheduler so that message bus routing was implemented using an advice. This is when SchedulerEngineProxyBean was introduced.

You can think of this implementation pattern as the following:

  • Start with a dummy implementation (SchedulerEngineProxyBean), usually with a name that ends with ProxyBean
  • Use a factory so that Spring knows to instantiate a proxy that wraps the dummy implementation
  • Use the invocatation handler provided for the proxy (MessagingProxyInvocationHandler) to convert the method call into a message bus message
  • With a listener that listens to messages on the message bus destination, deserialize the message bus message and invoke the method on the real implementation

Through this proxy bean pattern, Liferay provides the API in core, and then it provides the implementation in a separate Marketplace plugin. It also provides consistency between calling the Util and calling the Spring-managed instance. You can see similar approaches in things like workflow.

With the move to OSGi, Util classes still exist, but now they exist for classes that are not managed by OSGi to call instances of OSGi-managed classes.

What kind of API did Liferay provide with SchedulerEngine?

When scheduler was first introduced in LEP-6187, it had a very simple API, which basically served as a wrapper around specific Quartz Scheduler methods. At the time, they were really just the same methods that you would call in the Quartz Quick Start Guide.

Later, with LPS-7391 Liferay thought it would be interesting to create a portlet that would allow you to manage scheduled jobs. So with the subtask described in LPS-7395, Liferay added an API to allow you to modify existing jobs.

Liferay also assumed that people wouldn’t want to just modify scheduled jobs, but maybe they’d want to write scheduled jobs on the fly. Therefore, Liferay also added an API method to allow you to add a Groovy script that would run at a scheduled time.

In order to make informed decisions about why a scheduled job needed to be updated, with LPS-7397, Liferay also provided an API to retrieve metadata about existing scheduled jobs.

Out of all that, Liferay thought that you might also want to know about timings of a scheduled job, so as part of LPS-7397, Liferay also provided those helper methods as well:

Another natural question to ask is also about past timings of a scheduled job.

In addition, Liferay also provided a way to interact with that metadata, by allowing you to pause and resume scheduled jobs, check whether the job had been paused, and also let you know if there were any

Unfortunately, as Liferay prepared for its 6.0 release, Liferay’s UI/UX team got pulled into designing newer features that would be added to that release, and the user interface for managing an older feature like scheduler never got designed. Ultimately, the whole idea of a built-in portlet to manage scheduled jobs ended up in the product backlog, never to be heard from again.

Why did Liferay introduce a SchedulerEngineHelper?

Later, with LPS-23998, Liferay also needed a way to derive a cron expression, based on the same concepts as Liferay’s Calendar portlet, in order to improve the way we were implementing scheduled staging. The logic was also added to SchedulerEngine, because it looked like a common concern for anything wanting to create scheduled tasks.

With LPS-25385, Liferay decided to broadcast an audit event any time that a scheduled job was fired, and so we added a method to the SchedulerEngine interface that would be called whenever a scheduled job fired.

After reflecting on this a little more, we realized that from the standpoint of separation of concerns, saying that every implementation of a SchedulerEngine needed to also provide getCronText to derive a cron expression as well as auditScheduledJob to broadcast an audit event didn’t make any sense. Additionally, addScriptingJob, wasn’t really a special function of a scheduler, either.

Therefore, with LPS-29425, we introduced a new layer, SchedulerEngineHelper, that would provide the implementation of broadcasting an audit event, while also serving as an adapter that could simplify how developers interacted with the SchedulerEngine, without requiring the SchedulerEngine to provide any implementation relating to those interactions, allowing SchedulerEngine itself to be relatively stable. We also moved any other methods that allowed you to retrieve specific metadata about a scheduled job, since getScheduledJob contained all the metadata, the individual accessors could just delegate and then retrieve the individual fields.

Understanding SchedulerEngineHelper

What changed with SchedulerEngineProxyBean in 7.x?

Let’s keep in mind what we got from SchedulerEngineProxyBean in earlier releases. Through this proxy bean pattern, Liferay provides the API in core, and then it provides the implementation in a separate Marketplace plugin.

When you think about it like that, this is the same benefit you get with OSGi. Therefore, with Liferay 7.x, Liferay decided to reimplement scheduler as OSGi-managed components, rather than Spring-managed components injected into OSGi.

As a result, we effectively lost the benefit of Spring advices. However, rather than reconsider our decision, and rather than move all the logic back to the Util again, Liferay opted to divide all the concerns related to scheduler across a long implementation chain:

  • QuartzSchedulerEngine is provided as an OSGi component, disabled by default
  • ModulePortalProfile uses the concept of a portal profile gatekeepers (an undocumented Liferay API that makes it easier to conditionally enable components that are disabled by default) to enable QuartzSchedulerEngine (mentioned above) and SchedulerEngineHelerImpl (mentioned later)
  • QuartzSchedulerProxyMessageListener waits for a reference to QuartzSchedulerEngine and registers itself to the liferay/scheduler_engine message bus destination
  • SchedulerEngineProxyBeanConfigurator instantiates a SchedulerEngineProxyBean, registers it as something that routes messages to the liferay/scheduler_engine message bus destination (similar to the old Util class in earlier Liferay releases), and registers it as an OSGi service with scheduler.engine.proxy.bean=true
  • A SchedulerEngineConfigurator (which will be enabled using a portal profile gatekeeper) waits for a reference to scheduler.engine.proxy.bean=true, and on activation registers a scheduler with scheduler.engine.proxy=true
    • SingleSchedulerEngineConfigurator (only enabled for 7.0.x CE) registers an unclustered scheduler
    • ClusterSchedulerEngineConfigurator (only enabled for 7.0.x DXP, and all releases after that when clustering was re-added to CE) registers the original unclustered scheduler if cluster.link.enabled=false, or a clustered wrapper around the proxy bean if cluster.link.enabled=true
  • SchedulerEngineHelperImpl waits for a reference to scheduler.engine.proxy=true

Developers who want to work with the scheduler obtain a reference to the SchedulerEngineHelper service (rather than to the SchedulerEngine service), and make calls against the provided API. Of course, that provided API isn’t well-documented right now, which causes a lot of confusion when using scheduler, so we’ll move onto documenting that next.

What new methods were added to SchedulerEngineHelper with 7.x?

When SchedulerEngineHelperImpl is activated, it will create a ServiceTracker that uses an internal class (SchedulerEventMessageListenerServiceTrackerCustomizer) to track when new instances of SchedulerEventMessageListener are registered and when existing instances are unregistered.

In other words, with 7.x, scheduled jobs are managed by creating components that provide the SchedulerEventMessageListener service.

When a new component providing the SchedulerEventMessageListener service is registered to OSGi, addingService will deactivate a thread local (used by SchedulerClusterInvokeAcceptor), populate a map that will effectively ask all callee nodes to ignore the call, broadcast the method call to the cluster (where the callee nodes will ignore it), and then attempt to add a trigger for the job on the local node.

When an existing component providing the SchedulerEventMessageListener service is unregistered from OSGi, removedService will deactivate a thread local (used by SchedulerClusterInvokeAcceptor), populate a map that will effectively ask all callee nodes to ignore the call, broadcast the method call to the cluster (where the callee nodes will ignore it), and then attempt to remove a trigger for the job from the local node.

With LPS-59681, Liferay added API methods to SchedulerEngineHelper to help register your scheduled job, which is probably a MessageListener if you’re coming from earlier releases:

  • register: Adapts a regular MessageListener as a SchedulerEventMessageListener, registers the adapted wrapper as an OSGi component, and remembers it so that it can be manually unregistered by calling unregister
  • unregister: Unregisters the SchedulerEventMessageListener component corresponding to the MessageListener, assuming you created one by calling register

You can create regular MessageListener classes and register them from OSGi, as is done in an existing blade sample (BladeSchedulerEntryMessageListener). Alternately, you can also create a component that provides the SchedulerEventMessageListener directly, but that comes with some caveats which will be described later in the code samples.

Understanding ClusterSchedulerEngine

Now that we have a basic understanding of SchedulerEngine, the next thing we want to know is how scheduler works in a clustered environment.

As noted earlier, ClusterSchedulerEngineConfigurator instantiates a ClusterSchedulerEngine, providing it with an existing SchedulerEngine that will perform all the actual work, as well as any other OSGi-managed classes (because the lifecycle of ClusterSchedulerEngine is managed by Liferay, not OSGi).

The real magic of ClusterSchedulerEngine doesn’t exist in the class itself, but in ClusterableProxyFactory. We essentially instantiate a dynamic proxy around ClusterSchedulerEngine, and our invocation handler checks for a Clusterable annotation on every method declared by its target object. The elements set against that annotation will then be used to determine how that method call should work in a clustered environment.

If you’ve never seen an annotation before, you may want to read up on Annotation Basics from Oracle’s Java tutorials for additional background.

Clusterable

All methods on Clusterable currently have Javadocs, but it’s worth mentioning the methods again here.

  • onMaster: Whether to invoke this method on the master node rather than the local node (if the local node is not a master node). If set to true, Liferay will attempt to route the method call to the master node. If the current node is not the master node, parameters will be serialized on the invoking node and sent to the master node, and the return value will be serialized on the master node and sent back to the invoking node.
  • acceptor: If invokeOnMaster is not set (or set explicitly to false), Liferay will load a class implementing the ClusterInvokeAcceptor. Once you specify this value, Liferay will attempt to call an internal _invoke method within ClusterableInvokerUtil on every node. Within this invocation, each node will call the accept method of the ClusterInvokeAcceptor, and if it is true, it will proceed to invoke the original method that we had proxied.

The acceptor element can be confusing the first time you encounter it without reading the implementation for ClusterableInvokerUtil.invokeOnCluster. To summarize, within invokeOnCluster, Liferay calls ClusterableInvokerUtil.createMethodHandler to create a serializable method with additional environment information (which, if you’re familiar with functional programming, you can think of as equivalent to creating a closure) that will be invoked on all nodes on a cluster.

With that, we can understand that once the Clusterable annotation is set, and the onMaster element is set to false the caller node will unconditionally attempt to invoke the method on all nodes in the cluster. The ClusterInvokeAcceptor will be called on each callee node to determine whether the method call proceeds on that node.

In other words, the ClusterInvokeAcceptor is used to determine whether a callee node is ready to invoke the method. It does not prevent the caller node from broadcasting the invocation to the cluster.

ClusterSchedulerEngine

So now that we have a background on Clusterable, we can use that background to understand how ClusterSchedulerEngine does in a cluster.

First, let’s look at the methods where the annotation has the element onMaster set to true. It turns out that the only methods that are annotated in this way are the methods that retrieve metadata about scheduled jobs that we’d talked about when we went over LPS-7397.

Therefore, we can think of this as saying that whenever you use the API to retrieve metadata about scheduled jobs, all nodes will ask the master node for that information. This means that no matter what node you are on (whether you’re doing it from a Groovy script or you’re writing your own scheduled job management logic), these API calls will provide what the master node believes is the state of those scheduled jobs.

Next, let’s look at the methods where the annotation has the element acceptor set. It turns out that these are all the methods that modify the state of a job, which were introduced in LEP-6187 (the initial creation of scheduler), LPS-7395 (the API we’d added for a scheduled job management portlet that never came to be), and LPS-7397 (more of the API we’d added for that scheduled job management portlet that never came to be).

All of them have the same value set (SchedulerClusterInvokeAcceptor.class), which is to say that every one of these methods uses the same rules to decide whether to execute the method. This means that if you attempt to modify the state of a scheduled job on any node, all of the other nodes will be informed of the method call.

An important thing to note is that each node decides whether to carry out the invocation via SchedulerClusterInvokeAcceptor. In other words, the acceptor does not decide whether the call is relayed to the rest of the cluster. All method invocations that do not have the onMaster element set to true will always

What’s interesting is that when you look at the actual ClusterSchedulerEngine implementation in each of these methods, they have special logic to check whether they are the master node. In other words, for scheduler in particular, having an acceptor set is the same as onMaster; the only difference is whether all of the other nodes in the cluster should be notified.

This leaves the following methods that have no annotation, which include the initialization and the destruction of the scheduler engine implementation.

To summarize everything so far, we know the following. There are exactly two methods in ClusterSchedulerEngine (start and shutdown) that are not broadcast to a cluster. With the built-in Liferay implementation based on Quartz, Quartz itself has some cluster-management capabilities built into it. As a result, while it’s possible that non-default implementations of a scheduler might want clustering logic, Liferay itself won’t provide this logic out of box because its default implementation does not need it

Importantly, for all other methods that are part of the SchedulerEngine interface, calling the API is equivalent to asking the master node to execute the method, while all other nodes update metadata assuming that the master node has completed that execution, unless that node is not ready to recognize that invocation call, as determined by SchedulerClusterInvokeAcceptor.

Disappearing Scheduled Jobs

So with all of this knowledge, we can now revisit the disappearing MEMORY_CLUSTERED scheduled jobs problem mentioned earlier.

To understand what happened, first we need some background information on MEMORY_CLUSTERED, and how it relates to everything we’ve learned about ClusterSchedulerEngine, which provides the functionality.

Conceptually, a MEMORY_CLUSTERED job was added in LPS-15343, and it equates to a job that is not persisted in any way (in the default Quartz implementation, it uses a RAMJobStore), but it retains the desirable property of a persisted job where only one node executes the scheduled job. This is achieved through ClusterSchedulerEngine, which adds a mechanism for limiting scheduled jobs to one node, by only notifying the scheduler engine of one node about the job.

In the initial implementation, the first node to successfully acquire an entry in the Lock_ table in Liferay would run MEMORY_CLUSTERED scheduled jobs. With LPS-51058, this converted into running jobs on the coordinator node elected by JGroups, which effectively resulted in MEMORY_CLUSTERED jobs always running only on the node designated as the JGroups coordinator node. Then, in order to resolve LPS-66858, Liferay added a solution where each node in a cluster maintained metadata on the timing of scheduled jobs, so that it would remember the proper start times for MEMORY_CLUSTERED jobs if it are chosen as the new JGroups coordinator.

The root cause of the problem lay in assumptions about what would happen with the following block of code:

It’s easy to understand why the code was written in this way. Intuitively, the idea is that if you manually deactivate your component (usually by stopping a bundle, for example), you’ll want to make sure that you clean up any references to it as well.

However, in practice, unschedule isn’t a clean-up method. It specifically asks the scheduler engine implementation to stop running the scheduled job. In order to force a MEMORY_CLUSTERED job to stop running a scheduled job, the master node must stop running the job, and all nodes need to make sure that they also remember that the scheduled job shouldn’t be running any more.

While this works fine for a single node environment, or with manual deactivate, the trouble occurs when something a developer doesn’t anticipate deactivates the component, and this scenario was a server shutdown in a clustered environment.

To understand this more clearly, imagine a situation where you’ve shutdown a node. As part of the shutdown process, Liferay unregisters the ModuleServiceLifecycle service (once the server shutdown process starts, the portal is no longer initialized), and this component loses its reference to ModuleServiceLifecycle. Because our scheduled job component has flagged the ModuleServiceLifecycle as required (the default), OSGi proceeds to deactivate our scheduled job component and invokes our component’s deactivate method.

From here, the job will be unscheduled. The natural question follow-up is, "When will the scheduled job start to run again?"

Since unschedule is annotated with Clusterable, the call to unschedule will be broadcast to all active nodes of the cluster, and it will be executed on any node that would return true when SchedulerClusterInvokeAcceptor was invoked on that node. Naturally, all non-coordinator nodes in the cluster proceed to update their metadata and forget about the scheduled job, as a side-effect of LPS-66858, and the non-coordinator node will also stop running the job, so any time a new node starts up and tries to retrieve metadata on scheduled jobs, that job will be missing.

Since all nodes of the cluster have forgotten about the scheduled job (again, because unschedule actually unschedules the job), including any existing master node, then as a side-effect of LPS-66858, Liferay will never recover the lost scheduled job unless a node that is starting up is immediately designated as the JGroups coordinator as it starts up. In practice, this scenario will only happen when the entire cluster is brought down, and a new node is started, and the situation will reoccur on the next shutdown.

Liferay Scheduler Code Samples

We’ll assume that our scheduled job is configurable, following the tutorial on Making Your Applications Configurable in the developer guide, and that our configuration class, ExampleSchedulerConfiguration, has a getter method interval that returns the number of seconds between each scheduled job firing.

From there, I’ve created two sample scheduled job classes, ExampleMessageListener and ExampleSchedulerEventMessageListener, so that you can see the slight differences in the two implementations.

The two sample classes are described below.

Common Boilerplate

Next, we’ll think about all of the boiler plate that’s associated with creating scheduled jobs. This boiler plate is marked in the two sample scheduled job classes created for this post.

We generate triggers using TriggerFactory, we’ll ask OSGi to provide us with a reference to it.

Note that Liferay won’t provide a reference to a TriggerFactory if scheduler is completely disabled, which means adding a reference to a TriggerFactory also means that our component will also not activate if scheduler is disabled.

From here, as a convenience method, since every implementation of a scheduled job always needs to create a SchedulerEntry, we’ll have a utility method return one, based on a trigger generated from a configuration (in this case ExampleSchedulerConfiguration).

Because our scheduled job doesn’t have a fixed start time, it’s possible for the scheduled job to not fire immediately if we attempt to schedule it before Liferay begins scheduling jobs. This is because even though missed jobs will fire as long as it’s within the misfireThreshold, by default, this value is only five seconds, and Liferay doesn’t change the default for memory clustered jobs.

To avoid that problem, a common approach is to wait until the portal itself has started up. This next part is not necessary if your scheduled job has a predictable start time (cron expression, or you have a non-null start time that you’ve configured in some other way).

Then, we need code that will do the work of our scheduled job. In this example, we’ll just print a message using System.out.

At this point, the implementations will diverge, because there are two ways you can implement a scheduled job in 7.x. The approaches do not differ very much in terms of lines of code, but we’ll outline both approaches in the coming sections in case one is easier to understand than the other.

Using BaseMessageListener

You can create a scheduled job by extending BaseMessageListener, as is done in a lot of modern Liferay code and in the ExampleMessageListener code sample created for this post. You can also do this by extending the deprecated BaseSchedulerEntryMessageListener, as is done in the Liferay blade samples.

In both cases, the convention is to use your current class as the provided service, so that nothing accidentally finds it, because this message listener doesn’t have any real meaning unless it’s wrapped (more on that later).

Next, when our component activates we’ll want to use SchedulerEngineHelper in order to: adapt our MessageListener as a SchedulerEventMessageListener, register the adapted wrapper as an OSGi component, and remember it so that it can be manually unregistered later. We do this by using the register method documented earlier.

We will also want to call this when the configuration is modified, because the register method will take care of updating the adapted wrapper using its service reference (code block). Once the adapted wrapper has its service reference’s configuration modified, this will trigger a modifiedService on the service tracker, which will also take care of updating the scheduled job. For this reason, in Liferay blade samples, you don’t see any special logic in @Modified annotated methods.

Finally, we’ll also want to make sure that the adapted wrapper is properly removed if our component gets deactivated (bundle stops, one of the service dependencies disappears). This can be done by calling unregister.

Using SchedulerEventMessageListener

You can also create a scheduled job by creating a component that provides the SchedulerEventMessageListener service and implements that interface, as is done in the ExampleSchedulerEventMessageListener code sample created for this post.

Next, we need to actually implement the interface, which requires that we return a SchedulerEntry. To achieve this, we make sure that we notice both when the component activated and when the configuration changes, and to make sure we return a SchedulerEntry that reflects the updated configuration. Liferay and SchedulerEventMessageListenerWrapper will automatically handle the rest.

As long as all you want is a scheduled job, your implementation is complete as soon as you implement a receive method, which could just delegate work to the doReceive from the boiler plate code we had before (or you could name that method receive from the beginning).

However, we have to keep in mind that when switching from Spring to OSGi, Liferay still hasn’t quite broken free from using wrappers to implement functionality. This means that if we choose the SchedulerEventMessageListener route, we have to take care of any functionality implemented in the wrapper that we want to make use of in our implementation.

For example, if you’ve decided to enable auditing scheduled events (as noted earlier, this was a feature added with LPS-25385, disabled by default), SchedulerEventMessageListenerWrapper is what normally calls the API to broadcast the audit event, and so you’ll need to call it from your scheduled job as well.

Since there is no way for anyone to know how many implementation details will be added to the wrapper class, implementing SchedulerEventMessageListener directly isn’t common in Liferay code examples, because you would lose all of the additional functionality provided by the wrappers.

Blogs

Holy Moses Minhchau, You should have broken this guy into chapters! :) Awesome post though with such an awesome timeline of the evolution and the right level of deep dive detail (I think). Thanks so much for taking the time to share this -- I didn't know about the annotations!