Cluster Aware Upgrade Processes

Make your Upgrade Process implementations cluster-aware to avoid running them on all nodes in the cluster.

I recently was helping a client with an upgrade process and that had run into a little problem...

A few of the model hints were not updated on some of the columns in their Service Builder services and, as we all know, Service Builder will not ensure the columns are changed in a production environment (well, actually any environment where the schema.module.build.auto.upgrade property is not set to true).

For these environments, Liferay recommends using an upgrade process to handle making the changes in those environments and even provides very helpful documentation to build one of these here: https://help.liferay.com/hc/en-us/articles/360031165751-Creating-Upgrade-Processes-for-Modules

The team had followed the instructions, step by step, and everything was working fine locally. The next step was to send it to the single-node development environment, and there it also worked swimingly.

The next step was to promote it to the two-node UAT environment, and here it crashed and burned hard.

Both of the UAT nodes started to process the upgrade at the same time. Each node found that the current version from the Release table was older, found the upgrade step to get to the next version, and both tried to execute the commands to alter the columns but they couldn't both do it.

Long story short, the upgrade failed but the Release and ServiceComponent tables thought it had and neither node could start up with the services because version match errors... We finally had to take the update out and restore the database to get back to a known working state.

If the team had been able to do a staggered deployment, only sending the upgrade process to a single node in the cluster, then later on send it to the second node, all would have been fine. Only one node would have processed the upgrade, the nodes wouldn't step on each other, and when deployed to the 2nd node the Release table would have already been at the later version so it would have no reason to try and run the upgrade...

Sometimes (as in their case) a staggered deployment isn't possible, so this begs the question, is there anything that can be done to deal with this scenario?

I think there is, and I'm here to present that solution today - Cluster-Aware Upgrade Processes

The challenge is how to run something, such as an upgrade process, on only a single node in the cluster and avoid having multiple nodes try and step on each other.

With a few small changes to the UpgradeStepRegistrator and injection of some Cluster Executor classes, you can define upgrade processes that will only register steps on the cluster leader, and non-leaders will not get the steps and not run the upgrade processes.

Let's start by modifying Liferay's example for registering the upgrade steps:

package com.liferay.mycustommodule.upgrade;

import com.liferay.portal.upgrade.registry.UpgradeStepRegistrator;

import org.osgi.service.component.annotations.Component;

@Component(immediate = true, service = UpgradeStepRegistrator.class)
public class MyCustomModuleUpgrade implements UpgradeStepRegistrator {

    @Override
    public void register(Registry registry) {
        registry.register(
            "com.liferay.mycustommodule", "0.0.0", "2.0.0",
            new DummyUpgradeStep());

        registry.register(
            "com.liferay.mycustommodule", "1.0.0", "1.1.0",
            new com.liferay.mycustommodule.upgrade.v1_1_0.UpgradeFoo());

        registry.register(
            "com.liferay.mycustommodule", "1.1.0", "2.0.0",
            new com.liferay.mycustommodule.upgrade.v2_0_0.UpgradeFoo(),
            new UpgradeBar());
    }
}

UpgradeFoo the class which is actually altering table column sizes, and this is not something that we want every node in the cluster to try. In the example above, we can see that UpgradeFoo is used in both the 1.0.0 -> 1.1.0 step as well as the 1.1.0 -> 2.0.0 steps, so we want to protect those steps from running on all nodes.

We'll start by adding a couple of imports:

import com.liferay.portal.kernel.cluster.ClusterExecutor;
import com.liferay.portal.kernel.cluster.ClusterMasterExecutor;

These imports will provide the services we'll need to determine if we're on the cluster leader or not. Our MyCustomModuleUpgrade class will also get some new dependencies:

@Reference
private volatile ClusterMasterExecutor _clusterMasterExecutor;
@Reference
private volatile ClusterExecutor _clusterExecutor;
I apologize for having to use names here that some might find offensive. I've tried like the dickens to convince Liferay that there are suitable alternative names that could be used and avoid problematic ones, but there is a great hesitancy (that I also understand) to change working code that many community members and customers might rely on and impose a significant, detrimental impact to those environments.

Since we're @Referencing these guys in, we know that our upgrade step registrar is not going to be able to run until those services are available by Liferay and are ready to inject into our instance.

Now, we can use these in the register() method to only register if the node is the cluster leader:

boolean clusterLeader = false;

// we will only be executing this step
if (_clusterExecutor.isEnabled()) {
  clusterLeader = _clusterMasterExecutor.isMaster();
} else {
  // not in a cluster, so this is the leader
  clusterLeader = true;
}

// we will only register this step on the cluster leader
if (clusterLeader) {
  ...

We use the two Liferay services to determine if the node is the leader and, if it is, we're then okay to register the upgrade steps.

Our final class ends up looking like:

package com.liferay.mycustommodule.upgrade;

import com.liferay.portal.kernel.cluster.ClusterExecutor;
import com.liferay.portal.kernel.cluster.ClusterMasterExecutor;
import com.liferay.portal.upgrade.registry.UpgradeStepRegistrator;

import org.osgi.service.component.annotations.Component;

@Component(immediate = true, service = UpgradeStepRegistrator.class)
public class MyCustomModuleUpgrade implements UpgradeStepRegistrator {

  @Override
  public void register(Registry registry) {
    registry.register(
      "com.liferay.mycustommodule", "0.0.0", "2.0.0",
      new DummyUpgradeStep());

    boolean clusterLeader = false;

    // we will only be executing this step
    if (_clusterExecutor.isEnabled()) {
      clusterLeader = _clusterMasterExecutor.isMaster();
    } else {
      // not in a cluster, so this is the leader
      clusterLeader = true;
    }
		
    // we will only register this step on the cluster leader
    if (clusterLeader) {
      registry.register(
        "com.liferay.mycustommodule", "1.0.0", "1.1.0",
        new com.liferay.mycustommodule.upgrade.v1_1_0.UpgradeFoo());
			
      registry.register(
        "com.liferay.mycustommodule", "1.1.0", "2.0.0",
        new com.liferay.mycustommodule.upgrade.v2_0_0.UpgradeFoo(),
        new UpgradeBar());
    }
  }
  
  @Reference
  private volatile ClusterMasterExecutor _clusterMasterExecutor;
  @Reference
  private volatile ClusterExecutor _clusterExecutor;
}

And the Minions?

So this code will work to run the upgrade steps on the cluster leader, but what about the other nodes?

As is, the nodes would do nothing. They would start up, they would not have any registered upgrade steps to execute, so they'd be ready to start serving traffic.

This is important to keep in mind in certain scenarios...

Imagine if you have updated your Service Builder service to use a new column, and you followed Liferay's guidance to extend the UpgradeProcess to create the new column, plus you wrote some code to pre-populate the columns of existing rows so your data model will be consistent...

Let's say your upgrade process fails on the cluster leader; the other nodes aren't going to know that it failed. The other nodes might expect that the column was already added, that the data was already populated, and the code might not be able to handle the case when either or both of these aren't actually met.

You can actually guard against this in some cases. For example, you can add an @Reference for a Release instance like:

@Reference(
  target = "(&(release.bundle.symbolic.name=com.example.mycustom.service)(&
    release.schema.version>=2.0.0)))"
)
private Release _release;

This will actually prevent the @Component from starting unless version 2.0.0 or greater is available. Note, however, that the cluster leader is not going to send out notifications that it has finished the upgrade so these @References can be resolved, it will take a node restart for them to resolve.

A great place to add this, well as long as your upgrade step registrar (and therefore upgrade steps) are not dependent upon it, is you XxxLocalServiceImpl class. As in our case above, if UpgradeBar class is adding the column and populating it, not allowing XxxLocalServiceImpl to start would prevent any other component dependent upon the service from starting. But, if your upgrade step registrar and/or the upgrade steps are dependent upon the service, the service would be blocked because the upgrade steps didn't run, and the upgrade steps wouldn't run because they are dependent upon the service which hasn't started yet.

So this is by no means an ideal solution, but it can be a preventative solution to avoid messy situations where your upgrade is not ready and your code would fail badly as a result. In this case, not starting is going to be better than starting, failing and having to clean up data later on...

Conclusion

So with this relatively minor change to our upgrade step registration, we receive the following benefits:

  • We don't have to do anything special for cluster startups.
  • Sensitive upgrade steps can run on a single node.
  • We prevent parallel execution of upgrade steps on all nodes in the cluster.

If you build a lot of Upgrade Steps (like I often do), keep this trick in mind to avoid cluster upgrade step issues.

Blogs

> Note, however, that the cluster leader is not going to send out notifications that it has finished the upgrade so these @References can be resolved, it will take a node restart for them to resolve.

Does this mean after the cluster leader has completed the upgrade, we should restart the other nodes?

If you are going to use the Release in order to verify that the version is available, yes you'd have to restart the cluster to get them to pick up on the new release.

Personally I'd be selective when wanting to use this technique if only because it would force the cluster restart...

There are absolutely some use cases where you really wouldn't want to proceed unless the version matched, but I don't think that all use cases will automatically fall into that class.