How to diagnose and recover Liferay cluster

Esta entrada de blog  está también disponible en español.

 

Liferay is designed to scale horizontally by supporting the addition of new nodes. This nodes will share the same information, but will also be able to handle different requests simultaneously. You can find much more information in Liferay DXP Clustering.

Liferay cluster synchronization and communication relies on JGroups, so much of the information we're going to talk about can also be found in JGroups documentation. On the other hand scheduling capabilities depend on Quartz. Its knowledge will help working with Liferay scheduled jobs.

Understanding how both interact with Liferay can be useful to comprehend some of the errors that are commonly obtained when having a cluster up. Usually when this errors appear there's no other solution than restarting the nodes, but sometimes other approaches can be followed. We'll talk about them through this blog.

 

 

1. Introduction

Basically you just can start the cluster configuring the following property:

cluster.link.enabled=true

 

With that default configuration, and once you start the server, you'll see that Liferay builts a couple of channels:

  • transport-0: This channel handles the communication needed to send Liferay's cache invalidation messages. Up to 10 transport channels can be configured.
  • control: This channel purpose is to handle the communication from a cluster hierarchy point of view, assuring jobs synchronization so that Liferay can take care of which node is executing each type of job.

So, basically there are a couple of main concerning points:

  • Liferay content synchronization.
  • Liferay scheduled jobs management.

 

 

2. Cache Types

Liferay's cache configuration relies on Ehcache. It’s an independent framework used by Liferay’s data access and template engine components (more information on Introduction to Cache Configuration). It manages two pools:

  • Multi-VM: Cache is replicated among cluster nodes.
  • Single-VM: Cache isn't replicated among cluster nodes and it's handled uniquely per VM. Can be used for objects and references that don't need to be replicated among nodes.

When dealing with cache replication popular belief is that Liferay replicates cache values. But, that's not the default behavior, although it can be obtained configuring properties like replicatePutsViaCopy, replicateUpdatesViaCopy. More information about this in Liferay's Clustering API.

So, in the end, by default, the synchronization is done by invalidation, not by value. This means that when a value of a key is changed in a node the remaining nodes are notified forcing them to reconcile their information in the common source of information, ie, the database.

The synchronization is accomplished using the transport channels. So a healthy channel that includes all cluster nodes must be provided in order to ensure content correct synchronization.

As a high level picture you can imagine every node having its own cache space managed by Ehcache and, in case of MultiVM, Liferay performing background synchronization tasks to ensure that no cache value is obsolete.

SingleVM

To check that there's no cache synchronization at all when dealing with SingleVM the following  steps can be followed. To simplify it will involve just a couple of nodes, but it can be used to check any number of combinations:

  1. Let's take one of the nodes, and to differentiate it name it Node1. Execute then the following Groovy script:

    import com.liferay.portal.kernel.cache.SingleVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private SingleVMPool getSingleVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(SingleVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    SingleVMPool singleVMPool = getSingleVMPool();
    PortalCache portalCache = singleVMPool.getPortalCache("CACHE-TEST");
    
    portalCache.put("testKey", "testValue1");
    out.println("=== Value for testKey: " + portalCache.get("testKey"));

    This code simply puts, in CACHE-TEST cache, the value testValue1 in key testKey. Afterwards it will print it.

    From a low level perspective no synchronization code is involved.

     

  2. Execute, on a different cluster node (named to differentiate Node2) whose (lack of) synchronization against Node1 wants to be tested, the following Groovy script:

    import com.liferay.portal.kernel.cache.SingleVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private SingleVMPool getSingleVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(SingleVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    SingleVMPool singleVMPool = getSingleVMPool();
    PortalCache portalCache = singleVMPool.getPortalCache("CACHE-TEST");
    
    out.println("=== Value for testKey: " + portalCache.get("testKey"));
    portalCache.put("testKey", "testValue2");
    out.println("=== Value for testKey: " + portalCache.get("testKey"));
    

    Script checks if there's any value cached in Node2, in CACHE-TEST cache, for key testKey. Shouldn't be possible (hence null return) because it hasn't been put before.

    Afterwards Node2 puts a new value for testKey. Again, no synchronization code is involved.

     

  3. Execute the following script, this time again over Node1:

    import com.liferay.portal.kernel.cache.SingleVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private SingleVMPool getSingleVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(SingleVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    SingleVMPool singleVMPool = getSingleVMPool();
    PortalCache portalCache = singleVMPool.getPortalCache("CACHE-TEST");
    
    out.println("=== Value for testKey: " + portalCache.get("testKey"));

    Checks that value for testKey remains the same (testValue1) because no synchronization with Node2 was involved.

 

MultiVM

To check if the current cache synchronization is working as expected (also to understand how the default behavior works) the following steps can be followed. To simplify it will involve just a couple of nodes, but it can be used to check any number of combinations.

  1. Let's take one of the nodes, and to differentiate it name it Node1. Execute then the following Groovy script:

    import com.liferay.portal.kernel.cache.MultiVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private MultiVMPool getMultiVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(MultiVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    MultiVMPool multiVMPool = getMultiVMPool();
    PortalCache portalCache = multiVMPool.getPortalCache("CACHE-TEST");
    
    portalCache.put("testKey", "testValue1");
    out.println("=== Value for testKey: " + portalCache.get("testKey"));

    This code simply puts, in CACHE-TEST cache, the value testValue1 in key testKey. Afterwards it will print it.

    From a low level perspective what this code does is to add, in Node1 space cache, the value testValue1 to the key testKey in the CACHE-TEST cache. Meanwhile an invalidation message will be sent to the remaining nodes in order to remove values associated to testKey in CACHE-TEST cache in their own individual cache space.

     

  2. Execute, on a different cluster node (named to differentiate Node2) whose connectivity against Node1 wants to be tested, the following Groovy script:

    import com.liferay.portal.kernel.cache.MultiVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private MultiVMPool getMultiVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(MultiVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    MultiVMPool multiVMPool = getMultiVMPool();
    PortalCache portalCache = multiVMPool.getPortalCache("CACHE-TEST");
    
    out.println("=== Value for testKey: " + portalCache.get("testKey"));
    portalCache.put("testKey", "testValue2");
    out.println("=== Value for testKey: " + portalCache.get("testKey"));

    Script checks if there's any value cached in Node2, in CACHE-TEST cache, for key testKey. Shouldn't be possible (hence null return) because it was invalidated when value was put in the first step (Node1).

    Afterwards Node2 puts a new value for testKey. This will repeat previous step synchronization process meaning that Node2 cache space should have testValue2 as value for testKey and all remaining nodes, including Node1, should have removed its previous value for testKey.

     

  3. Execute the following script, this time again over Node1:

    import com.liferay.portal.kernel.cache.MultiVMPool;
    import com.liferay.registry.RegistryUtil;
    import com.liferay.registry.Registry;
    import com.liferay.portal.kernel.cache.PortalCache;
    import com.liferay.registry.ServiceReference;
    import com.liferay.portal.kernel.cluster.ClusterExecutorUtil;
    import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;
    
    private MultiVMPool getMultiVMPool() {
        Registry registry = RegistryUtil.getRegistry();
    
        ServiceReference serviceReference = registry.getServiceReference(MultiVMPool.class);
    
        return registry.getService(serviceReference);
    }
    
    out.println("=== Execution in: " + ClusterExecutorUtil.getLocalClusterNode());
    out.println("=== isMaster: "+ ClusterMasterExecutorUtil.isMaster());
    
    MultiVMPool multiVMPool = getMultiVMPool();
    PortalCache portalCache = multiVMPool.getPortalCache("CACHE-TEST");
    
    out.println("=== Value for testKey: " + portalCache.get("testKey"));

    Checks that in fact an invalidation arrived to Node1 causing the removal of the key testKey value.

 

Transport channel(s) is the one involved during the cache replication process. Any unexpected problems regarding cache synchronization should not be attributed to the control channel.

 

3. Scheduled Jobs

There are three types of scheduled jobs, defined in StorageType:

  • MEMORY: Executed on each node. Their statuses live only in memory, meaning that after a restart the information about their previous executed time or their next fire time gets reset.
  • MEMORY_CLUSTERED: Only executed on the master node. Their statuses only live in memory, but this information is shared across all cluster nodes. We'll talk about this later.
  • PERSISTED: Intended to maintain their statuses available after every server restart. It's the only type that persist its  information in persisted.scheduler.org.quartz.jobStore.tablePrefix database tables.

To manage the fire time, and when a job should be triggered Liferay relays on Quartz. This library acts as an scheduler that notifies Liferay when a job must be executed. Here's is a brief introduction about how does Liferay integrates with Quartz and how does each type is defined by different characteristics.

 

MEMORY

This jobs are pretended to be executed on each node. Each time a MEMORY job is deployed Liferay uses Quartz to schedule it. Quartz will not be responsible for the execution, but only manages the scheduling.

When Quartz detects that a job needs to be executed notifies Liferay. This causes a message to be added in Liferay's bus with the job execution information and the scheduler as destination name. The job configured listener that reads from the scheduler destination name will execute the job logic.

The information about the job schedule lives only in memory, so the status about previous misfires or successful executions will be lost after the server is stopped. Also there is no synchronization between nodes.

 

MEMORY_CLUSTERED

Only one node is in charge of the execution of this kind of jobs. Which one? The master node. But since the master node can change across time there needs to be a synchronization between different nodes.

This are the most commonly used jobs. They are useful for housekeeping tasks over a common resource like a database. Since many tasks don't depend on the node from which they get executed there is no need to overload different nodes with a repetitive process.

To guarantee this coordination Liferay keeps in memory the photograph of the jobs that the master has scheduled in Quartz. In case the master node stops a slave will assume its role and will transfer from memory to Quartz the saved jobs. 

Coordination is handled by the following class: https://github.com/liferay/liferay-portal/blob/master/modules/apps/portal-scheduler/portal-scheduler-multiple/src/main/java/com/liferay/portal/scheduler/multiple/internal/ClusterSchedulerEngine.java

This is the flow each time a MEMORY_CLUSTERED job gets deployed:

  • If node is slave: Slave sends a message to the master node to check if master node has the job already deployed, and, only in such case, saves the job in memory. This ensures that all cluster nodes share the same configuration at a given moment.
  • If node is master: The job is scheduled in the master node using Quartz. After that the master node notifies all slave nodes asking them to save the job in memory.

To handle all this logic communication process stands the control channel. All jobs synchronization messages are sent through this channel.

 

PERSISTED

Each type (MEMORY, MEMORY_CLUSTERED and PERSISTED) job statuses or trigger dates are handled by Quartz. This is common in all three. But with PERSISTED jobs also the coordination is delegated in Quartz. And Quartz uses the database as the place where to keep this information, specifically in Liferay's QUARTZ_* tables.

So Quartz decides on which node a given job will be executed. It doesn't matter if all nodes are stopped that information about when to fire next job will be kept in the database and available for the next node up.

Out of the Box Liferay doesn't provide any PERSISTED job, but they could be useful depending on the job importance and if they must be executed as soon as a node is starting no matter if all nodes were down when it had to be executed before.

 

As a brief summary, from a node's role perspective, this will be each of them responsibilities:

  • Master:
    • Executes, always, MEMORY jobs.
    • Executes, always, MEMORY_CLUSTERED jobs.
    • Executes PERSISTED jobs only when Quartz decides so (Quartz is blind to Liferay master/slave division).
  • Slave:
    • Executes, always, MEMORY jobs.
    • Never executes MEMORY_CLUSTERED jobs, but saves them in memory.
    • Executes PERSISTED jobs when Quartz decides so.

 

 

4. Action begins: Cluster start

Before digging into the point where every node gets started the following recommendation should be reminded: When several nodes need to be started it must be done sequentially, not in parallel, to avoid concurrency problems.

Regarding the startup you'll find several milestones where important information related to the cluster creation can be retrieved:

 

4.1 Channel creation

While starting a node, the creation of JGroups channel is the first relevant cluster related action to be discovered. At least a couple, by default, of channels are built:

 

  • Control Channel:
    [JGroupsClusterChannel:112] Create a new JGroups channel {channelName: liferay-channel-control, localAddress: malvaro-ThinkPad-T450s-15069, properties: UDP(time_service_interval=500;thread_pool_max_threads=100;mcast_group_addr=239.255.0.1
  • Transport-0 Channel:
    [JGroupsClusterChannel:112] Create a new JGroups channel {channelName: liferay-channel-transport-0, localAddress: malvaro-ThinkPad-T450s-28377, properties: UDP(time_service_interval=500;thread_pool_max_threads=100;mcast_group_addr=239.255.0.2

 

Each channel has, by default, its own configuration.

When the node gets started and both channels are created the node can be considered to be under a Liferay cluster. Initially when only a node is up there will be only a node in the cluster, but when a new one, sharing the same cluster configuration, gets started it will join each channel and, afterwards, be part of the same Liferay cluster.

Every node join/departure even can be visualized scanning the logs for a trace like the following one:

[JGroupsReceiver:91] Accepted view [node2-23516|1] (2) [node2-23516, node1-18511]

In the comma separated list you will see every node that is part of the same cluster. Order matters since Liferay, by convention, considers the first node as the master node.

 

4.2 Jobs schedule

Once the channels are created, and while nodes keep joining, every node will know its role (master/slave) inside Liferay's cluster.

Master node will be in charge to schedule all the deployed MEMORY_CLUSTERED jobs in Quartz. In the moment after this is done every job becomes eligible to be executed as soon as its next trigger fire time arrives.

Subsequently, each time a new node joins the cluster as slave what will happen is the following:

  • Every MEMORY job will be scheduled in Quartz.
  • PERSISTED job management will be delegated in Quartz.
  • The slave will ask the master node for its MEMORY_CLUSTERED jobs to save a copy of them just in memory.

To confirm last point correct communication you'll see a trace like the following one:

[ClusterSchedulerEngine:607] Load 16 memory clustered jobs from master

 

At this point, every node, regarding memory clustered jobs, should be synchronized. What else can happen afterwards? A new job can be deployed (ensure that the development gets always deployed on every node).

If a new memory clustered job gets deployed a new synchronization process among all nodes takes place. To force this the master node uses the control channel sending a common message to each node telling them to add the new job. The aim is to ensure that at any given moment the master node is executing the deployed memory clustered jobs and the slaves nodes have those jobs saved in memory prepared to schedule them in the moment they acquire the master token.

Suddenly the master node can stop its execution (either after a crash or after a programmed stop). JGroups will detect that a node has left the cluster and will notify Liferay listener. That will generate a new event in Liferay that will end up choosing the first node in the list as the new master (aka coordinator). After that:

  • The new master node will proceed to schedule in Quartz all memory clustered jobs previously saved in memory notifying afterwards all the remaining nodes to force a new synchronization.
  • The slave nodes will attend this notification saving in memory all the master received jobs.

 

 

5. Cluster up: Detecting scheduled jobs problems

Once the cluster is up several problems related to the scheduled jobs can occur.

This section covers some of the most common error/warning messages that can be found in Liferay logs and how to interpret them. Also a useful analysis script to check which jobs are currently scheduled will be provided.

It is assumed, at this point, that cluster configuration is correct (doesn't matter if it uses TCP or UDP) and that it has been tested building a healthy cluster at least once:

 

5.1 Log common traces analysis

5.1.1 Errors that need immediate attention

  • [ERROR] "Unable to load memory clustered jobs from master in XX seconds, you might need to increase value set to "clusterable.advice.call.master.timeout", will retry again":  If full synchronization between slave and master doesn't happen this error will be obtained.

    Since this is a crucial step that affects all MEMORY_CLUSTERED jobs there's a retry mechanism that keeps trying to perform the synchronization until it succeeds. The retry will be performed every clusterable.advice.call.master.timeout seconds, so the message will be printed, spaced in time,  while it keeps failing.

    Stopping the master node while the retry is still in progress will lead to all MEMORY_CLUSTERED jobs being lost. This occurs because while it keeps failing no records are saved in memory, so no records can afterwards be scheduled. In the following sections an explanation about how to recover from this situation consequence will be provided.

 

  • [ERROR] "Unable to get a response from master for memory clustered job XXXXXX": Each time a new MEMORY_CLUSTERED job gets deployed on a slave node the slave sends a message to the master node to perform the synchronization. If there's a communication error during this process the trace will be printed.

    Unfortunately there's no retry retry mechanism, so, at a given point it could happen that if no synchronization is triggered afterwards we could end having different job configurations between the affected node and master. In the following sections an explanation about how to force this synchronization will be provided.

 

  • [WARN] "Property scheduler.enabled is disabled in the master node. To ensure consistent behavior, this property must have the same value in all cluster nodes. If scheduler needs to be enabled, please stop all nodes and restart them in an ordered way.": Shown when the master node is started with the scheduler disabled, but a new slave joins having this property enabled.

    There is no way to ensure that all MEMORY_CLUSTERED jobs are always executed if one node doesn't have the property enabled. This means that there is no way to force jobs being executed on certain nodes only.

    To enable again the property all nodes must be stopped and the property changed, because an inconsistent configuration may remain in memory.

 

  • [ERROR] "Unable to notify slave": When master node fails to notify every slave about a new job being scheduled this trace arises. In the following sections an explanation about how to recover from this situation will be provided.

 

5.1.2 Messages that must be analyzed

  • [INFO] "Memory clustered job XXXXXX is not yet deployed on master": This happens when a new MEMORY_CLUSTERED job is deployed in a slave node before it gets deployed in master.

    It should not be a big deal, most likely it occurs when a new development is deployed for the first time simultaneously in every started node. If a slave node deploys first the message will appear, but it can be safely ignored if there's evidence that it was deployed afterwards in the master node.

    Always check the deployment process to ensure that custom developments get deployed (as needed) on every node.

 

  • [INFO] "Load XX memory clustered jobs from master": Indicates the number of MEMORY_CLUSTERED jobs that have been retrieved from the master node and saved in memory.

    A number less than the expected one implies that some jobs have been lost in the master node. In the following sections an explanation about how to recover from this situation will be provided.

 

  • [INFO] "XX MEMORY_CLUSTERED jobs started running on this node": The current node becomes master and the number of nodes it schedules gets printed. If the number is less than the expected one in the following sections an explanation about how to recover from this situation will be provided.

                           

  • [INFO] "MEMORY_CLUSTERED jobs stopped running on this node": The current node stops being master. If this happens without any node being stopped it can be symptom of a cluster being split (hence having several master per sub-cluster) and afterwards rejoined.

 

  • [INFO] "Accepted view MergeView": having the following format:

    [BaseReceiver:83] Accepted view MergeView::[node1-38143|3] [node1-38143, node2-42091], subgroups=[node2-42091|2] [node1-42091], [node2-38143|0] [node1-38143]

    This trace isn't a failure, but it can warn about a previous alarming situation. It means that the cluster was split before in, at least, a couple of subgroups and now has joined back. During the time that was split it can cause differences between each node content (especially MultiVM caches) since communication between channels was temporary broken.

    Usually this is a consequence of network problems (packages being lost) or nodes being overload so that they cannot respond in time to the "is alive messages". This causes, from a JGroups perspective, the consideration of the node being a suspect of having left the cluster.

 

5.1.3 Messages that can safely be ignored

  • [INFO] "Receive notification from master, add memory clustered job XXXXXX": Not to worry information about a MEMORY_CLUSTERED job being deployed in the master and notified to the current slave node.

 

  • [INFO] "Skip scheduling memory clustered job XXXXXX with a null trigger. It may have been unscheduled or already finished.": Not to care information about an unscheduled job being skipped from saving it in memory.

 

  • [INFO] "Receive notification from master, reload memory clustered jobs": Each time a new master is chosen all slave nodes will receive a notification (noticed by this trace) to synchronize their jobs.                      

 

5.2 Groovy cluster debug scripts

Cluster health at any given moment can be checked not only with log analysis, but also with different Groovy scripts. Here you can find a couple of them:

  • SchedulerDebugInformation.groovy: Can be executed either on the master node or on every slave node.

    It should return the same number of jobs independently on which node runs. If not, it means that there is a synchronization problem between node jobs. In the following sections an explanation about how to recover from this situation will be provided.

    • If executed on master it will print all Quartz scheduled jobs.
    • If executed on a slave it will print all MEMORY_CLUSTERED jobs that are currently saved in memory.

 

  • ClusterNodesDebugInformation.groovy: Can be executed either on the master node or on every slave node.

    It will print information about the current cluster node where the script is executed on. It will also print all nodes that compose the cluster from the current node perspective. This nodes must be the same independently on which node the script runs. Otherwise it'll mean that the cluster is split due to communication problems.

 

 

6. Jobs are missing. Now what?

Missing MEMORY_CLUSTERED jobs is not an easy to reproduce situation because usually it's a result of a two step process:

  1. Slave fails to synchronize correctly with the jobs scheduled in the master node. Normally this is a silent failure because jobs in slave nodes aren't scheduled, so they only live, as a backup, in memory.
  2. Master node stops and inconsistent slave becomes master. Memory clustered jobs are scheduled in Quartz and error now is visible in the form of unexpected job executions.

 

To detect this situation two approaches can be followed:

  • Reviewing jobs execution from a logical point of view: Job can print information traces. Job result as cleaning a table isn't visible. Etc.
  • Using some of the previously indicated debug scripts.

 

If after the analysis, jobs are confirmed to be lost some steps can be followed to fix it:

  • Restart the component using the Gogo console: If the Gogo console is accessible and the job is defined as a component like, for example, CheckAssetEntryMessageListener.java the easiest, and fastest, way is to restart the component in the master node. This causes the job to be registered again.

    Can be done using the following commands once the missing job is detected:

    scr:disable com.liferay.asset.publisher.web.internal.messaging.CheckAssetEntryMessageListener
    scr:enable com.liferay.asset.publisher.web.internal.messaging.CheckAssetEntryMessageListener 

 

  • Execute the following Groovy script: As a last resort, when many jobs were lost or some of them were migrated from a former Liferay version (not following the component pattern) it can be useful to execute the following script on every node (it will detect if the node has the master role and only if so execute all its logic).

    The script, that can be found here SchedulerJobsManager.groovy,will rebuild the SchedulerEventMessageListener serviceTracker rescheduling all jobs.

 

 

 

* I would like to apologize for the terminology master-slave, but in order to make the explanation as clear and familiar as possible I could not thought of a better one. Suggestions are welcomed.

 

Blogs

A 7.4 MultiVMPool version import com.liferay.portal.kernel.cache.MultiVMPool; import org.osgi.framework.BundleContext; import org.osgi.framework.ServiceReference; import com.liferay.portal.kernel.module.util.SystemBundleUtil; import com.liferay.portal.kernel.cache.PortalCache; import com.liferay.portal.kernel.util.PortalUtil; import com.liferay.portal.kernel.cluster.ClusterExecutorUtil; import com.liferay.portal.kernel.cluster.ClusterMasterExecutorUtil;

private MultiVMPool getMultiVMPool() {

//    Registry registry = RegistryUtil.getRegistry(); //ServiceReference serviceReference = registry.getServiceReference(MultiVMPool.class);

    BundleContext bundleContext = SystemBundleUtil.getBundleContext();     ServiceReference serviceReference = bundleContext.getServiceReference(MultiVMPool.class.name)     return bundleContext.getService(serviceReference) }

try {     // Get local cluster node information     String localClusterNode = ClusterExecutorUtil.getLocalClusterNode();     boolean isMaster = ClusterMasterExecutorUtil.isMaster();          // Check if clustering information is available     if (localClusterNode != null) {         println("=== Execution in: " + localClusterNode);     } else {         println("=== Clustering information is not available.");     }

    println("=== isMaster: " + isMaster);

    MultiVMPool multiVMPool = getMultiVMPool();     if (multiVMPool != null) {         now = new Date()         println(now)         PortalCache<String, String> portalCache1 = multiVMPool.getPortalCache("CACHE-TEST-1");         println("=== Value for testKey before: " + portalCache1.get("testKey"));         portalCache1.put("testKey", "testValue:"+now);         println("=== Value for testKey after: " + portalCache1.get("testKey"));     } else {         println("MultiVMPool service is not available.");     }

} catch (Exception e) {     println(e) }