RE: Job doenst trigger. Msg: Liferay job node not deployed on master

Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Hello,

   Well, we are using 2 liferay servers. prd1 and prd2. After we deployed a job on both servers, when I try to schedule in a specific time in a ddl and after restart the job.
   It doenst trigger and give me the message when I start: Liferay job node not deployed on master
I tried to find a solution but some of them said that I need to install the fixpack32 that is already on a higher level than this one. I'm using the version 7.1
   Someone have a idea about what I need to do? I saw that when I use a specific ip node (that is prd2) it runs. but the general url that uses both node has this reaction.
  Thanks in advance.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hi Alex,

I'm not sure that I am totally following along -- I do have a few questions. One thing to note right off the bat though is that the job will NOT run on all the nodes in the cluster. One of the nodes (usually the master from my experience) we get control of the lock mechanism for running scheduled tasks. All the other nodes may TRY to run the task but because they can't get a hold of the lock, they won't. The good news is that when something happens to the master node (its shut down, or crashes, bascially whatever takes it out of the pool) then one of the other nodes (usually the first one to try to run the task) will then get the lock and everything continues along without issue. But, again, the jobs will only ever run on one node of the cluster at a time.

Now, the part I didn't quite catch, was the part about how you are trying to set up you trigger? Can you share some code with us and maybe try to explain once again?
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Hi Andrew,

   Thanks for your interesting.
   I removed the log part just to show the necessary.
   Values that I set using a DDL: 
  this.intervalo = 1440
  this.logPath =   random log directory to write the status.
  this.horarioExecucao = 15:30

   package cronjob.canais.job;

import com.liferay.dynamic.data.lists.model.DDLRecord;
import com.liferay.dynamic.data.lists.model.DDLRecordSet;
import com.liferay.dynamic.data.lists.service.DDLRecordSetLocalServiceUtil;
import com.liferay.dynamic.data.mapping.storage.DDMFormFieldValue;
import com.liferay.dynamic.data.mapping.storage.DDMFormValues;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.log.Log;
import com.liferay.portal.kernel.log.LogFactoryUtil;
import com.liferay.portal.kernel.messaging.BaseSchedulerEntryMessageListener;
import com.liferay.portal.kernel.messaging.DestinationNames;
import com.liferay.portal.kernel.messaging.Message;
import com.liferay.portal.kernel.module.framework.ModuleServiceLifecycle;
import com.liferay.portal.kernel.scheduler.SchedulerEngineHelper;
import com.liferay.portal.kernel.scheduler.SchedulerException;
import com.liferay.portal.kernel.scheduler.StorageType;
import com.liferay.portal.kernel.scheduler.StorageTypeAware;
import com.liferay.portal.kernel.scheduler.TimeUnit;
import com.liferay.portal.kernel.scheduler.TriggerFactory;
import com.liferay.portal.kernel.scheduler.TriggerFactoryUtil;
import com.liferay.portal.kernel.util.GetterUtil;
import com.liferay.portal.kernel.util.LocaleUtil;
import com.liferay.portal.kernel.util.PropsUtil;

import java.util.Calendar;
import java.util.Date;
import java.util.List;
import java.util.Map;

import org.osgi.service.component.annotations.Activate;
import org.osgi.service.component.annotations.Component;
import org.osgi.service.component.annotations.Deactivate;
import org.osgi.service.component.annotations.Modified;
import org.osgi.service.component.annotations.Reference;

/**
 * @author naruto.hokage
 */

@Component (immediate=true, service = CronjobCanaisJob.class)
public class CronjobCanaisJob extends BaseSchedulerEntryMessageListener {
    
    private String logPath;
    private int intervalo;
    private Date horarioExecucao;
    private long ddLConf;
    
    @Reference(target = ModuleServiceLifecycle.PORTAL_INITIALIZED, unbind = "-")
    private volatile ModuleServiceLifecycle _moduleServiceLifecycle;

    @Reference(unbind = "-")
    private volatile SchedulerEngineHelper _schedulerEngineHelper;

    @Reference(unbind = "-")
    private volatile TriggerFactory _triggerFactory;
    
    @Activate
    @Modified
    protected void activate() {
        registerScheduler();
        _schedulerEngineHelper.register(this, schedulerEntryImpl, DestinationNames.SCHEDULER_DISPATCH);    
    }

    private void registerScheduler() {
        try {
            this.getHoraAndLogPath();
        } catch (Exception e) {
            e.printStackTrace();
        }

        if (horarioExecucao != null) {
            schedulerEntryImpl.setTrigger(TriggerFactoryUtil.createTrigger(
                    getEventListenerClass(), getEventListenerClass(), 
                    horarioExecucao, intervalo, TimeUnit.DAY));
        } else {
        schedulerEntryImpl.setTrigger(
                TriggerFactoryUtil.createTrigger(
                        getEventListenerClass(), getEventListenerClass(), intervalo,
                        TimeUnit.MINUTE));
        }
    }

    @Deactivate
    protected void deactivate() {
        try {
            _schedulerEngineHelper.unschedule(schedulerEntryImpl, getStorageType());
        } catch (SchedulerException e) {
            _log.error("CronJobERROR : "+ e.getMessage());
        }
        _schedulerEngineHelper.unregister(this);
    }

    @Override
    protected void doReceive(Message message) throws Exception {
        try {
            //Calling my implementation over here.   
        } catch (Exception e) {
            e.printStackTrace();
        }        
        registerScheduler();
    }


    /**
     * getStorageType: Utility method to get the storage type from the scheduler entry wrapper.
     * @return StorageType The storage type to use.
     */
    protected StorageType getStorageType() {
        if (schedulerEntryImpl  instanceof StorageTypeAware) {
            return ((StorageTypeAware) schedulerEntryImpl).getStorageType();
        }

        return StorageType.MEMORY_CLUSTERED;
    }


    private void getHoraAndLogPath() throws PortalException {
        this.ddLConf =  GetterUtil.getLong(PropsUtil.get("com.admin.cronjob-canais-pacotes.ddl.conf"));

        DDLRecordSet ddlRecordSet = DDLRecordSetLocalServiceUtil.getDDLRecordSet(this.ddLConf);
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;List<ddlrecord> ddlRecords = ddlRecordSet.getRecords();
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;for (DDLRecord ddlRecord : ddlRecords) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;DDMFormValues ddmFormValues = ddlRecord.getDDMFormValues();
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Map<string, list<ddmformfieldvalue>&gt; map = ddmFormValues.getDDMFormFieldValuesMap();
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;this.intervalo = Integer.parseInt(getValueFromList(map.get("intervalo")));
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;this.logPath = getValueFromList(map.get("logPath"));
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;this.horarioExecucao = getHorario(getValueFromList(map.get("horarioExecucao")));
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;_log.info("Minutos : "+ this.intervalo);
&nbsp;&nbsp; &nbsp;}
&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;/**
&nbsp;&nbsp; &nbsp; * Converts time String (HH:mm) into a @Date object.
&nbsp;&nbsp; &nbsp; *&nbsp;
&nbsp;&nbsp; &nbsp; * @param
&nbsp;&nbsp; &nbsp; * @return
&nbsp;&nbsp; &nbsp; * */
&nbsp;&nbsp; &nbsp;private Date getHorario (String horario) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Date retorno = null;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;try {&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if (!horario.equals("") &amp;&amp; horario.charAt(2) == ':' &amp;&amp; horario.length() == 5) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;String [] horaEminuto = horario.split(":");&nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Calendar scheduledDate = Calendar.getInstance();&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scheduledDate.set(Calendar.HOUR_OF_DAY, Integer.parseInt(horaEminuto[0]));
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scheduledDate.set(Calendar.MINUTE, Integer.parseInt(horaEminuto[1]));
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scheduledDate.set(Calendar.SECOND,0);
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scheduledDate.set(Calendar.MILLISECOND,0);
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Calendar today = Calendar.getInstance();
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if(scheduledDate.before(today)) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;scheduledDate.set(Calendar.DATE, 1);
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;retorno = scheduledDate.getTime();
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;} &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;} catch (Exception e) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;_log.info("Erro na obtencao de horario inicial de execucao.");
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;return retorno;
&nbsp;&nbsp; &nbsp;}
&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;private String getValueFromList(List<ddmformfieldvalue> ddmFormFieldValues) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if (ddmFormFieldValues == null || ddmFormFieldValues.isEmpty()) {
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;return null;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}&nbsp;&nbsp; &nbsp;
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;return ddmFormFieldValues.get(0).getValue().getString(LocaleUtil.getDefault());
&nbsp;&nbsp; &nbsp;}&nbsp;&nbsp; &nbsp;
}</ddmformfieldvalue></string,></ddlrecord>
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hi Alex,

Ok -- phew! this is a lot of code just to setup a trigger. I'm going to give you a little unsolicited advice if you don't mind -- mostly because I am not sure I entirely agree with your solution emoticon

First off, this causes me some concern --
    @Override
    protected void doReceive(Message message) throws Exception {
        try {
            //Calling my implementation over here.   
        } catch (Exception e) {
            e.printStackTrace();
        }        
        registerScheduler();
    }

... every time your trigger fires, this message listener will be invoked right? At the end of the invocation (doReceive) you are calling the "registerScheduler" again -- but you definitely shouldn't have to do that. So I would start by removing that. 

The next thing I would say is that you don't need this code --

    /**
     * getStorageType: Utility method to get the storage type from the scheduler entry wrapper.
     * @return StorageType The storage type to use.
     */
    protected StorageType getStorageType() {
        if (schedulerEntryImpl  instanceof StorageTypeAware) {
            return ((StorageTypeAware) schedulerEntryImpl).getStorageType();
        }

        return StorageType.MEMORY_CLUSTERED;
    }
.. because the default type for a scheduled task I believe is MEMORY_CLUSTERED. So you can probably get rid of that one too. Rather than creating a method to register the scheduler, I normally just put the logic to create the trigger and register the scheduler straight in the activate method. 

Now, for your settings. I think what you are trying to do is provide two definitions. One for a Simple Trigger and one for a CRON possibly? Actually it looks like two Simple Triggers but if you want one to run "daily" at a specific time, you're better to use a CRON Expression. My guess is that you want this becuase in DEV you want to run the task more frequently than you do in production. 

First off, that is a great idea and I love the fact that you have the foresight to make it a configuration that can be changed at runtime. The downside here is two fold. 

1. Changes you make (the way you have it now) won't Deactivate/Activate the component (automatically) so it won't be automatically picked up -- so if you have it set (in your DDL) to 5 minutes and then you change it to 10, it won't change unless you stop and start

2. It doesn't follow the best practices for configuring a component. 

SOOOOOO! .. what i would suggest is this.

Start simple. Add two new settings to the portal-ext.properties and then in your Activate method, instead of reading a DDL setting you can use the PropsUtil class from kernel to get your setting (and the GetterUtil to convert it to the correct type). Then test with that.

If that works and you want to have a runtime setting, then you should use the Configuration API to create a configuration in the Control Panel. Then you can reference it using the configurationPid attribute on your @Component reference which will be a much simpler way (and more correct) way to handle the setting.

-- BUT, bottom line, I would start with the properties, or even something hard coded to make sure that it works ok, and then build from there.

Let me know if any of this doesn't make sense, or if you have more questions.

PS> One last thing, I noticed you are extending the BaseSchedulerEntryMessageListener class which is actually depracated. You should instead switch that to BaseMessageListener. 
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Thanks Andrew, you have amazing tips that I didn't know before. I just used a old cronjob project as a reference.

If that works and you want to have a runtime setting, then you should use the Configuration API to create a configuration in the Control Panel. Then you can reference it using the configurationPid attribute on your @Component reference which will be a much simpler way (and more correct) way to handle the setting.
I need to understand more the features to understand how to to these steps.

-- BUT, bottom line, I would start with the properties, or even something hard coded to make sure that it works ok, and then build from there.
About the idea the use the properties, It's a good idea to start as simple as I can, but I can't restart the server as I wish, cause it's a production server.

All the code was working before. The problem started because a coworker deleted all jobs to deploy it again.
About the problem: Memory clustered job is not yet deployed on master.
I think that it has some job configuration that makes it not executable in clusterized environment. I just need to find exactly where it's happening to fix it and after that rebuild the code with your tips.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hi Alex,

I understand -- certainly not the first time I have tried to provide advice that assumes an ideal scenario of full control over the environment and the time it takes to figure it out and build it right emoticon

Coming back to your actual problem, I did some more digging and I found the class ClusterSchedulerEngine seems to be the class that reports the error you are experiencing --
...

try {
   SchedulerResponse schedulerResponse = future.get(
      _callMasterTimeout, TimeUnit.SECONDS);

   if ((schedulerResponse == null) ||
      (schedulerResponse.getTrigger() == null)) {

      if (_log.isInfoEnabled()) {
         _log.info(
            StringBundler.concat(
               "Memory clustered job ",
               getFullName(jobName, groupName),
               " is not yet deployed on master"));
      }
   }
   else {
      addMemoryClusteredJob(schedulerResponse);
   }
}
...

.. can you share a full stack trace so I can follow the rabbit down the hole? 

Also, at the very least can you remove the call to setup the scheduler in the doReceive?  just to make sure that it's not part of the issue? I say that because normally when you register a task with a simple trigger, when it is deployed it is immediately run so I am wondering if when you deloy, you register the trask, but since the listener runs right away it is trying to register the task right away (a second time). Even if that is not the case, you shouldn't need that code all the same.
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Sure.

To reply the case, I access:

1. Control Panel > Applications > Applications Management
2. I choose a specific job, and I stop/start with the time changed on my ddl.
Note: Remembering that the same code woks fine on development server and test server (because it has just one node). Just production server that has 2 nodes (prd1 and prd2) that is not working by some clustering configuration that maybe was lost.


On the main server it doens't give to me too much information:
2019-04-04 07:45:50.206 INFO [http-nio-8080-exec-257][CronJob:93] Deactivate : Thu Apr 04 07:45:50 BRT 
20192019-04-04 07:45:50.280 INFO [http-nio-8080-exec-257][BundleStartStopLogger:38] STOPPED com.admin.cronjob_1.0.0 [1168]
2019-04-04 07:45:58.061 INFO [http-nio-8080-exec-290][BundleStartStopLogger:35] STARTED com.admin.cronjob_1.0.0 [1168]
2019-04-04 07:45:58.072 INFO [http-nio-8080-exec-290][CronJob:63] Activate : Thu Apr 04 07:45:58 BRT 
20192019-04-04 07:45:58.074 INFO [http-nio-8080-exec-290][CronJob:178] Minutes : 1440
2019-04-04 07:45:58.074 INFO [http-nio-8080-exec-290][CronJob:79] The job is going to be executed&nbsp;on: Thu Apr 04 07:50:00 BRT 
20192019-04-04 07:45:58.090 INFO [http-nio-8080-exec-290][ClusterSchedulerEngine:358] Memory clustered job is not yet deployed on master
​​​​​​​2019-04-04 07:45:58.095 INFO [http-nio-8080-exec-290][CronJob:68] CronJobINFO :Job Registered.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Ok I am wondering now if maybe you are trying to initialize the service on the node that isn't the one managing the jobs. For example, let's say your Prod2 is the one running the tasks, but you are in the Control Panel on Prod1.

Is there a way for you to open two browers where one is on Prod1 and the other (incognito, or a different browser) is on Prod2? And then try the same seteps on each to see if the erro shows up in one log but not the other?
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Yep, when I tried to run alone in a specific ip of prd2, it worked. but just accesing prd2. I saw that my job is deployed in both servers but I don't remember the order right now. 

When I runned the job on prd2 it also gaves me some information on prd2:2019-04-04 07:55:05.025 INFO [default-6194][ClusterSchedulerEngine:745] Receive notification from master, add memory clustered job {groupName=com.admin.cronjob.job.CronJob, jobName=com.admin.cronjob.job.CronJob, storageType=MEMORY_CLUSTERED}.

Ok I am wondering now if maybe you are trying to initialize the service on the node that isn't the one managing the jobs. For example, let's say your Prod2 is the one running the tasks, but you are in the Control Panel on Prod1.
We deployed the same application on both servers, but both nodes should be sincronized, communicating with each other, right? As I'm a beginner entering in the middle of the project, I don't know how exactly is this communication between servers.
Hmm, I got your point. I just can not see a way out of this hole hehehe'
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Well, it's not really a hole. The Quartz Scheduler is built so that it only runs on one node. So if you have a cluster of 15 nodes, even if you deploy the module to all 15 nodes (which you should) it will only ever run on one node. My point is that if you have that setup, 15 nodes all with your module, and you stop/start the module on a node that isn't the one with the quartz lock, then perhaps that is why you see that message -- but if the message is reported on a node that is NOT controlling the lock, then maybe you can ignore it. So that is why I was strying to figure out if the error shows up on ALL your nodes? or on your Number of Nodes - 1 ... if it is the Number of Nodes - 1 then I would suspect that you are fine. 

... does the job still run in the end? on Any of the nodes?
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Andrew JardineWell, it's not really a hole. The Quartz Scheduler is built so that it only runs on one node. So if you have a cluster of 15 nodes, even if you deploy the module to all 15 nodes (which you should) it will only ever run on one node. My point is that if you have that setup, 15 nodes all with your module, and you stop/start the module on a node that isn't the one with the quartz lock, then perhaps that is why you see that message -- but if the message is reported on a node that is NOT controlling the lock, then maybe you can ignore it. So that is why I was strying to figure out if the error shows up on ALL your nodes? or on your Number of Nodes - 1 ... if it is the Number of Nodes - 1 then I would suspect that you are fine. 

... does the job still run in the end? on Any of the nodes?
You are right, it happens just on one node that is the main node (prd1). The second node (prd2) that I need to access via ip the job runs.
thumbnail
Christoph Rabel, modified 6 Years ago. Liferay Legend Posts: 1555 Join Date: 9/24/09 Recent Posts
I didn't really follow the thread, but I had a superweird problem with the scheduler once too. After a few hours, it simply stopped working till I restarted the server. I really fiddled a while with the problem. Then I deleted the osgi/state folder, restarted and the problem was gone. I know, it's a wild guess and kinda a "sacrifice a lamb and dance around the fire" solution, but deleting the state folder resolved weird issues for me a couple of times now.
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Christoph RabelI didn't really follow the thread, but I had a superweird problem with the scheduler once too. After a few hours, it simply stopped working till I restarted the server. I really fiddled a while with the problem. Then I deleted the osgi/state folder, restarted and the problem was gone. I know, it's a wild guess and kinda a "sacrifice a lamb and dance around the fire" solution, but deleting the state folder resolved weird issues for me a couple of times now.

It's also a great tip Christoph.
I'll keep this option as a resource if I take too long to find a solution. I want to run the scheduled job from my primary node (prd1), instead of every time I access the second node (prd2) by ip.
thumbnail
Christoph Rabel, modified 6 Years ago. Liferay Legend Posts: 1555 Join Date: 9/24/09 Recent Posts
Not sure, if it helps in any way, but maybe you can use the idea somehow:
I had a related problem, that I needed to control from the outside, when a job was run. (Usually scheduled, but sometimes it needed to be executed immediately).

I created a rest service, that triggers the execution of the job. A cronjob starts the job every night but it can also be manually triggered and triggered by external services.
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Christoph RabelNot sure, if it helps in any way, but maybe you can use the idea somehow:
I had a related problem, that I needed to control from the outside, when a job was run. (Usually scheduled, but sometimes it needed to be executed immediately).

I created a rest service, that triggers the execution of the job. A cronjob starts the job every night but it can also be manually triggered and triggered by external services.
I also made it. It helps a lot but nobody wants to wake up at midnight everyday to start the job manually hehe'
Also, something strange happened. I made it for two jobs, one of them works fine creating the content requested, but the another one just write the log information but doens't create new contents. I decided just to focus on the main problem, If I solve the job scheduler not running on the main node, the rest service triggering the job is not necessary anymore.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hey Alex,

You keep referring to the job running on the main node -- I just want to make sure that you understand that you can't control what node the job will run on. It's basically a race. Whichever node in your cluster runs a task first will get the lock. And to be clear, it doens't have to be YOUR task that runs first. Liferay has several scheduled tasks that run as well -- for example the JournalCheckInterval that runs (by default) every 15 minutes to see if there is content that should be published/unpublished. 

The closest you could get to making sure Node 1 and not Node 2 runs the task would be to restart your cluster, but only bring up one node for a time until you are sure that at least one task has been run and that Node 1 (your primary node) ran it. Then you could start Node 2.

But I would say that the design of this solution (from Liferay) is such that it expects the job to be able to run on any node -- hence the fail over we talked about yesterday. 

You keep referencing the Primary Node. Does it HAVE to run on the primary node? and if yes, why?
Amos Fong, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Andrew JardineHey Alex,

You keep referring to the job running on the main node -- I just want to make sure that you understand that you can't control what node the job will run on. It's basically a race. Whichever node in your cluster runs a task first will get the lock. And to be clear, it doens't have to be YOUR task that runs first. Liferay has several scheduled tasks that run as well -- for example the JournalCheckInterval that runs (by default) every 15 minutes to see if there is content that should be published/unpublished. 

The closest you could get to making sure Node 1 and not Node 2 runs the task would be to restart your cluster, but only bring up one node for a time until you are sure that at least one task has been run and that Node 1 (your primary node) ran it. Then you could start Node 2.

But I would say that the design of this solution (from Liferay) is such that it expects the job to be able to run on any node -- hence the fail over we talked about yesterday. 

You keep referencing the Primary Node. Does it HAVE to run on the primary node? and if yes, why?

Hmm, I learned the idea of load-balance some days ago and I think that this is what you are talking about.
I referenced the primary node just because the job was deployed on both nodes, and it doenst work in one of them. before restart it was working fine.
​​​​​​​The closest you could get to making sure Node 1 and not Node 2 runs the task would be to restart your cluster, 
but only bring up one node for a time until you are sure that at least one task has been run and that Node 1 (your primary node) ran it. 
​​​​​​​Then you could start Node 2.
I think that I'm going to try this one first.
thumbnail
Christoph Rabel, modified 6 Years ago. Liferay Legend Posts: 1555 Join Date: 9/24/09 Recent Posts
Alex Camaroti
I referenced the primary node just because the job was deployed on both nodes, and it doenst work in one of them. before restart it was working fine.


Here lies your actual problem! Except for some really special scenarios, you should not care, on which node the job is executed. It simply should not matter. If one machine is down, the other one should run the job.

Alex Camaroti
​​​​​​​The closest you could get to making sure Node 1 and not Node 2 runs the task would be to restart your cluster, 
but only bring up one node for a time until you are sure that at least one task has been run and that Node 1 (your primary node) ran it. 
​​​​​​​Then you could start Node 2.
I think that I'm going to try this one first.
Please note that this is highly unreliable. One restart in the wrong order -> Problem.

May I suggest a few other options:
-) Create an extra service just for the job and deploy it only on one server
-) Add a portal-ext property "enable_my_service". Set it true on server one, false on server two and just don't run the job on server two.
thumbnail
Christoph Rabel, modified 6 Years ago. Liferay Legend Posts: 1555 Join Date: 9/24/09 Recent Posts
Alex Camaroti
I also made it. It helps a lot but nobody wants to wake up at midnight everyday to start the job manually hehe'
Also, something strange happened. I made it for two jobs, one of them works fine creating the content requested, but the another one just write the log information but doens't create new contents. I decided just to focus on the main problem, If I solve the job scheduler not running on the main node, the rest service triggering the job is not necessary anymore.

I am not sure if your main problem is actually the job scheduler. You have the service deployed on both nodes, but it behaves correctly only on one of them? I'd say, there is something fishy going on.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hahaha -- hey, sometimes the best solution to the problem is to appease the Gods right? emoticon
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
I'm gonna test everything in the end of the day.
Just fixing what I've said before: The job works fine on the main node (prd1). The second node (prd2) is the real problem.
The last test, I deleted the jobs of my osgi/modules from prd1 and I kept them on prd2.
Result: Doens't work.
So, I did the opposite: delete from prd2 and kept them on prd1.
Result: Worked fine using a specific ip server url.

But talking about cluster configuration, I have this on my portal-ext.properties:
This config is on both servers.
##
## Cluster Link
##

&nbsp; &nbsp; #
&nbsp; &nbsp; # Set the cluster node bootup response timeout in milliseconds.
&nbsp; &nbsp; #
&nbsp; &nbsp; cluster.link.node.bootup.response.timeout=10000

&nbsp; &nbsp; #
&nbsp; &nbsp; # Set this to true to enable the cluster link. This is required if you want
&nbsp; &nbsp; # to cluster indexing and other features that depend on the cluster link.
&nbsp; &nbsp; #
&nbsp; &nbsp; cluster.link.enabled=true
&nbsp;&nbsp; &nbsp;
&nbsp; &nbsp; #
&nbsp; &nbsp; # Set the JGroups properties for each channel, we support up to 10 transport
&nbsp; &nbsp; # channels and 1 single required control channel. Use as few transport
&nbsp; &nbsp; # channels as possible for best performance. By default, only one UDP
&nbsp; &nbsp; # control channel and one UDP transport channel are enabled. Channels can be
&nbsp; &nbsp; # configured by XML files that are located in the class path or by inline
&nbsp; &nbsp; # properties.
&nbsp; &nbsp; #
&nbsp; &nbsp; cluster.link.channel.properties.control=tcp.xml
&nbsp; &nbsp; cluster.link.channel.properties.transport.0=tcp.xml
&nbsp; &nbsp;&nbsp;
&nbsp; &nbsp; # Set this property to autodetect the default outgoing IP address so that
&nbsp; &nbsp; # JGroups can bind to it. The property must point to an address that is
&nbsp; &nbsp; # accessible to the portal server, www.google.com, or your local gateway.
&nbsp; &nbsp; #cluster.link.autodetect.address=www.google.com:80



Is this config ok or do I need to implement something more?
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
Hi Alex, 

I don't think your test is valid. Let me try to explain once more --

1. Your servers are stopped.
2. Your code is NOT deployed on either Prod1 or Prod2
3. You start Prod1, and then you start Prod2
4. The first (Liferay) scheduled task fires on Prod1 -- Prod1 gets the Lock for scheduled tasks, and runs the task.
5. A second (Liferay) scheduled task fires on Prod1 -- Prod1 has the Lock, so the task runs. 
6. The first (Liferay) scheduled task fires on Prod2 -- Prod2 tries but FAILS to get the Lock, so the task can't run.

.. this continues this way where all scheduled tasks (Liferay) will run only on Prod1.

7. You deploy your code on both nodes.
8. Your scheduled task runs on Prod1 -- the task executes because Prod1 has the Lock.
9. You scheduled task runs on Prod2 -- the task FAILS (as in it simply won't start in the first place) because Prod2 doesn't have the Lock

... so even with your services deployed, they will still only run on the node with the lock. This is nothing to do with load balancing, there is no request routing here as these threads spin up and run independent of a proxied request -- which is kind of the point of a scheduled task emoticon

10. You UNDEPLOY your scheduled task from Prod1.
11. Eventually, your scheduled task tried to run on Prod2 -- but again, Prod2 still doesn't have the lock do you can't run the task. So it just won't run on any node now

.. remember that your task is probably not the only scheduled task in the system so your task probably doesn't control WHEN and WHICH server get the lock. If you set your task to run at 7pm, but start your server earlier, then almost surely one of the Liferay tasks will fire first (like the JournalCheckInterval) and the server it runs on will obtain the lock. 

12. You put your code back on BOTH servers - so that your cluster deployments are properly sync'ed now.

.. now HERE is the proper test to see if there is a problem.

13. Assuming that Prod1 has the Lock still, SHUT THE SERVER DOWN.

.. this will cause the server node to be removed from the cluster leaving just Prod2 in the pool. Now when the next scheduled task fires, Prod2 will be able to obtain the Lock and will now be the node running the tasks. 

14. Start Prod1 back up

... and at this point you will be in the inverse scenario. Prod1, when it tries to run tasks, will no longer have or be able to obtain the Lock so the scheduled tasks will never run on prod1 and instead run on prod2.

Now, you might say "well that's no good! I want to blance the task execution amongst all my nodes!" -- and that's a fair point. But it's the limitation. The advantage though is the automatic failover to the other node when one goes down. The failover at least maintains business continuity. It does mean however a couple of things --

1. Only ONE node in your cluster will ever run the scheduled tasks (all of them) until that node goes down
2. You cannot (without a lot of effort) designate the node that will run scheduled tasks
3. You should make sure that your scheduled task has what it needs to run on ANY node in your cluster because of #2 and to support the failover scenario.


​​​​​​​Make sense?
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Andrew JardineHi Alex, 

I don't think your test is valid. Let me try to explain once more --

1. Your servers are stopped.
2. Your code is NOT deployed on either Prod1 or Prod2
3. You start Prod1, and then you start Prod2
4. The first (Liferay) scheduled task fires on Prod1 -- Prod1 gets the Lock for scheduled tasks, and runs the task.
5. A second (Liferay) scheduled task fires on Prod1 -- Prod1 has the Lock, so the task runs. 
6. The first (Liferay) scheduled task fires on Prod2 -- Prod2 tries but FAILS to get the Lock, so the task can't run.

.. this continues this way where all scheduled tasks (Liferay) will run only on Prod1.

7. You deploy your code on both nodes.
8. Your scheduled task runs on Prod1 -- the task executes because Prod1 has the Lock.
9. You scheduled task runs on Prod2 -- the task FAILS (as in it simply won't start in the first place) because Prod2 doesn't have the Lock

... so even with your services deployed, they will still only run on the node with the lock. This is nothing to do with load balancing, there is no request routing here as these threads spin up and run independent of a proxied request -- which is kind of the point of a scheduled task emoticon

10. You UNDEPLOY your scheduled task from Prod1.
11. Eventually, your scheduled task tried to run on Prod2 -- but again, Prod2 still doesn't have the lock do you can't run the task. So it just won't run on any node now

.. remember that your task is probably not the only scheduled task in the system so your task probably doesn't control WHEN and WHICH server get the lock. If you set your task to run at 7pm, but start your server earlier, then almost surely one of the Liferay tasks will fire first (like the JournalCheckInterval) and the server it runs on will obtain the lock. 

12. You put your code back on BOTH servers - so that your cluster deployments are properly sync'ed now.

.. now HERE is the proper test to see if there is a problem.

13. Assuming that Prod1 has the Lock still, SHUT THE SERVER DOWN.

.. this will cause the server node to be removed from the cluster leaving just Prod2 in the pool. Now when the next scheduled task fires, Prod2 will be able to obtain the Lock and will now be the node running the tasks. 

14. Start Prod1 back up

... and at this point you will be in the inverse scenario. Prod1, when it tries to run tasks, will no longer have or be able to obtain the Lock so the scheduled tasks will never run on prod1 and instead run on prod2.

Now, you might say "well that's no good! I want to blance the task execution amongst all my nodes!" -- and that's a fair point. But it's the limitation. The advantage though is the automatic failover to the other node when one goes down. The failover at least maintains business continuity. It does mean however a couple of things --

1. Only ONE node in your cluster will ever run the scheduled tasks (all of them) until that node goes down
2. You cannot (without a lot of effort) designate the node that will run scheduled tasks
3. You should make sure that your scheduled task has what it needs to run on ANY node in your cluster because of #2 and to support the failover scenario.


​​​​​​​Make sense?

Totally. I'm gonna do this tomorrow. 
About the cluster configuration that I posted yesterday, in your point of view is that ok or do I need something more?
Thanks a million.
thumbnail
Andrew Jardine, modified 6 Years ago. Liferay Legend Posts: 2416 Join Date: 12/22/10 Recent Posts
I don't see any issue with your cluster config, assuming the tcp.xml is configured correctly. Easiest way (at least that was taught to me and I still use) to make sure you cluster (replication) is working is to bring up the same page on both servers in two different browsers. Add a new portlet to the first page and then simply refresh the second page. If your cluster is working, then the change (to the page) will be replicated and you can see the results correctly on both servers. 

NOTE though, that since your scheduled tasks don't actually do anything with clustering, clustering is not really required for this. All you need is more than one Liferay server all pointing to the same database. 
Alex Camaroti, modified 6 Years ago. New Member Posts: 16 Join Date: 4/8/19 Recent Posts
Hello Guys,

  I noticed that something good happened.
  After a few restarts from each production server (I dont know the order of who was restarted first).
  It comes back to the original state, starting the jobs by the main url.
  If I try to deploy the job scheduler again, deleting the existed one. The problem happens again but for now I don't need to worry about this problem anymore.
  Thanks a million for all your help.
  God bless the Liferay's heros hehehe'