OpenFire Hazelcast can't handle fast restart of node, but ok if slow

Nathan_Neulinger · August 12, 2016, 3:00pm

Sharing this to hopefully get an explanation as well as to benefit other users if they see the same behavior.

Setup: Three 4.0.2 nodes with Hazelcast 2.2.0, backend is a three node Percona XTraDB MySQL cluster.

Front end loadbalancer is currently a F5 LTM. Individual servers are set with a static list of member nodes with unicast setup.

If I do a fast restart of any given one of the nodes - i.e. just go stop and immediately restart, it comes up and is unable to properly join the cluster. In fact, it winds up coming up claiming that clustering is disabled.

If on the other hand I do the restart slowly - stop, wait a minute (I suspect it’s actually 30-seconds as the cutoff), and then restart — all appears to be good.

Is this just expected behavior, or is it something that is tuned incorrectly on my side?

Iqbal1 · October 6, 2016, 12:09am

Hey Nathan,

In my case, I got this error if JMX console was open in a browser.

Then I closed the JXM console tab in the browser and restarted the openfire and then it worked. Do you have any idea, on why this happened so ?

Regards,

Iqbal

Nathan_Neulinger · October 6, 2016, 12:39am

Unfortunately no, however, I can say that the hazelcast stability (and restartability) in my cluster has improved dramatically since updating from 4.0.3 to latest git/nightly build from a few days ago. That build corrects some of the null entry in cache issues which appear to impact the ability of the cluster to come online.

Matteo_Fiandesio · November 2, 2016, 12:06pm

Ehi Nathan,

could you please provide the commit hash for that nightly? I can’t find anything related to Hazelcast changes on 4.0.3 tag

Thanks

Nathan_Neulinger · November 2, 2016, 6:04pm

I’m currently running “c580edfcb0da5f64395f6482a639d0371406690d” with no hazelcast issues.