Recovery of nodes during network glitch

vadi · May 6, 2014, 1:30pm

We have 2 node openfire/hazelcast cluster on windows 2008. (ver 3.8.2). This setup seems to be having issues recovering

from the network glitches. In the sense, when there is a network glitch(for few seconds) , the two nodes cannot see each other. After glitch is resolved ,users in 2 different nodes cannot communicate with each other, even though both nodes establish connectivity to each other. We see messages like, “requested node not found” with a reference to a node id.

If users logout and relog in , then things are fine.

where do you think problem could be ? is it the plugin or the hazelcast itslef ? how do you suggest we troubleshoot this.

Tom_Evans1 · May 6, 2014, 8:51pm

I suspect that the existing user sessions are marked with a particular cluster node id, and that the node id changes after the network glitch.

How are the clients connected in this case? Are you using BOSH (via HTTP), or native TCP connections (via 5222)?

I will take a look at the cluster reconnect logic. Refer to OF-794 for more information and status updates.

vadi · May 7, 2014, 12:17pm

We are using native tcp port 5222. We try to cleanup the sessions in RoutingtableImpl

leftcluster() (remove the users session belonging to the other nodes) and sync cache up in joincluster(). But did not seem to help.

Tom_Evans1 · May 7, 2014, 4:50pm

I have applied a small fix for the session cleanup logic in the Hazelcast plugin (now version 1.2.2). Can you give this a try and see if there is any improvement? You can find the latest plugin via your local admin console, or you can download from the plugins page (http://www.igniterealtime.org/projects/openfire/plugins.jsp).

Javier_Deferrari · August 19, 2015, 2:03pm

The same issue is happening to us. On a network glitch or after restarting one of the openfire instances, one of the nodes can’t see the complete list of user sessions. It looks like the cache of sessions is not being shared.

Node A and B.

Client SB is connected to B.

We restart A and connect client SA to A.

Node A has SA connected to it.
Node B has SB connected to it.

When looking at the session-summary.jsp A only sees local sessions (SA). but B sees all sessions (SA as remote, SB as local).

It’s not something that happens everytime, but it’s much more common than we would like. Especially during network glitches. Reproducing it by restarting one of the servers is much more difficult.

We did the upgrade to 3.10.2 with the hazelcast plugin version 2.0.0. We will upgrade to 2.1.1 but from the changelog it doesn’t look like anything related to this may have changed.

We use the cluster to provide full availability, the load is low since we only have 5000 users and normally we only see 1000 connected at the same time. This error means we can’t use Openfire if we want to provide full availability since the cluster is doing more harm than good right now.