After one cluster-node goes down, clients cannot rejoin rooms

Anno_van_Vliet · July 27, 2020, 9:55am

We are testing Clustering using hazelcast 2.5.0 and openfire 4.5.1, and especially fail-over behaviour.

Currently isn’t successful. We have 2 node cluster setup with OF 4.5.1 What happens is that after shutting down one of the nodes (the junior), Muc behaviour is not restored. The client is reconnecting, Private chat is possible, but the client cannot rejoin the room. Further investigation showed that no subdomains (conference, search, etc ) were available anymore. a Disco#items on the server returns zero items.

We see in the logs lots of errors where the server tries to resolve the conference domain as a remote connection.

When the second node is restored, all is working again.

The issue is definitely Cache related. So i focused on the cache behaviour:

initially a server has a “Disco Server Items” cache with 4 items in my case:

[broadcast.jchat.oftst]
[conference.jchat.oftst]
httpfileupload.jchat.oftst]
[search.jchat.oftst]

Each referring to the Node id of the same clusternode. Somehow I would expect 8 entries, four for each Clusternode.

When I stop the other clusternode, with the client connected to this remote, nothing changes. The Client is reconnecting to the other node. cache is not changed.

When I restart the node again, after the node rejoining the cluster, I see the cache content is changed. All components now point to the restarted node. so components are hosted on the restarted node.

When I again stop the same clusternode, Component cache is empty and client is not functioning, because of missing components.

So my questions are:

Should there not be a cache entry for each component on each node (8 instead of 4)?
Why are the Cache entries replaced when a new node enters the cluster?

guus · August 13, 2020, 1:05pm

Thanks for reporting this, Anno. I’ve raised https://issues.igniterealtime.org/browse/OF-2060 in our bugtracker to track this issue.

guus · August 13, 2020, 1:55pm

From looking at the code, there should be only one cache item per component (running anywhere in the cluster), so 4 instead of 8 seems “correct”. The cached item should contain a reference to all cluster nodes that the component is running on.

You indicated that the items in your cache refer to the node-id of same clusternode. How did you verify this? In my (4.6.0-alpha based) cluster, I’m seeing only memory addresses in the admin console. Did you mistake these for node identifiers, or did you do more thorough inspection?

Anno_van_Vliet · August 18, 2020, 9:48am

That correct, also in 4.5. but initially I looked into the cache for components. And these have a reference to the clusternode id.

Anno_van_Vliet · August 18, 2020, 10:21am

But then this might be an indication of the error. I found logging which indicates that items are removed from the components list, when a cluster-node has disappeared.

Also when a second node is added to the cluster, I found a log line in the logs of the newly added node: Line 4294: 2020.08.18 10:08:35 org.jivesoftware.openfire.component.InternalComponentManager - The local cluster node joined the cluster. The component ‘conference.xxx’ is living on our cluster node, but not on any others. Invoking the ‘component registered’ (and if applicable, the ‘component info received’) event on the remote nodes.

It appears that the componentCache only keeps one node instance per component.

guus · August 24, 2020, 8:13am

The “Components” cache does at least have one issue: its content expires. As there is no way for code to be restored on demand, this should not happen. I’ve raised a new issue for this problem: https://issues.igniterealtime.org/browse/OF-2065

guus · August 24, 2020, 11:14am

My current hypothesis is that the issues that we’re experiencing are caused by the fix for https://issues.igniterealtime.org/browse/OF-974 which, ironically, tries to address a very related issue.

It appears that, when joining a cluster, the joining cluster node adds all entries of all (clustered) caches to the clustered cache, rucksigtloss. This is unlikely to behave as expected when the clustered cache already has a key of that value, that is mapping different values that need to be merged.

If the above is correct, then I’m thinking thatOF-974 should be rolled back, in favor of a solution that allows for a per-cache strategy to merge cache content of pre- and post cluster joins. @gdt - any thoughts?

guus · August 24, 2020, 11:22am

Alternatively, we improve the implementation that merges the cache content in the clustered cache.

Update: I’ve worked on a solution that’s based on such a merge. It’s pretty straight-forward to get an implementation that works with collection-based values (those can be merged). If a cached value is some kind of class for which internal state needs to be merged, things get problematic.

Before going further down this road, we should consider the inverse of the scenario that we’re working on. The changes in OF-974 place responsibility of merging caches when the local node joins a cluster with the implementation of CacheFactory. However, when a node leaves the cluster, that code can’t be responsible for the inverse operation, as it no longer knows what data is local to the node, and what data is not.

I believe that it’s generally best have symmetry: code that’s responsible for adding things, should also be responsible for removing things. With that, I believe that the best course of action would be to roll back OF-974, and let the manager of each cache be responsible for adding and removing cache entries.

gdt · August 26, 2020, 11:01am

I agree with the cache-customisable merge point, but not entirely sure OF-974 should be rolled back. It is after all doing a sort of crude merge.

Regarding the per-cache merge strategy, I think there are probably two distinct cases that need to be considered;

Another node has left the cluster (i.e. remove all cache entries that are associated with that node), and
This node has left the cluster (i.e. remove all cache entries that align with /any/ other node). This can happen in the case of a ‘split-brain’, or if clustering is disabled, e.g. as part of a rolling update.

Greg

gdt · August 26, 2020, 11:11am

(Having looked at the PR I can see that it’s far more than a simple revert, so I think this is pretty much covered).

guus · August 26, 2020, 4:07pm

For posterity: I’ve discussed the solution in detail with @gdt in the pull request. From what I’ve learned up to now, reverting OF-974 seems to be preferable to trying to implement some kind of merge.

guus · August 31, 2020, 8:29am

I believe that this issue has now been resolved. The fix will be part of the 4.6.0 release. In the main time, the nightly builds that are accessible at https://www.igniterealtime.org/downloads/nightly_openfire.jsp can be used to test.