Restarting a component got "conflict" error due to an orphan entry in the component cache

V_Lau · October 22, 2018, 11:44pm

I would like to report a bug that restarting a component would get a “conflict” error in a clustering environment. The symptom was reported in https://discourse.igniterealtime.org/t/hazelcast-cluster-on-openfire-4-0-2-issues/61731/2

The root cause was: an entry was added to component cache using the DEFAULT_NODE_ID before Hazelcast was initialized with a real node ID. When the real node ID was ready, the RoutingTableImpl.addComponentRoute() just added a new entry to the cache without removing the old entry (the one with DEFAULT_NODE_ID.) Therefore, the component cache always had two entries for a component (one with DEFAULT_NODE_ID, one with the real node ID.) When the component was disconnected, the entry with real node ID was removed from the component cache correctly, but one with DEFAULT_NODE_ID remained. When the component reconnected again, the Component Manager thought the component already existed and returned a “conflict” error.

A suggested fix is to remove a component cache entry with DEFAULT_NODE_ID for the component before adding a new one to the cache. Two files are involved: XMPPServer.java and RoutingTableImpl.java:

% git diff ../src/java/org/jivesoftware/openfire/XMPPServer.java
diff --git a/src/java/org/jivesoftware/openfire/XMPPServer.java b/src/java/org/jivesoftware/openfire/XMPPServer.java
index 35a84ca..54ec69b 100644
--- a/src/java/org/jivesoftware/openfire/XMPPServer.java
+++ b/src/java/org/jivesoftware/openfire/XMPPServer.java
@@ -226,6 +226,16 @@ public class XMPPServer {
     }
 
     /**
+     * Returns the default node ID used by this server before clustering is
+     * initialized.
+     *
+     * @return The default node ID.
+     */
+    public NodeID getDefaultNodeID() {
+        return DEFAULT_NODE_ID;
+    }
+
+    /**
      * Returns true if the given address matches a component service JID.
      *
      * @param jid the JID to check.

% git diff ../src/java/org/jivesoftware/openfire/spi/RoutingTableImpl.java
diff --git a/src/java/org/jivesoftware/openfire/spi/RoutingTableImpl.java b/src/java/org/jivesoftware/openfire/spi/RoutingTableImpl.java
index 6d91f4d..b1f1aa3 100644
--- a/src/java/org/jivesoftware/openfire/spi/RoutingTableImpl.java
+++ b/src/java/org/jivesoftware/openfire/spi/RoutingTableImpl.java
@@ -139,6 +139,11 @@ public class RoutingTableImpl extends BasicModule implements RoutingTable, Clust
             if (nodes == null) {
                 nodes = new HashSet<>();
             }
+            // Remove an orphan entry (if any) in the routing table; mainly for clustering
+            if (nodes.remove(server.getDefaultNodeID())) {
+                Log.debug("Replacing DEFAULT_NODE_ID with \"{}\" for component {}",
+                    server.getNodeID(), route);
+            }
             nodes.add(server.getNodeID());
             componentsCache.put(address, nodes);
         } finally {

gdt · October 23, 2018, 8:24am

HI Vincent,

Thanks for this; it looks OK by eye, and reading between the lines it sounds like you’ve got this working locally? Is there any chance you could raise a PR on Github for it,

Thanks,

Greg

V_Lau · October 23, 2018, 6:38pm

Hi Greg,

Yes, I got it working locally. I intentionally put a Log.debug() in the code, so I could see that the orphan entry was replaced correctly. I will do a PR with the correct branch this time. Thanks.

-Vincent

Ciaccina · November 17, 2018, 5:54pm

Hi,

i’ve the same issue on Opefire version 4.1.5 with hazelcast plugin 2.2.2.
Also, we have two nodes, and four client connected in local in one of them and remote on the other.
Every day, one of four client is online on the local node and offline in the remote node.
Have you ever seen such behavior?

Thanks

Ciaccina

V_Lau · November 18, 2018, 6:37pm

@Ciaccina When you referred “clients”, do you mean “external component” or users? The “client connection” usually refers to user connection via port 5222 or http-bind, while “external component” connection is via port 5275. In my case, I had external components trying to reconnect the nodes repeatedly if the nodes were not up. The problem was that the reconnection had been established before Hazelcast was initialized completely. BTW, do your “clients” (assuming as components) have same subdomain name (e.g. comp.mydomain.com)? If they have same subdomain name, there was another bug in Component Manager that if one instance of the component was offline, it might remove a wrong instance from its component session table.

Ciaccina · November 19, 2018, 7:13am

Hi @V_Lau thanks for the response.

The connection of my clients is on 5222 port.

I’m usign smack library for the connection between my client and openfire server. I’ve two openfire in cluster with hazelcast plugin 2.2.2.

When the openfire is up, and when the cluster is up too, I start the client and then it connects to the openfire node 1 in local and in remote in the other node 2. After some hours the client is

offline in the remote node and online in the local node. How can i prevent this?

Ciaccina · November 20, 2018, 1:23pm

Hi @V_Lau,

Could we have the openfire compiled that includes the fix to this issue?
If this is not possible, can you give me a guide to compile openfire version 4.1.5 downloaded from github?

Thank at all.

V_Lau · November 21, 2018, 6:25pm

Ciaccina,

Your problem is actually different from mine. My issue was related “external component” sessions. Yours is client sessions. However, I suggest you to enable log or add some log for org.jivesoftware.openfire.session.LocalClientSession and org.jivesoftware.openfire.SessionManager. BTW, you may want to check if XEP-198 (Stream Management) is enabled in 4.1.5. In 4.2.3 it is enabled and smack uses it by default. And it created a lot of unexpected behavior if your client did not disconnect properly; the server treated those client sessions as “detached” and expected those clients would resume the sessions shortly using XEP-198. If the clients did not resume properly, the new sessions would be mistakenly timed out. I had to disable Stream Management from Openfire.

Clone https://github.com/igniterealtime/Openfire.git
git checkout tags/v4.1.5
cd {openfire}/build
ant clean openfire plugins
The entire build is in {openfire}/target/openfire