Hazelcast Cluster on Openfire 4.0.2 issues

We have setup openfire 4.0.2 on two Linux Redhat servers in a cluster using Hazelcast plugin 2.2.0

Apart from cluster not starting sometimes, we are facing a number of other related issues even when the cluster does form correctly.

  • When any node of the Cluster is stopped (using service openfire stop), the components registered on that node are not always removed from the server items for the cluster
  • When any node of the Cluster is killed (using kill -9), the components registered on that node are never removed from the server items for the cluster
  • When a node e.g. primary node is stopped and leaves the cluster and started again after the secondary node has become primary (cluster senior member), it doesn’t always get the JoinedCluster() event, but the other (senior) node gets the JoinedCluster(node id) event for the newly started / joined node.
  • When a node e.g. primary node is stopped and leaves the cluster and started again after the secondary node has become primary (cluster senior member), even when it rejoins the cluster successfully and gets the JoinedCluster() event, the clients that connect to the new node cannot route packets to the components registered on the existing node whereas clients that connect to existing node can route packets to all registered components.

Debug log on Openfire 4.0.2 newly started node shows the following exchange for the last scenario:

2016.09.16 09:36:58 <iq type=“get” id=“750-14” from=“admin@xmppdomain" to=“component1. xmppdomain">

2016.09.16 09:36:58 org.jivesoftware.openfire.spi.RoutingTableImpl - Failed to route packet to JID: component1. xmppdomain packet:

2016.09.16 09:36:58 org.jivesoftware.openfire.IQRouter - IQ sent to unreachable address: <iq type=“get” id=“750-14” from=“admin@xmppdomain” to=“component1.xmppdomain">

In Error Logs we see the Illegal Argument Exception.

2016.09.16 09:36:58 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed

java.lang.IllegalArgumentException: Requested node not found in cluster

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask (ClusteredCacheFactory.java:316)

at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:569)

at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(C lusterPacketRouter.java:46)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImp l.java:434)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:248)

at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:323)

at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:115)

at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:78)

at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:69)

at org.jivesoftware.openfire.component.InternalComponentManager.sendPacket(Interna lComponentManager.java:288)

at org.xmpp.component.AbstractComponent.send(AbstractComponent.java:925)

Same scenarios work on Openfire 3.8.2 with Coherence based Clustering plugin version 1.2.4 flawlessly every time.

Does anyone know the reason for this behaviour, is it configuration or something fundamentally missing from cache synchronisation in Hazelcast or Openfire 4.0.2?

Thanks

spi.RoutingTableImpl issues continue with Openfire 4.2.0 and hazelcast 2.2.4

Entities on one node are not routable from the other node in the cluster.

org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed

In openfire 4.2.2 with Hazelcast 2.3

org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node not found in cluster

I encountered the same problem in Openfire 4.2.3 with Hazelcast 2.3. Make sure that your external component is not connected to Openfire until clustering is completely initialized.

When Openfire is started, hazelcast plugin is not loaded until all internal components are started. If any external component connects to any Openfire node before clustering is initialized, the routing table for the external component will have an “empty” nodeID (namely DEFAULT_NODE_ID) which cannot be removed until Openfire is restarted. When the external component loses its connection, the ComponentManager is unable to remove the component session from Hazelcast cache because of this “empty” nodeID entry. If the external component tries to reconnect, it will get a “conflict” error.

I am checking with the community if there is a fix.

Update: if you are interested in a fix, I posted it to