We have setup openfire 4.0.2 on two Linux Redhat servers in a cluster using Hazelcast plugin 2.2.0
Apart from cluster not starting sometimes, we are facing a number of other related issues even when the cluster does form correctly.
- When any node of the Cluster is stopped (using service openfire stop), the components registered on that node are not always removed from the server items for the cluster
- When any node of the Cluster is killed (using kill -9), the components registered on that node are never removed from the server items for the cluster
- When a node e.g. primary node is stopped and leaves the cluster and started again after the secondary node has become primary (cluster senior member), it doesn’t always get the JoinedCluster() event, but the other (senior) node gets the JoinedCluster(node id) event for the newly started / joined node.
- When a node e.g. primary node is stopped and leaves the cluster and started again after the secondary node has become primary (cluster senior member), even when it rejoins the cluster successfully and gets the JoinedCluster() event, the clients that connect to the new node cannot route packets to the components registered on the existing node whereas clients that connect to existing node can route packets to all registered components.
Debug log on Openfire 4.0.2 newly started node shows the following exchange for the last scenario:
2016.09.16 09:36:58 <iq type=“get” id=“750-14” from=“admin@xmppdomain" to=“component1. xmppdomain">
2016.09.16 09:36:58 org.jivesoftware.openfire.spi.RoutingTableImpl - Failed to route packet to JID: component1. xmppdomain packet:
2016.09.16 09:36:58 org.jivesoftware.openfire.IQRouter - IQ sent to unreachable address: <iq type=“get” id=“750-14” from=“admin@xmppdomain” to=“component1.xmppdomain">
In Error Logs we see the Illegal Argument Exception.
2016.09.16 09:36:58 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask (ClusteredCacheFactory.java:316)
at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(C lusterPacketRouter.java:46)
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImp l.java:434)
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:248)
at org.jivesoftware.openfire.component.InternalComponentManager.sendPacket(Interna lComponentManager.java:288)
Same scenarios work on Openfire 3.8.2 with Coherence based Clustering plugin version 1.2.4 flawlessly every time.
Does anyone know the reason for this behaviour, is it configuration or something fundamentally missing from cache synchronisation in Hazelcast or Openfire 4.0.2?