Openfire 4.1.0 alpha Hazelcast 2.2 HAProxy Docker

Setup: Two node Openfire cluster using Hazelcast with HAProxy load balancer and each component is running in its own Docker container

  • openfire1 in Docker container

  • openfire2 in Docker container

  • haproxy in Docker container in front of two node cluster using TCP mode

  • Hazelcast plugin used to create cluster between openfire1 and openfire2

My web client is able to connect to both Openfire instances and create client sessions fine. I can view them in Admin Consoles of both Openfire nodes. Clustering is enabled and I can see some client sessions are local whereas others are remote depending on which Openfire node the client connects to.

Problem: For testing, I manually take down openfire1 in the cluster and notice that the client that is locally connected to that Openfire immediately loses their connection and then attempts to reconnect. I would expect that Hazelcast would handle notifying the other Openfire node that it must take over the client’s session. That seems to happen, but not before a very long delay occurs. During this delay, I cannot view Client Sessions in the remaining Openfire node using Admin Console. Nor will any new client sessions be able to be created, connected to the remaining node.

Question: Anyone know where this problem might be coming from … Hazelcast, Openfire, HAProxy, Docker?? It would seem like during the loss of the first Openfire node, openfire1, the second node, openfire2, seems stuck trying to process, or figure out that it is now the sole node remaining, and is trying to process the transition? This is my theory since when navigating to the surviving node’s Admin Console, I cannot view Server or Client Sessions, however I can view Plugins tab, and Group Chat tab. After some long time, say more than a minute, the second Openfire node, openfire2, then reconnects the client sessions, and is able to accept new connections as well. Maybe there is a configuration setting which decreases this long delay of failover?

Additional info:

It takes approximately 1 minute 14 seconds for the remaining node to pick up all client sessions and allow them to reconnect.

In the log file, as soon as openfire1 node is brought down, the following messages appear, and they appear over and over until the cluster removes the node that was taken down. Seems like once the node that was taken down is officially removed, then the remaining up node is able to recover client sessions.

2016.06.09 20:11:14 INFO [hz.openfire.IO.thread-in-0]: com.hazelcast.nio.tcp.TcpIpConnection - [172.17.0.5]:5701 [openfire] [3.5.1] Connection [Address[172.17.0.8]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]

2016.06.09 20:11:14 WARN [hz.openfire.IO.thread-in-0]: com.hazelcast.nio.tcp.ReadHandler - [172.17.0.5]:5701 [openfire] [3.5.1] hz.openfire.IO.thread-in-0 Closing socket to endpoint Address[172.17.0.8]:5701, Cause:java.io.EOFException: Remote socket closed!

2016.06.09 20:11:16 INFO [cached3]: com.hazelcast.nio.tcp.SocketConnector - [172.17.0.5]:5701 [openfire] [3.5.1] Connecting to /172.17.0.8:5701, timeout: 0, bind-any: true

2016.06.09 20:11:20 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:25 WARN [cached2]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:28 INFO [hz.openfire.InspectInvocationsThread]: com.hazelcast.spi.OperationService - [172.17.0.5]:5701 [openfire] [3.5.1] Handled 0 invocation timeouts and 1 backupTimeouts

2016.06.09 20:11:30 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:35 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:40 WARN [cached4]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:44 INFO [hz.openfire.InspectInvocationsThread]: com.hazelcast.spi.OperationService - [172.17.0.5]:5701 [openfire] [3.5.1] Handled 0 invocation timeouts and 1 backupTimeouts

2016.06.09 20:11:45 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:50 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:55 WARN [cached1]: com.hazelcast.cluster.ClusterService - [172.17.0.5]:5701 [openfire] [3.5.1] This node does not have a connection to Member [172.17.0.8]:5701

2016.06.09 20:11:57 ERROR [socket_c2s-thread-2]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException: Call Invocation{ serviceName=‘hz:impl:executorService’, op=com.hazelcast.executor.impl.operations.MemberCallableTaskOperation{serviceNa me=‘null’, partitionId=-1, callId=0, invocationTime=1465503117330, waitTimeout=-1, callTimeout=30000}, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=64, callTimeout=30000, target=Address[172.17.0.8]:5701, backupsExpected=0, backupsCompleted=0} encountered a timeout

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicatio nResponse(InvocationFuture.java:366)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicatio nResponseOrThrowException(InvocationFuture.java:334)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFut ure.java:225)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:71)

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doSynchronous ClusterTask(ClusteredCacheFactory.java:374)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:596)

at org.jivesoftware.openfire.plugin.session.RemoteSession.doSynchronousClusterTask (RemoteSession.java:193)

at org.jivesoftware.openfire.plugin.session.RemoteClientSession.incrementConflictC ount(RemoteClientSession.java:157)

After more investigation, I have discovered a the following Hazelcast properties that are affecting the Openfire cluster:

hazelcast.operation.call.timeout.millis
60000
int
Timeout to wait for a response when a remote call is sent, in milliseconds.

hazelcast.socket.connect.timeout.seconds
0
int
Socket connection timeout in seconds.Socket.connect() will be blocked until either connection is established or connection is refused or this timeout passes. Default is 0, means infinite.

hazelcast.connection.monitor.max.faults
3
int
Maximum IO error count before disconnecting from a node.

HeartbeatOperation will consume a total of 60 seconds if nothing else kills it. It will be tried 3 times until it completely fails to get a pulse on a node at which time the node will be removed from the cluster. All other nodes will be notified of the dead member being removed. Since socket timeout is 0, infinite, when a node can’t be reached, it will “try” for the longest time possible before failing. I notice that the default configuration for the Openfire Hazelcast plugin sets the operation timeout to 30000, 30 seconds, thus in the log file I witness that there are 3 attempts to connect to the failed node for about 30 seconds each, until finally it is removed.

The remaining question to me though is this:

Why is it that in a two-node cluster, when a single node goes down, the other node seems non-functional until it is notified that the dead node is removed? Seems like the remaining healthy node doesn’t accept new connections, disallows sending messages, and joining rooms until it has fully “recovered” from the other node going down and being removed from the cluster.

It would be good if @Tom Evans could weigh in on this one to better understand the expected operation of healthy nodes in a cluster when one of the other nodes goes down.

I can tell you that my hazelcast cluster does not have this happen. I am using 2 Windows 2008 servers for OpenFire and a windows 2012 server to host the Microsoft SQL database. I was testing my setup today and had over 139k users online and I’ve shut down a server to make sure the connections continue going to the other server and it does in fact function as it is supposed to.

I am also using a Barracuda Load Balancer in front of those 2 servers so when I take one server down, the load balancer sees this and starts sending all traffic to the other server/node until the first server comes back online. Then it sends the incoming connections to both servers. I can set the load balancer to use round robin or least connections. In either case it works as it should.