Timeouts and ForkJoinPool starvation when clustering with hazelcast

Hello, we’re running:

  • Openfire 5.0.1
  • Hazelcast plugin 5.5.0.1
  • Two clustered server instances (8xCPU, 32GB RAM)

This is a setup intended to replace our original single server deployment that handles ~60.000 connections at its busiest periods.

With this cluster, the moment the load starts increasing rapidly (first time in the morning when clients start to connect), we are seeing continuous timeout and threading errors such as:

2025.10.13 08:31:56.514 ^[[32mINFO ^[[m [hz.openfire.cached.thread-248]: org.jivesoftware.openfire.session.RemoteSessionTask - An exception was logged while executing RemoteSessionTask to close session: LocalClientSession{address=204fda1d-0867-4781-85a9-7ea7e5936cee@xmpp.displaynote.com/I/5F3601E1F57AC2A2BE5E976DD537229D130B7155, streamID=373st5uxpw, status=CLOSED, isEncrypted=true, isDetached=false, serverName='xmpp.displaynote.com', isInitialized=true, hasAuthToken=true, peer address='168.254.25.129', presence='<presence from="204fda1d-0867-4781-85a9-7ea7e5936cee@xmpp.displaynote.com/I/5F3601E1F57AC2A2BE5E976DD537229D130B7155"><c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="https://github.com/qxmpp-project/qxmpp" ver="K3ag6SarEcZHRQXYiCJ4QixxRkE="></c><receiver-status xmlns="https://www.displaynote.com/ns/commands" receiver-platform="android" status="idle" version-tag="2.39.11"></receiver-status></presence>'}
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(FutureTask.java:204) ~[?:?]
        at org.jivesoftware.openfire.session.RemoteSessionTask.run(RemoteSessionTask.java:162) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.session.ClientSessionTask.run(ClientSessionTask.java:71) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:603) ~[hazelcast-5.5.0.1.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        at com.hazelcast.executor.impl.DistributedExecutorService$Processor.run(DistributedExecutorService.java:286) ~[hazelcast-5.5.0.jar:5.5.0]
        at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:220) ~[hazelcast-5.5.0.jar:5.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) [?:?]
        at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76) [hazelcast-5.5.0.jar:5.5.0]
        at com.hazelcast.internal.util.executor.PoolExecutorThreadFactory$ManagedThread.executeRun(PoolExecutorThreadFactory.java:74) [hazelcast-5.5.0.jar:5.5.0]
        at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:111) [hazelcast-5.5.0.jar:5.5.0]

or (more timeouts):

2025.10.13 08:31:33.019 ^[[1;31mERROR^[[m [socket_c2s-worker-303]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds
java.util.concurrent.TimeoutException: MemberCallableTaskOperation failed to complete within 59999955356 NANOSECONDS. Invocation{op=com.hazelcast.executor.impl.operations.MemberCallableTaskOperation{serviceName='hz:impl:executorService', identityHash=2128197160, partitionId=-1, replicaIndex=0, callId=11441889, invocationTime=1760344233019 (2025-10-13 08:30:33.019), waitTimeout=-1, callTimeout=30000, tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0, name=openfire::cluster::executor}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=30000, firstInvocationTimeMs=1760344233019, firstInvocationTime='2025-10-13 08:30:33.019', lastHeartbeatMillis=1760344288852, lastHeartbeatTime='2025-10-13 08:31:28.852', targetAddress=[xmpp-cluster-prod-02.displaynote.com]:5701, targetMember=Member [xmpp-cluster-prod-02.displaynote.com]:5701 - 9e1acebd-a193-4667-a2df-bc40964eb2fd, memberListVersion=2, pendingResponse={VOID}, backupsAcksExpected=-1, backupsAcksReceived=0, connection=Connection[id=1, /10.18.45.4:5701->/10.18.45.5:53051, qualifier=null, endpoint=[xmpp-cluster-prod-02.displaynote.com]:5701, remoteUuid=9e1acebd-a193-4667-a2df-bc40964eb2fd, alive=true, connectionType=MEMBER, planeIndex=0]}
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newTimeoutException(InvocationFuture.java:85) ~[?:?]
        at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:657) ~[?:?]
        at com.hazelcast.spi.impl.DelegatingCompletableFuture.get(DelegatingCompletableFuture.java:119) ~[?:?]
        at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(ClusteredCacheFactory.java:433) ~[?:?]
        at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory.java:779) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.handler.IQBindHandler.handleIQ(IQBindHandler.java:126) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:125) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:403) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:106) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:392) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler.java:90) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:334) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.net.StanzaHandler.processStanza(StanzaHandler.java:222) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:114) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:142) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:50) ~[xmppserver-5.0.1.DNRC1.jar:5.0.1.DNRC1]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289) ~[netty-handler-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.traffic.AbstractTrafficShapingHandler.channelRead(AbstractTrafficShapingHandler.java:506) ~[netty-handler-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1515) ~[netty-handler-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1378) ~[netty-handler-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1427) ~[netty-handler-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) ~[netty-codec-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) ~[netty-codec-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) ~[netty-common-4.1.118.Final.jar:4.1.118.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.118.Final.jar:4.1.118.Final]
        at java.lang.Thread.run(Thread.java:1583) [?:?]

or (threading):

2025.10.13 08:31:37.598 ^[[33mWARN ^[[m [ForkJoinPool.commonPool-worker-263]: org.jivesoftware.openfire.nio.NettyConnection - Exception while invoking close listeners for NettyConnection{peer: /110.54.145.108:54898, state: CLOSED, session: LocalClientSession{address=xmpp.displaynote.com/2a710dde-d4c0-4365-b503-c8519504cb6f, streamID=67iyql8bau, status=CLOSED, isEncrypted=true, isDetached=false, serverName='xmpp.displaynote.com', isInitialized=false, hasAuthToken=true, peer address='110.54.145.108', presence='<presence type="unavailable"/>'}, Netty channel handler context name: NettyClientConnectionHandler#0}
java.util.concurrent.CompletionException: java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874) [?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) [?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) [?:?]
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1810) [?:?]
        at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796) [?:?]
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387) [?:?]
        at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312) [?:?]
        at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843) [?:?]
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808) [?:?]
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188) [?:?]
Caused by: java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
        at java.util.concurrent.ForkJoinPool.tryCompensate(ForkJoinPool.java:2000) ~[?:?]
        at java.util.concurrent.ForkJoinPool.compensatedBlock(ForkJoinPool.java:3737) ~[?:?]
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3723) ~[?:?]
        at com.hazelcast.spi.impl.AbstractInvocationFuture.manageParking(AbstractInvocationFuture.java:692) ~[?:?]
        at com.hazelcast.spi.impl.AbstractInvocationFuture.joinInternal(AbstractInvocationFuture.java:583) ~[?:?]
        at com.hazelcast.internal.locksupport.LockProxySupport.lock(LockProxySupport.java:67) ~[?:?]
        at com.hazelcast.internal.locksupport.LockProxySupport.lock(LockProxySupport.java:59) ~[?:?]
        at com.hazelcast.map.impl.proxy.MapProxyImpl.lock(MapProxyImpl.java:321) ~[?:?]
        at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache$ClusterLock.doLock(ClusteredCache.java:438) ~[?:?]
        at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache$ClusterLock.lock(ClusteredCache.java:402) ~[?:?]
        at org.jivesoftware.openfire.spi.RoutingTableImpl.removeClientRoute(RoutingTableImpl.java:989) ~[?:?]
        at org.jivesoftware.openfire.SessionManager.removeSession(SessionManager.java:1286) ~[?:?]
        at org.jivesoftware.openfire.SessionManager.removeSession(SessionManager.java:1262) ~[?:?]
        at org.jivesoftware.openfire.SessionManager$ClientSessionListener.lambda$onConnectionClosing$2(SessionManager.java:1397) ~[?:?]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        ... 9 more

I already have this configured in my hazelcast-local-config.xml:

    <executor-service name="openfire::cluster::executor">
        <pool-size>500</pool-size>
        <queue-capacity>4000</queue-capacity>
        <statistics-enabled>true</statistics-enabled>
	<!-- <split-brain-protection-ref>splitbrainprotection-name</split-brain-protection-ref> -->
    </executor-service>

Anything we could be missing to configure to be able to reduce contention and assume this load?

Happy to provide more details if needed. Thanks!

Hi Miguel!

I’m sorry you’re running into this issue!

What seems to be happening is that one cluster node is complaining that the other cluster node is not giving it an answer (or isn’t giving it fast enough). I think all of these stack traces are generated on the cluster node that’s waiting on the other one.

What would be of interest is finding out why the other node isn’t responding in a timely manner. A good way of determining what that node is doing, is by looking at a thread dump that was created when these errors start happening. Those thread dumps are likely to show what the slow-to-respond cluster node is currently doing.

Additionally, looking at the logs of that node could give more indications, but you’ve likely already tried that.

Try generating those thread dumps. There are a couple of ways to do that

Hello Guus, thanks for answering so fast!

We are planning to do generate some controlled load testing in our setup. It will take us some time, I will provide further details when results are available.

Regards!

It will take us some time, I will report back

1 Like