All socket threads blocked in openfire 4.9.2

Hello Dear Openfire-Community,

During was able to reproduce a bug where Openfire v4.9.2 is not able to accept any new connections because all socket threads are in a blocked state.

Testsetup:

  • Establish 10.000 connections (40 users with 500 different resources each)
  • Ungracefully terminate all connections
  • Try to reastblish connections with a rate of ~ 120/s

Relevant configurations:

Resource conflict: Kick

At some point during the test step 3 openfire did not accept any new connections anymore. I captured a heapdump and analyzed the state of Openfire and saw that every c2s socket thread was in following stacktrace:

Thread 'socket_c2s-thread-17' with ID = 102
	java.lang.Object.wait(Object.java)
	java.lang.Object.wait(Object.java:328)
	io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:254)
	io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
	io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
	io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:405)
	io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
	io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
	org.jivesoftware.openfire.nio.NettyConnection.close(NettyConnection.java:237)
	org.jivesoftware.openfire.Connection.close(Connection.java:177)
	org.jivesoftware.openfire.session.LocalSession$$Lambda$911.accept(Native method)
	java.util.Optional.ifPresent(Optional.java:183)
	org.jivesoftware.openfire.session.LocalSession.close(LocalSession.java:481)
	org.jivesoftware.openfire.handler.IQBindHandler.handleIQ(IQBindHandler.java:131)
	org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:127)
	org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:403)
	org.jivesoftware.openfire.IQRouter.route(IQRouter.java:106)
	org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74)
	org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:392)
	org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler.java:90)
	org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:334)
	org.jivesoftware.openfire.net.StanzaHandler.processStanza(StanzaHandler.java:222)
	org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:114)
	org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:142)
	org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:50)
	io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.traffic.AbstractTrafficShapingHandler.channelRead(AbstractTrafficShapingHandler.java:506)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	java.lang.Thread.run(Thread.java:829)

I can provide further information from the stacktrace and also try reproduce the issue once more, in case any additional information is required.

Remark: The test was created in the process of reproducing a production issue, where openfire does not accept new connections after a network interruption. During that production issue, we observed that the affected machines stacked up a lot of TCP connections in the “closed wait” state and we slowly lost user sessions. What might be relevant for the analyzation is, that the initial problem also occured on openfire version 4.7.2 (before switch to nio) but I did not had the chance to run the same test on such a machine.

Are there any configurations available that could help with such a issue?

Best regards,
Nico

Thanks for this analysis. Are you in a position to test a main branch / Openfire 5.0.0 build in this manner? I believe some changes have been made to help this.

Hi Nico, I’m sorry to hear about this. If you’re able to provide a full thread dump, that would be helpful.

Without seeing direct evidence of any of this in the data that you provided, some thoughts:

  • Are you using websockets a lot? Are you in a position to try if you can reproduce the problem without those?
  • Can you reproduce the issue with the stream management feature disabled? To do so, set stream.management.active to false in the Openfire admin console.

Hello Guus & Daryl

  • Here is the full stackstrace
    openfire_472_stacktrace.txt (117.7 KB)
  • Websockets are not used
  • Steam management is disabled
  • Next week I can also repeat the test with openfire v5.0.0 and let you know about the results.

Hello Guus & Darly,

I was able to reproduce the same issue in following test setup: Openfire 5.0.0 Alpha Release with Java 17 and the same configuration as above. Additionally I was also able to reproduce the issue with only 40 parallel connections this time. The server was not able to accept new connections for 600 seconds, after that the connections in the CLOSE_WAIT state to my clients got terminated and the server was functional again without any manual interaction.

I attached the new stack-trace for you here:
openfire_500_stacktrace.txt (119.6 KB)

Is there anything else that I could provide to you to help you further analyze the problem?

Hi Nico, thanks for providing this information. That is very helpful.

You do not appear to be using the very latest code of Openfire, as that has a minor change in relevant code. It might be worth trying again against the latest nightly build (although I have no high hopes that this problem would be resolved).

The ‘connection-close’ event for Openfire TCP client connections executes asynchronously. Openfire waits up to 10 minutes for the corresponding event listeners to finish processing the event that a connection was closed. This really is just a safety net, as these event listeners are expected to return almost immediately. For some reason, you are hitting that 10-minute limit. This is curious, and shouldn’t be possible to happen.

The corresponding code can be found in Blaming Openfire/xmppserver/src/main/java/org/jivesoftware/openfire/nio/NettyConnection.java at main · igniterealtime/Openfire · GitHub

One of the reasons for this to happen could be if a close listener would take a very long time to return, but from what I can tell, your code doesn’t seem to be busy invoking such listeners. Maybe they errorred out?

Do you use custom close listeners?

Do any of the close listeners (custom or otherwise) generate errors in the logs?

The problem appears to occur when a client logs in again with the same resource-part as what it was using before (Openfire tries to kick the old session, which is where the delay is introduced). As a workaround, you could consider not re-using the resource-part.

I repeated the test with the openfire nightly build from 2025-02-17 and also replaced our certificates and users with generic ones. Therefore I am now also able to share the full heapdump and log trace with you (via E-Mail).

Since in this installation no additional code or plugins are run, only the standard openfire close listeners are invoked.

I made a list of configuration changes that I applied to the server in order to make it compatible with the test:

  • xmpp.client.version-query.enabled → false
  • xmpp.client.other-resource.presence → false
  • stream.management.active → false

Unfortunately it is very hard to change the client behavior in our production environment and therefore a workaround with changing the resource is not possible for us.

Again, Thank you!

Thanks for the data, that should be helpful. Sadly, I’ve not found much yet.

I’ve been trying to reproduce the problem, using two similar but distinct approaches, but I’ve been unsuccessful.

This is the Smack-based code that I used to try and reproduce the problem:
TestClientReconnect.java (5.1 KB)
TestClientReconnectResourceConflict.java (6.3 KB)

Both tests run successfully. I’ve checked explicitly: there is no logging of this message (which is what I assume is a key characteristic of the problem):

org.jivesoftware.openfire.nio.NettyConnection - Timed out waiting for close listeners to complete.

Can you please review the code, and suggest changes to give it a better chance of reproducing the problem?

In your code version the connection speed was too low to reproduce the issue. I moved each connection in a single thread and now I am able to reproduce the same issue also with the smack client (see attached code) on a M2 Pro and 100M internet connection.

TestClientReconnectResourceConflictParallel.java (4.5 KB)

Sadly, I’m still not reproducing the problem. I’m running Openfire on the same machine as the tests. Even if it’s a beefy machine, it may affect things, I suppose. I wonder if we can introduce something in Openfire’s code base to make this problem more likely to occur - something like a sleep() or somesuch…