All socket threads blocked in openfire 4.9.2

Hello Dear Openfire-Community,

During was able to reproduce a bug where Openfire v4.9.2 is not able to accept any new connections because all socket threads are in a blocked state.

Testsetup:

  • Establish 10.000 connections (40 users with 500 different resources each)
  • Ungracefully terminate all connections
  • Try to reastblish connections with a rate of ~ 120/s

Relevant configurations:

Resource conflict: Kick

At some point during the test step 3 openfire did not accept any new connections anymore. I captured a heapdump and analyzed the state of Openfire and saw that every c2s socket thread was in following stacktrace:

Thread 'socket_c2s-thread-17' with ID = 102
	java.lang.Object.wait(Object.java)
	java.lang.Object.wait(Object.java:328)
	io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:254)
	io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
	io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
	io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:405)
	io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
	io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
	org.jivesoftware.openfire.nio.NettyConnection.close(NettyConnection.java:237)
	org.jivesoftware.openfire.Connection.close(Connection.java:177)
	org.jivesoftware.openfire.session.LocalSession$$Lambda$911.accept(Native method)
	java.util.Optional.ifPresent(Optional.java:183)
	org.jivesoftware.openfire.session.LocalSession.close(LocalSession.java:481)
	org.jivesoftware.openfire.handler.IQBindHandler.handleIQ(IQBindHandler.java:131)
	org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:127)
	org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:403)
	org.jivesoftware.openfire.IQRouter.route(IQRouter.java:106)
	org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:74)
	org.jivesoftware.openfire.net.StanzaHandler.processIQ(StanzaHandler.java:392)
	org.jivesoftware.openfire.net.ClientStanzaHandler.processIQ(ClientStanzaHandler.java:90)
	org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:334)
	org.jivesoftware.openfire.net.StanzaHandler.processStanza(StanzaHandler.java:222)
	org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:114)
	org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:142)
	org.jivesoftware.openfire.nio.NettyConnectionHandler.channelRead0(NettyConnectionHandler.java:50)
	io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.handler.traffic.AbstractTrafficShapingHandler.channelRead(AbstractTrafficShapingHandler.java:506)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	java.lang.Thread.run(Thread.java:829)

I can provide further information from the stacktrace and also try reproduce the issue once more, in case any additional information is required.

Remark: The test was created in the process of reproducing a production issue, where openfire does not accept new connections after a network interruption. During that production issue, we observed that the affected machines stacked up a lot of TCP connections in the “closed wait” state and we slowly lost user sessions. What might be relevant for the analyzation is, that the initial problem also occured on openfire version 4.7.2 (before switch to nio) but I did not had the chance to run the same test on such a machine.

Are there any configurations available that could help with such a issue?

Best regards,
Nico

Thanks for this analysis. Are you in a position to test a main branch / Openfire 5.0.0 build in this manner? I believe some changes have been made to help this.

Hi Nico, I’m sorry to hear about this. If you’re able to provide a full thread dump, that would be helpful.

Without seeing direct evidence of any of this in the data that you provided, some thoughts:

  • Are you using websockets a lot? Are you in a position to try if you can reproduce the problem without those?
  • Can you reproduce the issue with the stream management feature disabled? To do so, set stream.management.active to false in the Openfire admin console.

Hello Guus & Daryl

  • Here is the full stackstrace
    openfire_472_stacktrace.txt (117.7 KB)
  • Websockets are not used
  • Steam management is disabled
  • Next week I can also repeat the test with openfire v5.0.0 and let you know about the results.