Bosh connection blocks all SmackReactor threads

We are hitting a problem(in jigasi) which leads to blocking all SmackReactor threads and the problem connections cannot be detected and most of the xmpp functionality is blocked and doesn’t work.
What we see in the dump is that SmackReactor has a queue of more than a 100 actions scheduled for several hours without being executed (7 hours in the dump I have).

Smack DefaultReactor Thread #0  Waiting Thread ID: 13
  jdk.internal.misc.Unsafe.park(Unsafe.java)
  java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
  org.igniterealtime.jbosh.BOSHClient.blockUntilSendable(BOSHClient.java:830)
  org.igniterealtime.jbosh.BOSHClient.send(BOSHClient.java:485)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.send(XMPPBOSHConnection.java:321)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.sendElement(XMPPBOSHConnection.java:252)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.sendStanzaInternal(XMPPBOSHConnection.java:247)
  org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:873)
  org.jivesoftware.smack.AbstractXMPPConnection.sendAsync(AbstractXMPPConnection.java:2004)
  org.jivesoftware.smack.AbstractXMPPConnection.sendIqRequestAsync(AbstractXMPPConnection.java:1940)
  org.jivesoftware.smackx.ping.PingManager.pingAsync(PingManager.java:198)
  org.jivesoftware.smackx.ping.PingManager.pingServerIfNecessary(PingManager.java:440)
  org.jivesoftware.smackx.ping.PingManager$$Lambda$240.run()
  org.jivesoftware.smack.ScheduledAction.run(ScheduledAction.java:84)
  org.jivesoftware.smack.SmackReactor$Reactor.handleScheduledActionsOrPerformSelect(SmackReactor.java:208)
  org.jivesoftware.smack.SmackReactor$Reactor.reactorLoop(SmackReactor.java:188)
  org.jivesoftware.smack.SmackReactor$Reactor.run(SmackReactor.java:173)

Smack DefaultReactor Thread #1  Waiting Thread ID: 14
  jdk.internal.misc.Unsafe.park(Unsafe.java)
  java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
  org.igniterealtime.jbosh.BOSHClient.blockUntilSendable(BOSHClient.java:830)
  org.igniterealtime.jbosh.BOSHClient.send(BOSHClient.java:485)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.send(XMPPBOSHConnection.java:321)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.sendElement(XMPPBOSHConnection.java:252)
  org.jivesoftware.smack.bosh.XMPPBOSHConnection.sendStanzaInternal(XMPPBOSHConnection.java:247)
  org.jivesoftware.smack.AbstractXMPPConnection.sendStanza(AbstractXMPPConnection.java:873)
  org.jivesoftware.smack.AbstractXMPPConnection.sendAsync(AbstractXMPPConnection.java:2004)
  org.jivesoftware.smack.AbstractXMPPConnection.sendIqRequestAsync(AbstractXMPPConnection.java:1940)
  org.jivesoftware.smackx.ping.PingManager.pingAsync(PingManager.java:198)
  org.jivesoftware.smackx.ping.PingManager.pingServerIfNecessary(PingManager.java:440)
  org.jivesoftware.smackx.ping.PingManager$$Lambda$240.run()
  org.jivesoftware.smack.ScheduledAction.run(ScheduledAction.java:84)
  org.jivesoftware.smack.SmackReactor$Reactor.handleScheduledActionsOrPerformSelect(SmackReactor.java:208)
  org.jivesoftware.smack.SmackReactor$Reactor.reactorLoop(SmackReactor.java:188)
  org.jivesoftware.smack.SmackReactor$Reactor.run(SmackReactor.java:173)

There are several threads waiting for the same lock.

The bosh is stalled in:

"RequestProcessor[446930054]: Receive thread 0" #73 daemon prio=5 os_prio=0 cpu=659.51ms elapsed=64122.82s tid=0x00005615d7f04800 nid=0x1f47 runnable  [0x00007ff76d1d4000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(java.base@11.0.15/Native Method)
	at java.net.SocketInputStream.socketRead(java.base@11.0.15/SocketInputStream.java:115)
	at java.net.SocketInputStream.read(java.base@11.0.15/SocketInputStream.java:168)
	at java.net.SocketInputStream.read(java.base@11.0.15/SocketInputStream.java:140)
	at sun.security.ssl.SSLSocketInputRecord.read(java.base@11.0.15/SSLSocketInputRecord.java:478)
	at sun.security.ssl.SSLSocketInputRecord.readHeader(java.base@11.0.15/SSLSocketInputRecord.java:472)
	at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(java.base@11.0.15/SSLSocketInputRecord.java:70)
	at sun.security.ssl.SSLSocketImpl.readApplicationRecord(java.base@11.0.15/SSLSocketImpl.java:1454)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@11.0.15/SSLSocketImpl.java:1065)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:161)
	at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:82)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:276)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
	at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:294)
	at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:257)
	at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:230)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
	at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:679)
	at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:481)
	at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at org.igniterealtime.jbosh.ApacheHTTPResponse.awaitResponse(ApacheHTTPResponse.java:235)
	- locked <0x0000000748750ce0> (a org.igniterealtime.jbosh.ApacheHTTPResponse)
	at org.igniterealtime.jbosh.ApacheHTTPResponse.getBody(ApacheHTTPResponse.java:192)
	at org.igniterealtime.jbosh.BOSHClient.processExchange(BOSHClient.java:1123)
	at org.igniterealtime.jbosh.BOSHClient.processMessages(BOSHClient.java:999)
	at org.igniterealtime.jbosh.BOSHClient.access$300(BOSHClient.java:100)
	at org.igniterealtime.jbosh.BOSHClient$RequestProcessor.run(BOSHClient.java:1728)
	at java.lang.Thread.run(java.base@11.0.15/Thread.java:829)

And this never timeouts. We normally detect stalled connections with pings and trigger a reconnect, but when the both threads of the reactor are blocked that is impossible.
SmackReactor is supposed to be used with non-blocking I/O but in the case of bosh this is blocking.

@Flow wdyt? What is the way to go here? Any help is welcome :slight_smile: Thanks

edited-stack-2022-06-01-1308-7909.threads.zip (13.1 KB)

Appears to be at least a classic instance of a blocking operation in a reactor. Likely caused by BOSHClient’s outgoing queue being full, which causes BOSHClient to block.

I believe the “proper” fix would be to make XMPPConnection.sendStanza() throw an Exception if the connection’s outgoing queue is full, but that would require modification in multiple places of Smack, and even jbosh. A quick and dirty band aid could be to run the sendStanza() invocation in AbstractXMPPConnection.sendAsync() in an extra thread (potentially by using Smack’s Async.go(Runnable) API). But that would increase the cost of sendAsync() as a new thread would always be created when this method is invoked, and furthermore, Smack could potentially create an unbounded number of threads (which is bad).

So after thinking a little bit about this, it really appears to be a problem that can only be properly fixed in Smack by changing its API. Nothing for the current stable branch 4.4 and future releases of Smack’s 4.4 series. That said, the question is, if we could get some code in 4.4 that helps you to work around the issue. I was thinking about something in the line of

What do you think?

Ha, I just discoverd that XMPPConnection.trySendStanza() was added in 4.4. Not that it helps us much, since it is probably mapped to sendStanza() in BOSHConnection.

Sounds good as a temp workaround, thank you.

Another candidate for 4.4.6 :stuck_out_tongue:

Yes, good thing that I did not release it right away. Usually a new bug affecting the stable series appears just after the release. That appears to be a fundamental law of software engineering.

I’ve updated the PR and plan to merge as soon as the CI is green, and then release 4.4.6 probably within the next 2-3 days.

Can you push another snapshot, I may try testing it under load in the next few days.

Done

https://bamboo.igniterealtime.org/browse/SMACK-NIGHTLYSTABLE-233

1 Like

Looks good for now after 24 hours … running dial-in on meet.jit.si with that change.

2 posts were split to a new topic: State of smack-bosh

We hit another strange problem few days ago. At some point in one of the regions all instances experienced something which caused the reactors to be in this state:

"Smack DefaultReactor Thread #0" #13 daemon prio=5 os_prio=0 cpu=66931.72ms elapsed=324527.06s tid=0x00007fd23cd16000 nid=0x1cb3 runnable  [0x00007fd21214f000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPoll.wait(java.base@11.0.15/Native Method)
	at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.15/EPollSelectorImpl.java:120)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.15/SelectorImpl.java:124)
	- locked <0x0000000740001b80> (a sun.nio.ch.Util$2)
	- locked <0x0000000740001a10> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(java.base@11.0.15/SelectorImpl.java:136)
	at org.jivesoftware.smack.SmackReactor$Reactor.handleScheduledActionsOrPerformSelect(SmackReactor.java:256)
	- locked <0x0000000740001a10> (a sun.nio.ch.EPollSelectorImpl)
	at org.jivesoftware.smack.SmackReactor$Reactor.reactorLoop(SmackReactor.java:188)
	at org.jivesoftware.smack.SmackReactor$Reactor.run(SmackReactor.java:173)

"Smack DefaultReactor Thread #1" #14 daemon prio=5 os_prio=0 cpu=67194.92ms elapsed=324527.06s tid=0x00007fd23cd18000 nid=0x1cb4 waiting for monitor entry  [0x00007fd21204e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.jivesoftware.smack.SmackReactor$Reactor.handleScheduledActionsOrPerformSelect(SmackReactor.java:214)
	- waiting to lock <0x0000000740001a10> (a sun.nio.ch.EPollSelectorImpl)
	at org.jivesoftware.smack.SmackReactor$Reactor.reactorLoop(SmackReactor.java:188)
	at org.jivesoftware.smack.SmackReactor$Reactor.run(SmackReactor.java:173)

And the queues keep growing. Have you seen something like that? Any idea?

Oh, now I see this is not running the latest snapshot with the bosh workaround … the same issue I think, will update it.

FWIW I’ve just merged

note that this adds support for non-blocking sends to BOSH connection, amongst other connection types). But in case of BOSH the error handling and reporting is not (yet) ideal. Basically because the outgoing elements are now written in an extra threads, and exceptions in this thread are not reported back to the user (JUL logging does not really count).

I don’t have immediate plans to improve that. And hence still recommend to use Smack’s new modular connection infrastructure and WebSockets, which has better error reporting (and futher improvements). However, of course, the new code being new, it may also has unknown bugs.

This topic was automatically closed 100 days after the last reply. New replies are no longer allowed.