HTTP-bind sessions lose RIDs, cause session termination

I have a Strophe.js client connecting to an Openfire server via BOSH. This works as expected, but the server seems to lose requests (which appear successful to the client) at some low rate (perhaps 5% of requests cause a problem) which eventually causes the connection to be lost. I think this is a bug, but can’t tell what’s triggering it.


The specific case I have here involves a connection that stayed alive for about 20 minutes before failing.

When a problem occurs, the Openfire log contains a timeout:

2021.11.10 04:33:18 WARN [Jetty-QTP-BOSH-4254]: org.eclipse.jetty.server.HttpChannelState - java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms while invoking onTimeout listener org.jivesoftware.openfire.http.HttpSession$3@4af15a58
2021.11.10 04:33:18 WARN [Jetty-QTP-BOSH-4254]: org.eclipse.jetty.util.thread.QueuedThreadPool -
java.lang.IllegalStateException: ABORTED
at org.eclipse.jetty.server.HttpChannelState.sendError(HttpChannelState.java:896) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannelState.nextAction(HttpChannelState.java:501) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannelState.unhandle(HttpChannelState.java:420) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:555) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773) ~[jetty-util-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905) [jetty-util-9.4.35.v20201120.jar:9.4.35.v20201120]
at java.lang.Thread.run(Thread.java:831) [?:?]
2021.11.10 04:34:19 WARN [Jetty-QTP-BOSH-866]: org.eclipse.jetty.server.HttpChannelState - java.io.IOException: Unexpected RID error. while invoking onTimeout listener org.jivesoftware.openfire.http.HttpSession$3@5128f7ba
2021.11.10 04:35:20 WARN [Jetty-QTP-BOSH-866]: org.eclipse.jetty.server.HttpChannelState - java.io.IOException: Unexpected RID error. while invoking onTimeout listener org.jivesoftware.openfire.http.HttpSession$3@63f56b66
2021.11.10 04:35:20 WARN [Jetty-QTP-BOSH-866]: org.jivesoftware.openfire.http.HttpSession - Request 2556721695 > 2556721694, ending session 3grfbt8ct7 

To investigate the issue I tried increasing the number of in-flight requests permitted by the server to 3 as suggested by an answer to a similar problem, but this seems to have only increased the time before the session is closed: in this log there are two Unexpected RID errors, whereas there was only one before the server closed the connection when xmpp.httpbind.client.requests.max was set to 2.
This seems to suggest that the actual problem is the immediately preceding TimeoutException.

On the client, the following requests were sent immediately prior to the connection loss:

  1. RID 2556721692 at 04:32:18, waited 60 seconds with no data and got ack 2556721692
  2. RID 2556721693 at 04:33:19, waited 60 seconds and got ack 2556721693
  3. RID 2556721694 at 04:34:20, got ack 2556721692 after 60 seconds
  4. RID 2556721695 at 04:35:20, got immediate terminate response <body xmlns="http://jabber.org/protocol/httpbind" type="terminate" condition="item-not-found"/>

It looks like the original exception (for which I’m not clear of the cause) caused the lastRequestID of the session to be reset to the one that was in flight immediately prior to that error, and then it was just a matter of time until the server would stop accepting new requests.

It’s particularly unclear to me where the 30 second timeout on the first exception comes from, since my requests have a 60 second timeout. Any ideas on what’s going wrong? Have I done something wrong with the client, or is the server misbehaving?

I got a little more information from the server by turning on debug logs, but it doesn’t seem much more informative that what I had already pieced together.

2021.11.12 08:10:36 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@1a5ae88e for 4030671804 in session 1tar1f498y
2021.11.12 08:11:37 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - timeout event org.eclipse.jetty.server.AsyncContextEvent@1e656f1d for 4030671805 in session 1tar1f498y
2021.11.12 08:11:37 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@1e656f1d for 4030671805 in session 1tar1f498y
2021.11.12 08:12:37 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - timeout event org.eclipse.jetty.server.AsyncContextEvent@5b67e2b5 for 4030671806 in session 1tar1f498y
2021.11.12 08:12:37 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@5b67e2b5 for 4030671806 in session 1tar1f498y
2021.11.12 08:13:38 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - timeout event org.eclipse.jetty.server.AsyncContextEvent@27a53bd8 for 4030671807 in session 1tar1f498y
2021.11.12 08:13:38 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@27a53bd8 for 4030671807 in session 1tar1f498y
2021.11.12 08:14:35 DEBUG [TaskEngine-pool-691]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Pinged occupant '1@nakamura.taricorp.net/1tar1f498y' of room 'sax@conference.nakamura.taricorp.net' due to exceeding idle time limit.
2021.11.12 08:14:35 DEBUG [Jetty-QTP-BOSH-8170]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@6a76b93a for 4030671808 in session 1tar1f498y
2021.11.12 08:15:36 DEBUG [Jetty-QTP-BOSH-8154]: org.jivesoftware.openfire.http.HttpSession - timeout event org.eclipse.jetty.server.AsyncContextEvent@38f310cb for 4030671809 in session 1tar1f498y
2021.11.12 08:15:36 WARN [Jetty-QTP-BOSH-8154]: org.eclipse.jetty.server.HttpChannelState - java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30009/30000 ms while invoking onTimeout listener org.jivesoftware.openfire.http.HttpSession$3@656d26d2
2021.11.12 08:15:36 WARN [Jetty-QTP-BOSH-8154]: org.eclipse.jetty.util.thread.QueuedThreadPool -
java.lang.IllegalStateException: ABORTED
at org.eclipse.jetty.server.HttpChannelState.sendError(HttpChannelState.java:896) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannelState.nextAction(HttpChannelState.java:501) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannelState.unhandle(HttpChannelState.java:420) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:555) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340) ~[jetty-server-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773) ~[jetty-util-9.4.35.v20201120.jar:9.4.35.v20201120]
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905) [jetty-util-9.4.35.v20201120.jar:9.4.35.v20201120]
at java.lang.Thread.run(Thread.java:834) [?:?]
2021.11.12 08:16:37 DEBUG [Jetty-QTP-BOSH-8154]: org.jivesoftware.openfire.http.HttpSession - timeout event org.eclipse.jetty.server.AsyncContextEvent@789bfdf6 for 4030671810 in session 1tar1f498y
2021.11.12 08:16:37 WARN [Jetty-QTP-BOSH-8154]: org.eclipse.jetty.server.HttpChannelState - java.io.IOException: Unexpected RID error. while invoking onTimeout listener org.jivesoftware.openfire.http.HttpSession$3@1c68729f
2021.11.12 08:16:37 DEBUG [Jetty-QTP-BOSH-8154]: org.jivesoftware.openfire.http.HttpSession - complete event org.eclipse.jetty.server.AsyncContextEvent@789bfdf6 for 4030671810 in session 1tar1f498y
2021.11.12 08:16:38 WARN [Jetty-QTP-BOSH-8209]: org.jivesoftware.openfire.http.HttpSession - Request 4030671811 > 4030671810, ending session 1tar1f498y
2021.11.12 08:16:38 DEBUG [Jetty-QTP-BOSH-8209]: org.jivesoftware.openfire.http.HttpBindServlet - Closing session due to error: itemNotFound. Affected session: HttpSession{address=1@nakamura.taricorp.net/1tar1f498y, streamID=1tar1f498y, status=3 (authenticated), isSecure=false, isDetached=false, serverName='nakamura.taricorp.net', isInitialized=false, hasAuthToken=true, peer address='2001:44b8:3196:1:0:0:0:7f43', presence='
<presence type="unavailable"/>', hold='1', wait='60', maxRequests='2', maxPause='300', lastActivity='1636665397462', lastAcknowledged='4030671809', inactivityTimeout='30', openConnectionCount='0'} 

Seeing the timeout/completion pairs seems useful though, since normally a timeout fires and completion immediately follows, but that doesn’t happen for RID 4030671809. My best guess is that the following request is already holding a lock somewhere that the timeout handler also needs to acquire, so the timeout handler itself times out.

If this theory is correct, changing the wait time for each request to be less than 30 seconds (which seems to be the timeout for the timeout handler) might work around the problem, but I still believe there is an underlying concurrency bug.

The idea about something holding a lock doesn’t seem to hold water, since the IllegalStateException seems to occur right after the timeout handler triggers, and it seems to be in the handler for the same request (Jetty-QTP-BOSH-8154).

Looking at the exception that I think is the problem, it looks like a connection was aborted which is causing Jetty to throw the IllegalStateException:

            if (_outputState != OutputState.OPEN)
                throw new IllegalStateException(_outputState.toString());

The ongoing question appears to be where that Idle timeout expired message is coming from, and what request is corresponds to.

I haven’t been able to sort this out, but have decided to work around it by making my client prefer to use websockets. Clients that don’t support websockets remain affected by this, but I’m sufficiently stumped that they’ll simply have to deal with bad reliability now.