powered by Jive Software

Thread stuck in MultiUserChat.enter() forever

Hey,

We discovered a condition under which MultiUserChat.enter() blocks forever, resulting in the MUC for this address being unusable. This is the stack trace we observed (a few hours after the call to createOrJoin):

java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(java.base@11.0.16/Native Method)
 - waiting on <no object reference available>
 at java.lang.Object.wait(java.base@11.0.16/Object.java:328)
 at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:427)
 - waiting to re-lock in wait() <0x0000000717561e20> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)

In our case this probably happened because the PacketReader thread died with an exception and the reflected self presence was never processed. The thread running enter() does a wait(0) until the stanza listener is called. Using a finite timeout and failing if the reflected self presence still hasn’t been processed should solve the problem. If you agree I can work on a PR.

Boris

Hey Boris,

as always, thanks for your report. :slight_smile:

I am not sure if the root cause here could really be a crashed PacketReader thread. Your stacktrace shows that the control flow reaches presenceListener.wait(), which mean that the previous code line

reflectedSelfPresence = selfPresenceCollector.nextResultOrThrow(conf.getTimeout());

was executed without an exception. So the reflected self presence was already received by a collector. In the next step, the code waits for the stanza listener to process the reflected self presence. The collectors and listeners are both handled by the same method (AbstractXMPPConnection.invokeStanzaCollectorsAndNotifyRecvListeners(Stanza). So we know the collector successfully collected the reflected self presence, but for some reason the listener never ran. Which shouldn’t happen, and hence I decided to use an potentially infinite wait() here, assuming that it is never an infinite wait by construction.

But you showed that this is not the case. We could do something like

but I also really want to get a better understanding how this could happen. Because obviously, I am missing some part of the picture. I will try to get behind this, but let me know if you have any theories or ideas.

Hi Flow,

Thanks for looking into this and correcting my interpretation! I think I’m getting closer to a correct explanation, but I’m still not there. I found the following:

  1. I’m pretty sure the exception in the PacketReader initiated the chain of events. It was followed by a re-connect (the XMPP server runs on localhost so it happened very quickly) and the stack trace I originally posted was initiated by the authenticated() being called on the ConnectionListener.

  2. This thread is waiting to lock the same MultiUserChat instance (handling an “unavailble” presence):

"Smack Cached Executor" #87237 daemon prio=5 os_prio=0 cpu=648.79ms elapsed=63818.52s tid=0x00007fa1e400b800 nid=0x35a3c9 waiting for monitor entry  [0x00007fa215a00000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.jivesoftware.smackx.muc.MultiUserChat.userHasLeft(MultiUserChat.java:2197)
    - waiting to lock <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
    at org.jivesoftware.smackx.muc.MultiUserChat.access$900(MultiUserChat.java:113)
    at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:250)
    at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1623)
    at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$109/0x00000008402d3840.run(Unknown Source)
    at org.jivesoftware.smack.AsyncButOrdered$Handler.run(AsyncButOrdered.java:151)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

  1. XMPPTCPConnection.inOrderListeners has:
  • An entry in threadActiveMap with the stuck MultiUserChat.presenceListener as key
  • An entry in pendingRunnables for the same key, which has 1 item in its queue, which has a Presence with type=available and statusCode 110.

It looks like handling of the “unavailable” presence happened before handling of the reflected self presence, and now the two threads are deadlocked. But I have no idea how that might have happened (I got excited for a second when I saw in the dump that the keys for AsyncButOrdered are MultiUserChat$1, $2, $3 instead of MultiUserChat itself, but in this case only presenceListener ($3) is involved). Perhaps that’s actually the order in which they were received from the server after the re-connect?

Thanks again, I appreciate your help.

Thanks for sticking with me and going down the rabbit hole to the root cause. I think I have now a good idea how the deadlock happens. I quickly hacked together something which I believe should prevent the deadlock:

Note that this is untested as I am currently short on time.

I don’t understand it fully, but I agree it should prevent the deadlock.

Created SMACK-927

1 Like

Merged into the 4.4 branch. Please test the next snapshot build and report back your findings. Thanks.

Sorry I dropped the ball on this one. I just ran a quick test with smack-core-4.4.7-20221005.095442-1 and I’m getting timeouts like these:

Jicofo 2022-10-20 23:47:30.976 SEVERE: [28] [room=__jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si meeting_id=] ChatRoomImpl.lambda$leave$2#381: Failed to properly leave MUC: __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si(focus@auth.meet.jit.si/focus)
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (full): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si/focus, PresenceTypeFilter: type=unavailable, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
	at org.jivesoftware.smackx.muc.MultiUserChat.leave(MultiUserChat.java:798)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$leave$2(ChatRoomImpl.java:372)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Jicofo 2022-10-20 23:47:31.704 WARNING: [35] FocusManager.conferenceRequest#117: Exception while trying to start the conference
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:408)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.joinAs(ChatRoomImpl.java:293)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.join(ChatRoomImpl.java:224)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.joinTheRoom(JitsiMeetConferenceImpl.java:493)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.start(JitsiMeetConferenceImpl.java:330)
	at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.kt:114)
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:163)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)
Jicofo 2022-10-20 23:47:31.704 SEVERE: [35] HealthChecker.run#175: Health check failed in PT15.00206S:
java.lang.RuntimeException: Failed to create conference with room name __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si:No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:176)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)

I’m not sure what introduced this, we’re currently using 4.4.6 without problems. I’ll dig more into it and report back.