Thread stuck in MultiUserChat.enter() forever

Hey,

We discovered a condition under which MultiUserChat.enter() blocks forever, resulting in the MUC for this address being unusable. This is the stack trace we observed (a few hours after the call to createOrJoin):

java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(java.base@11.0.16/Native Method)
 - waiting on <no object reference available>
 at java.lang.Object.wait(java.base@11.0.16/Object.java:328)
 at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:427)
 - waiting to re-lock in wait() <0x0000000717561e20> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)

In our case this probably happened because the PacketReader thread died with an exception and the reflected self presence was never processed. The thread running enter() does a wait(0) until the stanza listener is called. Using a finite timeout and failing if the reflected self presence still hasn’t been processed should solve the problem. If you agree I can work on a PR.

Boris

Hey Boris,

as always, thanks for your report. :slight_smile:

I am not sure if the root cause here could really be a crashed PacketReader thread. Your stacktrace shows that the control flow reaches presenceListener.wait(), which mean that the previous code line

reflectedSelfPresence = selfPresenceCollector.nextResultOrThrow(conf.getTimeout());

was executed without an exception. So the reflected self presence was already received by a collector. In the next step, the code waits for the stanza listener to process the reflected self presence. The collectors and listeners are both handled by the same method (AbstractXMPPConnection.invokeStanzaCollectorsAndNotifyRecvListeners(Stanza). So we know the collector successfully collected the reflected self presence, but for some reason the listener never ran. Which shouldn’t happen, and hence I decided to use an potentially infinite wait() here, assuming that it is never an infinite wait by construction.

But you showed that this is not the case. We could do something like

but I also really want to get a better understanding how this could happen. Because obviously, I am missing some part of the picture. I will try to get behind this, but let me know if you have any theories or ideas.

Hi Flow,

Thanks for looking into this and correcting my interpretation! I think I’m getting closer to a correct explanation, but I’m still not there. I found the following:

  1. I’m pretty sure the exception in the PacketReader initiated the chain of events. It was followed by a re-connect (the XMPP server runs on localhost so it happened very quickly) and the stack trace I originally posted was initiated by the authenticated() being called on the ConnectionListener.

  2. This thread is waiting to lock the same MultiUserChat instance (handling an “unavailble” presence):

"Smack Cached Executor" #87237 daemon prio=5 os_prio=0 cpu=648.79ms elapsed=63818.52s tid=0x00007fa1e400b800 nid=0x35a3c9 waiting for monitor entry  [0x00007fa215a00000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.jivesoftware.smackx.muc.MultiUserChat.userHasLeft(MultiUserChat.java:2197)
    - waiting to lock <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
    at org.jivesoftware.smackx.muc.MultiUserChat.access$900(MultiUserChat.java:113)
    at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:250)
    at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1623)
    at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$109/0x00000008402d3840.run(Unknown Source)
    at org.jivesoftware.smack.AsyncButOrdered$Handler.run(AsyncButOrdered.java:151)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

  1. XMPPTCPConnection.inOrderListeners has:
  • An entry in threadActiveMap with the stuck MultiUserChat.presenceListener as key
  • An entry in pendingRunnables for the same key, which has 1 item in its queue, which has a Presence with type=available and statusCode 110.

It looks like handling of the “unavailable” presence happened before handling of the reflected self presence, and now the two threads are deadlocked. But I have no idea how that might have happened (I got excited for a second when I saw in the dump that the keys for AsyncButOrdered are MultiUserChat$1, $2, $3 instead of MultiUserChat itself, but in this case only presenceListener ($3) is involved). Perhaps that’s actually the order in which they were received from the server after the re-connect?

Thanks again, I appreciate your help.

Thanks for sticking with me and going down the rabbit hole to the root cause. I think I have now a good idea how the deadlock happens. I quickly hacked together something which I believe should prevent the deadlock:

Note that this is untested as I am currently short on time.

I don’t understand it fully, but I agree it should prevent the deadlock.

Created SMACK-927

1 Like

Merged into the 4.4 branch. Please test the next snapshot build and report back your findings. Thanks.

Sorry I dropped the ball on this one. I just ran a quick test with smack-core-4.4.7-20221005.095442-1 and I’m getting timeouts like these:

Jicofo 2022-10-20 23:47:30.976 SEVERE: [28] [room=__jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si meeting_id=] ChatRoomImpl.lambda$leave$2#381: Failed to properly leave MUC: __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si(focus@auth.meet.jit.si/focus)
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (full): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si/focus, PresenceTypeFilter: type=unavailable, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
	at org.jivesoftware.smackx.muc.MultiUserChat.leave(MultiUserChat.java:798)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$leave$2(ChatRoomImpl.java:372)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Jicofo 2022-10-20 23:47:31.704 WARNING: [35] FocusManager.conferenceRequest#117: Exception while trying to start the conference
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:408)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.joinAs(ChatRoomImpl.java:293)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.join(ChatRoomImpl.java:224)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.joinTheRoom(JitsiMeetConferenceImpl.java:493)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.start(JitsiMeetConferenceImpl.java:330)
	at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.kt:114)
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:163)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)
Jicofo 2022-10-20 23:47:31.704 SEVERE: [35] HealthChecker.run#175: Health check failed in PT15.00206S:
java.lang.RuntimeException: Failed to create conference with room name __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si:No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:176)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)

I’m not sure what introduced this, we’re currently using 4.4.6 without problems. I’ll dig more into it and report back.

This topic was automatically closed 100 days after the last reply. New replies are no longer allowed.

@Boris_Grozev I was not yet able to reproduce this. Smack currently has a few pending commits in its ‘4.4’ branch and I wonder if there is an regression or if it could be released as 4.4.7. Let me know if I can help tracking down the root cause of the NoResponseExceptions you are seeing.

I’m sorry for the lack of response here.

I am able to reproduce the problem when I run jicofo with Smack 4.4.7-20221005.095442-1, but I have no success isolating it in an independent test. I have a set of threads doing a join/leave in a MUC and another set of threads sending a ping and waiting for a response (I’m attaching the test in case it’s useful somehow).

I suspect it’s something specific to jicofo’s thread model. I’m attaching two thread dumps. The first was was captured a few seconds after timeouts were logged. The second one was captured while a thread was waiting for a response which eventually timed out. Let me know if you notice anything useful.

As it stands we wouldn’t be able to update to Smack 4.4.7 if you release it, but I don’t know if the problem is in Jicofo or Smack. Let us know if you have any suggestions on how to debug further.

Note that I’ll be travelling and unable to respond for the next 2 weeks. I promise to take a look when I’m back.

Regards,
Boris
thread-dump-1.txt (42.9 KB)
thread-dump-2.txt (42.0 KB)

No worries. Just to confirm: Smack regresses from 4.4.6 to 4.4.7-SNAPSHOT for you. That is, you don’t see this issue with Smack 4.4.6. Is that correct?

@Flow yes, that is correct.

Alright, here is a first analysis. I think most of this is explained with a single stacktrace of the following thread from thread-dump-1.txt:

"Smack Cached Executor" #40 daemon prio=5 os_prio=31 cpu=26.85ms elapsed=174.68s tid=0x0000000127089000 nid=0xb803 in Object.wait()  [0x0000000175a59000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.18/Native Method)
	- waiting on <no object reference available>
	at org.jivesoftware.smack.StanzaCollector.nextResult(StanzaCollector.java:206)
	- waiting to re-lock in wait() <0x00000006d15cbbf0> (a org.jivesoftware.smack.StanzaCollector)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:270)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
	at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:606)
	at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:578)
	at org.jitsi.jicofo.xmpp.XmppProvider.discoverFeatures(XmppProvider.kt:238)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:217)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:216)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	- locked <0x00000006d1b83c28> (a kotlin.SynchronizedLazyImpl)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl.getFeatures(ChatRoomMemberImpl.kt:216)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.inviteChatMember(JitsiMeetConferenceImpl.java:697)
	- locked <0x00000006d21a74d0> (a java.lang.Object)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.onMemberJoined(JitsiMeetConferenceImpl.java:657)
	- locked <0x00000006d21a74d0> (a java.lang.Object)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl$ChatRoomListenerImpl.memberJoined(JitsiMeetConferenceImpl.java:1970)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$processOtherPresence$12(ChatRoomImpl.java:856)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl$$Lambda$366/0x0000000800585c40.invoke(Unknown Source)
	at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
	at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
	at org.jitsi.utils.event.BaseEventEmitter.wrap(EventEmitter.kt:49)
	at org.jitsi.utils.event.SyncEventEmitter.fireEvent(EventEmitter.kt:64)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processOtherPresence(ChatRoomImpl.java:855)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processPresence(ChatRoomImpl.java:909)
	at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:309)
	at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1619)
	at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$360/0x0000000800587440.run(Unknown Source)
	at org.jivesoftware.smack.AbstractXMPPConnection$10.run(AbstractXMPPConnection.java:2149)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.18/Thread.java:829)

This thread starts essentially via MultiUserChat.presenceListener. This is a not an asynchronous connection listener, so it must return before following incoming stanzas are processed 1. However, further up in the call stack JitsiMeetConfereceImpl.inviteChatMember performs chatRoomMember.getFeatures() which triggers a service discovery lookup via the ServiceDiscoveryManager (SDM). This lookup uses a StanzaCollector to wait for the result stanza. Unfortunately, waiting for this is in vain, because the StanzaCollector is never notified about the result stanza, because for the StanzaCollectors to be able to get notified, this presence listener has to return.

This is a classical deadlock and a general problem with non-asynchronous stanza listeners. However, I believe you want non-asynchronous listeners here since the order of events appears relevant.

What you could probably do to fix this, is to decouple the blocking operation, the service lookup here, from the presence listener processing, allowing the presence listener to return, while the potentially blocking operation is still in progress. In fact, this was already done before in ChatRoomImpl, which contains code that decouples an operation and makes it asynchronous. For example TaskPools.getIoPool().execute(() -> … in ChatRoomImpl.leave().

But keep in mind that such a construct does no longer guarantee the ordering of events. For example, if you where to do so for presence processing, then, consider two incoming presences from user A that declare A as ‘available’ and then following as ‘away’. It could be possible that the ‘away’ presence is processed first, even though it was received later, and only then the ‘available’ presence is processed. This could result in the user showing as ‘available’ when they are in fact ‘away’.

One solution for this is to decouple the operation from the presence listener, but retain the order via an queue and a single consumer that processes elements (actions, events, …) from the queue. However, if the queue is bounded, then you just reduce the risk of such a deadlock, but prevent it by construction. Using an unbounded queue would prevent this deadlock by construction, but could potentially cause unbounded memory usage, leading to OOM situations.

So what is it for you, the red or the blue pill? :slight_smile:

Jokes aside, I have to give this some more thought. I believe the fix currently in the 4.4 branch is fundamentally not wrong, but I also will not rule out the possiblity of reverting it and finding a different, maybe even better, solution.

1: This is a new condition which was added due to SMACK-927 in [core] Replace AbstractXMPPConnection.inOrderListeners · igniterealtime/Smack@92f253c · GitHub to fix initial issue of this thread.