Thread stuck in MultiUserChat.enter() forever

Boris_Grozev · September 23, 2022, 9:46pm

Hey,

We discovered a condition under which MultiUserChat.enter() blocks forever, resulting in the MUC for this address being unusable. This is the stack trace we observed (a few hours after the call to createOrJoin):

java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(java.base@11.0.16/Native Method)
 - waiting on <no object reference available>
 at java.lang.Object.wait(java.base@11.0.16/Object.java:328)
 at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:427)
 - waiting to re-lock in wait() <0x0000000717561e20> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
 at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
 - locked <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)

In our case this probably happened because the PacketReader thread died with an exception and the reflected self presence was never processed. The thread running enter() does a wait(0) until the stanza listener is called. Using a finite timeout and failing if the reflected self presence still hasn’t been processed should solve the problem. If you agree I can work on a PR.

Boris

Flow · September 24, 2022, 9:59am

Hey Boris,

as always, thanks for your report.

I am not sure if the root cause here could really be a crashed PacketReader thread. Your stacktrace shows that the control flow reaches presenceListener.wait(), which mean that the previous code line

reflectedSelfPresence = selfPresenceCollector.nextResultOrThrow(conf.getTimeout());

was executed without an exception. So the reflected self presence was already received by a collector. In the next step, the code waits for the stanza listener to process the reflected self presence. The collectors and listeners are both handled by the same method (AbstractXMPPConnection.invokeStanzaCollectorsAndNotifyRecvListeners(Stanza). So we know the collector successfully collected the reflected self presence, but for some reason the listener never ran. Which shouldn’t happen, and hence I decided to use an potentially infinite wait() here, assuming that it is never an infinite wait by construction.

But you showed that this is not the case. We could do something like

github.com/igniterealtime/Smack

DRAFT [muc] do not wait infinite for the processing of the reflected self-p…

igniterealtime:master ← Flowdalic:muc-reflected-self-presence-finite-wait

opened 09:58AM - 24 Sep 22 UTC

Flowdalic

+17 -10

…resence In theory, using an infinite wait() here should be fine. Because if …the collector in the lines before receives the presence, then it should by guranteed by construction that the listener will eventually also process it. However, Boris reported a case where the wait() would hang forever. The actual root cause is unknown. This change would fix the infinite wait(), but could theoretically cause errornous NoResponseExceptions (e.g., if the system is under heavy load and threads compete for the CPU).

but I also really want to get a better understanding how this could happen. Because obviously, I am missing some part of the picture. I will try to get behind this, but let me know if you have any theories or ideas.

Boris_Grozev · September 27, 2022, 4:33pm

Hi Flow,

Thanks for looking into this and correcting my interpretation! I think I’m getting closer to a correct explanation, but I’m still not there. I found the following:

I’m pretty sure the exception in the PacketReader initiated the chain of events. It was followed by a re-connect (the XMPP server runs on localhost so it happened very quickly) and the stack trace I originally posted was initiated by the authenticated() being called on the ConnectionListener.
This thread is waiting to lock the same MultiUserChat instance (handling an “unavailble” presence):

"Smack Cached Executor" #87237 daemon prio=5 os_prio=0 cpu=648.79ms elapsed=63818.52s tid=0x00007fa1e400b800 nid=0x35a3c9 waiting for monitor entry  [0x00007fa215a00000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.jivesoftware.smackx.muc.MultiUserChat.userHasLeft(MultiUserChat.java:2197)
    - waiting to lock <0x000000071755f510> (a org.jivesoftware.smackx.muc.MultiUserChat)
    at org.jivesoftware.smackx.muc.MultiUserChat.access$900(MultiUserChat.java:113)
    at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:250)
    at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1623)
    at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$109/0x00000008402d3840.run(Unknown Source)
    at org.jivesoftware.smack.AsyncButOrdered$Handler.run(AsyncButOrdered.java:151)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

XMPPTCPConnection.inOrderListeners has:

An entry in threadActiveMap with the stuck MultiUserChat.presenceListener as key
An entry in pendingRunnables for the same key, which has 1 item in its queue, which has a Presence with type=available and statusCode 110.

It looks like handling of the “unavailable” presence happened before handling of the reflected self presence, and now the two threads are deadlocked. But I have no idea how that might have happened (I got excited for a second when I saw in the dump that the keys for AsyncButOrdered are MultiUserChat$1, $2, $3 instead of MultiUserChat itself, but in this case only presenceListener ($3) is involved). Perhaps that’s actually the order in which they were received from the server after the re-connect?

Thanks again, I appreciate your help.

Flow · September 28, 2022, 11:58am

Thanks for sticking with me and going down the rabbit hole to the root cause. I think I have now a good idea how the deadlock happens. I quickly hacked together something which I believe should prevent the deadlock:

Note that this is untested as I am currently short on time.

Boris_Grozev · September 30, 2022, 5:30pm

I don’t understand it fully, but I agree it should prevent the deadlock.

Flow · October 2, 2022, 9:34am

Created SMACK-927

Flow · October 5, 2022, 9:56am

Merged into the 4.4 branch. Please test the next snapshot build and report back your findings. Thanks.

Boris_Grozev · October 20, 2022, 11:51pm

Sorry I dropped the ball on this one. I just ran a quick test with smack-core-4.4.7-20221005.095442-1 and I’m getting timeouts like these:

Jicofo 2022-10-20 23:47:30.976 SEVERE: [28] [room=__jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si meeting_id=] ChatRoomImpl.lambda$leave$2#381: Failed to properly leave MUC: __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si(focus@auth.meet.jit.si/focus)
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (full): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si/focus, PresenceTypeFilter: type=unavailable, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-1b2ac32b8c6603e1@conference.meet.jit.si, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
	at org.jivesoftware.smackx.muc.MultiUserChat.leave(MultiUserChat.java:798)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$leave$2(ChatRoomImpl.java:372)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Jicofo 2022-10-20 23:47:31.704 WARNING: [35] FocusManager.conferenceRequest#117: Exception while trying to start the conference
org.jivesoftware.smack.SmackException$NoResponseException: No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:124)
	at org.jivesoftware.smack.SmackException$NoResponseException.newWith(SmackException.java:104)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:281)
	at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:408)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:546)
	at org.jivesoftware.smackx.muc.MultiUserChat.createOrJoin(MultiUserChat.java:521)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.joinAs(ChatRoomImpl.java:293)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.join(ChatRoomImpl.java:224)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.joinTheRoom(JitsiMeetConferenceImpl.java:493)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.start(JitsiMeetConferenceImpl.java:330)
	at org.jitsi.jicofo.FocusManager.conferenceRequest(FocusManager.kt:114)
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:163)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)
Jicofo 2022-10-20 23:47:31.704 SEVERE: [35] HealthChecker.run#175: Health check failed in PT15.00206S:
java.lang.RuntimeException: Failed to create conference with room name __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si:No response received within reply timeout. Timeout was 15000ms (~15s). StanzaCollector has been cancelled. Waited for response using: AndFilter: (StanzaTypeFilter: Presence, OrFilter: (AndFilter: (FromMatchesFilter (ignoreResourcepart): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si, MUCUserStatusCodeFilter: status=110), AndFilter: (FromMatchesFilter (full): __jicofo-health-check-d40c0c62e401ceb6@conference.meet.jit.si/focus, StanzaIdFilter: id=RALYD-14, PresenceTypeFilter: type=error))).
	at org.jitsi.jicofo.health.JicofoHealthChecker.check(JicofoHealthChecker.java:176)
	at org.jitsi.jicofo.health.JicofoHealthChecker.performCheck(JicofoHealthChecker.java:112)
	at org.jitsi.health.HealthChecker.run(HealthChecker.kt:144)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.run(RecurringRunnableExecutor.java:216)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor.runInThread(RecurringRunnableExecutor.java:292)
	at org.jitsi.utils.concurrent.RecurringRunnableExecutor$1.run(RecurringRunnableExecutor.java:328)

I’m not sure what introduced this, we’re currently using 4.4.6 without problems. I’ll dig more into it and report back.

system · January 28, 2023, 11:51pm

This topic was automatically closed 100 days after the last reply. New replies are no longer allowed.

Flow · February 12, 2023, 12:50pm

@Boris_Grozev I was not yet able to reproduce this. Smack currently has a few pending commits in its ‘4.4’ branch and I wonder if there is an regression or if it could be released as 4.4.7. Let me know if I can help tracking down the root cause of the NoResponseExceptions you are seeing.

Boris_Grozev · February 22, 2023, 9:23pm

I’m sorry for the lack of response here.

I am able to reproduce the problem when I run jicofo with Smack 4.4.7-20221005.095442-1, but I have no success isolating it in an independent test. I have a set of threads doing a join/leave in a MUC and another set of threads sending a ping and waiting for a response (I’m attaching the test in case it’s useful somehow).

I suspect it’s something specific to jicofo’s thread model. I’m attaching two thread dumps. The first was was captured a few seconds after timeouts were logged. The second one was captured while a thread was waiting for a response which eventually timed out. Let me know if you notice anything useful.

As it stands we wouldn’t be able to update to Smack 4.4.7 if you release it, but I don’t know if the problem is in Jicofo or Smack. Let us know if you have any suggestions on how to debug further.

Note that I’ll be travelling and unable to respond for the next 2 weeks. I promise to take a look when I’m back.

Regards,
Boris
thread-dump-1.txt (42.9 KB)
thread-dump-2.txt (42.0 KB)

Flow · February 23, 2023, 9:01am

No worries. Just to confirm: Smack regresses from 4.4.6 to 4.4.7-SNAPSHOT for you. That is, you don’t see this issue with Smack 4.4.6. Is that correct?

damencho · February 23, 2023, 1:54pm

@Flow yes, that is correct.

Flow · March 8, 2023, 12:07pm

Alright, here is a first analysis. I think most of this is explained with a single stacktrace of the following thread from thread-dump-1.txt:

"Smack Cached Executor" #40 daemon prio=5 os_prio=31 cpu=26.85ms elapsed=174.68s tid=0x0000000127089000 nid=0xb803 in Object.wait()  [0x0000000175a59000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.18/Native Method)
	- waiting on <no object reference available>
	at org.jivesoftware.smack.StanzaCollector.nextResult(StanzaCollector.java:206)
	- waiting to re-lock in wait() <0x00000006d15cbbf0> (a org.jivesoftware.smack.StanzaCollector)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:270)
	at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
	at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:606)
	at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:578)
	at org.jitsi.jicofo.xmpp.XmppProvider.discoverFeatures(XmppProvider.kt:238)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:217)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:216)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	- locked <0x00000006d1b83c28> (a kotlin.SynchronizedLazyImpl)
	at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl.getFeatures(ChatRoomMemberImpl.kt:216)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.inviteChatMember(JitsiMeetConferenceImpl.java:697)
	- locked <0x00000006d21a74d0> (a java.lang.Object)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.onMemberJoined(JitsiMeetConferenceImpl.java:657)
	- locked <0x00000006d21a74d0> (a java.lang.Object)
	at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl$ChatRoomListenerImpl.memberJoined(JitsiMeetConferenceImpl.java:1970)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$processOtherPresence$12(ChatRoomImpl.java:856)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl$$Lambda$366/0x0000000800585c40.invoke(Unknown Source)
	at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
	at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
	at org.jitsi.utils.event.BaseEventEmitter.wrap(EventEmitter.kt:49)
	at org.jitsi.utils.event.SyncEventEmitter.fireEvent(EventEmitter.kt:64)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processOtherPresence(ChatRoomImpl.java:855)
	at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processPresence(ChatRoomImpl.java:909)
	at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:309)
	at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1619)
	at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$360/0x0000000800587440.run(Unknown Source)
	at org.jivesoftware.smack.AbstractXMPPConnection$10.run(AbstractXMPPConnection.java:2149)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.18/Thread.java:829)

This thread starts essentially via MultiUserChat.presenceListener. This is a not an asynchronous connection listener, so it must return before following incoming stanzas are processed ¹. However, further up in the call stack JitsiMeetConfereceImpl.inviteChatMember performs chatRoomMember.getFeatures() which triggers a service discovery lookup via the ServiceDiscoveryManager (SDM). This lookup uses a StanzaCollector to wait for the result stanza. Unfortunately, waiting for this is in vain, because the StanzaCollector is never notified about the result stanza, because for the StanzaCollectors to be able to get notified, this presence listener has to return.

This is a classical deadlock and a general problem with non-asynchronous stanza listeners. However, I believe you want non-asynchronous listeners here since the order of events appears relevant.

What you could probably do to fix this, is to decouple the blocking operation, the service lookup here, from the presence listener processing, allowing the presence listener to return, while the potentially blocking operation is still in progress. In fact, this was already done before in ChatRoomImpl, which contains code that decouples an operation and makes it asynchronous. For example TaskPools.getIoPool().execute(() -> … in ChatRoomImpl.leave().

But keep in mind that such a construct does no longer guarantee the ordering of events. For example, if you where to do so for presence processing, then, consider two incoming presences from user A that declare A as ‘available’ and then following as ‘away’. It could be possible that the ‘away’ presence is processed first, even though it was received later, and only then the ‘available’ presence is processed. This could result in the user showing as ‘available’ when they are in fact ‘away’.

One solution for this is to decouple the operation from the presence listener, but retain the order via an queue and a single consumer that processes elements (actions, events, …) from the queue. However, if the queue is bounded, then you just reduce the risk of such a deadlock, but prevent it by construction. Using an unbounded queue would prevent this deadlock by construction, but could potentially cause unbounded memory usage, leading to OOM situations.

So what is it for you, the red or the blue pill?

Jokes aside, I have to give this some more thought. I believe the fix currently in the 4.4 branch is fundamentally not wrong, but I also will not rule out the possiblity of reverting it and finding a different, maybe even better, solution.

1: This is a new condition which was added due to SMACK-927 in [core] Replace AbstractXMPPConnection.inOrderListeners · igniterealtime/Smack@92f253c · GitHub to fix initial issue of this thread.

Boris_Grozev · April 19, 2023, 7:08pm

Hi Flow,

Thanks for the analysis, spot-on as usual. We moved the feature discovery to a separate IO thread[0] and it fixed that problem. But upon more more we found other similar patterns (we have a sync IQ listener which joins MUC while handling an IQ). I think we can solve them in a similar way, so there’s no need to hold 4.4.7 for us. But you may want to consider other applications using Smack in similar ways.

On a separate topic, Damian noticed a race condition that breaks the XEP-0115 caching and causes unnecessary disco#info. EntityCapsManager populates JID_TO_NODEVER_CACHE in an async stanza listener[1]. Sometimes the code that handles MUC presence runs before the cache has been updated, leading to ServiceDiscovertManager.discoverInfo being unable to resolve locally.

[0] Fix feature discovery blocking a smack thread. by bgrozev · Pull Request #1078 · jitsi/jicofo · GitHub
[1] Smack/EntityCapsManager.java at master · igniterealtime/Smack · GitHub

Flow · April 21, 2023, 10:30am

Thanks for the reply and the kind words.

Yes, that is an issue with the upcoming 4.4.7 release. While it will remain (mostly) API compatible the behavior change has a potential for deadlocks. But there is no win-win option here. At least, I don’t see one. So the best I can do I to add a big warning to the release notes.

Thanks, I’ll have a look at it soon.

Flow · November 24, 2023, 11:13am

Thanks, created SMACK-937.