Alright, here is a first analysis. I think most of this is explained with a single stacktrace of the following thread from thread-dump-1.txt
:
"Smack Cached Executor" #40 daemon prio=5 os_prio=31 cpu=26.85ms elapsed=174.68s tid=0x0000000127089000 nid=0xb803 in Object.wait() [0x0000000175a59000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.18/Native Method)
- waiting on <no object reference available>
at org.jivesoftware.smack.StanzaCollector.nextResult(StanzaCollector.java:206)
- waiting to re-lock in wait() <0x00000006d15cbbf0> (a org.jivesoftware.smack.StanzaCollector)
at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:270)
at org.jivesoftware.smack.StanzaCollector.nextResultOrThrow(StanzaCollector.java:228)
at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:606)
at org.jivesoftware.smackx.disco.ServiceDiscoveryManager.discoverInfo(ServiceDiscoveryManager.java:578)
at org.jitsi.jicofo.xmpp.XmppProvider.discoverFeatures(XmppProvider.kt:238)
at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:217)
at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl$features$2.invoke(ChatRoomMemberImpl.kt:216)
at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
- locked <0x00000006d1b83c28> (a kotlin.SynchronizedLazyImpl)
at org.jitsi.jicofo.xmpp.muc.ChatRoomMemberImpl.getFeatures(ChatRoomMemberImpl.kt:216)
at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.inviteChatMember(JitsiMeetConferenceImpl.java:697)
- locked <0x00000006d21a74d0> (a java.lang.Object)
at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl.onMemberJoined(JitsiMeetConferenceImpl.java:657)
- locked <0x00000006d21a74d0> (a java.lang.Object)
at org.jitsi.jicofo.conference.JitsiMeetConferenceImpl$ChatRoomListenerImpl.memberJoined(JitsiMeetConferenceImpl.java:1970)
at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.lambda$processOtherPresence$12(ChatRoomImpl.java:856)
at org.jitsi.impl.protocol.xmpp.ChatRoomImpl$$Lambda$366/0x0000000800585c40.invoke(Unknown Source)
at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
at org.jitsi.utils.event.SyncEventEmitter$fireEvent$1$1.invoke(EventEmitter.kt:64)
at org.jitsi.utils.event.BaseEventEmitter.wrap(EventEmitter.kt:49)
at org.jitsi.utils.event.SyncEventEmitter.fireEvent(EventEmitter.kt:64)
at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processOtherPresence(ChatRoomImpl.java:855)
at org.jitsi.impl.protocol.xmpp.ChatRoomImpl.processPresence(ChatRoomImpl.java:909)
at org.jivesoftware.smackx.muc.MultiUserChat$3.processStanza(MultiUserChat.java:309)
at org.jivesoftware.smack.AbstractXMPPConnection.lambda$invokeStanzaCollectorsAndNotifyRecvListeners$8(AbstractXMPPConnection.java:1619)
at org.jivesoftware.smack.AbstractXMPPConnection$$Lambda$360/0x0000000800587440.run(Unknown Source)
at org.jivesoftware.smack.AbstractXMPPConnection$10.run(AbstractXMPPConnection.java:2149)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.18/Thread.java:829)
This thread starts essentially via MultiUserChat.presenceListener
. This is a not an asynchronous connection listener, so it must return before following incoming stanzas are processed 1. However, further up in the call stack JitsiMeetConfereceImpl.inviteChatMember
performs chatRoomMember.getFeatures()
which triggers a service discovery lookup via the ServiceDiscoveryManager
(SDM). This lookup uses a StanzaCollector
to wait for the result stanza. Unfortunately, waiting for this is in vain, because the StanzaCollector
is never notified about the result stanza, because for the StanzaCollectors
to be able to get notified, this presence listener has to return.
This is a classical deadlock and a general problem with non-asynchronous stanza listeners. However, I believe you want non-asynchronous listeners here since the order of events appears relevant.
What you could probably do to fix this, is to decouple the blocking operation, the service lookup here, from the presence listener processing, allowing the presence listener to return, while the potentially blocking operation is still in progress. In fact, this was already done before in ChatRoomImpl
, which contains code that decouples an operation and makes it asynchronous. For example TaskPools.getIoPool().execute(() -> ā¦
in ChatRoomImpl.leave()
.
But keep in mind that such a construct does no longer guarantee the ordering of events. For example, if you where to do so for presence processing, then, consider two incoming presences from user A that declare A as āavailableā and then following as āawayā. It could be possible that the āawayā presence is processed first, even though it was received later, and only then the āavailableā presence is processed. This could result in the user showing as āavailableā when they are in fact āawayā.
One solution for this is to decouple the operation from the presence listener, but retain the order via an queue and a single consumer that processes elements (actions, events, ā¦) from the queue. However, if the queue is bounded, then you just reduce the risk of such a deadlock, but prevent it by construction. Using an unbounded queue would prevent this deadlock by construction, but could potentially cause unbounded memory usage, leading to OOM situations.
So what is it for you, the red or the blue pill? 
Jokes aside, I have to give this some more thought. I believe the fix currently in the 4.4 branch is fundamentally not wrong, but I also will not rule out the possiblity of reverting it and finding a different, maybe even better, solution.
1: This is a new condition which was added due to SMACK-927 in [core] Replace AbstractXMPPConnection.inOrderListeners Ā· igniterealtime/Smack@92f253c Ā· GitHub to fix initial issue of this thread.