Smack thread indefinite freeze when joining MUC

guus · July 8, 2021, 12:43pm

While developing a SINT extension against the current main branch of Smack (4.5.0-alpha), one (of many) test runs suddenly seemed to grind to a halt. After a couple of minutes, there was no progression. I took a thread dump of the process, which gave me the following:

"main" #1 prio=5 os_prio=0 cpu=1008,36ms elapsed=400,35s tid=0x00007f19dc018800 nid=0xae0f in Object.wait()  [0x00007f19e1f68000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.11/Native Method)
	- waiting on <0x00000004525a9238> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
	at java.lang.Object.wait(java.base@11.0.11/Object.java:328)
	at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:423)
	- waiting to re-lock in wait() <0x00000004525a9238> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
	at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:720)
	- locked <0x00000004525a8e30> (a org.jivesoftware.smackx.muc.MultiUserChat)
	at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:648)
	at org.jivesoftware.smackx.muc.MultiUserChatRolesAffiliationsPrivilegesIntegrationTest.mucTestPersistentAffiliation(MultiUserChatRolesAffiliationsPrivilegesIntegrationTest.java:659)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.11/Native Method)
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.11/NativeMethodAccessorImpl.java:62)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.11/DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(java.base@11.0.11/Method.java:566)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.lambda$runTests$0(SmackIntegrationTestFramework.java:405)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework$$Lambda$120/0x0000000840213440.execute(Unknown Source)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.runConcreteTest(SmackIntegrationTestFramework.java:480)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework$PreparedTest.run(SmackIntegrationTestFramework.java:681)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.runTests(SmackIntegrationTestFramework.java:468)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.run(SmackIntegrationTestFramework.java:229)
	- locked <0x00000004154001f0> (a org.igniterealtime.smack.inttest.SmackIntegrationTestFramework)
	at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.main(SmackIntegrationTestFramework.java:107)

It seems that the thread is waiting indefinitely on reflected self presence, as part of a user joining a MUC room.

Line 423 is the one that invokes wait() in the snippet below:

synchronized (presenceListener) {
    // Only continue after we have received *and* processed the reflected self-presence. Since presences are
    // handled in an extra listener, we may return from enter() without having processed all presences of the
    // participants, resulting in a e.g. to low participant counter after enter(). Hence we wait here until the
    // processing is done.
    while (!processedReflectedSelfPresence) {
        presenceListener.wait();
    }
}

I’m not ruling out that the server is to blame here. I cannot verify if it actually sent the self-presence. I did check that according to the server, the user had entered the room when this thread dump was made.

Should Smack guard against misbehaving servers, by adding some kind of time-out here?

Could this be caused by an issue in Smack? Is the setting and reading of the value in the processedReflectedSelfPresence field (which is defined as being volatile) thread safe?

} else if (mucUser.getStatus().contains(MUCUser.Status.PRESENCE_TO_SELF_110)) {
    processedReflectedSelfPresence = true;
    synchronized (this) {
        notify();
    }
} else {

Flow · July 13, 2021, 7:34am

A misbehaving server is unlikely, otherwise I had put a timeout here.

The code in question is awaiting that completion of the processing of the reflected self-presence. Line 404

reflectedSelfPresence = selfPresenceCollector.nextResultOrThrow(conf.getTimeout());

already ensures that the self-presence was received, but not processed (at least, that is the intention of the code).

As with the information available at the moment, my guess would be that something blocks the execution of the presenceLIstener so that it is not able to notify about the processed self-presence.

Boris_Grozev · October 1, 2021, 7:56pm

Hey Guus, Flow,

We ran into the same issue with Jitsi/Jicofo after updating to Smack 4.4.3, and I believe the root cause is a race condition in MultiUserChat. Here’s the race condition we believe is possible:

Thread A executes the MultiUserChat#presenceListener for a stanza containing self-presence. It gets interrrupted here (note that it does not hold a lock on the MultiUserChat object).
A user Thread B executes leave(), which clears occupantsMap. It also removes presenceListener from the connection, but it doesn’t affect Thread A which is already running.
Thread A continues execution and places an entry in occupantsMap for the local occupant.
At this stage, until the MultiUserChat object gets garbage collected, it is impossible to join the same MUC with the same nickname. Getting the cached MultiUserChat object from MultiUserChatManager and calling createOrJoin(nickname) leads to the thread being stuck in the state Guus reported. This is because enter waits for the processedReflectedSelfPresence flag to be set, which neverr happens because the self-presence was mistakenly handled as an update to an existing occupant.

We believe we’re running into this problem in practice, because in the current Jicofo code we send a presence update right before we leave the room, and the XMPP server runs on the local machine. These were seen shortly before we saw a thread hung in enter() for the same MUC.

2021-10-01 17:31:55.181: SENT PKT (0): <presence xmlns='jabber:client' to='example@conference.example.com/focus' id='LN68T-72'><priority>0</priority><etherpad xmlns='http://jitsi.org/jitmeet/etherpad'>example</etherpad><versions xmlns='http://jitsi.org/jitmeet'><component xmlns='http://jitsi.org/jitmeet' name='focus'>1.1.SNAPSHOT</component></versions><conference-properties xmlns='http://jitsi.org/protocol/focus'><property xmlns='http://jitsi.org/protocol/focus' key='support-terminate-restart' value='true'/><property xmlns='http://jitsi.org/protocol/focus' key='bridge-count' value='0'/></conference-properties><c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='http://jitsi.org/jicofo' ver='Lg0vhCNhxjoeKJi2/hukdsizNWA='/></presence>
2021-10-01 17:31:55.183: RCV PKT (0): <presence xmlns='jabber:client' xml:lang='en' to='focus@auth.example.com/focus' from='example@conference.example.com/focus' id='LN68T-72'><priority>0</priority><etherpad xmlns='http://jitsi.org/jitmeet/etherpad'>example</etherpad><versions xmlns='http://jitsi.org/jitmeet'><component name='focus'>1.1.SNAPSHOT</component></versions><conference-properties xmlns='http://jitsi.org/protocol/focus'><property key='support-terminate-restart' value='true'></property><property key='bridge-count' value='0'></property></conference-properties><c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='http://jitsi.org/jicofo' ver='Lg0vhCNhxjoeKJi2/hukdsizNWA='/><x xmlns='http://jabber.org/protocol/muc#user'><item affiliation='owner' jid='focus@auth.example.com/focus' role='moderator'></item><status code='110'/></x></presence>
2021-10-01 17:31:55.185: SENT PKT (0): <presence xmlns='jabber:client' to='example@conference.example.com/focus' id='YPYJG-15' type='unavailable'/>

We stopped this behavior, and we expect to see these occurances much less often (it happens relatively often right now, we noticed machines leaking 1500 threads within a few days), but it won’t fix the problem. We brain-stormed some solutions:

Aquire the lock in presenceListener and ignore presence after leave. I have no idea if this is safe.
Clear occupantsMap every time before joining. For example here we know that we are not joined so the map should be empty, and we’re holding the lock. This may leave other state touched by presenceListener at risk.
Provide a way for applications to disable caching of MultiUserChat objects. This doesn’t actually fix the problem, but provides a way for applications to work around it.

We finally managed to update to Smack 4.4.3 and stop using a fork, so if you have any suggestions for ways for us to work around the problem in a reliable way without changes to Smack they would be welcome. We’re currently leaning towards implementing option 2 in our app with reflection

All the best,
Boris

Boris_Grozev · October 1, 2021, 7:59pm

In option “2” I wanted to refer to MultiUserChat line 540, but Discourse won’t allow me to post that link.

guus · October 4, 2021, 10:20am

Discourse didn’t recognize you as a recurring user. I’ve manually bumped up your “yes-we-can-trust-this-person” score.

guus · October 4, 2021, 10:21am

You guys are using Prosody, right? That takes away one of my worries that this was specific to Openfire (which is where I ran into the issue).

damencho · October 4, 2021, 11:45am

Yes, and over the weekend I had played with some automated testing and I can repro it after 40-60 join and leaves in the same room.

guus · October 4, 2021, 11:49am

Can you distill some minimalistic unit test from your automation, not depending on any properietary code? That’s be of interest while diving into this - if anything to verify any fixes that we might come up with.

damencho · October 4, 2021, 12:45pm

Not sure how easy that will be … but I can give it a try …

damencho · October 4, 2021, 1:01pm

Seems I still cannot build the project this is discussed in Similar class names, would you reconsider? - #3 by guus any hints?

warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'smack-core/src/main/java/org/jivesoftware/smack/provider/IQProvider.java'
  'smack-core/src/main/java/org/jivesoftware/smack/provider/IqProvider.java'

I see there are some mocking around connection I may try it, if I manage to build the project, but don’t promise that something will come out of it …

guus · October 4, 2021, 1:20pm

Ugh, you’re on a Mac, I suppose? That issue is so annoying. There’s no easy work-around, I’m afraid. Lots of technical debt and not enough resouces to tackle it. Sorry for that. :-/

What’s the premise of your reproduction? Create a chatroom, join, send presence, leave (lather rinse repeat)?

damencho · October 4, 2021, 1:40pm

Yep the flow is the following:

any client first sends an iq with room name to jicofo
Jicofo is the first to join and create the room (there is a prosody mode where only admins(jicofo) are allowed to create the room for that muc component)
Then Jicofo responds to the client which triggers the client to join.
When the last client leaves jicofo leave the room

The steps will be:

Two participants in the room, when participant1 leaves, then participant2 also leaves (this is jicofo in our case)
Make sure before leaving there are some self-presences received(status==110) by participant2
When the next time participant2 tries to join the same room:

MultiUserChat muc = manager.getMultiUserChat(roomJid);
if (muc.getOccupants().size() > 0) {
    System.err.println("Houston, we have a problem!");
}
muc.createOrJoin(nickname);

After this, you will see the stuck thread as in the first post in this thread.

Of course, it is a sync and timing issue and in my case I was hitting it after 50-60 times executing the same test case, using the same room name.

guus · October 4, 2021, 1:44pm

Ok, that helps. I’m trying to dig up the details of your setup from memory. Are you still using an actual user-entity in jicofo (the ‘focus’ user) that, much like a regular user, joins the room to create it? Is that focus user still in the room when participant1 and participant2 leave?

damencho · October 4, 2021, 1:46pm

Oh sorry, I didn’t mention it, participant2 is jicofo. I’m editing now my previous post.

Boris_Grozev · October 4, 2021, 1:58pm

To repro without any additional infrastructure use a server on the same host to make the race easier to hit (I don’t think the server software makes a difference), join a MUC. Then send updated presence and immediately call leave(). Try joining the same MUC with the same nickname.

Here’s Jonathan’s workaround: https://github.com/jitsi/jicofo/pull/816

guus · October 5, 2021, 1:09pm

I’m having a tough time reproducing the problem. So far, I’ve created this test in the Smack Integration Test Framework, which I’m running against a server running on the same computer.

Without logging on, I can easily increase the iteration count to 10,000 without running into some kind of deadlock.

Am I correctly reproducing the scenario here?

@SmackIntegrationTest
public void mucTest() throws Exception {
    EntityBareJid mucAddress = getRandomRoom("smack-inttest-message");

    MultiUserChat mucAsSeenByOne = mucManagerOne.getMultiUserChat(mucAddress);
    Resourcepart nickname = Resourcepart.from("one-" + randomString);

    // Join a new room, configure it to be persistent.
    mucAsSeenByOne.join(nickname);
    FillableForm configForm = mucAsSeenByOne.getConfigurationForm().getFillableForm();
    configForm.setAnswer("muc#roomconfig_persistentroom", true);
    mucAsSeenByOne.sendConfigurationForm(configForm);

    for (int i=0; i<2000; i++) {
        Presence presence = mucAsSeenByOne.getXmppConnection().getStanzaFactory()
                .buildPresenceStanza()
                .ofType(Presence.Type.available)
                .setMode(Presence.Mode.away)
                .to(mucAddress)
                .build();

        mucAsSeenByOne.getXmppConnection().sendStanza(presence);
        mucAsSeenByOne.leave();
        mucAsSeenByOne.createOrJoin(nickname);
    }

    // Clean-up after test.
    tryDestroy(mucAsSeenByOne);
}

This is a snippet from the data exchange, which includes two of the many iteraterations.

<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E' id='G9S1U-21'>
  <priority>
    0
  </priority>
  <x xmlns='http://jabber.org/protocol/muc'>
  </x>
  <c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='https://igniterealtime.org/projects/smack' ver='lWpWNgqtfynY+MxupmhOr2LZqOg='/>
</presence>
15:10:19 RECV (1): 
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-21" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <priority>
    0
  </priority>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="https://igniterealtime.org/projects/smack" ver="lWpWNgqtfynY+MxupmhOr2LZqOg=">
  </c>
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="moderator"/>
    <status code="110"/>
    <status code="100"/>
    <status code="170"/>
  </x>
</presence>
<message from="smack-inttest-message-hmf6w-t18zpg@conference.example.org" type="groupchat" id="36554a5b-63c5-4049-b03f-09604c7b656f" to="smack-inttest-one-hmf6w@example.org/one-hmf6w">
  <subject/>
</message>
15:10:19 SENT (1): 
<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org' id='G9S1U-22'>
  <priority>
    0
  </priority>
  <show>
    away
  </show>
  <c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='https://igniterealtime.org/projects/smack' ver='lWpWNgqtfynY+MxupmhOr2LZqOg='/>
</presence>
<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E' id='G9S1U-23' type='unavailable'/>
15:10:19 RECV (1): 
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-22" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <priority>
    0
  </priority>
  <show>
    away
  </show>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="https://igniterealtime.org/projects/smack" ver="lWpWNgqtfynY+MxupmhOr2LZqOg=">
  </c>
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="moderator"/>
    <status code="110"/>
  </x>
</presence>
15:10:19 RECV (1): 
<r xmlns='urn:xmpp:sm:3' />
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-23" type="unavailable" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="none"/>
    <status code="110"/>
  </x>
</presence>
15:10:19 SENT (1): 
<a xmlns='urn:xmpp:sm:3' h='30'/>
15:10:19 SENT (1): 
<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E' id='G9S1U-24'>
  <priority>
    0
  </priority>
  <x xmlns='http://jabber.org/protocol/muc'>
  </x>
  <c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='https://igniterealtime.org/projects/smack' ver='lWpWNgqtfynY+MxupmhOr2LZqOg='/>
</presence>
<r xmlns='urn:xmpp:sm:3'/>
15:10:19 RECV (1): 
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-24" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <priority>
    0
  </priority>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="https://igniterealtime.org/projects/smack" ver="lWpWNgqtfynY+MxupmhOr2LZqOg=">
  </c>
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="moderator"/>
    <status code="110"/>
    <status code="100"/>
    <status code="170"/>
  </x>
</presence>
<message from="smack-inttest-message-hmf6w-t18zpg@conference.example.org" type="groupchat" id="6c1721f0-27c7-4dc3-a898-ebba16571dd0" to="smack-inttest-one-hmf6w@example.org/one-hmf6w">
  <subject/>
</message>
15:10:19 RECV (1): 
<a xmlns='urn:xmpp:sm:3' h='30' />
15:10:19 SENT (1): 
<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org' id='G9S1U-25'>
  <priority>
    0
  </priority>
  <show>
    away
  </show>
  <c xmlns='http://jabber.org/protocol/caps' hash='sha-1' node='https://igniterealtime.org/projects/smack' ver='lWpWNgqtfynY+MxupmhOr2LZqOg='/>
</presence>
<presence to='smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E' id='G9S1U-26' type='unavailable'/>
15:10:19 RECV (1): 
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-25" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <priority>
    0
  </priority>
  <show>
    away
  </show>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="https://igniterealtime.org/projects/smack" ver="lWpWNgqtfynY+MxupmhOr2LZqOg=">
  </c>
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="moderator"/>
    <status code="110"/>
  </x>
</presence>
15:10:19 RECV (1): 
<presence to="smack-inttest-one-hmf6w@example.org/one-hmf6w" id="G9S1U-26" type="unavailable" from="smack-inttest-message-hmf6w-t18zpg@conference.example.org/one-2N492E">
  <x xmlns="http://jabber.org/protocol/muc#user">
    <item jid="smack-inttest-one-hmf6w@example.org/one-hmf6w" affiliation="owner" role="none"/>
    <status code="110"/>
  </x>
</presence>

damencho · October 5, 2021, 1:52pm

Hum, can you enable milliseconds in your message logs?

At first glance, I think it is the same … @Boris_Grozev wdyt?

guus · October 5, 2021, 1:57pm

I don’t know actually. This is the default logging that comes with the framework.

damencho · October 5, 2021, 2:04pm

Oo one difference, we leave the room in a different thread.

So can you modify your test that way, and do the join a little bit after that … maybe 1-2 seconds later.

Another difference is that we always lookup the cache for the room mucManagerOne.getMultiUserChat before joining, not sure whether that makes any difference.

Flow · October 5, 2021, 6:51pm

Thanks for your very detailed situation report. I finally had some time to look into it, and I found that what you described is plausible. I am not yet sure about the best solution. Assuming that you correctly identified, I wondered if simply moving the if (mucUser.getStatus().contains(MUCUser.Status.PRESENCE_TO_SELF_110)) at the first position would already help.

I have some untested code for this at

While I do hope to find time to look deeper into this, I am happy to hear your opinion about this approach.