While developing a SINT extension against the current main branch of Smack (4.5.0-alpha), one (of many) test runs suddenly seemed to grind to a halt. After a couple of minutes, there was no progression. I took a thread dump of the process, which gave me the following:
"main" #1 prio=5 os_prio=0 cpu=1008,36ms elapsed=400,35s tid=0x00007f19dc018800 nid=0xae0f in Object.wait() [0x00007f19e1f68000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.11/Native Method)
- waiting on <0x00000004525a9238> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
at java.lang.Object.wait(java.base@11.0.11/Object.java:328)
at org.jivesoftware.smackx.muc.MultiUserChat.enter(MultiUserChat.java:423)
- waiting to re-lock in wait() <0x00000004525a9238> (a org.jivesoftware.smackx.muc.MultiUserChat$3)
at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:720)
- locked <0x00000004525a8e30> (a org.jivesoftware.smackx.muc.MultiUserChat)
at org.jivesoftware.smackx.muc.MultiUserChat.join(MultiUserChat.java:648)
at org.jivesoftware.smackx.muc.MultiUserChatRolesAffiliationsPrivilegesIntegrationTest.mucTestPersistentAffiliation(MultiUserChatRolesAffiliationsPrivilegesIntegrationTest.java:659)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.11/Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.11/NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.11/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.11/Method.java:566)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.lambda$runTests$0(SmackIntegrationTestFramework.java:405)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework$$Lambda$120/0x0000000840213440.execute(Unknown Source)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.runConcreteTest(SmackIntegrationTestFramework.java:480)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework$PreparedTest.run(SmackIntegrationTestFramework.java:681)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.runTests(SmackIntegrationTestFramework.java:468)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.run(SmackIntegrationTestFramework.java:229)
- locked <0x00000004154001f0> (a org.igniterealtime.smack.inttest.SmackIntegrationTestFramework)
at org.igniterealtime.smack.inttest.SmackIntegrationTestFramework.main(SmackIntegrationTestFramework.java:107)
It seems that the thread is waiting indefinitely on reflected self presence, as part of a user joining a MUC room.
Line 423 is the one that invokes wait() in the snippet below:
synchronized (presenceListener) {
// Only continue after we have received *and* processed the reflected self-presence. Since presences are
// handled in an extra listener, we may return from enter() without having processed all presences of the
// participants, resulting in a e.g. to low participant counter after enter(). Hence we wait here until the
// processing is done.
while (!processedReflectedSelfPresence) {
presenceListener.wait();
}
}
Iām not ruling out that the server is to blame here. I cannot verify if it actually sent the self-presence. I did check that according to the server, the user had entered the room when this thread dump was made.
Should Smack guard against misbehaving servers, by adding some kind of time-out here?
Could this be caused by an issue in Smack? Is the setting and reading of the value in the processedReflectedSelfPresence field (which is defined as being volatile) thread safe?
already ensures that the self-presence was received, but not processed (at least, that is the intention of the code).
As with the information available at the moment, my guess would be that something blocks the execution of the presenceLIstener so that it is not able to notify about the processed self-presence.
We ran into the same issue with Jitsi/Jicofo after updating to Smack 4.4.3, and I believe the root cause is a race condition in MultiUserChat. Hereās the race condition we believe is possible:
Thread A executes the MultiUserChat#presenceListener for a stanza containing self-presence. It gets interrrupted here (note that it does not hold a lock on the MultiUserChat object).
A user Thread B executes leave(), which clears occupantsMap. It also removes presenceListener from the connection, but it doesnāt affect Thread A which is already running.
Thread A continues execution and places an entry in occupantsMap for the local occupant.
At this stage, until the MultiUserChat object gets garbage collected, it is impossible to join the same MUC with the same nickname. Getting the cached MultiUserChat object from MultiUserChatManager and calling createOrJoin(nickname) leads to the thread being stuck in the state Guus reported. This is because enter waits for the processedReflectedSelfPresence flag to be set, which neverr happens because the self-presence was mistakenly handled as an update to an existing occupant.
We believe weāre running into this problem in practice, because in the current Jicofo code we send a presence update right before we leave the room, and the XMPP server runs on the local machine. These were seen shortly before we saw a thread hung in enter() for the same MUC.
We stopped this behavior, and we expect to see these occurances much less often (it happens relatively often right now, we noticed machines leaking 1500 threads within a few days), but it wonāt fix the problem. We brain-stormed some solutions:
Aquire the lock in presenceListener and ignore presence after leave. I have no idea if this is safe.
Clear occupantsMap every time before joining. For example here we know that we are not joined so the map should be empty, and weāre holding the lock. This may leave other state touched by presenceListener at risk.
Provide a way for applications to disable caching of MultiUserChat objects. This doesnāt actually fix the problem, but provides a way for applications to work around it.
We finally managed to update to Smack 4.4.3 and stop using a fork, so if you have any suggestions for ways for us to work around the problem in a reliable way without changes to Smack they would be welcome. Weāre currently leaning towards implementing option 2 in our app with reflection
Can you distill some minimalistic unit test from your automation, not depending on any properietary code? Thatās be of interest while diving into this - if anything to verify any fixes that we might come up with.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:
'smack-core/src/main/java/org/jivesoftware/smack/provider/IQProvider.java'
'smack-core/src/main/java/org/jivesoftware/smack/provider/IqProvider.java'
I see there are some mocking around connection I may try it, if I manage to build the project, but donāt promise that something will come out of it ā¦
Ugh, youāre on a Mac, I suppose? That issue is so annoying. Thereās no easy work-around, Iām afraid. Lots of technical debt and not enough resouces to tackle it. Sorry for that. :-/
Whatās the premise of your reproduction? Create a chatroom, join, send presence, leave (lather rinse repeat)?
any client first sends an iq with room name to jicofo
Jicofo is the first to join and create the room (there is a prosody mode where only admins(jicofo) are allowed to create the room for that muc component)
Then Jicofo responds to the client which triggers the client to join.
When the last client leaves jicofo leave the room
The steps will be:
Two participants in the room, when participant1 leaves, then participant2 also leaves (this is jicofo in our case)
Make sure before leaving there are some self-presences received(status==110) by participant2
When the next time participant2 tries to join the same room:
MultiUserChat muc = manager.getMultiUserChat(roomJid);
if (muc.getOccupants().size() > 0) {
System.err.println("Houston, we have a problem!");
}
muc.createOrJoin(nickname);
After this, you will see the stuck thread as in the first post in this thread.
Of course, it is a sync and timing issue and in my case I was hitting it after 50-60 times executing the same test case, using the same room name.
Ok, that helps. Iām trying to dig up the details of your setup from memory. Are you still using an actual user-entity in jicofo (the āfocusā user) that, much like a regular user, joins the room to create it? Is that focus user still in the room when participant1 and participant2 leave?
To repro without any additional infrastructure use a server on the same host to make the race easier to hit (I donāt think the server software makes a difference), join a MUC. Then send updated presence and immediately call leave(). Try joining the same MUC with the same nickname.
Iām having a tough time reproducing the problem. So far, Iāve created this test in the Smack Integration Test Framework, which Iām running against a server running on the same computer.
Without logging on, I can easily increase the iteration count to 10,000 without running into some kind of deadlock.
Am I correctly reproducing the scenario here?
@SmackIntegrationTest
public void mucTest() throws Exception {
EntityBareJid mucAddress = getRandomRoom("smack-inttest-message");
MultiUserChat mucAsSeenByOne = mucManagerOne.getMultiUserChat(mucAddress);
Resourcepart nickname = Resourcepart.from("one-" + randomString);
// Join a new room, configure it to be persistent.
mucAsSeenByOne.join(nickname);
FillableForm configForm = mucAsSeenByOne.getConfigurationForm().getFillableForm();
configForm.setAnswer("muc#roomconfig_persistentroom", true);
mucAsSeenByOne.sendConfigurationForm(configForm);
for (int i=0; i<2000; i++) {
Presence presence = mucAsSeenByOne.getXmppConnection().getStanzaFactory()
.buildPresenceStanza()
.ofType(Presence.Type.available)
.setMode(Presence.Mode.away)
.to(mucAddress)
.build();
mucAsSeenByOne.getXmppConnection().sendStanza(presence);
mucAsSeenByOne.leave();
mucAsSeenByOne.createOrJoin(nickname);
}
// Clean-up after test.
tryDestroy(mucAsSeenByOne);
}
This is a snippet from the data exchange, which includes two of the many iteraterations.
Oo one difference, we leave the room in a different thread.
So can you modify your test that way, and do the join a little bit after that ā¦ maybe 1-2 seconds later.
Another difference is that we always lookup the cache for the room mucManagerOne.getMultiUserChat before joining, not sure whether that makes any difference.
Thanks for your very detailed situation report. I finally had some time to look into it, and I found that what you described is plausible. I am not yet sure about the best solution. Assuming that you correctly identified, I wondered if simply moving the if (mucUser.getStatus().contains(MUCUser.Status.PRESENCE_TO_SELF_110)) at the first position would already help.
I have some untested code for this at
While I do hope to find time to look deeper into this, I am happy to hear your opinion about this approach.