Smack 4.4.4: pingFailed not being called

Some thoughts that are long shots at best. I’m also on mobile, please excuse the brevity.

Have you found a way to create a heap dump (or maybe a thread dump) while ping-failed should have been called, but has not? It might be worth checking if a thread is stuck somewhere, delaying the execution (maybe indefinitely)?

Is there any chance that the ping-success is actually correct, and that the other connection-lost detection is firing unneeded? When you expect the ping-failed callback to have been called, was the success handler invoked instead?

In Java projects, I have previously experienced confusing behaviour like this when an uncaught Throwable occurs in an executor pool, for example when Exceptions are caught/handled, but Errors are not (catch Exception vs catch Throwable). I have no indication that this is at play here, but the unexpected behaviour reminds me of the wild goose chases that I had before.

Thanks for the suggestions, Guus!

I just got a copy of George’s heap dump and I think it’s a deadlock. The dump was taken on January 11. The two SmackReactor threads are waiting on the selector: one is doing a select() on line 257 and the other is waiting on the lock on line 214. What is strange is that scheduledActions contains 4 entries that were supposed to be executed on January 3. A second dump from a day later still contains the exact same entries (and the threads are in the same state). This makes me think the thread is running select(0) and will block indefinitely.

I think this is possible if a thread executed schedule() after the first thread finished scheduledActions.peek() (line 215). Presumably any new call to schedule would have woken up the reactor. However, to explain the 4 entries in scheduledActions 4 threads would have to have won the race (together), which seems very unlikely.

My plan right now is to just improve the way we handle this case in our application. I do think there’s something going wrong with SmackReactor, but I don’t know if it’s the same as the one the OP wrote about (apologies for hijacking the thread if that’s the case). If you investigate further I’m happy to assist.

Boris

1 Like

I’ve raised a new Jira ticket for this: [SMACK-923] Deadlock PingManager - Ignite Realtime Jira

Hi Boris, :slight_smile:

If a thread is within the critical section starting in L214 but already peeked at the scheduledActions, then the selector.wakeup() in schedule() would either wakeup the thread from blocking in select(), or just make the select() non-blocking, in case the thread did not yet invoke it. In this case, the thread would cycle again in the reactor and handle the scheduledActions before potentially blocking again in select().

Hence, I am not sure which race you are describing. So let’s try to fit together the puzzles in our head: could you re-elaborate?

And, of course, this is concurrent programming territory, and I certainly do not want to rule out bugs.

Good observation. The success callback may be invoked twice, but not both, the error callback and the success callback. It see how this would be unexpected.

What do you think of

You are correct, I misunderstood how the selector works.

Ok than. :slight_smile: I wish I could say “good”, but if you have heap dumps with long-time expired scheduled actions, then it still means there is a bug lurking somewhere.

4 posts were split to a new topic: Smack minimum ping interval

Indeed. I can somewhat reliably get it into this broken state (I don’t know if this is “good” or not).

I run 15 XMPPTCPConnections, each to a different server. All of them have pings enabled at 30 second interval, all of them join a MUC (a separate MUC for each) and then there’s no traffic other than pings. I start a video recording session on XMPP connection “A” (this leads to a few IQs being exchanged), and then cut its network connection. For each XMPP connection I have a timer which logs if getLastStanzaReceived() is more than 30 seconds ago.

The expected behavior is that in 2-3 minutes the pingFailed callback for connection “A” should be called. Instead, my prints start appearing for “A”, and about a minute later they start appearing for all other connections as well and no pingFailed callback is ever called. This continues indefinitely and in the heap dump I see old items in the Reactor’s queue (20 minutes old right now).

This happens almost every time. I couldn’t reproduce if I use 2 connections instead of 15, or if I skip the video recording session (I tried 3-4 times each).

I enabled all logs and I don’t see anything from SmackReactor, or any uncaught exceptions. My next step is to add more logs to SmackReactor. Any suggestions for what else I could look at?

I think I found the problem. If nextScheduledAction.getTimeToDueMillis() returns 0, the reactor thread does a select(0) instead of executing the action:

This should fix it:

diff --git a/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java b/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
index 75e1af8e9..67e68f1fd 100644
--- a/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
+++ b/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
@@ -221,11 +221,10 @@ public class SmackReactor {
                     selectWait = 0;
                 } else {
                     selectWait = nextScheduledAction.getTimeToDueMillis();
-                }
-
-                if (selectWait < 0) {
-                    // A scheduled action was just released and became ready to execute.
-                    return;
+                    if (selectWait <= 0) {
+                        // A scheduled action was just released and became ready to execute.
+                        return;
+                    }
                 }
 
                 // Before we call select, we handle the pending the interest Ops. This will not block since no other

1 Like

I think that <= 0 is definitely correct here (as opposed ot the current < 0). But you would only get an expired scheduled actions that do not get scheduled if you do not schedule further actions and if the select() in the reactor does not return otherwise (e.g. because you do not use a NIO based XMPP connection, which is not unlikely).

I am puzzled how this bug alone leads to multiple expired scheduled actions. But I hadn’t had much sleep this week and it should be fixed nevertheless. Maybe it really is the only root cause. Would you do me the honors of submitting a PR against the 4.4 branch? :slight_smile:

Yes, this is exactly what happens in our case. I verified that if I schedule a dummy task while it’s stuck it recovers.

When the reactor threads does a select(0) and there’s nothing to wake it up, all of the scheduled actions already in the queue will eventually expire. In our case we have a bunch of ping requests to be sent, from each of the XMPPConnections.

Certainly, I opened #512

Thank you for merging it. Are there any plans for a new release soon?

Yes, I hope in the next two weeks. I also hope that I am able to setup the nighlty snapshot publishing to Maven Central as suggested by Ingo. But my spare time is sparse at the moment.

Smack snapshot artifacts should now be available on OSS Sonatype for you to test. Ideally you are able to test them and give me a, hopefully, positive feedback, after which I would publish the latest state of the ‘4.4’ branch as Samck 4.4.5. Please let me know if you are not able to test them, then I’d aim for a Smack release in the next few days™.

1 Like

I can test them, but I’m unable to find the artefacts (or setup an environment to build smack). Can you point me to them please?

Boris

They are at Index of /repositories/snapshots/org/igniterealtime/smack

repositories {
    maven { url "https://oss.sonatype.org/content/repositories/snapshots" }
}
smackVersion = '4.4.5-20220223.152334-4'
1 Like

Thanks, I confirm I’m not able to reproduce the issue with this version!

1 Like

Spoiler: I’ve released Smack 4.4.5 to Maven Central.Official release announcement to be done soon™.

2 Likes

This topic was automatically closed 62 days after the last reply. New replies are no longer allowed.