Smack 4.4.4: pingFailed not being called

Hi,

We just upgraded our product from Smack 4.3.0 to Smack 4.4.4. Everything looks good except when we drop the network connection to the OpenFire server (4.6.1). Our Spark clients detect the outage, but the pingFailed method is not being called. The code is pretty simple, once the connection is authenticated -

pingManager = PingManager.getInstanceFor(connection);
pingManager.setPingInterval(XMPPPingInterval); // Set to 15 secs for the test
pingManager.registerPingFailedListener(this);

Our class implements PingFailedListener with

@Override
public void pingFailed {
   // Code to log connection failure and try to reconnect
}

Thanks, Wray

Hi Wray,

With the implementation of a very basic demo client, running against a development build of Openfire (4.8.0-SNAPSHOT) running with the bin/openfire.sh -demoboot -debug startup arguments, I had no trouble getting the PingFailedListener to fire.

import org.jivesoftware.smack.AbstractXMPPConnection;
import org.jivesoftware.smack.ConnectionConfiguration;
import org.jivesoftware.smack.tcp.XMPPTCPConnection;
import org.jivesoftware.smack.tcp.XMPPTCPConnectionConfiguration;
import org.jivesoftware.smackx.ping.PingManager;

import java.time.Duration;

public class TestClient {
    public static void main(String[] args) throws Exception {
        XMPPTCPConnectionConfiguration configA = XMPPTCPConnectionConfiguration.builder()
            .setUsernameAndPassword("jane", "secret")
            .setXmppDomain("example.org")
            .setResource("test")
            .setHost("localhost")
            .setPort(5222)
            .setSecurityMode(ConnectionConfiguration.SecurityMode.disabled)
            .build();
        AbstractXMPPConnection connection = new XMPPTCPConnection(configA);

        connection.connect();
        connection.login();

        PingManager pingManager = PingManager.getInstanceFor(connection);
        pingManager.setPingInterval(5);
        pingManager.registerPingFailedListener(() -> System.out.println("Ping failed!"));

        Thread.sleep(Duration.ofMinutes(6).toMillis());

        connection.disconnect();
    }
}

After first observing a ping to be succesfull, I I placed a breakpoint in Openfireā€™s IQPingHandler, effectively pausing the processing of the client request indefinitely. The next ping that was fired caused the PingFailedListener to be invoked after about 2 minutes, as expected.

Looking at Smackā€™s PingManager implementation, this caught my eye in the ā€˜no response receivedā€™ handling of the ping request:

pingFuture.onError(new ExceptionCallback<Exception>() {
    @Override
    public void processException(Exception exception) {
        long lastStanzaReceived = connection.getLastStanzaReceived();
        if (lastStanzaReceived > 0) {
            long now = System.currentTimeMillis();
            // Delta since the last stanza was received
            int deltaInSeconds = (int)  ((now - lastStanzaReceived) / 1000);
            // If the delta is smaller then the ping interval, we have got an valid stanza in time
            // So not error notification needed
            if (deltaInSeconds < pingInterval) {
                maybeSchedulePingServerTask(deltaInSeconds);
                return;
            }
        }
    
        for (PingFailedListener l : pingFailedListeners) {
            l.pingFailed();
        }
    }
    });
}

It looks like the ping failure event listener does not fire, if between the time that the ping request was made, some other data was received by the server. The reasoning here probably is that such data can be considered ā€˜proof of lifeā€™.

If in your setup, you properly killed the network connection, I canā€™t see other data arrive at the client, but Iā€™m not sure exactly how you did that. If you only prevented the ping response from arriving, that might account for something.

Otherwise, Iā€™m not sure whatā€™s going on. Maybe another part of the code already picked up the network disconnect, and put Smack in some kind of ā€œIā€™m no longer connectedā€ state? Without more debugging, itā€™s hard to tell exactly.

Thank you sir. Since the problem is obviously on my end I will dig into our setup and see if something changed. Appreciate the help.

Quick follow up question. You set the pingInterval to 5 and from the java docs
Set the interval in seconds between a automated server ping is send. A negative value disables automatic server pings. All settings take effect immediately. If there is an active scheduled server ping it will be canceled and, if pingInterval is positive, a new one will be scheduled in pingInterval seconds.

If the ping fails after 3 attempts waiting the connections reply timeout for an answer, then the ping failed listeners will be invoked.

So shouldnā€™t the pingFailed listener be called after around 15 seconds? Perhaps the issue is I donā€™t understand when the pingFailed listeners are invovked?

We also noticed something similar in jibri after updating to Smack 4.4.4. Unfortunately I donā€™t know the details, and the person who does is on PTO right now. I think he still has a heapdump we can review. See here:

This bit of code in PingManager seems to control the timeout after which the PingFailed event listeners are triggered:

final long minimumTimeout = TimeUnit.MINUTES.toMillis(2);
final long connectionReplyTimeout = connection.getReplyTimeout();
final long timeout = connectionReplyTimeout > minimumTimeout ? connectionReplyTimeout : minimumTimeout;

From what Iā€™m seeing here, it is using the generic connection replyTimeout, or 2 minutes if the connection replyTimeout is smaller.

Thanks for sharing @Boris_Grozev. Am I right to conclude that Georgeā€™s code is intended to cover an edge case that occurs during the start-up phase of things (and wouldnā€™t be needed anymore after the PingManager got underway)?

No, I think the failure happened after start-up.

We have an app running two XMPPConnections with Smack 4.4.4 and Georgeā€™s code which triggers a re-connect if it detects that no stanza was received in the last 60 seconds. We see this code trigger multiple times (about 8 times in about 20 hours) while PingManager never called the failed callback. The XMPPConnection then re-connects successfully and we see the same thing at a later time. Most of the time it triggers for both XMPPConnections at roughly the same time (within a few ms).

Iā€™m trying to understand why this happens, because the re-connection opens a whole other can of worms in our case. In heap dumps I see that the two PingManagers schedule tasks on the SmackReactor as expected. Any clues would be appreciated.

While reviwing the code I noticed something, which is most likely unrelated, but it may cause weird behavior in some cases. The SmackFuture onSuccess and onError methods call both callbacks if the future has been completed. This means that if a race occurs and the future completes, the callbacks could be called twice in code like this:

Or in a test:

    @Test
    public void testSmackFuture()
        throws Exception
    {
        final CompletableFuture future = new CompletableFuture();
        future.complete();
        // This prints "Success"
        future.onSuccess((x) -> {
            System.err.println("Success");
        });
        // This prints "Success" again
        future.onError((x) -> {
            System.err.println("Error");
        });
    }

    static class CompletableFuture extends SmackFuture<Boolean, Exception>
    {
        public void complete()
        {
            result = true;
        }
    }

Some thoughts that are long shots at best. Iā€™m also on mobile, please excuse the brevity.

Have you found a way to create a heap dump (or maybe a thread dump) while ping-failed should have been called, but has not? It might be worth checking if a thread is stuck somewhere, delaying the execution (maybe indefinitely)?

Is there any chance that the ping-success is actually correct, and that the other connection-lost detection is firing unneeded? When you expect the ping-failed callback to have been called, was the success handler invoked instead?

In Java projects, I have previously experienced confusing behaviour like this when an uncaught Throwable occurs in an executor pool, for example when Exceptions are caught/handled, but Errors are not (catch Exception vs catch Throwable). I have no indication that this is at play here, but the unexpected behaviour reminds me of the wild goose chases that I had before.

Thanks for the suggestions, Guus!

I just got a copy of Georgeā€™s heap dump and I think itā€™s a deadlock. The dump was taken on January 11. The two SmackReactor threads are waiting on the selector: one is doing a select() on line 257 and the other is waiting on the lock on line 214. What is strange is that scheduledActions contains 4 entries that were supposed to be executed on January 3. A second dump from a day later still contains the exact same entries (and the threads are in the same state). This makes me think the thread is running select(0) and will block indefinitely.

I think this is possible if a thread executed schedule() after the first thread finished scheduledActions.peek() (line 215). Presumably any new call to schedule would have woken up the reactor. However, to explain the 4 entries in scheduledActions 4 threads would have to have won the race (together), which seems very unlikely.

My plan right now is to just improve the way we handle this case in our application. I do think thereā€™s something going wrong with SmackReactor, but I donā€™t know if itā€™s the same as the one the OP wrote about (apologies for hijacking the thread if thatā€™s the case). If you investigate further Iā€™m happy to assist.

Boris

1 Like

Iā€™ve raised a new Jira ticket for this: [SMACK-923] Deadlock PingManager - Ignite Realtime Jira

Hi Boris, :slight_smile:

If a thread is within the critical section starting in L214 but already peeked at the scheduledActions, then the selector.wakeup() in schedule() would either wakeup the thread from blocking in select(), or just make the select() non-blocking, in case the thread did not yet invoke it. In this case, the thread would cycle again in the reactor and handle the scheduledActions before potentially blocking again in select().

Hence, I am not sure which race you are describing. So letā€™s try to fit together the puzzles in our head: could you re-elaborate?

And, of course, this is concurrent programming territory, and I certainly do not want to rule out bugs.

Good observation. The success callback may be invoked twice, but not both, the error callback and the success callback. It see how this would be unexpected.

What do you think of

You are correct, I misunderstood how the selector works.

Ok than. :slight_smile: I wish I could say ā€œgoodā€, but if you have heap dumps with long-time expired scheduled actions, then it still means there is a bug lurking somewhere.

4 posts were split to a new topic: Smack minimum ping interval

Indeed. I can somewhat reliably get it into this broken state (I donā€™t know if this is ā€œgoodā€ or not).

I run 15 XMPPTCPConnections, each to a different server. All of them have pings enabled at 30 second interval, all of them join a MUC (a separate MUC for each) and then thereā€™s no traffic other than pings. I start a video recording session on XMPP connection ā€œAā€ (this leads to a few IQs being exchanged), and then cut its network connection. For each XMPP connection I have a timer which logs if getLastStanzaReceived() is more than 30 seconds ago.

The expected behavior is that in 2-3 minutes the pingFailed callback for connection ā€œAā€ should be called. Instead, my prints start appearing for ā€œAā€, and about a minute later they start appearing for all other connections as well and no pingFailed callback is ever called. This continues indefinitely and in the heap dump I see old items in the Reactorā€™s queue (20 minutes old right now).

This happens almost every time. I couldnā€™t reproduce if I use 2 connections instead of 15, or if I skip the video recording session (I tried 3-4 times each).

I enabled all logs and I donā€™t see anything from SmackReactor, or any uncaught exceptions. My next step is to add more logs to SmackReactor. Any suggestions for what else I could look at?

I think I found the problem. If nextScheduledAction.getTimeToDueMillis() returns 0, the reactor thread does a select(0) instead of executing the action:

This should fix it:

diff --git a/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java b/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
index 75e1af8e9..67e68f1fd 100644
--- a/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
+++ b/smack-core/src/main/java/org/jivesoftware/smack/SmackReactor.java
@@ -221,11 +221,10 @@ public class SmackReactor {
                     selectWait = 0;
                 } else {
                     selectWait = nextScheduledAction.getTimeToDueMillis();
-                }
-
-                if (selectWait < 0) {
-                    // A scheduled action was just released and became ready to execute.
-                    return;
+                    if (selectWait <= 0) {
+                        // A scheduled action was just released and became ready to execute.
+                        return;
+                    }
                 }
 
                 // Before we call select, we handle the pending the interest Ops. This will not block since no other

1 Like

I think that <= 0 is definitely correct here (as opposed ot the current < 0). But you would only get an expired scheduled actions that do not get scheduled if you do not schedule further actions and if the select() in the reactor does not return otherwise (e.g. because you do not use a NIO based XMPP connection, which is not unlikely).

I am puzzled how this bug alone leads to multiple expired scheduled actions. But I hadnā€™t had much sleep this week and it should be fixed nevertheless. Maybe it really is the only root cause. Would you do me the honors of submitting a PR against the 4.4 branch? :slight_smile:

Yes, this is exactly what happens in our case. I verified that if I schedule a dummy task while itā€™s stuck it recovers.

When the reactor threads does a select(0) and thereā€™s nothing to wake it up, all of the scheduled actions already in the queue will eventually expire. In our case we have a bunch of ping requests to be sent, from each of the XMPPConnections.

Certainly, I opened #512