XEP-0198 resume failure reconnect: resending of MUC messages

Georg_Lukas · November 27, 2018, 2:54pm

When a 0198-enabled session has stanzas queued, and the resume attempt fails, XMPPTCPConnection will re-send the unacked stanzas (via previouslyUnackedStanzas) right after the bind.

This has multiple issues:

some of the stanzas don’t make sense any more (especially IQ requests and responses are probably timed out already)
outgoing messages will appear as if they were sent right now instead of being delayed for a considerable amount of time
outgoing MUC messages will fail (right after the new bind, we are not joined to any MUCs anymore)

It would be great to have an API that either:

informs the client via a Listener that can manipulate/remove the stanzas before they are sent, or
bounces all the queued stanzas back to the client, allowing it to re-queue them manually, e.g. with added <delay> or after joining the respective MUCs.

Flow · November 27, 2018, 8:40pm

It is certainly desirable to provide the user with more control about which stanzas are getting resend after a stream resumption. No doubt about that. I haven’t made up my mind about how the API, that we expose to the user, should look like and be designed. I think there may be a few little pitfalls. Looking forward for you concrete proposal, ideally as working code.

Two remarks about the issues you mentioned below:

That is not really true for a “short” GSM ↔ WLan switch, isn’t it? And even if you have been offline a little longer, sending the IQ may still be sensible.

Does this lead to adding a <delay/> to stanzas that got delayed because of a broken and then resumed stream? It sure can’t hurt to have that. But I see SM mostly as tool for network connectivity switches, not for devices being offline for an extended period of time, that is why I don’t see that much value in that. But again, it can’t hurt, and maybe I am missing something?

Georg_Lukas · November 28, 2018, 8:56am

We are talking about the “resume failed” use case, so it’s longer than 5 minutes in typical deployments.

One could indeed argue that, however there is no hard line for that. How long a transmission delay does justify adding a <delay>? 10 seconds? A minute? 5 minutes?

The logic in yaxim works as follows:

if there is a connection, send the message right away
if we know we are not currently connected+authenticated (even during a short switch 3G/WiFi), store the message with a timestamp for later delivery
send all pending messages after resume/reconnect

So in fact, the above described issue only affects the messages that I send while yaxim is still “connected”, i.e. network traffic gets blackholed, and then SM resumption fails.

Regarding the API, I’ll be glad to implement one, but I’m not sure if you’ll like it. It would go like this:

mirror the (add|remove|removeAll)StanzaAcknowledgedListener() functions into (add|remove|removeAll)StanzaDroppedListener()
if at least one StanzaDroppedListener is registered, feed all of previouslyUnackedStanzas into it and drop the list (*)
send all of previouslyUnackedStanzas if no StanzaDroppedListener is available

(*) This would be compatible to the current behavior, but slightly surprising from an API design point. Maybe it would be better to have an explicit configuration option to resend queued stanzas after a failed resume?