Group presence and group messages not getting delivered, loosing customers

We need with a pernicious problem with group messages. We believe we may have encountered it before, and even after upgrading to 4.5.1 and now to 4.5.3 it is still there. We are now on 4.5.3 with Session Management turned on and two servers in a hazelcast cluster. Upgrading to 4.5.3 has helped (Stream Management bug?) a lot, thanks to @guus, but this problem is deeper than Session Management, we believe it goes down to the RoutingTable.

After a few connects and disconnects, it so happens that the user’s presence is no longer processed and the user is not logged into the they are part of, the routing table keeps a ghost session, and the user does not get message from any of their groups. Changing resources helps for a while, but

For example:
user@example.com/iphone gets “stuck” and no matter how many sessions or time passes the user does not receive messages. If, from the same client and username, we change the resource to
user@example.com/iphoneABC the user now gets group messages no problem. If we change the resource back no group messages. This does not happen at all for 1:1 messages, they are not affected and don’t get lost.

We have noticed that in the log our code gets called, and at other times it does not get called:
Here is the NON-working version with the /iphone resource
18:46:54.609 [socket_c2s_ssl-thread-4] DEBUG org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user@example.com/iphone
18:46:54.808 [socket_c2s_ssl-thread-4] DEBUG org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user@example.com/iphone

We then change the resource to /iphoneSV and log back in we see this:
18:40:11.859 [socket_c2s_ssl-thread-3] DEBUG org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user@example.com/iphoneSV
18:40:11.861 [socket_c2s_ssl-thread-3] DEBUG org.jivesoftware.util.Log - Inserting New Session values - Node user
18:40:11.861 [socket_c2s_ssl-thread-3] DEBUG org.jivesoftware.util.Log - Inserting Session values - Statement com.mysql.cj.jdbc.ClientPreparedStatement: INSERT INTO userStatus (username, resource, online, lastIpAddress, lastLoginDate, serverName) VALUES (‘user’, ‘iphoneSV’, 1, '33.21.4.12, ‘001598121611362’, ‘nothing’)
18:40:12.040 [socket_c2s_ssl-thread-2] DEBUG org.jivesoftware.openfire.spi.RoutingTableImpl - Adding client route user@example.com/iphoneSV

The two lines in between “RoutingTableImpl” are an indication that group messages will work, however that part of the code does not get called, almost as if there is a problem in RoutingTableImpl, or in calling our plugin or something else. It’s as if the RoutingTableImpl maintains a record for the user with /iphone resource and never restarts the session which means that group messages do not get sent.

Any advice? We are considering shutting down session management completely.

Thank you again,
DT.

Curious. I do not have an immediate solution.

Are you able to consistently reproduce this problem?

Can you be sure that running in a cluster is a relevant factor (eg: does it stop happening when you’re running on just one cluster node)?

Can you be sure that stream management is a relevant factor (eg: does it stop happening when you disable SM?)

When you’re saying “group messages”, then you’re referring to MUC functionality?

Note: I think that the duplicate line “Adding client route xyz@example.com” won’t be logged any more in 4.6.0 (OF-2012j / https://github.com/igniterealtime/Openfire/pull/1627).

I don’t quite recognize the log messages that contain “Inserting New Session values - Node user” and “Inserting Session values”. The latter uses a query that resembles one in our UserStatus plugin, but that plugin doesn’t log messages. Is this a custom plugin, or did you modify the plugin? I’m interested in finding out what path in the code causes that log line to be generated.

A lot of changes to Stream Management, Clustering and the Routing table have gone into the code lately. Are you able to reproduce this problem with the latest nightly build of Openfire?

Can you log the XMPP stanzas that are exchanged between server and client, in both scenarios? That might give us valuable insights.

The Routing Table basically is a collection of caches. The content of these caches can be inspected in the admin console. You might want to look at the cache named “Routing Users Cache” and “Routing User Sessions”. It would be interesting to see if the affected users are listed in that cache when the problem occurs.

Hunch: does this problem occur when Stream Management is used to ‘resume’ a session on a different cluster node than the one where the session was detached? I do not think we’ve ever tested that.