ClassCastException resuming a SM session with hazelcast clustering

There’s a ClassCastException being raised when resuming a stream management session if the detached session of the user is on another cluster node (I suppose).
The offending code is in the stream manager:

at org.jivesoftware.openfire.streammanagement.StreamManager.startResume(StreamManager.java:298) ~[xmppserver-4.6.0.jar:4.6.0] 

The problem is that the code after assumes the returned session is LocalClientSession. However, with hazelcast clustering, it is actually RemoteClientSession. Which also extends ClientSession, but doesn’t inherit from LocalClientSession.

The rest of the method (after the exception-throwing code) uses some LocalClientSession specific methods, namely reattach and getStreamManager so using ClientSession instead doesn’t really work.

I’d love to fix that issue, but I don’t know what the right replacement code would be in the RemoteClientSession case. Any ideas?

The impact for clients is that they get disconnected (socket closed by peer) after they connect and attempt to resume.

Just for reference, this is the stacktrace I’m seeing:

2020.11.16 15:05:39 ERROR [socket_c2s-thread-4]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to error while processing message: <resume xmlns="urn:xmpp:sm:3" previd="MnZ4cTRrYmpvcgAydnhxNGtiam9y" h="34"/> 
java.lang.ClassCastException: class org.jivesoftware.openfire.plugin.session.RemoteClientSession cannot be cast to class org.jivesoftware.openfire.session.LocalClientSession (org.jivesoftware.openfire.plugin.session.RemoteClientSession is in unnamed module of loader org.jivesoftware.openfire.container.PluginClassLoader @44c715d8; org.jivesoftware.openfire.session.LocalClientSession is in unnamed module of loader org.jivesoftware.openfire.starter.JiveClassLoader @14bee915) 
at org.jivesoftware.openfire.streammanagement.StreamManager.startResume(StreamManager.java:298) ~[xmppserver-4.6.0.jar:4.6.0] 
at org.jivesoftware.openfire.streammanagement.StreamManager.process(StreamManager.java:157) ~[xmppserver-4.6.0.jar:4.6.0] 
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:206) ~[xmppserver-4.6.0.jar:4.6.0] 
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0.jar:4.6.0] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?] 
at java.lang.Thread.run(Unknown Source) [?:?] 

Hey, thanks for reporting this. I’ve noticed the same behavior.

I’m not sure if it’s doable to “fix” this behavior, as I don’t know if we can/should share all relevant state over the cluster. I’m not against trying this, but my initial goal is more pragmatic:

  • properly catch the exception, and handle it more gracefully
  • provide a ‘location’ attribute in the enable element, as defined by XEP-0198, which hints the client to reconnect to the same cluster node.

I’ve capture the above in https://igniterealtime.atlassian.net/browse/OF-2153

Ah, the ‘location’ bit is already being set by Openfire. I’ve provided a fix for the exception handling in https://github.com/igniterealtime/Openfire/pull/1760

Wow, Guus, thanks for creating a fix for this so fast!

I am relatively familiar with the resumption location in SM. And I realise that, as far as XMPP is concerned, that should be enough to make compliant clients realise what’s happening and connect to the other node.

I have two additional notes on that. In Openfire, the resume location is a configurable property so if it’s disabled, clients receive an “unexpected request” error with no idea (or way) how to recover from it.

A second thing (which, I admit, is my problem currently) is that in many deployments, clients might not have direct access to the Openfire instances - if there’s a load balancer in front of them.

One possible workaround, assuming resuming a session from another node is not practical, is to instruct that node to terminate that session. As far as I understand, that session would be in a detached state and ending that detached state, would write all pending message stanzas to offline storage. Granted, iq and presence stanzas would be potentially lost, and the session would not be resumed, but the client will be able to connect normally to the new node and not lose and message stanzas. What do you think about that? Could maybe be a configurable property?

Anyway, thanks so much for your support so far!

1 Like

There at the very least is merit in what you’re writing. Although I’m not sure if loadbalancers should have a place in an Openfire environment (as opposed to depending on DNS SRV), it simply is a matter of fact that many deployments use them. As such, finding a solution that works in those scenarios is desirable.

I’m not sure if the SM XEP defines a standard way of returning an error that instructs the client to retry resumption on another server. If so, we should use that. If not - we might want to suggest such an addition.

What would be the trigger for instructing a node to kill a session that’s in detached state? A (failed) attempt at resumption on another node? That would make the above redirection impossible. Restarting a new session with the same resource? Openfire has logic to handle such conflicts out of context of SM - maybe that can be adapted.

I feel that the ‘proper’ way of doing this is to somehow modify Openfire to allow resumption on other cluster nodes. I’m not sure if that’s feasible though.

something to keep in mind…In additional to load-balancing and fail-over:
A common use-case is as a reverse proxy, so that a server doesn’t have to be exposed directly to the internet.

1 Like