The presence object is for sure not null, so apparently, joinRole is. My guess is that occupantsByFullJID doesn’t have that user. Not sure why. A possible fix is to check for null before setting the presence, but maybe that would only hide an underlying problem.
What do you guys think?
Server version: 4.6.0 Alpha (sorry, not an official release)
2020.10.21 14:39:59 ERROR [socket_c2s-thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Internal server error
java.lang.NullPointerException: null
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:674) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:485) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:180) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:364) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:834) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:402) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:243) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:384) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:285) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:203) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
[...]
I have worked on related code as part of the 4.6.0 release. It is worth retesting with the latest release.
Adding a null-check seems, like you say, hiding the problem. Let’s try to see if we can reproduce it with the latest release, and then investigate further.
Ok, took me a while to reproduce this with debug logs enabled, but it happened.
I don’t see anything particularly interesting in these logs (not sure which log statement you were referring to) but here they are (at the end of the comment).
I’m also now relatively sure this exception affects clients in a negative way. They try to join a room, but never get a response regarding it. So if the client software doesn’t have some sort of timeout, the room gets stuck in “joining” state (at least that’s what seems to happen with XMPPFramework).
I’m still running an old 4.6.0 Alpha version (sorry about that) so it’s possible that this is already fixed, but I’ll need time to verify that.
I have the same problem. In older versions I had it too. Now I can provide you with a log created with the master (at least all files in the stack trace are equal to the master).
Steps how I can reproduce it (it will probably be easier):
Create a cluster with 3 nodes and a load balancer (e.g. nginx).
create a chat with several users (the more the better)
join the chat with multiple clients (the more the better 10+)
restart one node
the clients that were on the restarted node must join another node in the chat with the same resource and nickname
It is not perfectly reproducible.
If the bug occurs, the user can no longer join the MUC, even after a long offline time.
During debugging I noticed that the two maps occupantsByFullJID and occupantsByNickname are not synchronous.One of them contained the session of a user as LocalSession, the other not or as RemoteSession.
Best Regards
chp
2020.10.28 12:22:37.207 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:615) - User 'user2@clustertest/blaa' attempts to join room 'testmuc@conference.clustertest' using nickname 'user2'.
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:631) - User 'user2@clustertest/blaa' role and affiliation in room 'testmuc@conference.clustertest are determined to be: participant, member
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:777) - Checking all preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest'.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:806) - All preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest' have been met. User can join the room.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:663) - Skip adding user 'user2@clustertest/blaa' as an occupant of room 'testmuc@conference.clustertest' using nickname 'user2', as it already is. Updating occupancy with its latest presence information.
2020.10.28 12:22:37.210 ERROR [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.NullPointerException: null
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:665) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.processRoomJoinRequest(LocalMUCUser.java:830) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:730) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:266) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
2020.10.28 13:51:27.281 ERROR [socket_c2s-thread-5] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.UnsupportedOperationException: Cannot process Presence packets of remote users:
<presence to="testmuc@conference.clustertest" id="b5CHq-5485" from="user7@clustertest/blaa">
<x xmlns="http://jabber.org/protocol/muc"></x>
<c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="http://www.igniterealtime.org/projects/smack" ver="NBoGUcJ1ljgBE8BFGCMFmGIhHq4="></c>
</presence>
at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:75) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:63) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
i have found the problem after a long investigation.
If the first cluster node comes online, it becomes a senior member.
If a second cluster node comes online, the senior member asks for the mucs and online occupants in this node.
The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.
If now the member node goes offline the senior member gets a corresponding event. He removes all occupants from the mucs, which come from the leaving node.
The consequence is that the NodeId is always the one of the senior node and so the remote sessions are not removed. If the client connects on the senior node, it cannot join because a remote session already exists.
How would you rate this fixed? Is there anything else to consider?
The pull request is in “draft” state because there are some new problems with the fix.
If a node goes offline, it looks like all other nodes want to send an unavailable presence for the occupants of the node. This fails because the NodeId in the cluster is not known anymore.
Here I will look for how this happens and add a corresponding fix in the branch. Help is welcome.
Best regards
chp
2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
By my reckoning,
a) the IllegalArgumentException in the short term can /probably/ be ignored, because
b) it’s /probably/ just an ordering thing. Openfire has received a notification that node XX has left the cluster, so it does two things
sends a message to everyone in the chat (including yy) that person yy has left, and
removes person yy from the chat.
Probably the easiest correct way to fix this is to skip the routing if the remote node is no longer in the cluster - around ClusterPacketRouter.java line 43.
Unfortunately it does not solve the problem completely.After first observations less errors are logged.
I have gone further to the bottom of this. The problem is, as I suspected, that for every remote role that was online on the other node a broadcast cluster task is sent, which is executed on every remaining node.
I have found a solution to reduce the number of broadcast cluster tasks accordingly and fix the problem. https://github.com/igniterealtime/Openfire/pull/1749
I cannot confirm that so far, but even if it’s not a direct cause of this NPE, it would definitely explain some oddities I’ve experienced with clustering.
As nodes are joining and leaving the cluster, I also see intermittent errors about packets that cannot be routed. So I’d definitely be very interested in this fix . Thanks, @chp!
Still connected to this issue: I upgraded to 4.6.0 and was still able to reproduce the issue so unfortunately it’s not something related to an older version. I’ll try to get you debug logs from when it happened.
Thanks for the feedback. Let’s for now assume that they’re the same issue (as to not convolute this thread) - but lets make sure to circle back to your original reporting, in case it turns out to be a different issue after all.
Guys, I don’t know what’s the plan regarding merging the two pull requests upstream, but I plan on cherry-picking them both and running them in my test environment (probably early next week).
I’ll keep you posted - hopefully I’ll stop seeing exceptions trying to join a room.
I’ve been trying to reproduce this issue using a Docker setup in order to prove the fix. I’m using a simple Hazelcast setup with no load balancer, instead exposing ports from the stack for the 2 nodes.
I launch a stack of 2 nodes both running the 4.6.0 release, both pointing at the same DB. When they’re ready, one is the senior.
I connect a client to the junior node and join a MUC
I shut down the node
I connect a client to the senior node and log in as the same user joining the same MUC with the same resource and nickname. Everything works.
I’m trying to decipher the difference between our two configurations that exposes the issue. Could you share your load balancer config? Maybe adding this to my stack will show the issue?
Presumably, the restart you’re doing on the junior node is “clean” and so the senior immediately observes that it is the sole cluster member? When this occurs for me, the first user is immediately removed from the MUC (visible in Admin, or by another member in the MUC).
The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.
Where is this happening? A trace through the code seems to show that the NodeID is created on the remote node (at the line changed in your PR), serialised, then read by the senior node and honoured throughout. What have I missed?
So far so good. It’s been running on out test system since Monday and nothing bad happened.
And I haven’t seen this exception happen in the last days. So from my perspective, the fix looks good.
Ofc, since the exception wasn’t happening all the time anyway, I’ll keep monitoring.
So, if you also think the PR is good, IMHO, it’s safe to merge.
Although it’s good to hear that the solution has been proven to work in your production environment, I am concerned how this fix purposely uses the ‘wrong’ node ID. I worry that this might not be a solution that we want to carry forward - if only because it will eventually confuse others, but also because such a fix suggests that something is wrong in a more basic layer (I’m wondering if this fix hides a larger problem).
@chp provided a very useful write-up of his fix above, but like @danc I’m having trouble understanding the exact details of this bit, particularly the last sentence.
From what I can tell, the pre-existing code does ‘the right thing’ (and not replace the NodeID with its own value - which is the opposite of what @chp describes).
It might very well be that I’m simply missing something obvious. The code and eventing is complex, and it’s easy to get things wrong. I am very much looking forward to additional details that @chp announced.