Null pointer exception joining a room

chp · October 28, 2020, 12:22pm

Hi,

I have the same problem. In older versions I had it too. Now I can provide you with a log created with the master (at least all files in the stack trace are equal to the master).

Steps how I can reproduce it (it will probably be easier):

Create a cluster with 3 nodes and a load balancer (e.g. nginx).
create a chat with several users (the more the better)
join the chat with multiple clients (the more the better 10+)
restart one node
the clients that were on the restarted node must join another node in the chat with the same resource and nickname

It is not perfectly reproducible.

If the bug occurs, the user can no longer join the MUC, even after a long offline time.

During debugging I noticed that the two maps occupantsByFullJID and occupantsByNickname are not synchronous.One of them contained the session of a user as LocalSession, the other not or as RemoteSession.

Best Regards

chp

2020.10.28 12:22:37.207 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:615) - User 'user2@clustertest/blaa' attempts to join room 'testmuc@conference.clustertest' using nickname 'user2'.
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:631) - User 'user2@clustertest/blaa' role and affiliation in room 'testmuc@conference.clustertest are determined to be: participant, member
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:777) - Checking all preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest'.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:806) - All preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest' have been met. User can join the room.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:663) - Skip adding user 'user2@clustertest/blaa' as an occupant of  room 'testmuc@conference.clustertest' using nickname 'user2', as it already is. Updating occupancy with its latest presence information.
2020.10.28 12:22:37.210 ERROR [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.NullPointerException: null
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:665) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.processRoomJoinRequest(LocalMUCUser.java:830) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:730) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:266) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]

chp · October 28, 2020, 1:03pm

When I apply the pull request from https://github.com/igniterealtime/Openfire/pull/1521 the following error occurs and the client still cannot join the chat:

2020.10.28 13:51:27.281 ERROR [socket_c2s-thread-5] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.UnsupportedOperationException: Cannot process Presence packets of remote users: 
<presence to="testmuc@conference.clustertest" id="b5CHq-5485" from="user7@clustertest/blaa">
  <x xmlns="http://jabber.org/protocol/muc"></x>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="http://www.igniterealtime.org/projects/smack" ver="NBoGUcJ1ljgBE8BFGCMFmGIhHq4="></c>
</presence>
	at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:75) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:63) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]

Line https://github.com/igniterealtime/Openfire/blob/master/xmppserver/src/main/java/org/jivesoftware/openfire/muc/spi/MultiUserChatServiceImpl.java#L954 returns a RemoteSession

chp · November 3, 2020, 4:01pm

i have found the problem after a long investigation.

If the first cluster node comes online, it becomes a senior member.
If a second cluster node comes online, the senior member asks for the mucs and online occupants in this node.
The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.

If now the member node goes offline the senior member gets a corresponding event. He removes all occupants from the mucs, which come from the leaving node.
The consequence is that the NodeId is always the one of the senior node and so the remote sessions are not removed. If the client connects on the senior node, it cannot join because a remote session already exists.

I fixed the bug in https://github.com/igniterealtime/Openfire/pull/1748

How would you rate this fixed? Is there anything else to consider?
The pull request is in “draft” state because there are some new problems with the fix.

If a node goes offline, it looks like all other nodes want to send an unavailable presence for the occupants of the node. This fails because the NodeId in the cluster is not known anymore.

Here I will look for how this happens and add a corresponding fix in the branch. Help is welcome.

Best regards

chp

2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]

speedy · November 3, 2020, 5:16pm

pinging @guus and @gdt

gdt · November 3, 2020, 5:59pm

OK, looks like the patch is an improvement.

By my reckoning,
a) the IllegalArgumentException in the short term can /probably/ be ignored, because
b) it’s /probably/ just an ordering thing. Openfire has received a notification that node XX has left the cluster, so it does two things

sends a message to everyone in the chat (including yy) that person yy has left, and
removes person yy from the chat.

Probably the easiest correct way to fix this is to skip the routing if the remote node is no longer in the cluster - around ClusterPacketRouter.java line 43.

I may be entirely wrong, though.

chp · November 3, 2020, 8:59pm

Thanks for your quick response.

Then the pull request can be merged.

I took a look at the ClusterPacketRouter from the Hazelcast plugin and made a modification https://github.com/igniterealtime/openfire-hazelcast-plugin/pull/55

Unfortunately it does not solve the problem completely.After first observations less errors are logged.

I have gone further to the bottom of this. The problem is, as I suspected, that for every remote role that was online on the other node a broadcast cluster task is sent, which is executed on every remaining node.
I have found a solution to reduce the number of broadcast cluster tasks accordingly and fix the problem. https://github.com/igniterealtime/Openfire/pull/1749

guus · November 4, 2020, 8:21am

Thanks for your thorough investiation @chp. For the issue that you’re describing, I’ve raised https://igniterealtime.atlassian.net/browse/OF-2131

As an aside: is the issue that @snikola desribes the same issue as @chp, or are they two different issues?

snikola · November 4, 2020, 8:52am

I cannot confirm that so far, but even if it’s not a direct cause of this NPE, it would definitely explain some oddities I’ve experienced with clustering.

As nodes are joining and leaving the cluster, I also see intermittent errors about packets that cannot be routed. So I’d definitely be very interested in this fix . Thanks, @chp!

Still connected to this issue: I upgraded to 4.6.0 and was still able to reproduce the issue so unfortunately it’s not something related to an older version. I’ll try to get you debug logs from when it happened.

guus · November 4, 2020, 8:56am

Thanks for the feedback. Let’s for now assume that they’re the same issue (as to not convolute this thread) - but lets make sure to circle back to your original reporting, in case it turns out to be a different issue after all.

chp · November 5, 2020, 10:03am

you are welcome

snikola · November 13, 2020, 5:11pm

Guys, I don’t know what’s the plan regarding merging the two pull requests upstream, but I plan on cherry-picking them both and running them in my test environment (probably early next week).

I’ll keep you posted - hopefully I’ll stop seeing exceptions trying to join a room.

speedy · November 13, 2020, 7:26pm

yes, please test and report your feedback!

danc · November 19, 2020, 11:20am

I’ve been trying to reproduce this issue using a Docker setup in order to prove the fix. I’m using a simple Hazelcast setup with no load balancer, instead exposing ports from the stack for the 2 nodes.

I launch a stack of 2 nodes both running the 4.6.0 release, both pointing at the same DB. When they’re ready, one is the senior.
I connect a client to the junior node and join a MUC
I shut down the node
I connect a client to the senior node and log in as the same user joining the same MUC with the same resource and nickname. Everything works.

I’m trying to decipher the difference between our two configurations that exposes the issue. Could you share your load balancer config? Maybe adding this to my stack will show the issue?

Presumably, the restart you’re doing on the junior node is “clean” and so the senior immediately observes that it is the sole cluster member? When this occurs for me, the first user is immediately removed from the MUC (visible in Admin, or by another member in the MUC).

The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.

Where is this happening? A trace through the code seems to show that the NodeID is created on the remote node (at the line changed in your PR), serialised, then read by the senior node and honoured throughout. What have I missed?

snikola · November 23, 2020, 12:51pm

So far so good. It’s been running on out test system since Monday and nothing bad happened.

And I haven’t seen this exception happen in the last days. So from my perspective, the fix looks good.
Ofc, since the exception wasn’t happening all the time anyway, I’ll keep monitoring.

So, if you also think the PR is good, IMHO, it’s safe to merge.

chp · November 24, 2020, 4:18pm

@danc thanks for testing it and sorry for my late response.

nginx.conf (1,4 KB)
docker-compose.yml (3,5 KB)
(It is intended for debugging the individual nodes with IntelliJ)

I have a “Gracefull” shutdown configured for hazelcast and i use docker-compose restart openfire_1 to restart the node.

yes

In the next days I will try to answer your questions and go into more detail.

guus · November 25, 2020, 9:42am

Although it’s good to hear that the solution has been proven to work in your production environment, I am concerned how this fix purposely uses the ‘wrong’ node ID. I worry that this might not be a solution that we want to carry forward - if only because it will eventually confuse others, but also because such a fix suggests that something is wrong in a more basic layer (I’m wondering if this fix hides a larger problem).

@chp provided a very useful write-up of his fix above, but like @danc I’m having trouble understanding the exact details of this bit, particularly the last sentence.

From what I can tell, the pre-existing code does ‘the right thing’ (and not replace the NodeID with its own value - which is the opposite of what @chp describes).

It might very well be that I’m simply missing something obvious. The code and eventing is complex, and it’s easy to get things wrong. I am very much looking forward to additional details that @chp announced.

chp · December 10, 2020, 11:01pm

Sorry for the late reply.

Please correct me if I am wrong:

there are 3 nodes, node 1 is the senior member.
on each node there is a client online (A node 1, B node 2, C node 3)
node 1 is restarted
node 2 becomes the senior member
node 1 joins the cluster
node 1 makes a SeniorMemberServicesRequest to node 2
node 2 creates the response and gets the ServiceInfo for each service
in the ServiceInfo constructor, all MUCRooms are fetched and a RoomInfo is created for each.
In the RoomInfo constructor, an OccupantAddedEvent is created for each occupant.
the NodeID of node 2 is taken before this fix.

=> for client C it is the wrong NodeID, because it should be the one of node 3.

node 3 is stopped
the leftCluster event in the MultiUserChatManager is called with the NodeID of node 3.
all MUCRoles with the NodeID of node 3 are fetched from the MUCRoom and thus mucRoom.leaveRoom is called.

=> in our case this is none, because the RemoteMUCRole for client C of node 3 was provided with the NodeID of node 2. Therefore the role is not removed from the MUC.

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I can’t tell “yet”. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, it’s just apparently not as likely anymore.
In our case, each device has a unique resource (in the jid). If a new user logs in, they get a new resource. Each user can have multiple devices. Each user has a fixed nickname in the MUCs that is the same on all devices. I.e. it can happen that in the MUC a user is online several times with the same bare jid and the same nickname but different full jid. All devices automatically connect to the server in case of disconnections and automatically join the MUCs in which they have an affiliation>member.

Best Regards

chp

guus · December 21, 2020, 9:07pm

I totally see what you are getting at - so much even, that I fail to recall how your change would be ‘bad’. This is the black/blue dress all over again

guus · December 25, 2020, 2:14pm

I’ve looked at this until my eyes hurt, but I can’t find my original objection to this. Let’s merge this, test it, and see if something breaks.

chp · December 26, 2020, 11:54pm

@guus thanks for merging

Before Christmas I hopefully found the solution for the mentioned problem, but unfortunately I could commit it only now:

The problem from my post above: Null pointer exception joining a room - #21 by chp

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I can’t tell “yet”. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, it’s just apparently not as likely anymore.

The Bug:

If a cluster node starts, then it can happen that a client establishes a connection and joins the muc before the cluster connection is ready (e.g.: by a plugin, which needs very long time to start).

If the cluster connection is ready, then the senior member queries all the mucs including the online users of the mucs from this cluster node. The cluster node then queries the Mucs including the online users of the cluster from the senior node and enters them in its own maps. However, the cluster node receives its online users again as a RemoteMucRole and overwrites its own LocalMucRole. The consequence is that the user can still send messages, but no longer receive them.

The pull request: OF-2180: Fixes that a local muc role is overridden by a remote muc role after … by cpetzka · Pull Request #1783 · igniterealtime/Openfire · GitHub

I think it would be better to make the value of the map occupantsByFullJID as a list to avoid such concurrency problems. Methods like getOccupantByFullJID should then return the corresponding “newer” role.
What do you think?