Null pointer exception joining a room

I sometimes encounter a null pointer when joining a MUC room. The stack trace is at the bottom.
The problem is in these two lines of code:

joinRole = occupantsByFullJID.get(user.getAddress());
joinRole.setPresence( presence ); // OF-1581: Use latest presence information.

The presence object is for sure not null, so apparently, joinRole is. My guess is that occupantsByFullJID doesn’t have that user. Not sure why. A possible fix is to check for null before setting the presence, but maybe that would only hide an underlying problem.

What do you guys think?

Server version: 4.6.0 Alpha (sorry, not an official release)

2020.10.21 14:39:59 ERROR [socket_c2s-thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Internal server error 
java.lang.NullPointerException: null 
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:674) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:485) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:180) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:364) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:834) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:402) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:243) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:384) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:285) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:203) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
[...]

I have worked on related code as part of the 4.6.0 release. It is worth retesting with the latest release.

Adding a null-check seems, like you say, hiding the problem. Let’s try to see if we can reproduce it with the latest release, and then investigate further.

There’s a debug satement in the code just before this NPE, could you enable debugging and then see what gets printed while reproducing the NPE?

Ok, took me a while to reproduce this with debug logs enabled, but it happened.
I don’t see anything particularly interesting in these logs (not sure which log statement you were referring to) but here they are (at the end of the comment).

I’m also now relatively sure this exception affects clients in a negative way. They try to join a room, but never get a response regarding it. So if the client software doesn’t have some sort of timeout, the room gets stuck in “joining” state (at least that’s what seems to happen with XMPPFramework).

I’m still running an old 4.6.0 Alpha version (sorry about that) so it’s possible that this is already fixed, but I’ll need time to verify that.


2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=140 cap=4096: 3C 69 71 20 74 79 70 65 3D 22 72 65 73 75 6C 74...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.codec.ProtocolCodecFilter - Processing a MESSAGE_RECEIVED for session 1075586 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=607 cap=4096: 3C 69 71 20 74 79 70 65 3D 22 72 65 73 75 6C 74...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=3583 cap=4096: 3C 69 71 20 74 79 70 65 3D 22 72 65 73 75 6C 74...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=96 cap=128: 3C 65 6E 61 62 6C 65 64 20 78 6D 6C 6E 73 3D 22...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.codec.ProtocolCodecFilter - Processing a MESSAGE_RECEIVED for session 1075586 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=326 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=2152 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=2138 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=444 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=326 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: class org.jivesoftware.openfire.streammanagement.StreamManager[10.0.2.56] - Requesting acknowledgement from peer, as we have 5 or more unacknowledged stanzas. 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=27 cap=32: 3C 72 20 78 6D 6C 6E 73 3D 27 75 72 6E 3A 78 6D...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=448 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=326 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=2144 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=452 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=210 cap=1024: 17 03 03 00 CD 2C A3 FB 7A E8 4A C5 BF DF 51 8E...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.codec.ProtocolCodecFilter - Processing a MESSAGE_RECEIVED for session 1075586 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=512 cap=512: 17 03 03 00 F0 2C A3 FB 7A E8 4A C5 C0 46 5C F5...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=282 cap=1024: 97 28 25 F2 9C CF EF D5 FE B1 CD C9 35 5A 70 8D...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=329 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: class org.jivesoftware.openfire.streammanagement.StreamManager[10.0.2.56] - Requesting acknowledgement from peer, as we have 5 or more unacknowledged stanzas. 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=27 cap=32: 3C 72 20 78 6D 6C 6E 73 3D 27 75 72 6E 3A 78 6D...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=447 cap=4096: 3C 6D 65 73 73 61 67 65 20 66 72 6F 6D 3D 22 64...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=329 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=193 cap=4096: 3C 6D 65 73 73 61 67 65 20 66 72 6F 6D 3D 22 33...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=326 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=2128 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: class org.jivesoftware.openfire.streammanagement.StreamManager[10.0.2.56] - Requesting acknowledgement from peer, as we have 5 or more unacknowledged stanzas. 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=27 cap=32: 3C 72 20 78 6D 6C 6E 73 3D 27 75 72 6E 3A 78 6D...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=450 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.codec.ProtocolCodecFilter - Processing a MESSAGE_RECEIVED for session 1075586 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory - Executing asynchronous MultiTask: org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=790 cap=1024: 17 03 03 00 B3 2C A3 FB 7A E8 4A C5 C4 02 FF 38...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=326 cap=4096: 3C 70 72 65 73 65 6E 63 65 20 74 6F 3D 22 2B 34...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=2145 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [socket_c2s-thread-4]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Writing Message : WriteRequest: HeapBuffer[pos=0 lim=460 cap=4096: 3C 6D 65 73 73 61 67 65 20 74 79 70 65 3D 22 67...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=244 cap=1024: 17 03 03 00 EF 2C A3 FB 7A E8 4A C5 C8 B1 90 65...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Message received : HeapBuffer[pos=0 lim=305 cap=512: 17 03 03 00 EF 2C A3 FB 7A E8 4A C5 C9 04 53 53...] 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslHandler - Session Server[1075586](SSL) Processing the received message 
2020.10.27 08:46:27 DEBUG [NioProcessor-1]: org.apache.mina.filter.ssl.SslFilter - Session Server[1075586](SSL): Processing the SSL Data 
2020.10.27 08:46:27 ERROR [socket_c2s-thread-4]: org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Internal server error 
java.lang.NullPointerException: null 
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:674) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:485) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:180) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:364) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:834) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:402) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:243) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:384) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:285) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:203) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?] 
at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?] 
at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?] 
at java.lang.Thread.run(Unknown Source) [?:?] 

Hi,

I have the same problem. In older versions I had it too. Now I can provide you with a log created with the master (at least all files in the stack trace are equal to the master).

Steps how I can reproduce it (it will probably be easier):

  1. Create a cluster with 3 nodes and a load balancer (e.g. nginx).
  2. create a chat with several users (the more the better)
  3. join the chat with multiple clients (the more the better 10+)
  4. restart one node
  5. the clients that were on the restarted node must join another node in the chat with the same resource and nickname

It is not perfectly reproducible.

If the bug occurs, the user can no longer join the MUC, even after a long offline time.

During debugging I noticed that the two maps occupantsByFullJID and occupantsByNickname are not synchronous.One of them contained the session of a user as LocalSession, the other not or as RemoteSession.

Best Regards

chp

2020.10.28 12:22:37.207 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:615) - User 'user2@clustertest/blaa' attempts to join room 'testmuc@conference.clustertest' using nickname 'user2'.
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:631) - User 'user2@clustertest/blaa' role and affiliation in room 'testmuc@conference.clustertest are determined to be: participant, member
2020.10.28 12:22:37.208 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:777) - Checking all preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest'.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:806) - All preconditions for user 'user2@clustertest/blaa' to join room 'testmuc@conference.clustertest' have been met. User can join the room.
2020.10.28 12:22:37.209 DEBUG [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.LocalMUCRoom:663) - Skip adding user 'user2@clustertest/blaa' as an occupant of  room 'testmuc@conference.clustertest' using nickname 'user2', as it already is. Updating occupancy with its latest presence information.
2020.10.28 12:22:37.210 ERROR [socket_c2s-thread-13] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.NullPointerException: null
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.joinRoom(LocalMUCRoom.java:665) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.processRoomJoinRequest(LocalMUCUser.java:830) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:730) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCUser.process(LocalMUCUser.java:266) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]

When I apply the pull request from https://github.com/igniterealtime/Openfire/pull/1521 the following error occurs and the client still cannot join the chat:

2020.10.28 13:51:27.281 ERROR [socket_c2s-thread-5] (org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl:429) - Internal server error
java.lang.UnsupportedOperationException: Cannot process Presence packets of remote users: 
<presence to="testmuc@conference.clustertest" id="b5CHq-5485" from="user7@clustertest/blaa">
  <x xmlns="http://jabber.org/protocol/muc"></x>
  <c xmlns="http://jabber.org/protocol/caps" hash="sha-1" node="http://www.igniterealtime.org/projects/smack" ver="NBoGUcJ1ljgBE8BFGCMFmGIhHq4="></c>
</presence>
	at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:75) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.RemoteMUCUser.process(RemoteMUCUser.java:63) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl.processPacket(MultiUserChatServiceImpl.java:423) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.component.InternalComponentManager$RoutableComponents.process(InternalComponentManager.java:859) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToComponent(RoutingTableImpl.java:451) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:266) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:398) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:102) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:291) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:209) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:183) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:1015) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:122) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:413) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:257) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.compression.CompressionFilter.messageReceived(CompressionFilter.java:169) [mina-filter-compression-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:650) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:49) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:1128) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.filterchain.IoFilterEvent.fire(IoFilterEvent.java:106) [mina-core-2.1.3.jar:?]
	at org.apache.mina.core.session.IoEvent.run(IoEvent.java:89) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTask(OrderedThreadPoolExecutor.java:766) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.runTasks(OrderedThreadPoolExecutor.java:758) [mina-core-2.1.3.jar:?]
	at org.apache.mina.filter.executor.OrderedThreadPoolExecutor$Worker.run(OrderedThreadPoolExecutor.java:697) [mina-core-2.1.3.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]

Line https://github.com/igniterealtime/Openfire/blob/master/xmppserver/src/main/java/org/jivesoftware/openfire/muc/spi/MultiUserChatServiceImpl.java#L954 returns a RemoteSession

1 Like

i have found the problem after a long investigation.

If the first cluster node comes online, it becomes a senior member.
If a second cluster node comes online, the senior member asks for the mucs and online occupants in this node.
The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.

If now the member node goes offline the senior member gets a corresponding event. He removes all occupants from the mucs, which come from the leaving node.
The consequence is that the NodeId is always the one of the senior node and so the remote sessions are not removed. If the client connects on the senior node, it cannot join because a remote session already exists.

I fixed the bug in https://github.com/igniterealtime/Openfire/pull/1748

How would you rate this fixed? Is there anything else to consider?
The pull request is in “draft” state because there are some new problems with the fix.

If a node goes offline, it looks like all other nodes want to send an unavailable presence for the occupants of the node. This fails because the NodeId in the cluster is not known anymore.

Here I will look for how this happens and add a corresponding fix in the branch. Help is welcome.

Best regards

chp

2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
	at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.2-SNAPSHOT.jar!/:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
	at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
	at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
	at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2 Likes

pinging @guus and @gdt

OK, looks like the patch is an improvement.

By my reckoning,
a) the IllegalArgumentException in the short term can /probably/ be ignored, because
b) it’s /probably/ just an ordering thing. Openfire has received a notification that node XX has left the cluster, so it does two things

  1. sends a message to everyone in the chat (including yy) that person yy has left, and
  2. removes person yy from the chat.

Probably the easiest correct way to fix this is to skip the routing if the remote node is no longer in the cluster - around ClusterPacketRouter.java line 43.

I may be entirely wrong, though.

Thanks for your quick response.

Then the pull request can be merged.

I took a look at the ClusterPacketRouter from the Hazelcast plugin and made a modification https://github.com/igniterealtime/openfire-hazelcast-plugin/pull/55

Unfortunately it does not solve the problem completely.After first observations less errors are logged.

I have gone further to the bottom of this. The problem is, as I suspected, that for every remote role that was online on the other node a broadcast cluster task is sent, which is executed on every remaining node.
I have found a solution to reduce the number of broadcast cluster tasks accordingly and fix the problem. https://github.com/igniterealtime/Openfire/pull/1749

Thanks for your thorough investiation @chp. For the issue that you’re describing, I’ve raised https://igniterealtime.atlassian.net/browse/OF-2131

As an aside: is the issue that @snikola desribes the same issue as @chp, or are they two different issues?

I cannot confirm that so far, but even if it’s not a direct cause of this NPE, it would definitely explain some oddities I’ve experienced with clustering.

As nodes are joining and leaving the cluster, I also see intermittent errors about packets that cannot be routed. So I’d definitely be very interested in this fix :smiley:. Thanks, @chp!

Still connected to this issue: I upgraded to 4.6.0 and was still able to reproduce the issue so unfortunately it’s not something related to an older version. I’ll try to get you debug logs from when it happened.

Thanks for the feedback. Let’s for now assume that they’re the same issue (as to not convolute this thread) - but lets make sure to circle back to your original reporting, in case it turns out to be a different issue after all.

you are welcome

Guys, I don’t know what’s the plan regarding merging the two pull requests upstream, but I plan on cherry-picking them both and running them in my test environment (probably early next week).

I’ll keep you posted - hopefully I’ll stop seeing exceptions trying to join a room. :crossed_fingers:

1 Like

yes, please test and report your feedback!

I’ve been trying to reproduce this issue using a Docker setup in order to prove the fix. I’m using a simple Hazelcast setup with no load balancer, instead exposing ports from the stack for the 2 nodes.

  • I launch a stack of 2 nodes both running the 4.6.0 release, both pointing at the same DB. When they’re ready, one is the senior.
  • I connect a client to the junior node and join a MUC
  • I shut down the node
  • I connect a client to the senior node and log in as the same user joining the same MUC with the same resource and nickname. Everything works.

I’m trying to decipher the difference between our two configurations that exposes the issue. Could you share your load balancer config? Maybe adding this to my stack will show the issue?

Presumably, the restart you’re doing on the junior node is “clean” and so the senior immediately observes that it is the sole cluster member? When this occurs for me, the first user is immediately removed from the MUC (visible in Admin, or by another member in the MUC).

The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.

Where is this happening? A trace through the code seems to show that the NodeID is created on the remote node (at the line changed in your PR), serialised, then read by the senior node and honoured throughout. What have I missed?

So far so good. It’s been running on out test system since Monday and nothing bad happened.

And I haven’t seen this exception happen in the last days. So from my perspective, the fix looks good.
Ofc, since the exception wasn’t happening all the time anyway, I’ll keep monitoring.

So, if you also think the PR is good, IMHO, it’s safe to merge.

@danc thanks for testing it and sorry for my late response.

nginx.conf (1,4 KB)
docker-compose.yml (3,5 KB)
(It is intended for debugging the individual nodes with IntelliJ)

I have a “Gracefull” shutdown configured for hazelcast and i use docker-compose restart openfire_1 to restart the node.

yes

In the next days I will try to answer your questions and go into more detail.

Although it’s good to hear that the solution has been proven to work in your production environment, I am concerned how this fix purposely uses the ‘wrong’ node ID. I worry that this might not be a solution that we want to carry forward - if only because it will eventually confuse others, but also because such a fix suggests that something is wrong in a more basic layer (I’m wondering if this fix hides a larger problem).

@chp provided a very useful write-up of his fix above, but like @danc I’m having trouble understanding the exact details of this bit, particularly the last sentence.

From what I can tell, the pre-existing code does ‘the right thing’ (and not replace the NodeID with its own value - which is the opposite of what @chp describes).

It might very well be that I’m simply missing something obvious. The code and eventing is complex, and it’s easy to get things wrong. I am very much looking forward to additional details that @chp announced. :slight_smile: