Null pointer exception joining a room

pinging @guus and @gdt

OK, looks like the patch is an improvement.

By my reckoning,
a) the IllegalArgumentException in the short term can /probably/ be ignored, because
b) itā€™s /probably/ just an ordering thing. Openfire has received a notification that node XX has left the cluster, so it does two things

  1. sends a message to everyone in the chat (including yy) that person yy has left, and
  2. removes person yy from the chat.

Probably the easiest correct way to fix this is to skip the routing if the remote node is no longer in the cluster - around ClusterPacketRouter.java line 43.

I may be entirely wrong, though.

Thanks for your quick response.

Then the pull request can be merged.

I took a look at the ClusterPacketRouter from the Hazelcast plugin and made a modification https://github.com/igniterealtime/openfire-hazelcast-plugin/pull/55

Unfortunately it does not solve the problem completely.After first observations less errors are logged.

I have gone further to the bottom of this. The problem is, as I suspected, that for every remote role that was online on the other node a broadcast cluster task is sent, which is executed on every remaining node.
I have found a solution to reduce the number of broadcast cluster tasks accordingly and fix the problem. https://github.com/igniterealtime/Openfire/pull/1749

Thanks for your thorough investiation @chp. For the issue that youā€™re describing, Iā€™ve raised https://igniterealtime.atlassian.net/browse/OF-2131

As an aside: is the issue that @snikola desribes the same issue as @chp, or are they two different issues?

I cannot confirm that so far, but even if itā€™s not a direct cause of this NPE, it would definitely explain some oddities Iā€™ve experienced with clustering.

As nodes are joining and leaving the cluster, I also see intermittent errors about packets that cannot be routed. So Iā€™d definitely be very interested in this fix :smiley:. Thanks, @chp!

Still connected to this issue: I upgraded to 4.6.0 and was still able to reproduce the issue so unfortunately itā€™s not something related to an older version. Iā€™ll try to get you debug logs from when it happened.

Thanks for the feedback. Letā€™s for now assume that theyā€™re the same issue (as to not convolute this thread) - but lets make sure to circle back to your original reporting, in case it turns out to be a different issue after all.

you are welcome

Guys, I donā€™t know whatā€™s the plan regarding merging the two pull requests upstream, but I plan on cherry-picking them both and running them in my test environment (probably early next week).

Iā€™ll keep you posted - hopefully Iā€™ll stop seeing exceptions trying to join a room. :crossed_fingers:

1 Like

yes, please test and report your feedback!

Iā€™ve been trying to reproduce this issue using a Docker setup in order to prove the fix. Iā€™m using a simple Hazelcast setup with no load balancer, instead exposing ports from the stack for the 2 nodes.

  • I launch a stack of 2 nodes both running the 4.6.0 release, both pointing at the same DB. When theyā€™re ready, one is the senior.
  • I connect a client to the junior node and join a MUC
  • I shut down the node
  • I connect a client to the senior node and log in as the same user joining the same MUC with the same resource and nickname. Everything works.

Iā€™m trying to decipher the difference between our two configurations that exposes the issue. Could you share your load balancer config? Maybe adding this to my stack will show the issue?

Presumably, the restart youā€™re doing on the junior node is ā€œcleanā€ and so the senior immediately observes that it is the sole cluster member? When this occurs for me, the first user is immediately removed from the MUC (visible in Admin, or by another member in the MUC).

The senior member then inserts the online occupants via OccupantAddedEvent in his own maps. However, he replaces the NodeID with his NodeID.

Where is this happening? A trace through the code seems to show that the NodeID is created on the remote node (at the line changed in your PR), serialised, then read by the senior node and honoured throughout. What have I missed?

So far so good. Itā€™s been running on out test system since Monday and nothing bad happened.

And I havenā€™t seen this exception happen in the last days. So from my perspective, the fix looks good.
Ofc, since the exception wasnā€™t happening all the time anyway, Iā€™ll keep monitoring.

So, if you also think the PR is good, IMHO, itā€™s safe to merge.

@danc thanks for testing it and sorry for my late response.

nginx.conf (1,4 KB)
docker-compose.yml (3,5 KB)
(It is intended for debugging the individual nodes with IntelliJ)

I have a ā€œGracefullā€ shutdown configured for hazelcast and i use docker-compose restart openfire_1 to restart the node.

yes

In the next days I will try to answer your questions and go into more detail.

Although itā€™s good to hear that the solution has been proven to work in your production environment, I am concerned how this fix purposely uses the ā€˜wrongā€™ node ID. I worry that this might not be a solution that we want to carry forward - if only because it will eventually confuse others, but also because such a fix suggests that something is wrong in a more basic layer (Iā€™m wondering if this fix hides a larger problem).

@chp provided a very useful write-up of his fix above, but like @danc Iā€™m having trouble understanding the exact details of this bit, particularly the last sentence.

From what I can tell, the pre-existing code does ā€˜the right thingā€™ (and not replace the NodeID with its own value - which is the opposite of what @chp describes).

It might very well be that Iā€™m simply missing something obvious. The code and eventing is complex, and itā€™s easy to get things wrong. I am very much looking forward to additional details that @chp announced. :slight_smile:

Sorry for the late reply.

Please correct me if I am wrong:

  1. there are 3 nodes, node 1 is the senior member.
  2. on each node there is a client online (A node 1, B node 2, C node 3)
  3. node 1 is restarted
  4. node 2 becomes the senior member
  5. node 1 joins the cluster
  6. node 1 makes a SeniorMemberServicesRequest to node 2
  7. node 2 creates the response and gets the ServiceInfo for each service
  8. in the ServiceInfo constructor, all MUCRooms are fetched and a RoomInfo is created for each.
    In the RoomInfo constructor, an OccupantAddedEvent is created for each occupant.
  9. the NodeID of node 2 is taken before this fix.

=> for client C it is the wrong NodeID, because it should be the one of node 3.

  1. node 3 is stopped
  2. the leftCluster event in the MultiUserChatManager is called with the NodeID of node 3.
  3. all MUCRoles with the NodeID of node 3 are fetched from the MUCRoom and thus mucRoom.leaveRoom is called.

=> in our case this is none, because the RemoteMUCRole for client C of node 3 was provided with the NodeID of node 2. Therefore the role is not removed from the MUC.

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I canā€™t tell ā€œyetā€. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, itā€™s just apparently not as likely anymore.
In our case, each device has a unique resource (in the jid). If a new user logs in, they get a new resource. Each user can have multiple devices. Each user has a fixed nickname in the MUCs that is the same on all devices. I.e. it can happen that in the MUC a user is online several times with the same bare jid and the same nickname but different full jid. All devices automatically connect to the server in case of disconnections and automatically join the MUCs in which they have an affiliation>member.

Best Regards

chp

I totally see what you are getting at - so much even, that I fail to recall how your change would be ā€˜badā€™. This is the black/blue dress all over again :pleading_face:

Iā€™ve looked at this until my eyes hurt, but I canā€™t find my original objection to this. Letā€™s merge this, test it, and see if something breaks.

@guus thanks for merging :smiley:

Before Christmas I hopefully found the solution for the mentioned problem, but unfortunately I could commit it only now:

The problem from my post above: Null pointer exception joining a room - #21 by chp

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I canā€™t tell ā€œyetā€. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, itā€™s just apparently not as likely anymore.

The Bug:

If a cluster node starts, then it can happen that a client establishes a connection and joins the muc before the cluster connection is ready (e.g.: by a plugin, which needs very long time to start).

If the cluster connection is ready, then the senior member queries all the mucs including the online users of the mucs from this cluster node. The cluster node then queries the Mucs including the online users of the cluster from the senior node and enters them in its own maps. However, the cluster node receives its online users again as a RemoteMucRole and overwrites its own LocalMucRole. The consequence is that the user can still send messages, but no longer receive them.

The pull request: OF-2180: Fixes that a local muc role is overridden by a remote muc role after ā€¦ by cpetzka Ā· Pull Request #1783 Ā· igniterealtime/Openfire Ā· GitHub

I think it would be better to make the value of the map occupantsByFullJID as a list to avoid such concurrency problems. Methods like getOccupantByFullJID should then return the corresponding ā€œnewerā€ role.
What do you think?

When I did some trial and error to find edge cases, I noticed that it is possible to connect with two clients (same full JID) on two different cluster nodes and join the muc before these two have established a cluster connection. If the cluster connection is established, both stay online even if the same resource policy is set to kick. Can someone create a ticket for this?

Thanks again for all of your hard work, @chp!

The NullPointerException related problem that you provided a fix for corresponds with issue https://igniterealtime.atlassian.net/browse/OF-2180 that was already in our bugtracker.

Iā€™m thinking that in the long run, we need to re-think the implementation of sharing MUC state over a cluster (to also tackle the other issue that you referred to, with having two identical full JIDs on nodes that become a cluster - a similar problem exists with two different users using the same nickname in the same room). On the short term, your patch likely is a good improvement. Iā€™ll pull it in for the upcoming 4.6.1 release.

Thanks again for all of your hard work, @chp!

You are welcome :slight_smile:

The NullPointerException related problem that you provided a fix for corresponds with issue [OF-2180] - Ignite Realtime Jira 1 that was already in our bugtracker.

Sorry, I meant for the ā€œbugā€ that you can be logged in with the same full JID on two cluster nodes if the two clients logged in before the cluster connection was established. (independent of MUCs)

Iā€™m thinking that in the long run, we need to re-think the implementation of sharing MUC state over a cluster (to also tackle the other issue that you referred to, with having two identical full JIDs on nodes that become a cluster - a similar problem exists with two different users using the same nickname in the same room). On the short term, your patch likely is a good improvement.

I fully agree