Null pointer exception joining a room

Sorry for the late reply.

Please correct me if I am wrong:

  1. there are 3 nodes, node 1 is the senior member.
  2. on each node there is a client online (A node 1, B node 2, C node 3)
  3. node 1 is restarted
  4. node 2 becomes the senior member
  5. node 1 joins the cluster
  6. node 1 makes a SeniorMemberServicesRequest to node 2
  7. node 2 creates the response and gets the ServiceInfo for each service
  8. in the ServiceInfo constructor, all MUCRooms are fetched and a RoomInfo is created for each.
    In the RoomInfo constructor, an OccupantAddedEvent is created for each occupant.
  9. the NodeID of node 2 is taken before this fix.

=> for client C it is the wrong NodeID, because it should be the one of node 3.

  1. node 3 is stopped
  2. the leftCluster event in the MultiUserChatManager is called with the NodeID of node 3.
  3. all MUCRoles with the NodeID of node 3 are fetched from the MUCRoom and thus mucRoom.leaveRoom is called.

=> in our case this is none, because the RemoteMUCRole for client C of node 3 was provided with the NodeID of node 2. Therefore the role is not removed from the MUC.

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I can’t tell “yet”. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, it’s just apparently not as likely anymore.
In our case, each device has a unique resource (in the jid). If a new user logs in, they get a new resource. Each user can have multiple devices. Each user has a fixed nickname in the MUCs that is the same on all devices. I.e. it can happen that in the MUC a user is online several times with the same bare jid and the same nickname but different full jid. All devices automatically connect to the server in case of disconnections and automatically join the MUCs in which they have an affiliation>member.

Best Regards

chp

I totally see what you are getting at - so much even, that I fail to recall how your change would be ‘bad’. This is the black/blue dress all over again :pleading_face:

I’ve looked at this until my eyes hurt, but I can’t find my original objection to this. Let’s merge this, test it, and see if something breaks.

@guus thanks for merging :smiley:

Before Christmas I hopefully found the solution for the mentioned problem, but unfortunately I could commit it only now:

The problem from my post above: Null pointer exception joining a room - #21 by chp

=> if client C is now joined on node 1 in the MUC alreadyJoinedWithThisNick (approximately line: 619) is true and joinRole = occupantsByFullJID.get(user.getAddress()); (approximately line: 646) is null. Why this is null, I can’t tell “yet”. If the fix is applied, then the MUCRole is removed and the user can join without problems because alreadyJoinedWithThisNick returns false. But unfortunately, we found out yesterday that this error can still happen, it’s just apparently not as likely anymore.

The Bug:

If a cluster node starts, then it can happen that a client establishes a connection and joins the muc before the cluster connection is ready (e.g.: by a plugin, which needs very long time to start).

If the cluster connection is ready, then the senior member queries all the mucs including the online users of the mucs from this cluster node. The cluster node then queries the Mucs including the online users of the cluster from the senior node and enters them in its own maps. However, the cluster node receives its online users again as a RemoteMucRole and overwrites its own LocalMucRole. The consequence is that the user can still send messages, but no longer receive them.

The pull request: OF-2180: Fixes that a local muc role is overridden by a remote muc role after … by cpetzka · Pull Request #1783 · igniterealtime/Openfire · GitHub

I think it would be better to make the value of the map occupantsByFullJID as a list to avoid such concurrency problems. Methods like getOccupantByFullJID should then return the corresponding “newer” role.
What do you think?

When I did some trial and error to find edge cases, I noticed that it is possible to connect with two clients (same full JID) on two different cluster nodes and join the muc before these two have established a cluster connection. If the cluster connection is established, both stay online even if the same resource policy is set to kick. Can someone create a ticket for this?

Thanks again for all of your hard work, @chp!

The NullPointerException related problem that you provided a fix for corresponds with issue https://igniterealtime.atlassian.net/browse/OF-2180 that was already in our bugtracker.

I’m thinking that in the long run, we need to re-think the implementation of sharing MUC state over a cluster (to also tackle the other issue that you referred to, with having two identical full JIDs on nodes that become a cluster - a similar problem exists with two different users using the same nickname in the same room). On the short term, your patch likely is a good improvement. I’ll pull it in for the upcoming 4.6.1 release.

Thanks again for all of your hard work, @chp!

You are welcome :slight_smile:

The NullPointerException related problem that you provided a fix for corresponds with issue [OF-2180] - Ignite Realtime Jira 1 that was already in our bugtracker.

Sorry, I meant for the “bug” that you can be logged in with the same full JID on two cluster nodes if the two clients logged in before the cluster connection was established. (independent of MUCs)

I’m thinking that in the long run, we need to re-think the implementation of sharing MUC state over a cluster (to also tackle the other issue that you referred to, with having two identical full JIDs on nodes that become a cluster - a similar problem exists with two different users using the same nickname in the same room). On the short term, your patch likely is a good improvement.

I fully agree