MUC events in Hazelcast plugin are out of order

I have two small machines (single CPU with one core) running a clustered Openfire 4.2.3. After some help from Greg in a discussion Need helps on finding the caller of OccupantAddedEvent().run in MUC, I discovered some serious timing issue in Hazelcast Plugin (v2.3.0) when handling incoming MUC events from another node. I added some log org.jivesoftware.openfire.muc.cluster.* classes to illustrate the problem (see the log below.)

Apparently Hazelcast Plugin uses a thread-pool to deserialize the MUCRoomTask objects (e.g. RoomAvailableEvent, OccupantAddedEvent…) asynchronously and uses different thread-pool to execute its run(). In my single core/single CPU hardware, the thread scheduler is unpredictable that the deserialization (i.e. readExternal()) of OccupantAddedEvent may finish before the deserialization of RoomAvailableEvent. And it causes the room not found error. Even worse, due to some unprotected section of an unknown critical area in ClusterManager, the ClusterManager has both starting and started states set to true. In the MUCRoomTask.execute() method, when the room was not found, it schedules to retry the tasks later if ClusterManager is in starting state. So, it ends up the tasks are never executed. And the MUC room in the clustered nodes have very different result (e.g. missing an occupant, mismatch role/affiliation…)

Here is the trace log. The lines not starting with “Deserialized” are logged at the beginning of a derived class MUCRoomTask.run() method. The lines with “Room xxx not found. Queue Task:… ClusterMgr: starting=true, started=true” are printed out from MUCRoomTask.execute() method after catching IllegalArgumentException (i.e. room not found)

2018.11.16 19:04:29 INFO  [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent - @@@ Deserialized OccupantAddedEvent: user=focus@jitsi-bb.magnet.com/jitsi-w4, role=0, affiliation=0
2018.11.16 19:04:29 INFO  [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent - @@@ Add Occupant user=focus@jitsi-bb.magnet.com/jitsi-w4
2018.11.16 19:04:29 INFO  [hz.openfire.generic-operation.thread-1]: org.jivesoftware.openfire.muc.cluster.RoomAvailableEvent - @@@ Deserialized RoomAvailableEvent room=cnn
2018.11.16 19:04:29 INFO  [hz.openfire.cached.thread-3]: org.jivesoftware.openfire.muc.cluster.RoomAvailableEvent - @@@ RoomAvailableEvent add room=cnn
2018.11.16 19:04:29 INFO  [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.RoomUpdatedEvent - @@@ Deserialized RoomUpdatedEvent
2018.11.16 19:04:29 INFO  [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.cluster.RoomUpdatedEvent - @@@ Update Configuration
2018.11.16 19:04:29 INFO  [hz.openfire.generic-operation.thread-1]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:29 INFO  [hz.openfire.cached.thread-6]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus
2018.11.16 19:04:29 WARN  [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found.  Queue Task: OccupantAddedEvent, ClusterMgr: starting=true, started=true
2018.11.16 19:04:29 INFO  [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:29 INFO  [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus
2018.11.16 19:04:29 WARN  [hz.openfire.cached.thread-6]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found.  Queue Task: UpdatePresence, ClusterMgr: starting=true, started=true
2018.11.16 19:04:30 WARN  [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found.  Queue Task: BroadcastPresenceRequest, ClusterMgr: starting=true, started=true
2018.11.16 19:04:30 INFO  [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:30 INFO  [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus

Have someone run into this problem? And is there a fix for this problem?

Summoning @gdt :slight_smile:

I don’t think anyone has seen this problem before - and as such there is no fix, as yet - but thanks for the analysis, that’s often half the battle!

I think the key to fixing it is probably instead of calling

CacheFactory.doClusterTask(new RoomAvailableEvent(room));

calling

CacheFactory.doSynchronousClusterTask(new RoomAvailableEvent(room), false);

and calling .get() on the Future that it returns.That way you can be sure the other node has created the room before the OccupantAddedEvent is triggered.

Greg

@gdt Thank you for a possible solution. As you mentioned, it is only half of the battle. My real concern is the out of order execution of the MUC events in a remote node. What if the OccupantAddedEvent and OccupantRemovedEvent were executed in unpredictable order when both events arrived very close together?

I am surprised that it happens to my environment; maybe it is rare to use a single core/single CPU these days.

I can see how the RoomAvailable/OccupantAdded could be handled in the wrong order as they are being called on the remote node(s) in the same thread by the local node.

In practical terms, though, I’m not sure how an OccupantAdded would be handled after an OccupantRemoved - this would require

a) user joins room
b) Local node sends OccupantAdded
(I/O occurs from client)
c) user leaves room
d) Local node sends Occupant Removed
e) Remote node processes OccupantRemoved
f) Remote node processes OccupantAdded

Is this something you are genuinely seeing happen? If so, I’m surprised.

I guess if this is more than a theoretical problem, the fix would be to ensure that the remote OccupantAdded processing completes as part of the local joinRoom() processing, in a similar manner to the one highlighted above.

Greg

@gdt The OccupantAddedEvent/OccupantRemovedEvent situation did not happen yet; it is just a theoretical issue. But I saw RoomAvailableEvent, OccupantAddedEvent, RoomUpdatedEvent, UpdatePresence were in wrong order. In couple times, a member was a moderator/owner in a local node, but was a normal participant in a remote node. And sometimes the room had no participants and the room would not be destroyed.

The asynchronous event processing in the remote node just worries me. As you said, if events are put in queues by room names as they arrived and processed sequentially, it should solve the problem.

The thing is, I don’t see it can happen in practice. The RoomAvailable/OccupantAdded can get in the wrong order, but the local node sends the events a few CPU cycles apart which means the threading in the remote node can get out of order.

Between sending an OccupantAdded and a OccupantRemove there are many, many CPU cycles (and I/O packets) on the local node. For that reason I’d be extremely reluctant to slow down the processing of an user entering a room while it waits for all the remote nodes to process if it’s solving a problem that simply doesn’t occur.

Greg

@gdt I am more concerned with the correctness of the ordering, but I understand your performance concern. After all, you know those codes better than I do.

BTW, with your CacheFactor.doSynchronousClusterTask() approach, if there are many rooms creating from one node, will it create a bottleneck? Just FYI. In my experimental fix, the remote node creates multiple named queues (by room name) to enqueue subsequent events when the room is not available yet and executes the queued tasks when the room is available. The servers were bombarbed with 2000 tries, and it seemed working.

It shouldn’t do, no. The code isn’t single threaded, so whilst the creation of any single room may be slightly slower, this shouldn’t affect the overall performance greatly.