I have two small machines (single CPU with one core) running a clustered Openfire 4.2.3. After some help from Greg in a discussion Need helps on finding the caller of OccupantAddedEvent().run in MUC, I discovered some serious timing issue in Hazelcast Plugin (v2.3.0) when handling incoming MUC events from another node. I added some log org.jivesoftware.openfire.muc.cluster.* classes to illustrate the problem (see the log below.)
Apparently Hazelcast Plugin uses a thread-pool to deserialize the MUCRoomTask objects (e.g. RoomAvailableEvent, OccupantAddedEvent…) asynchronously and uses different thread-pool to execute its run(). In my single core/single CPU hardware, the thread scheduler is unpredictable that the deserialization (i.e. readExternal()) of OccupantAddedEvent may finish before the deserialization of RoomAvailableEvent. And it causes the room not found error. Even worse, due to some unprotected section of an unknown critical area in ClusterManager, the ClusterManager has both starting and started states set to true. In the MUCRoomTask.execute() method, when the room was not found, it schedules to retry the tasks later if ClusterManager is in starting state. So, it ends up the tasks are never executed. And the MUC room in the clustered nodes have very different result (e.g. missing an occupant, mismatch role/affiliation…)
Here is the trace log. The lines not starting with “Deserialized” are logged at the beginning of a derived class MUCRoomTask.run() method. The lines with “Room xxx not found. Queue Task:… ClusterMgr: starting=true, started=true” are printed out from MUCRoomTask.execute() method after catching IllegalArgumentException (i.e. room not found)
2018.11.16 19:04:29 INFO [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent - @@@ Deserialized OccupantAddedEvent: user=focus@jitsi-bb.magnet.com/jitsi-w4, role=0, affiliation=0
2018.11.16 19:04:29 INFO [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.OccupantAddedEvent - @@@ Add Occupant user=focus@jitsi-bb.magnet.com/jitsi-w4
2018.11.16 19:04:29 INFO [hz.openfire.generic-operation.thread-1]: org.jivesoftware.openfire.muc.cluster.RoomAvailableEvent - @@@ Deserialized RoomAvailableEvent room=cnn
2018.11.16 19:04:29 INFO [hz.openfire.cached.thread-3]: org.jivesoftware.openfire.muc.cluster.RoomAvailableEvent - @@@ RoomAvailableEvent add room=cnn
2018.11.16 19:04:29 INFO [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.RoomUpdatedEvent - @@@ Deserialized RoomUpdatedEvent
2018.11.16 19:04:29 INFO [hz.openfire.cached.thread-4]: org.jivesoftware.openfire.muc.cluster.RoomUpdatedEvent - @@@ Update Configuration
2018.11.16 19:04:29 INFO [hz.openfire.generic-operation.thread-1]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:29 INFO [hz.openfire.cached.thread-6]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus
2018.11.16 19:04:29 WARN [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found. Queue Task: OccupantAddedEvent, ClusterMgr: starting=true, started=true
2018.11.16 19:04:29 INFO [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:29 INFO [hz.openfire.cached.thread-1]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus
2018.11.16 19:04:29 WARN [hz.openfire.cached.thread-6]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found. Queue Task: UpdatePresence, ClusterMgr: starting=true, started=true
2018.11.16 19:04:30 WARN [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.cluster.MUCRoomTask - @@@ Room cnn not found. Queue Task: BroadcastPresenceRequest, ClusterMgr: starting=true, started=true
2018.11.16 19:04:30 INFO [hz.openfire.generic-operation.thread-0]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Deserialized UpdatePresence nickname=focus
2018.11.16 19:04:30 INFO [hz.openfire.cached.thread-5]: org.jivesoftware.openfire.muc.cluster.UpdatePresence - @@@ Update Presence nickname=focus
Have someone run into this problem? And is there a fix for this problem?