Pagination of chat room history

I was doing a poc and came across rest plugin API to retrieve chat history of the room.
But the chat history is non-paginated. Is there any paginated api for that? If not what is the solution ? Please suggest…

Hi!

The GET /chatrooms/{roomName}/chathistory endpoint in the REST API plugin does not support pagination. It is also limited to the chat history ‘old style’, which typically is limited to the last few messages in a chat room (although it can be configured to retain more history).

What you probably want is a REST API that can query ‘all’ history. A message archive that contains that history is implemented by the Monitoring plugin (the feature is referred to as ‘Message Archive Management’).

Although that plugin has a limited private REST API (for public rooms), that’s not really designed for external use. I doubt that it’s paginated (other than that it returns data per day, perhaps).

It is certainly possible to enhance the functionality of this implementation, but that does require some work. If it comes to that (I appreciate that you’re still in a PoC phase), there are a couple of options:

  1. You can create pull requests with your changes against the source code at GitHub - igniterealtime/openfire-monitoring-plugin: Adds support for chat archiving and server statistics to Openfire
  2. You can engage with one of our commercial Service Providers and commission this work.

I hope this helps!

Really appreciate the clarity you provided. We’ll be exploring the suggestions.

@guus Hey I would appreciate if you can give me some info on the number of rooms we can create and any limitations on that. Our requirements needs us to make DMs as rooms itself. So I was looking How much rooms we can hold?

This is not going to be the answer that you’re looking for, but: it depends.

The limitations are very specific to use case scenarios. I’ve seen systems with a six-digit number of rooms work, but I’ve also seen systems with a fraction of that crash and burn.

There are so many factors that come into play that it’s impossible to predict things:

  • the version of Openfire
  • the availability of a cluster (and if clustered, its composition)
  • the number of conference services
  • the number of rooms per conference service
  • the number of occupants per room
  • the overall number of occupants
  • the percentage of rooms that concurrently see ‘activity’

And that doesn’t even take into account the definition of ‘activity’, which, in my experience, is wildly different between implementations. Only some aspects that are relevant:

  • the amount of messages that are shared
  • the amount of room leave/joins
  • the amount of (other) presence updates in a room
  • the amount of historic messages to keep

This is not an exhaustive list.

Finally, the make/model of client software also can make a huge difference. Some clients (or libraries, or self-built solutions) do unexpected stuff that on the surface is benign, but can add up immensely in a busy server.

As you will understand, given all of these factors, it is next to impossible to say anything with any kind of confidence.

It is relatively easy to use Openfire as a development platform, to create a complex system that offers rich functionality to a limited set of users. I’ve seen all to often that such systems grind to a halt when the user numbers increase.

Assuming that you will be developing your own IM solution, the better approach is to, at design and implementation time, be very thoughtful on how certain choices will affect performance of Openfire (and have repetitive performance tests in place).

Not to toot our own horn, but it really does pay to have an experienced developer or consultant at hand, that can help you vet your design.

Hi,
We have successfully set up clustering as per the documentation and have begun testing.

In our setup, we are implementing DMs as chat rooms. During testing, we had user1 connected to the local node and user2 on the remote node. When we shut down the remote node, after a short delay, the session for user2 migrated to the local node, and the chat between user1 and user2 resumed successfully.

However, we encountered an issue in the following scenario:

  • user1 and user2 were connected to node 1
  • user3 and user4 were connected to node 2

When we brought down node 2, user1 and user2 (who were still on the healthy node 1) were no longer able to send messages in their room.

So, my question is:
Do rooms maintain any state on the nodes themselves? Could it be that the room state was hosted on node 2, and shutting it down disrupted the room’s functioning temporarily—thereby preventing even users on the healthy node from continuing the conversation?

On the surface, this looks to be a bug.

Server-sided state for chat rooms (MUC rooms) is maintained in data structures that span the cluster, but should at all times be available to all cluster nodes. Even if one node leaves the cluster, other nodes should be able to either access MUC state or recreate it from local data).

If you are willing and able, I’m very interested in getting a (very) detailed write-up on how you reproduce this problem, if possible with XMPP data dumps and log file content.

I am going to write down , how we have set up and how we simulated the scenario.
With reference to the hazlecast docs,

  • We deployed two Windows-based VMs on Azure (VM1 and VM2).
  • The Hazelcast plugin was installed on both nodes.
  • We copied the same openfire.xml and hazelcast-config.xml to both instances.
  • On accessing the Openfire admin console, both nodes appear correctly under the clustering tab — one as local and the other as remote.
  • Both nodes share the same XMPP domain name, exposed via a load balancer and also the same external database.

We’re also sharing our Hazelcast configuration and server logs (attached below).

Our UI which makes the XMPP connection has handled the connection disconnect scenario when node goes down. and when the connection is re-established, it again handles resuming the messaging service.

What we simulated and observed,
We opened 4 XMPP connections from UI and after several retries made sure two of the user sessions(user1,user2) are on node1 and other 2(user3,user4) on node2 (visible on admin console->sessions). So when we stopped node2 openfire server in one of the vm, UI sensed connection lost for user3 and user4 and blocked the users and made sure no messages were sent (In the mean time, we can see the session switch happening from remote to local when observed from healthy node admin console). But at the same time, user1 and user2 did not catch any disconnect events, so by default they are allowed to send messages, even though they are sent, the messages are lost and receiver can not receive. Also we were able to observe that, after stopping one node, there is a slight delay in switching of those users from node2 to node1 , it goes like, remote-online → remote-offline —> Local-Online (Healthy node admin console).

We are very much stuck assuming what could be the behavior of the cluster and want clarification regarding the expected behavior of an Openfire cluster utilizing Hazelcast plugin, specifically concerning node failover
Our assumption is,

  • In the downtime of node2, node1 would handle things without causing any trouble of disconnecting, even user1 on node1 should send a message to a user3 whose node is down. UI should not see any disruptions from switching and should not handle disconnect events

OR

  • All the users should get connection lost for momentarily until switch happens and everything is handled(I.e even node1 users should get connection lost) and Ui will handle momentarily until connection is back up.

hazlecast-config.txt (1.8 KB)
openfire-log.txt (84.3 KB)

Please review if the set up is done correctly , or anything needs to be added up. Please clarify the behaviour and how things should be?

Thank you for the detailed write-up. What is not exactly clear to me is how chat rooms are utilized in this scenario, but I’m not sure if that’s important.

As you seem to have already suspected, and perhaps contrary to your initial expectations, the failover for user connections from an unhealthy to a healthy node is not without interruptions. Under water, the the TCP connection between the client and the server gets disconnected. It is up to the client to establish a new connection to one of the healthy cluster nodes. Between disconnection and reconnection, the client is considered ‘offline’ (if only briefly).

With this perspective (which I can understand is disappointing, if you’d expected a seamless failover) is any of the behavior that you observe still unexpected?

Chatrooms are used as DM rooms with two users in our case. So a user in UI will create an XMPP connection, when he opens a tab of the DM room , he joins the room and message can be sent here.

Our initial assumption during failover was that users on the unhealthy node would naturally be unable to chat. However, a healthy node user can join the room and send a message, but the message will be not sent and received by the recipient until switch happens. So the suspected thing was state of the room might have been compromised with unhealthy node during this period.

That’s when we tried to verify with pure DMs to see if the problem is related to room state or not. In this case as well, Users had active sessions on healthy node and couldn’t chat. So room state being involved with the node which went down was ruled off.

Now, what caused this to happen? answer to this is unknown.

As this is unknown, We came to realization, there are two options. Either none of the users irrespective of nodes should be affected and switch should be seamless
OR
All the users should be restricted momentarily(possible for unhealthy node users via ping check).

If sessions handover to healthy node takes some time, then all the other users who are part of healthy node should also get connection lost on UI ping. Otherwise , In UI we can’t restrict healthy node users from chatting. In this failover time, Healthy node users can chat with other healthy node users and also with other unhealthy node users which should be restricted.

For the time being, we could only manage this for users whose connections were lost when their node went down with the use of the UI ping method. UI ping method doesn’t work for healthy node users as they are never disconnected.

As you’ve mentioned,

failover for user connections from an unhealthy to a healthy node is not without interruptions.

Seamless switching is not an option. There should be some mechanism that must be followed to handle this gap of switching over to healthy node and users not losing messages or any other strange behaviours. Please review the cluster setup method as well from our earlier chat.

What do you mean by a ‘pure DM’?

Am I right to summarize your findings as follows?

When an Openfire cluster node crashes, users that are connected to other cluster nodes can not exchange messages in a chatroom, if that chatroom had (other) participants that are/were connected to the crashing server node.

Thanks for the follow-up!

  1. “pure DM”:
    By pure DM, I meant a direct XMPP chat from one user JID to another user JID — not a chat implemented using a MUC room with two users. In our setup, we primarily use chatrooms as DM rooms (with two participants), but for this test, we bypassed that and used standard one-to-one messaging (bare JID to bare JID).
  2. Regarding the findings summary:
    I think there may have been a slight misunderstanding. What we observed is that during the time one of the cluster nodes crashes, users connected to healthy nodes also experience issues sending and receiving messages, both in chatrooms and direct messages. The issue is not with respect to a chatroom having participants connected to the crashed node. Instead, it’s more general: any messaging (room-based or direct) is affected during the switching/failover window, even when both sender and receiver are connected to healthy nodes.
    That’s why we suspect there’s something during failover that affects messaging platform-wide, not just room state.

If this is expected behavior, we’re now exploring ways to handle it more gracefully for all the users. One of the approaches we tried was a UI-level ping check to detect broken user connections. This worked only for users whose sessions were on the crashed node — allowing the UI to sense the disconnection and temporarily block them from sending messages until their session switched to a healthy node.

That’s certainly not expected behavior: when a cluster node crashes, users that are connected to other cluster nodes should be unaffected.

Your work-around is for clients to detect that they are on a broken cluster node, right (this doesn’t really fix the problem for people that are already on ‘healthy’ nodes)? You may want to look into a feature called Stream Management. This will allow you to receive confirmation that the server has received data from your client. This may be a more efficient way to implement that liveliness check.

I will certainly take a look into stream management.

Since this kind of messaging disruption during node failure isn’t expected behavior, could you clarify what the ideal behavior should be during such failovers? and also just want to make sure there’s no misconfiguration in our clustering setup that might be causing broader impact — I’ve shared some details earlier, but happy to provide more if needed.

I would expect that during a cluster node failure, no users are affected other than the users that are connected to the node that is failing.

  • One more thing I wanted to ask, Based on our repeated crash tests, we’ve observed that the user sessions switching typically takes around 20 to 30 seconds on average and In UI we can’t hold users for that much time as some important and rapid discussions may be going on. We believe this duration can be optimized, and we’re exploring possible ways to fine-tune the setup to reduce the failover time. Could you shed some light on potential improvements or configurations that could help speed up this transition?

  • On the client side, we’re currently using React with @xmpp/client. We’ve also seen libraries like Strophe.js being recommended for better handling of reconnections. Do you have any recommendations or best practices when it comes to choosing or configuring XMPP clients for improved failover handling?

I don’t know why it is taken 20 to 30 seconds for the clients to fail over. As this requires a client to detect the server going down, and reconnecting to a different server, much of this delay is probably addressable from within the client implementation itself.

To reduce the impact of a switchover, it may be worth investigating into message archiving, for example by looking into the Message Archive Management protocol as provided by the Monitoring plugin. Clients can obtain any messages that were sent to them while they were not online from a message archive.

I don’t have any recommendations for one client library over another, as I’m simply not familiar enough with both your specific use case or any of those libraries to make a good suggestion.

As of now we are trying to tweak with properties in hazelcast-cache-config.
Such as adding these,

  <property name="hazelcast.max.no.heartbeat.seconds">5</property>
  <property name="hazelcast.heartbeat.interval.seconds">2</property>
  <property name="hazelcast.shutdownhook.enabled">true</property>

This seems to have an impact, we are still yet to evaluate client re-establishing connections. Hopefully we’ll be able to figure out.