Project Definition "Pampero Redux"

Back in the days, Jive Software launched the Pampero project. Pampero addressed scaling (but at the time, not clustering) issues of Openfire. Since then, a lot of time has passed, a lot of work has been done and a lot of experience was gained.

The new insights that arose in the time since can (and should!) be utilized to further improve Openfire in a constructive way. For this reason (and because I love the ring of the name), we are reviving the Pampero project, bringing you Pampero Redux!

Pampero Redux will again focus on scalability issues in Openfire, including clustering this time!

Topics to be addressed

Project Pampero Redux envisions the development of Openfire on the following topics:

  • The time it takes to retrieve a roster of a user from persistent storage shall be reduced to an acceptable amount (not more than a few seconds at worst)
  • Individual Openfire cluster nodes can be stopped, updated,started and crash without affecting the integrity of the XMPP domain (other than loosing the processing resources that are provided by the cluster node).
  • Openfire clustering will be based on open source implementations, preferably licensed similar to Openfire.
  • Openfires Achilles’ heel is fixed in a constructive manner.
  • Monitoring of the “health” of each cluster node will be improved and centralized.
  • Provide abuse control mechanisms.
  • Apache MINA will be replaced.
  • Replace PubSub support.
  • Offload Multi-User Chat.

This list of topics is not complete, and some of the topics overlap partially. Nonetheless, this is the least of spearheads for the Pampero Redux project. An elaboration of each of these topics follows in the remainder of this text.

Improving the persistant storage (of Roster objects)

In the existing implementation, Roster objects are stored in a relational database as individual contact list entries. For a moderate to high number of users, the number of rows in the database is considerable to high. After all, each user typically has multiple contacts. For example, given an XMPP domain that has 500,000 users, of which the average user has 20 contacts, the total amount of persisted data is 10,000,000.

Several administrators of medium-sized (roughly starting at 10,000 users) Openfire clusters have reported that retrieving a roster of a user that just logged in takes up to a few minutes. This is unacceptable, as the user experience is dramatically bad.

Improvements to the existing solution can be made by adding better database indices and restructuring the data that is persisted in the database. Alternatively, as the amount of related data is big, persisting the data in another entity than a relational database might be better. In the example above, we would be storing 10,000,000 times the average byte size of a table row in the database. We should answer this question: is storing all of this data in a relational database appropriate?

Pampero Redux will lead to a solution that will allow for near-immediate retrieval of Roster objects from the persistent storage. The solution or solutions that will be implemented can, but are not required to, apply to other objects that are persisted.

Food for thought related to this topic:

  • Introduce a noSQL database to store data.
  • Introduce an optional read-only connection pool, that can be backed by a replicated version of the database.

High Availability & Rolling Updates

The impact of outages of domains that have a large user base is higher than those of of smaller domains. The math is simple: if more people are affected by the outage, the owner of the domain will lose more potential goodwill, revenue and face.

An existing Openfire cluster is likely to grind to a halt if one of the cluster nodes fails. Additionally, running different versions of Openfire on each cluster node is unsupported. This requires the domain to be completely shut down if an update needs to be executed.

Pampero Redux will lead to a solution that will allow for rolling updates: every cluster node will be brought offline, updated, and rejoin the cluster again one by one.

Pampore Redux will prevent unplanned outage of one particular server node to cause a cluster-wide outage, as is currently the case. The cause for this problem is likely to be a side effect of the Achilles’ Heel problem.

Food for thought related to this topic:

  • XEP-0051: Connection Transfer
  • “cluster tasks” (tasks that depend on other than local resources) can be expected to fail, timeout, or take forever to finish. All of these situations need to be handled gracefully.
  • There cannot be such a thing as a synchronic cluster task.

Remove the need for closed-source software

The implementation of the clustered version of Openfire requires users to buy an Oracle Coherence license. This introduces a number of restrictions:

  • It puts up a barrier for users (which are not able or willing to pay for a license). It takes away of the availability of a clustered Openfire to the general public.
  • It introduces complexity (risk, bugs, poor performance) in the code and project. Problems are harder to diagnose.
  • It puts up a barrier for potential developers.

Pampero Redux will replace Oracle Coherence with an alternative that is freely available.

Food for thought related to this topic:

  • JGroups
  • EHCache
  • JBoss Cache
  • Consider a product that offers commercial support

Fix Openfires Achilles’ Heel

Openfire suffers from a concurrency-related issue that is known to bring down not only individual nodes, but entire clusters. The problem is documented as Openfires Achilles’ heel

Pampero Redux will fix the Achilles’ Heel problem in Openfire, by “sandboxing” routing and other core components of Openfire. Each sandbox will use dedicated resources only (including threads, database connections and cluster services)

Food for thought, related to this topic:

  • Brian Goetz’ Java Concurrency in Practice

Improve Health Monitoring

A sizable environment requires manageability features that outgrow the features offered by the admin console of Openfire. Monitoring features provided by Openfire should be able to integrate with existing monitoring solutions in place at the site of potential users. The most generic, standard and easily extensible framework that allows for monitoring in Java is JMX.

Pampero Redux will provide JMX-based monitoring of the overall health of cluster node members.

Food for thought related to this topic:

  • Use standard managed beans for thread pools / executor services
  • java-monitor.com

Provide Abuse control

Large domains are likely to attract more malicious users. Both users that target other users, as well as users that target the domain itself are expected to be active on large domains. Not only do these users brings a high risk of loss of service, but as explained earlier, these outages are more costly than on smaller domains.

Currently, Openfire offers virtually no functionality to deal with malicious users.

Pampero Redux will provide abuse control functionality.

Food for thought related to this topic:

  • Connection Throttling
  • XEP-0158: CAPTCHA Forms
  • XEP-0268: Incident Reporting

Replace Apache MINA

Apache MINA, which powers the socket control of Openfire, is no longer maintained actively and has proven to be buggy. Reportedly, a similar framework (Netty) outperforms MINA easily.

Pampero Redux should replace Apache MINA with a more stable and better performing framework.

Food for thought related to this topic:

  • Netty

Rewrite PubSub support

The existing implementation of Pubsub is hard to maintain, as its structure does not divide responsiblities over classes properly. On top of that has it been a source of bugs and memory leaks. Finally, it does not support clustering. It should be replaced. The new implementation should

  • be spec-compliant
  • make use of the Openfire plugin infrastructure (making it an optional feature of Openfire)
  • be memory-effective

Offload Multi-User Chat

MUC-functionality can be resource intensive. The earlier drafts of MUC had been written in such a way that the functionality could easily be packaged in a separate application. Most developers were not aware of this fact, which is why this feature was lost. The foundations of this setup are still available though.

MUC typically is hosted on a subdomain of the XMPP domain (eg: conference.igniterealtime.org). This is a textbook example of functionality that’s relatively easily moved to an external component. This would offload the load related to this functionality from the resources that are available to Openfire itself.

Sounds great Guus, but I am thinking that this is going to require a lot of resources to accomplish. As a heads up, I have already put some thought into the pubsub module in the past. I had a few ideas to fix some existing design issues and make it more robust and scalable.

I was thinking of a component though instead of a plugin, as I think it would be much more scalable if it can be deployed on it’s own (if needed). In this regards it falls into the same issues, and has the same properties as MUC.

As far as I know, it is spec compliant right now. The only problem is it is a very old version of the spec (1.7) and a fair amount has changed (now at 1.13). Often people report issues saying that stanzas are not compliant, but that is relative to the version (since it is still a draft), and the only version people see online are the latest ones.

Hey Robin,

I didn’t mean to make this document public - blasted SBS got me there. In any case, it’s very much work in progress. Please read it as such. In any case - thanks for your input.

Plugins and Components can go hand-in hand. I’m not sure if having a subdomain serving pubsub rather than the xmpp domain itself is desirable though.