I’m in the middle of an implementation for 4000 users. We opted not to go for the coherence license and go for “poor man’s clustering”.
Here’s our environment:
We’re using an external Oracle DB. Fault Tolerance for the DB is managed by a separate team, so that’s not a consideration for my design.
I have two Openfire Servers. They are VMs on a VMWare ESX HA\DRS cluster. Each on is configured with 4 GB RAM and two vCPUs. The OS is Centos 5.4. I have anti-affinity rules set up in VMWare so that the two VMs never run on the same physical host.
The servers are configured as identically as possible. They are both configured for the same IM domain and point to the same DB, but they are not clustered as far as Openfire is concerned. At any given time, one is active and the other is passive.
The two servers are front-ended by an F5 load balancer. The load balancer routes client traffic to which ever server is listening on port 5222. Failover is currently a manual activity. If the active server should go kaput, or I want to switch servers to perform mantenance, I shut down the Openfire service on the active server, and bring it up on the stand-by. Clients are momentarily disconnected. The load balancer redirects clients to the alternate server within about five seconds of the service becoming available.
It’s not true clustering, but it provides a level of availability that we are comfortable with, and it diidn’t cost us an arm and a leg to implement. (The key being that we already had the ESX cluster and load balancer).
Hope that helps.