A follow up on this. Let me lay out the use case:
On the server side we have the usual request threads, i.e. a client does an Ajax-request that results in a ‘response’ message sent over XMPP.
We also have long running threads on the server side that sends XMPP-messages based on events in other systems. So many machine-based sources of events that result in XMPP-messages being sent.
What happened during load is that we saturated the Smack-queue causing lock contention in the ArrayBlockingQueue which blocked many request threads. The blocked request threads really hurt the runtime behaviour of the server, it basically ran into the ground and did not recover gracefully.
In our particular use case blocked threads is quite possibly the worst thing that can happen, we would much rather have got a straight Exception or refusal to send a message if the outgoing queue was full. But then again this is an unusual use case for a chat-client
This is what we did: We pool several XMPPConnection objects and use a round-robin scheme when the server wants to send a message. Basically we let our server sign in several “sender” users in Openfire. This works very well.
We are collecting metrics on all sorts of things, for example the latency in “giving a message to Smack”, so we can monitor and add connections if we need to. This is actually not automated since load testing showed that in our case using 9 “users” per server was more than enough for our typical load.
We also implemented a circuit breaker. That mechanism monitors the error rate and latency of “giving a message to Smack”. If it starts taking too much time (which might indicate a contention) the server simply stops sending messages for a brief time.
Using these mechanisms we have no problem getting an average throughput of several hundred messages per second from each server, mostly due to having several connections, so we get more Smack sender threads and more Smack queues handling bursts. When the circuit breaker “opens”, i.e. briefly throws messages away, Smack rather quickly drains its queues and are ready for business. In this way we can survive huge message bursts. This is at the cost of loosing messages of course but the alternative is crashing the server.
So I would say lessons learned and things to think about in this type of situation are:
A final word about Openfire, it is a beast, it can take huge amounts of messages, we have been able to handle bursts of 800-1000 messages a second on a modest machine at the same time as we’ve had between 1000-2000 Bosch connections. The only time Openfire has been in a less than graceful state is when the OS runs out of memory causing the machine to run extremely slow due to swapping. So avoid that at all cost