Hi Timothee,
One of the bigger bottlenecks that I’ve encountered at my previous job is the way how Openfire uses threads.
Openfire uses a couple of thread groups to do the major part of all of the work. By default, these groups are configured to be 17 threads in size.
From the top of my head, groups like these exist:
- S2S connections;
- Clients through NIO;
- Clients through old style sockets;
- External Components;
Most likely, your clients will either use NIO, or old style sockets. This leaves 17 threads for (almost) all of the work thats related to a client connection. If a client sends a request, one of these 17 threads will pick it up, and will keep the thread busy, until the request has been fully processed (this includes but is not limited to: parsing, authentication verification, privacy list checks, routing, pre-processing event listeners, processing and post-processing).
If anything locks up a thread for a while, you’re opening the possibility of running into issues just like the ones you’re describing: the more users you’ve got connected concurrently, the better the chance that you’ve got a significant part of your 17 threads locked up like that. If most or all of them are locked, you’ve got few or none threads left to do any actual processing. This will lead to very random problems, disconnections, timeouts and a lot of head-ache (I’ve been there). Running Openfire in its clustered setup could actually make this problem worse, as threads tend to wait for an answer from another clusternode.
There’s a couple of things that you can do:
First, make the thread usage visible. They’re implemented using an executor service, which isn’t hard to gather statistisics from (I think there’s even a plugin flying around that does this by printing statistics into a file). Make sure that you monitor usage of your thread groups over time (all the time, preferrably). As a quick and dirty solution, you can have java thread dumps created at the time you’re noticing problems. Look at threads named ‘client-#’, which is the most likely candidate for giving you problems.
If you can successfully verify that it’s threading issues that you’re suffering from, you should be able to identify your current bottleneck by looking at these thread dumps. This should help you fixing your immediate problems. What I’ve experienced though, is that a new bottleneck will pop up right after you’ve fixed the previous one.
The first step towards a somewhat more long-term solution is externalizing all functionality that you can. I’ve found it easiest doing this by implementing most functionality as Components. (Components can usually be rewritten to external components). Most importantly is to release these ‘client threads’ (from the threadpools of 17) as soon as possible. I had rather good results by using a queuing mechanism, where each component would get a queue and its own executor service. The client threads simply hand of the workload, by placing the to-be-processed packet in the component queue.
Good luck and happy hunting!