Huge performance regression v3.9.3 -> v4.3.2

davidepetilli · May 17, 2019, 9:59am

Hello,
we are upgrading our plugins to Openfire 4.3.2 from Openfire 3.9.3. We are experiencing a huge performance loss in our performance tests. We are having a hard time to understand why is that.
In our investigations we found that http-bind threads go in a blocked status very often and that the operations on the DB have grown more than ten times compared to v3.9.3.
For example, a performance test run for 5 minutes on OF 3.9.3 gives this SELECT statistics:

Total # of operations: 5,252
Total time for all operations (ms): 4,446
Average time for each operation (ms): 0.85
Operations per second: 0.00

The same test, for the same amount of time on OF 4.3.2, gives:

Total # of operations: 670,267
Total time for all operations (ms): 429,048
Average time for each operation (ms): 0.64
Operations per second: 0.00

That’s a huge difference!

It looks like the DB is overwhelmed by requests and the http-bind threads get blocked. In our performance environment with 4 CPUs Xeon 2.4GHz and 8GB RAM doesn’t reach 250 users in a room and the chatting experience degrades much earlier than that target. Consider that on the same environment (also same DB), with the same tests and OF 3.9.3 we easily managed 4000 users in a single room.
We have tested with both Oracle and DB2 databases (running on different dedicated machines).

Any help in understanding what the problem could be is appreciated.

P.S. We have taken a thread dump, but I’m not allowed to upload it here (maybe I have to few posts). Please let me know if you need it and I will send it by email.

guus · May 17, 2019, 2:28pm

Thanks for reporting this! That’s a pretty dramatic degradation. I’d be very interested in fixing it. Can you share that thread dump with me, please? You can send an email to guus.der.kinderen on the gmail service.

Are you able to determine exactly which queries are being performed on the database?

davidepetilli · May 17, 2019, 4:19pm

Hi Guus,
I sent you the thread dump at the address you shared.
For what we can see, all the SELECTs have huge number of operations.

Thanks.
Davide

guus · May 17, 2019, 5:19pm

Hi Davide,

The thread dump was received, but didn’t show much database interaction. There was one thread, interacting with the database. This was triggered by your proprietary code. I noticed that it’s trying to obtain a user through the User Manager.

My advice is to further profile the application. It’s of interest to know exactly what queries are being executed.

Also, have a look at the caches, in the Openfire admin console. See if there are caches that are full and used a lot, but with low effectivity. That might indicate that you need to upscale those caches.

davidepetilli · May 18, 2019, 5:58am

Hi Guus,
thank you for taking the time to analyse our dump.

We came to the same conclusions.
That DB interaction from User Manager call that you saw in the dump it looks like is triggered by an (badly written) interceptor that doesn’t filter out room names. This means that for every message the users are checked in the cache, but rooms aren’t and an operation on the DB is executed. We will refactor this code.
In any case, we tried to disable this interceptor and the problem persists.

What we cannot explain yet is why the same code behaves very differently between the two Openfire versions. Entering in debug on the interceptors doesn’t show much difference in the flow between the two OF versions, one more packet is intercepted for every message sent. I don’t think this can explain the exponential growth of DB operations, at most it would double them but the actual behavior is 10x.

We already checked the caches and they look ok. We also noticed that the number of httpbind-pools is just three thread pools while in OF 3.9.3 it was about 10 or more thread pools. Is there a reason why this has been reduced? We already tried to bump up the number of thread pools (using the relative JiveGlobals property) but it didn’t change anything we just experience more blocked threads.

It has been four days we profile the application but we cannot find any obvious culprit for this strange behavior. In the next few weeks we’ll try to optimize our code, but my fear is that we won’t solve the problem at it’s root.

Any other suggestion would be appreciated.

Thanks.
Davide

guus · May 18, 2019, 8:36pm

I don’t think that the number of threads makes a difference, to be honest.

Increasing the cache sizes is something that is probably something that has no downside, so you could try that as a stab In the dark.

One thing to take into account is that the newer version of Openfire offers new features. Maybe your clients behave slightly different because of that, which causes different usage patterns?

I’m unsure what to suggest, other than profile the application and figure out where to optimise, and see where differences originate as compared to the old Openfire. Without being able to look into your application, it’s hard to come with more concrete solutions, sorry.

speedy · May 19, 2019, 4:59am

3.9.3 to 4.3.2 is a pretty big jump. although it wouldn’t be ideal (and probably a little time consuming), would you be willing to make a few incremental upgrades? I might help @guus and others by narrowing down when this started to happen …

wroot · May 19, 2019, 6:18am

davidepetilli · May 20, 2019, 6:02am

Hi,

@guus, we already increased caches, with no visible gain in performance. We have some caches with poor effectiveness, but it looks like even increasing them doesn’t change anything (both effectiveness and performance stay poor).
It’s four days we are investigating the possibility of different usage patterns that could lead to this performance loss, but we still can’t see anything (that’s why I’m writing here ). The only difference we can see is the spike in DB operations and the consequent httpbind threads blocked.
We are planning our next sprint with a cycle of profiling and optimizations, hoping to increase performance, I will update you if we have interesting findings.

@speedy I already thought about making incremental upgrades, but I don’t know if we can fit it in our schedule (which is quite tight at the moment, and this perf issue makes it even tighter).

Davide

guus · May 20, 2019, 7:35am

I feel your pain. Have you conclusively established that the additional database queries are all related to UserManager? In your first post, you mentioned using chat rooms, which do no relate to UserManager directly (although there will be some interaction, to check for room membership and permissions and the like).

Performing a detailed analysis of exactly what queries are done more often as compared to Openfire 3.9.3 might give you clues.

Also, just to rule out people before you having applied optimizations (you never know…): are you sure that you were running an unmodified 3.9.3? Maybe someone before you applied some kind of optimization that needs to be applied again?

davidepetilli · May 20, 2019, 12:16pm

It used to be a modified version of Openfire in the past, but then they removed all the customizations. I already checked it with a checksum and I can confirm it is the same version you can download from the website.

I don’t know if you use Jprofiler, in that case I could try to send you a snapshot of a profiling session…

guus · May 20, 2019, 7:09pm

I’d be happy to have a look, but investigations like those tend to get time consuming. At some point, my daytime job will take precedence. If you’re interested in having me look at this for a dedicated, extensive time, we can discuss options.

davidepetilli · May 21, 2019, 4:39am

I perfectly understand your position. Unfortunately it’s not up to me to discuss options, I am just a developer in a big company.