Openfire 4.7.4 Java Out Of Memory Issue

InteliCare · June 22, 2023, 4:29am

Hi All,

We are having a strange issue with java out of memory issues at the same time of day every few days.
We are using a Bitnami Azure deployment package that has openfire 4.7.4 (which we realise needs to be upgraded to 4.7.5 due to admin bypass issue.)
Its also configured using mysql-mariadb 10.6.11 and apache 2.4.54
We have around 30 devices connected all the time sending telemetry messages to 3 listener services.

During normal operation the memory usage reported on the openfire server page varies from a couple hundred megabytes to a couple of gigabytes. and ~1% cpu But at around 5:10-5:15 pm +8 (21:10-21:15 UTC) something happens and the memory usage skyrockets and the cpu usage hits 100% every 1,2 or 3 days this causes java to throw an out of memory error and die. Restarting the openfire service doesnt work, it requires rebooting the entire VM to bring it back to life.
There does appear to be increases in network traffic both in and out but mainly out of the server at this time.

In the openfire service logs it appears that about 5pm there are a whole bunch of these logs:

2023.06.21 21:36:02 ESC[33mWARN ESC[m [socket_c2s-thread-8]: org.jivesoftware.openfire.nio.ConnectionHandler - Closing connection due to exception in session: (
0x0000007A: nio socket, server, /[client ip address]:46992 => /[server ip]:5222)
java.io.IOException: Closing session that seems to be stalled. Preventing OOM 
[exception information]

then

2023.06.20 09:23:44 ESC[33mWARN ESC[m [Jetty-QTP-AdminConsole-1917]: org.eclipse.jetty.util.thread.QueuedThreadPool - 
java.lang.OutOfMemoryError: Java heap space

and thats where the logs stop until the VM is restarted and everything comes back up

We are going to turn on trace logging for a half an hour from 5pm to see if that captures anything more in the logs.

Is there any known issue that would cause something like this at the same time every day or few days?
Is there any obvious things we can try to debug this issue and work out what is giong on?

Kind Regards,

InteliCare team

guus · June 22, 2023, 5:43am

It would be useful to be able to analyze a heap dump. Any Java process, like Openfire, can be configured to create a heap dump automatically when an OutOfMemory condition occurs. Can you enable that? Typically, this is done by adding a command line option like this one: -XX:HeapDumpOnOutOfMemoryError

InteliCare · June 22, 2023, 7:15am

Hi Guus,

Thanks for the reply, we will add that flag and the HeapDumpPath flag as well

I forgot to add that we are using -Xmx4g -Xms1g java memory settings, so the dump may be of decent size, what would be the best way to make it available for analysis?

Kind Regards,

InteliCare team.

guus · June 22, 2023, 8:10am

That doesn’t really matter to me. Can you do FTP? Dropbox? SCP?

InteliCare · June 22, 2023, 8:33am

Easiest may be just to give you a temporary (24h) download link from azure blob storage.

InteliCare · June 23, 2023, 2:21am

Hi Guus,

Just an update, it appears one of our services is creating a lot of sessions and leaving them detached (about 250) in a minute or so a couple of times a day and Openfire appears to be unable to handle this very well. We are investigating the cause of this and updating the service so that it closes sessions correctly (sends presence unavailable) before disconnecting, and investigating why its doing so many calls at 3am and 5pm.

We have had 2 outages since last post, but It hasn’t produced a memory dump yet.

Kind Regards,

InteliCare

guus · June 23, 2023, 8:19am

That’s weird. Some kind of restart/reconnect feature that’s not doing quite what it’s supposed to do maybe? Sounds like you’re onto something…

InteliCare · June 26, 2023, 12:33am

Hi Guus,

Yeah it was caused by one of our services checking in on the connected devices, For some reason the timer that triggers it is running in a weird overlapping way twice a day, causing around 200+ connections and disconnections within a short period of time. All with a single users login.

We have resolved the detached sessions with presence: unavailable and that has resolved our issue with Openfire.

Kind Regards,
InteliCare