Memory leak in Openfire 4.4.4

Hallo,

I have another evidence of memory leaks in OF 4.4.4 (please see attachments) to support another finding mentioned here by suf126a.

Loadtest:
OS: Solaris 10 sparc
JDK: 1.8.0_191
50 concurrent users
5 chatrooms

The memory leak appears specifically when sending attachments (file transfers). A large number of FileTransfer objects appear to be kept in memory. A large number of DomainPairs also seem to exist (I 'll send screenshots in the near future).

StanzaHandler.processIQ() calls MetaFileTransferInterceptor.interceptPacket() for each IQ it processes (see profiler screenshot) which calls:

FileTransfer transfer = createFileTransfer(from, to, childElement);
and
acceptIncomingFileTransferRequest(transfer)

These file transfers seem to be cached in method acceptIncomingFileTransferRequest(transfer):
cacheFileTransfer(ProxyConnectionManager.createDigest(streamID, from, to), transfer)
and Cache seems to work fine (it autocleans itself after a while).

So, I wonder who keeps references to these FileTransfer objects.

FileTransfer references escape only via retrieveFileTransfer() method but this is called by DefaultFileTransferManager.registerProxyTransfer() and cached again.

I hope further discussion to the topic will help find the problem.

openfirememoryleak.zip (309.3 KB)

Kind regards,

John

1 Like

Interesting. Can you provide a memory dump that contains the instances that are leaking? I’d like to find out what keeps references to them open.

I 'm afraid I won’t be able to provide you with a heap dump (I tried a number of things but it won’t be possible to take it out of our system). So, I 'm afraid we have to continue with screenshots.

attachments(1).zip (257.8 KB)

If these won’t help, I can provide more later.

Thanks.

Please find 2 heapdumps attached.
heapdump-1580507303881-new.zip (14.3 MB) heapdump-1580506589266-new.zip (14.9 MB)

Hallo Guus. What exactly are you caching may I ask? I send the same file 3 times between the same 2 users, and DefaultFileTransferManager.cacheFileTransfer() creates a different key each time, as a result the same file is cached 3 times in fileTransferMap
Cache. If the purpose was to reuse the cached filetransfer, then why use a different key each time? Then, what is the purpose to cache file transfers in the first place?

The Cache ‘autocleans’ itself after a while, of course, but when you ‘bombardise’ the server with attachments, then the cache can fill up after a while if the autoclean doesn’t happen quickly enough.

Can you help me reproduce the problem on my end? The combination of file transfer and MUC room confuses me a little. Exactly what does you code do?

I’m currently looking at the heap dumps that you provided. What makes you conclude that the FileTransfer objects are the cause of the memory leak?

Although both heapdumps show a fair amount of FileTransfer objects (both have slightly over 1,000 objects), the retained heap (see below) for these is under half a megabyte. That’s well under 0.5% of the total heap size.

From JProfiler’s help files:

Shallow vs. Retained Heap

Shallow heap is the memory consumed by one object. An object needs 32 or 64 bits (depending on the OS architecture) per reference, 4 bytes per Integer, 8 bytes per Long, etc. Depending on the heap dump format the size may be adjusted (e.g. aligned to 8, etc…) to model better the real consumption of the VM.

Retained set of X is the set of objects which would be removed by GC when X is garbage collected.

Retained heap of X is the sum of shallow sizes of all objects in the retained set of X, i.e. memory kept alive by X.

In both heap dumps, I’m seeing a more likely candidate for a memory leak: both dumps have exactly 51 NioSocketSession instances (which is a representation of a TCP connection). Their retained heap is significant: 61% in one heap, 72% of the other heap.

A significant amount of these instances have a retained heap that’s larger than one megabyte. The arbitrary selection that I reviewed all had that memory used by the writeRequestQueue property.

From MINA’s javadoc on writeRequestQueue’s getter:

(…) the queue that contains the message waiting for being written. As the reader might not be ready, it’s frequent that the messages aren’t written completely, or that some older messages are waiting to be written when a new message arrives. This queue is used to manage the backlog of messages.

I’ve used OQL to extract the buffered values of all of these messages, using this query: SELECT r.originalMessage.buf.hb.toString() FROM org.apache.mina.core.write.DefaultWriteRequest r This query returns data from all DefaultWriteRequest instances, the type that’s put in the queue. When this type is used elsewhere (which I doubt), the results might be skewed.

This is a dump of the results: dump.txt (28.3 MB)

They consists solely of stanzas related to file transfers, it seems. I’ve performed the following grep to find lines that include a stanza ID matching something like id="transfer3_1145492"

$ grep -v id=\"transfer dump.txt | wc -l
7
$ wc -l dump.txt
28939 dump.txt

Of those 7 lines that do not match, most of them are empty.

How to interpret all this? The problem appears to be located in the queue that holds outbound stanzas. Why these queues are filling up needs further analysis, but a good part of that will involve looking at the code that generates the data.

From the data, it’s clear that some kind of test is being performed. I’m thinking that I’m seeing clients that loop over a bit of code that perform a file transfer. My first thought is that it might be the test code itself that’s causing the problem - it appears that the client code is not reading all data fast enough, causing the server-sided write buffers to fill up.

Were all 51 socket connections still active at the time that the dumps were created? What happens with the memory after they disconnect? What happens if you keep them connected for a couple of minutes, but not have them continuously push new file transfer requests any more? I wonder if things “catch up”

Hi there and thank you for looking into it. You are right.
The test tool we are using is a C program using iksemel 1.4 XMPP library. I attach an executable built in Mac. I 'll check if I can release the source code.

In order to be able to execute it you need to create 50 users (user001-user050) all with the same password (a in the example .sh, -p a) and 5 chat rooms (room001-room005).

It works in either of two modes:

  1. It splits 50 users to 5 chat rooms and each user sends a short message (50 characters) to a chatroom. This happens every second.

  2. Each user sends a small attachment to each other. This also occurs very fast, in a loop, so the load is quite high.

When we use the tool in mode 1, everything seems normal. When we use with mode 2 then the result is what you have received in the past.

The way the iksemel library seems to work (the same seems to be the case with the smack java library), it sends the attachment but there might be nobody to receive it. When we start the tool and we also start our Chat client application, that we use for verification (similar to Spark), we don’t see a dialog box that normally pops up when an attachment is received. And this could also be problematic because someone should be there to continuously accepting the file transfers (clicking really too fast to close the dialogs). This means that it is not required that somebody needs to receive/consume the file transfers. When we use mode 1, then we do see in our Chat client application all the messages filling up the chat rooms.

We have another 2 load test tools apart from the C one, one written in Java using the smack library that works similarly to the C program and another using JMeter based on this guide “XMPP Load Testing - The Ultimate Guide”. However, only with the C load test we had this issue (we 'll try again with the Java tool once the Korona virus issue goes away).

Because the problem appeared only with file transfer mode we focused on something that is specific to that functionality. The tool has been adapted to send less attachments in order to be more realistic, and also send bigger attachments less frequently as Openfire buffers seem to overflow, as mentioned in some post and Openfire seemed to behave satisfactorily.

loadtest_i386 (4.8 MB) loadtest.sh (62 Bytes)

Facing similar issue when trying to connect with clients written on C.