Great news! We have resolved this issue which was caused by a number of factors…
I found the problem by analysing a heap dump.
By default Openfire caches the Roster (username2roster) for 6 hours since the last activity.
On average each one of our roster items is about 90k in size in the cache. (This is the real size)
Because we could have about 50,000 people logging in within 6 hours during peak time that would use many GB’s of RAM, even though
we would have less than 9000 people logged in at anytime.
Unfortunately the cache wasn’t cleaned up correctly by Openfire when it reached the limit because Openfire is incorrectly calculating the size of the cache in org.jivesoftware.openfire.roster.RosterItem getCachedSize (and possible in org.jivesoftware.openfire.roster.Roster at an initial glance). This is a part of the Cachable interface.
A lot of information has been added to the roster items since this code was written so it is calculating about a third or less of the actual size of the roster items. A new issue needs to be created for this - and I would be happy to prepare the patch which will accurately calculate the size of the roster objects.
If this calculation was correct then the cache would have cleaned up properly and the memory issues wouldn’t have occurred.
We corrected the issue in the meantime by adding the property cache.username2roster.maxLifetime and setting it to 419430400 which 20 minutes (rather than its default 6 hours). For anyone interested, who may be having a similar issue now, this property can be added through the admin interface - no programming required.
I guess the other option would be to reduce the cache.username2roster.size to about a quarter of what you want to account for the calculation issue explained above.
Let me know if you have any questions,