Memory leak with Client Session Info Cache and Hazelcast?

Alex46 · July 16, 2013, 5:01pm

Hello,

we are running an openfire cluster 3.8 with hazelcast distributed cache. We are noticing that the Client Session Info Cache grows non-stop. I have added cluster_wide_map_size to restrict the growth but are running into issues.

Does anyone know why this cache would grow continuously? It seems like data is not being evicted.

Does the latest version (3.8.2) resolve this issue?

Thanks for any help,

Jack11 · July 18, 2013, 12:35am

Bump.

I might not have an answer to offer but I’m curious to know

what box you’re running it on,
how you’re measuring the memory leak growth?
at what rate’s the leak at (MB/min)
what level of XMPP activities (message/min) is the box at?
how many clusters are there?

Alex46 · July 18, 2013, 1:11am

Hi, here are some additional info:

what box you’re running it on,
We are running it on large Amazon ec2 instances (7.5gb of memory and 4ECU). We are starting our instances with 6GB of max memory

how you’re measuring the memory leak growth?

I’m just looking at the cache summary. it’s exceeding the max size and continuously growing. Seems entries are being written without eviction. It grows to >100MB per day.

at what rate’s the leak at (MB/min)
Not MB/min but over 1 day or so, it gets to be >100 MB
what level of XMPP activities (message/min) is the box at?
We have very low number of messages, but a large number of connected clients (100k users with 10k concurrent connected). However, the msg rate is less than 20/min.
how many clusters are there?
There are 4 machines in the cluster.

What’s the appropriate cluster_wide_map_size value for Client Session Info Cache? I have it currently set to 100000. Is that too high? I also had to tweak the eviction delay and eviction rate as well.

Thanks for any hints or help.

Jack11 · July 18, 2013, 8:11am

That seems pretty low in terms of activity and concurrency. Are the CPUs screaming and the amount of interrupts trending up? Hope you have munin / cacti to track trends in resource usage

With all due to respect, is there a possibility that it may be just the cache at work… caching? I wonder how long have u kept a machine running in its cluster? I’d let one climb up til it nearly maxes things out

Also, have you gone through the settings to tighten the spigot on client connection durations? Curious to know how many secs does it take to disconnect clients automatically?

Alex46 · July 18, 2013, 2:39pm

The CPUs do trend up to exceed 50% utilitization. So I’m worried about that. I don’t think we have a lot of data, so 50% utilization seems high. I will check our monitoring servers to see if there are more details on resource usage.

My concern is that the Client Session Info Cache size does not seem to adhere to the max size in the cache summary page at all. It just exceed that limit and keeps on growing.

I’m not sure about client connection durations since we are using XMPP to send notifications to clients and need to maintain a connection when our clients are running, I will try with a lower value to see if it makes a difference.

Jack11 · July 19, 2013, 8:10pm

How is it going now?

Is there a convenient way to simulate load on a secluded box and throw debug logs out at it to see where the leak could be happening?

Alex46 · July 20, 2013, 10:18pm

the Client Session Info cache is overflowing again.

Seeing hazelcast concurrentMapManager warnings in the logs:

RedoLog{name=c:Client Session Info Cache, redoType=REDO_MAP_OVER_CAPACITY, operation=CONCURRENT_MAP_PUT,

Yes, we can do some load testing on an isolated test cluster. Which debug logs do you think should be enabled?

Thanks,

Justin_Michael · January 22, 2014, 9:26pm

Have you tried setting the cache value to -1 and let it be unlimited? It sounds like to me that you believe the cache is not being effective because it is too small and therefore impacting performance.

liamchou · April 20, 2016, 12:53am

System Properties cache.ClientSessionInfoCache.size set value to -1

Sourabh1 · April 20, 2016, 11:14am

Isn’t setting -1 equally bad? Doesn’t it risk your whole openfire to be starved of memory as cache keeps on growing indefinitely.