Costly Transaction in 3.80: GroupManager.getGroup

Walter_Ebeling · February 10, 2013, 7:32pm

Hi

we are profiling Openfire 3.80 (release) in our prod environment and I have stumbled across the GroupManager.get.group at line 294/312. It regulary produces transaction time of several seconds according to our profiler (AppDynamics).

Any idea what’s the reason for that? PostGREs is prducing the results from the select of that transaction within a millisecond. Looks like a cache miss and a costly insert into the cache.

Ideas? Questions?

Walter

P.S. We are actually profiling against org.jivesoftware.openfire.spi.PacketRouterImpl.route since we assume that this method touches every message that runs through Openfire. Any other good idea what to profile is welcome (No we will not profile Kraken…).

Tom_Evans1 · February 11, 2013, 5:52pm

Hi Walter,

Assuming you are using the default implementation (DefaultGroupProvider) in a non-clustered configuration (DefaultLocalCacheStrategy), you might be running up against a problem with the default cache eviction policy. We found that with installations that make heavy use of groups, the default eviction policy of 15 minutes for groups was unworkable because the unloading/reloading of groups was fairly expensive (group, group properties, group users, etc.).

We found that the following configuration was better for high-volume single-node (non-clustered) deployments that use groups for roster management:

cache.username2roster.maxLifetime=-1

cache.userCache.maxLifetime=-1

cache.group.maxLifetime=-1

cache.groupMeta.maxLifetime=-1

This effectively disables cache eviction for these critical objects. You can set these properties using the admin console. Please give it a whirl and let us know if this improves your profiling results.

Note that if you are running in cluster mode, you will need to adjust the settings in the Hazelcast config file instead of using these system properties. Let me know if you need more information.

Cheers,

Tom

Tom_Evans1 · February 11, 2013, 6:40pm

Further to this, if you disable the cache eviction timer (as recommended above), you may also want to tweak the cache size properties for the corresponding caches (with defaults):

cache.username2roster.size=262144

cache.userCache.size=524288

cache.group.size=1048576

cache.groupMeta.size=524288

For any sizeable installation, these values are likely to be too small, especially the group size. You can use the cache summary page in the admin console to find the information you will need to tune these properties your deployment.

Walter_Ebeling · February 12, 2013, 7:29pm

Thanks Tom. The installation is a single instance with Kraken running on it. I will check the cache setting.

One additional thing about this implementation of getGroup. As far as I understand this method always throws an exception, if it does not find a groups with the name. Since getGroup is called very often, we have regular exceptions (actually due to Kraken). Maybe it it would be better to avoid an exception in case of a group that is not found.

Tom_Evans1 · February 12, 2013, 9:07pm

Interesting … well, I agree that the GroupNotFoundException is not ideal in many cases, but unfortunately it has been a long-standing fixture of the Groups API. At this point it would be problematic to change the contract for this key API method given the number of existing dependencies (internal and external).

However, there may be a workable alternative. The GroupManager also provides a search(string) method that may be preferable (or more efficient) to use when the given group name may not exist. In this case, the response for an unknown group name would be an empty GroupCollection, rather than the relatively expensive GNFE.

I am not sure why Kraken is asking for so many unknown groups (haven’t tried Kraken yet), but perhaps this behavior could be isolated and refined to use the GroupManager search method rather than the getGroup method.

Walter_Ebeling · February 13, 2013, 2:21pm

Don’t try Kraken. It’s a leaky plugin… We are behinid GoJara and that will be much better from design point of view.