Load Testing - Shared Groups vs Rosters and Caching study

Cohen_Pascal · November 6, 2006, 2:19pm

Hello

I am writing this small post to share information about what I observed doing some tests with Shared Groups and Rosters.

First of all some general information:

I created two databases with 10K users and creating “large rosters” of 300 items/users. In wftestR database, rosters were handled in the jiveroster and associated tables while in wftestG database, rosters were handled as Shared Groups with jivegroup and associated tables.

I first started some spark clients to evaluate the connection response time. It behave much better with rosters than with shared groups (a few seconds vs around 1minute). Repeating the operation didn’'t changed anything to rosters but increased the Shared Groups behavior (a few seconds). I think this is better thanks to caching.

Someone of my company that had studied shared groups told me that at the begining, all shared groups are loaded which makes the response time so long. I hadn’'t the time to check but to improve that I started to write a small pluin whose purpose is just to “load” all users, rosters and groups when the server starts. This increased the behavior significantly but needed also to study the caching.

In both cases (Rosters or Shared Groups) what needed the maximum caching size was the rosters. I imagine that there is a trade-off between shared groups and rosters. Rosters in the database don’'t need process but much memory while shared groups need more CPU inside WF server to be “converted” as rosters to the clients. I suppose the generated rosters corresponding to shared groups are also stored as rosters in the roster cache.

I haven’'t yet tuned precisely the roster cache but for the same number of users I observed some things:

With shared groups, it is like User cache is bigger than with rosters 1.8M instead of 0.8M (not really critical).
the Group caching for shared groups only needed around 16M.
the MetaGroup caching for shared groups only needed 0.9M.
the Roster cache size is as said the largest. With rosters I were around 100M and it was filled around 94% and I suppose I should go larger, it was filled around 85%. My first impression but it is maybe wrong is that roster cache can be smaller with shared groups.

Have someone some remarks or advices related to that ?

Then I started some load testing to see the behavior of the server in such a situation. Trying to connect around 6000 users, behave well with rosters. Load was quite high on server and on database but the system worked fine. With shared groups everything behave well also but the connection delay and time to get the rosters made the server loaded much higher. And suddenly around 6K users connected it is like the system was “saturated” and load becam really high, users could not connect anymore, users got timeouts.

I used simple scenarios where users connect at a given rate, synchronize together and then start to chat (light scenario) and change status (light scenario) and then disconnect.

I hadn’'t yet time to study precisely the origin of the load on the server and expect to have details in the coming days.

I don’'t know if these pieces of information are helpfull but large rosters can really increase the load of the server. If someone has advices or suggestions to reduce load and increase response time, I am interested.

By the way I heard about a Caching project developped by Apache/Jakarta - JCS if I am right. have you heard about it ? Do you think this could be interesting ?

Rgds

Pascal

Gaston_Dombiak · November 6, 2006, 10:35pm

Hey Pascal,

Thanks for sharing this useful information with us. I think that what you are observing is what I imagined while implementing shared groups. Anyway, since you already have some nice stress tests it would be nice to profile and optimize the code where needed. Could you please send me your creation scripts so I can run those tests in my local environment?

Thanks,

– Gato

Cohen_Pascal · November 8, 2006, 7:34am

Some questions related to Caching:

In fact I think to manage a trade off between Cache and remaining memory for other Java Objects.

I may be wrong I don’'t know but are objects duplicated in the Cache ?

In that case in fact if I have 2Gb available and let’'s say 500M dedicated to all cacing, I will only have 1.5G to let the server run. Am I right ?

I have thought about WeakHashMap. Could it be usefull for caching or what about using Mapped File instead of HashMap. That would reduce the Java memory usage and could probably remain much faster than accessing the DB.

What is your opinion about that ?

LG1 · November 9, 2006, 10:24pm

I’‘m glad that I’‘m not the only one who’‘d like to see a cache in the file system, it’'s not only a memory but also a restart issue - a file system is persistent while the memory is not.

LG

Cohen_Pascal · November 27, 2006, 8:48am

Hello here are some news concerning my tests performed:

Sorry for the “telgraphic” writting, If you want details, do not hesitate to contact me. It is some fresh news but I still face problems even if things are sometimes clearer on some sides.

If I make laod tests, with users with no shared groups or roster, I can reach around 40000 clients without any error message. If I go further, I have problems, generally I get from server/CM side the EOFException error while on client side this is managed as a timeout. Monitoring CPU usage on both side and making several tests on small amount of users, I concluded that the tsung client side was responsible of these problems generated by a high load on client side.

Anyway if I increase the shared group size, I have always the same problem (EOFException) but with lower users. With rosters of less than 150 items/users, I can almost reach 30000 users before problems start.

With rosters around 300 items, generally client generates errors around 18000 users. However if I keep on increasing the number of items in rosters, I generally overload the server. Tests are on going to estimate the threshold where server becomes a bottleneck. But until now it is hard to say if this is only the server or an interaction between client and server because if I stop not nicely the clients (aborting tsung) it generates suddenly a very high load on server side (Logs and things like that probably)… This is going to be investigated.

During these tests Caching was helpfull to adjust interaarival rate and reduce load on DB and server responsiveness. With cache sized to several Mb, I had better connection time and I could increase the interarrival rate.

I would like to thanks Gato for his help and suggestions and here are additional information:

I was not able in my configuration to have more than 130 users/s arrival rate.
I played with the number of connections to the DB but it did not upgrade my results because the problems I get are generally when all my users are connected and chat together, exchange presence message and then the load becomes high. Database is not involved in that phase

Until now I am pretty confident related to the server/CM behavior with medium rosters size, but for me it is totally unpredictable to say what can happen with larger rosters. Process time on server side could become a bottleneck.

Concerning Caching now, cache increases the ability to quickly connect users. My main fear is what about let’‘s say for example a network connection problem and suddenly 15K users try to connect simultaneously. And I really think a caching not only in memory could really be helpfull especially for cache that could be really large like Rosters - I had a look to JCS and for me it’'s great.

I made also a brief heuristic related to needed cache size, perhaps you could have a look and tell me what you think about it:

Let N be the total number of users in the database.

Let n be the number of users connected.

Let G be the number of groups (Size of jive group table).

Let ug be the average number of users per group (or lets call UG the size of jivegroupuser table).

The defined heuristic makes a linear estimation of the cache usage of several used caches. Domain of evaluation was between 10K and 50K users with groups size between 0 and 400 items per user. The role of the personal rosters has not yet been evaluated.

We can notice that we have the following formula:

UG = ug * G

User cache sizing

Total cache needed (size in Mb):

1Mb * (N/10000)

Real cache needed (size in Mb):

1Mb * (n/10000)

Meta Group cache sizing

Cache needed (size in Mb):

always less than 1.5Mb

Group cache sizing

Cache needed (size in Mb):

6Mb * (UG/10000) or 6Mb * ((ug * G) / 10000)

Roster cache sizing

Total cache needed (size in Mb):

40 * (CacheGroupSize) + ??(personal rosters)

Real cache needed (size in Mb):

40 * (CacheGroupSize) * (n/N) + ??(personal rosters)

Thanks for reading

If you have questions, remarks or need some explanations, feel free to contact me.

Rgds

Pascal

LG1 · December 4, 2006, 8:25pm

Hi Pascal,

I did never run a test with more than 1000 users but I’'m sure that the performance will decrease a lot as the JID cache is limited to 1000 entries. It also contains a little bug (see JM-855) but this does not affect performance. I wonder if you see better results regarding the performance if you set the cache size to 50000 in src/java/org/xmpp/packet/JID.java.

LG

Conor_Hayes · December 8, 2006, 2:08pm

Have a look at ehcache if you are thinking about spooling the in memory cache to disk. At the very least it will give you some ideas of what’'s involved and what type of features you need to support.

Cohen_Pascal · December 8, 2006, 3:36pm

Hello I had heard about JCS and did not know ehcache. Interesting, anyway but I don’'t know what is planned.