Wildfire 2.6.2 startup times

Hello all,

This may wander partly into a DB/OS tuning discussion, so forgive me, but it does have a fair amount to do with Wildfire.

I have a large installation (~9500 registered users - with upwards of 3100 online at any one time). I’'ve tuned the Java VM options properly to handle this load, on a linux box running RHEL 3, with 8GB of RAM and plenty of disk. We just upgraded to Wildfire 2.6.2 from 2.5.1, and are running against PostgreSQL 7.4.8, and authenticating against our corporate LDAP server.

The problem we have is that on server startup, we have a TON of clients who have ‘‘auto-reconnect’’ set to on. Because of some issues previously with memory, I have a DB connection pool limit of 550 connections, so, when we bring the server back online, we see thousands of connection requests, and obviously some get blocked until a free DB connection is available.

Our machine sits with ~550 postmaster processes chugging away, and it takes a long long time for anyone to log in. Once you do login, it usually takes upwards of 15-20 minutes for your roster to show up in the client, obviously because of database slowness.

Yesterday we increased the max shmem segment size in the linux kernel from 32MB to 128MB, and that appeared to help briefly. From what I can see from monitoring the system, this problem isn’‘t a Java or Jive problem per se, as the java process isn’'t taking up much CPU load. Do I have any other options other than increasing the max # of DB connections in the pool?

Is PostgreSQL the right DB to handle this kind of Jive load? Does anyone out there have a similiarly large installation? Any pointers would be appreciated at this point. My users are about ready to shoot me… Thanks.

-Guy

Hi,

for me it sounds really evil to open 500 database connections, I don’'t want to know how many CPU-seconds PostgreSQL needs just to open them and they may use all some memory and thus PostgreSQL has less memory for its cache.

Wifi 3.0 offers Connection Managers, which may help to reduce some load from Wildfire itself, but your database may still suffer so I can’'t recommend to get 3.0 to fix it.

There is some code available to make Wildfire handle the connection pool a little bit faster (about 10x) but as far as I know Gato did not include it in Wifi 3.0 so you will have no benefit in using 3.0 regarding the DB connections. I assume you don’'t want to test this code (it compiles fine with Wifi 2.6.2 anyhow) in your environment.

LG

Thanks for the response LG…

So, if Wifi 3.0 won’'t fix this, and opening that many DB connection is the problem, are you suggesting I decrease the max DB connections in hope that it will handle fewer connections faster?

-Guy

Guy,

It sounds like we need to implement a few optimizations to handle this case. I’'m guessing that the right caching or pre-loading logic could help a ton. Do you have any insight into which database queries are taking the longest, or which ones are being executed the most often? That would provide some clues as to the best place to optimize.

For example, we could load blocks of roster data into memory at a time instead of one at a time if that turned out to be a database hotspot.

Regards,

Matt

Hi,

I wonder how many connections you are using during normal operation. You are running all products on one server so monitoring the memory usage will be not as easy as if PostGreSQL would be running on another server, but top[/b] and Fn[/b] (sort memory) may be your friend to do this manually, “ps -someoptions” within cron may be better to monitor everything automatically. If you have during normal operation much more “free” cached memory than during startup then I’'d decrease the maxDB connection value. You may have read already http://revsys.com/writings/postgresql-performance.html or similar articles “max_connections = … Use this feature to ensure that you do not launch so many backends that you begin swapping to disk and kill the performance of all the children. Depending on your application it may be better to deny the connection entirely rather than degrade the performance of all of the other children.”.

LG

Matt,

Yes, I would agree with your assessment - it is most likely an issue with roster data - thankfully we do have logging turned on for postgres. Here are the two queries that seem to be executing the most during our startup throes:

statement: SELECT groupName FROM jiveRosterGroups WHERE rosterID=$1 ORDER BY rank

statement: SELECT jid, rosterID, sub, ask, recv, nick FROM jiveRoster WHERE username=$1

I’'m really surprised no one else has complained about this… does that mean we get the title of largest installation?

Thanks.

-Guy

Thanks LG,

In checking, it appears that most of our connections in the pool are idle during normal operation (garnered from ps -eaf | grep “postgres: wildfire” and top). I’'m only seeing about 4-5 connections being used at any one time - which makes sense once we are at a stable state.

We’‘ve been through a lot of DB tuning parameters, and it appears that we’‘ve done about all we can do in that regard. Depending on what Matt’‘s answer to my previous post is, we’‘ll determine what to do… I might decrease the max connections for now (we are at stable state, so I’'m not going to restart the server again today) in preparation for the next restart. Thanks.

Also - Gato, if you’'re listening - that 10x DB connection pool improvement in a production release would be awesome!

Thanks.

-Guy

Hi Matt,

we could load blocks of roster data into memory at a time[/i] sounds a lot like a read-ahead feature with the risk of reading too much and the wrong information and thus decreasing the performance.

Anyhow filling the Wifi database cache very fast would be great. Currently the cached objects (or references to them) are stored only in memory so Wildfire can not use this information after a restart but it would be nice if it would store it either in a file or the database so it can restore the cache after startup very fast.

LG

Guy,

Ok, here’‘s the idea that’‘s taking shape in my head. We already know the last login time of users (I think). It’'s probably a valid assumption that the users with the most recent login time will be the ones most likely to login first when the server starts up (based on auto-reconnect, etc). Therefore, we could have a new property “cache.readAheadUsers” or something. When set to a value of say 500, that would mean that the server would read the data of those 500 users into cache before the server starts accepting requests.

It should be possible to make the database queries to load the read-ahead users quite efficient. For example, a single query to load all roster data using an IN clause. So, instead of thousands of queries, the server would only need to make a couple when it’'s first starting up.

Does this seem like the right approach? The nice thing is that individual implementations will be able to tune the readAhead value. Or, maybe we can even auto-tune the read-ahead size based on how many users connect to your server? It would probably only take about a day to implement this logic.

Regards,

Matt

Hi Matt,

as far as I know Wildfire currently tracks only “lastActivity” when a user logs out normally. While “lastLogin” should work just fine there are alot of other caches where one can’'t use “lastLogin” to determine the last state of the cache. I prefer a logic which is usable for every cache.

LG

Ahh, ok. So, lastLogin would be something we need to add at the same time. Seems like a useful bit of information to store anyway.

I’'m not sure what you mean with reference to the cache. What caches are you thinking of?

Regards,

Matt

Hi Matt,

let my try to explain it with an example:

jiveVCard /bcache has 1 MB, it contains username /band value[/b].

So one could dump the whole cache (1 MB) or “references” (~50k) - in this case the username[/b].

It may be enough to get and dump the “references” every 30 (make it configurable) minutes, so the overhead is not too big.

LG

I’‘m ok with however you and LG want to do this - I’'ll leave it to you guys to debate the various methods and implementation details.

However, I’'d definitely vote for configurable - we are probably willing to put up with a slower server startup time (before it can accept connections) in order for the ‘‘perceived’’ startup time by the users to be relatively quick.

Being able to adjust that cache value depending on our situation would be ideal. Thanks.

-Guy

Hey Matt,

Any chance of getting a JM # for this so that we can track it (our growth curve is continue to climb, and we are going to need this probably pretty soon)… I know you guys are swamped with a bunch of stuff, but at least if this is in the system, we can track it. Thanks.

-Guy

Guy,

I filed JM-764 – please feel free to add comments. I’‘ve started doing some initial profiling and it looks like a user login requires about 12 database queries. That’'s actually down from about 15: I already optimized away a few of the queries, for example as described in JM-762.

I’'d like to get some more insight into which database queries are slow (if any). There are basically three scenarios I can think of:

  1. The sheer number of database queries when thousands of users are logging in takes a long time to process, even if none of them is very expensive.

  2. We’'re not blocking on the database at all, but some other part of Wildfire code. For example, do you use LDAP?

  3. There are a few database calls that are very expensive.

I’‘m adding in a database profiling tool to Wildfire that will help us answer these questions. In the meantime, it would be great if you could gather some more detailed information from your database. What’‘s the ordering of most common queries (top 20)? What’'s the average length of time those queries run, total time?

Thanks,

Matt

Thanks Matt,

I’‘ll add the queries and runtimes into that bug when we get them - we had to turn on full logging and we’‘ll be shutting down the server tomorrow to do some other unrelated systems work, so I’‘ll see what the #’‘s look like once everyone’'s clients try to auto-connect.

To address your questions, yes, we do use LDAP for authentication, but in watching the log files, we are seeing the LDAP connections stream by very very quickly - there don’‘t appear to be a lot of delays in authentication - as a matter of fact, the behaviour from the client end kind of supports this - logging in and authenticating is relatively quick, it’'s the wait for the system to return your roster information that seems to be taking the longest, leaving users to think they are in some sort of purgatory - logged in, but unable to see any of their roster entries.

In one of my previous posts, I put in the two top queries we are seeing the most often - I’‘ll try to have more detailed #’'s after our restart this weekend. Thanks.

-Guy

Ok, a small update here - I still have to get that query data loaded up for you guys Matt. However, I’‘ve been very busy (with Ryan Graham’'s help) getting our server stabilized by adjusting the cache sizes to meet our ever increasing installed base.

At Ryan’‘s suggestion, I’‘m posting these values here in hopes that they might help some other large Wildfire installation dealing with these issues. I’'ve looked and have not been able to find a general ‘‘tuning/scaling’’ guideline (hint hint Jive Software ), so I hope someone can benefit from our ‘‘growing pains’’.

Additional background - we now support an installation with over 10,000 registered users, where our average online user count is ~3000, 24x7. I first noticed problems when our stats plugin (written by Ryan’‘s company) started ‘‘hanging’’ at the same time every day. We’‘d ratcheted down our # of DB connections (per earlier advice in this thread) in order to try to stabilize things, but it turns out we were most likely starving the stats plugin by doing this. Ryan suggested tuning the cache values (which, BTW, is not intuitive - I’'d vote for a cache tuning interface in the admin console pronto! ). He pointed me to this tutorial by Gato:

http://www.jivesoftware.org/community/thread.jspa?messageID=119016&#119016

I’'m happy to report that with the values listed below, our installation has stabilized in its day to day operations (we still take ~30-45 minutes to come up to ‘‘stable state’’ after a restart and all of the thousands of re-connect requests from our clients, so that should be addressed), and the stats plugin has started to behave nicely now.

Our values:

  1. DB Connections (max): 330

cache.userCache.size: 10485760

cache.username2roster.size: 10485760

cache.vcardCache.size: 5242880

So, let me reiterate my vote for a tuning guide for Wildfire - our team has had to struggle and be in reactive mode for the last several months as our usage #’‘s increased. We’‘d have prefered to know ahead of time which values we would have likely had to adjust to meet our growth curve. Thanks to Ryan, we’'ve got some clue to that now. Still, an official guideline from you guys at Jive would be really really good. Thanks!

-Guy

Hi,

I agree that tuning the cache is not easy as the available counters are not so helpful.

I miss a counter where one can see how often the cache was purged because it was full. And how often also new objects had to be purged.

I did talk to Gato and did create JM-693 as a note so one could take a look at the cache classes and improve performance of the cache itself. Especially with a lot of objects (where’'s the #objects counter) the cache may slow down.

Do you have any hints or a small cache-tuning howto?

LG

Hi Guy,

Thanks for the kind words.

I actually found it very interesting tuning a Wildfire installation the size of yours. While working with you the two cache related items that really jumped out at me were:

  1. Bumping the Roster cache from 0.5 MB to 5 MB increased its effictiveness from barely 20% to nearly 80% (that precentage is probably higher now with the size set to 10 MB).

  2. Being able to reduce the number of database connection by 40% (550 -> 330) simply by allocating a total of ~25 MB to the User, Roster and vCard caches.

I think for a lot of situations little has to be done (if anything) when installing Wildfire beyond maybe increasing the memory you give it; I certainly wouldn’‘t start fiddling with the caches unless the effectiveness rate was real low after I had had Wildfire up and running for awhile. But, being able to view the various cache data in the Admin Console was extremely helpful in being able to tune Wildfire in Guy’'s situation.

If anyone else has done some tweaking to their Wildfire installation feel free to share it here or in another thread.

Cheers,

Ryan