Java process using massive CPU (Linux / Wildfire 2.4.0)

I have a Wildfire 2.4.0 server running since 9 January when I installed it. Although it’‘s running OK and messages are being passed around, the thing is sucking up TONS of CPU. I have a Red Hat GNU/Linux server. The process is sending the load average up to about 4. It’'s consistently taking over 70% of my CPU, and sometimes more like 85% or so; this is from top:

SIZE 113M

RSS 92M

%CPU 80.2

%MEM 9.1

TIME 4289m

COMMAND java

The memory doesn’'t bother me much, but look at the time (4289 CPU minutes!!) and of course the %CPU is 80% which is way high. I never had these kinds of issues with the Messenger (2.1.5) server I was using prior to this.

I only have about 700 users and 170 active sessions currently, so it’'s definitely not something that should be taking all that CPU.

I can, of course, restart the server (although that’‘s annoying since it will interrupt everyone) but are there any ideas about what might be causing this, and/or any suggestions on where to look for more info so that a proper bug report could be created? Or, does this sound like something that’'s been fixed in 2.4.1 or 2.4.2?

Hi,

you did not mention the Xmx value you are using, maybe you are nearly out of memory. You could use JVMstat / VisualGC http://java.sun.com/performance/jvmstat/ and connect to the JVM (no Wildfire restart necessary) and check this, there you’'ll also see how much time the JVM spent with garbage collections.

I would also add ‘’-verbose:gc’’ to my start script to have a lightweight memory profiling written to STDERR. We run every java server with this option and there is no negative impact.

I hope that you are not using the embedded database, r u?

LG

Sorry, don’‘t know what an Xmx value is? I do have memory available on this system if that’‘s what you mean. I found the jstat command (I’'m using jdk 1.5.0) and I can print lots of values with cryptic headers that I have no idea what they mean :-).

If I go to the Wildfire console the section on “Environment” says: 48.71 MB of 63.31 MB (76.9%) used

I would be pretty surprised if there were any memory pressure here: the server has 1G of RAM and as I said, it’'s not heavily loaded (700 users / 180 or so sessions)

I’‘m not using the embedded DB: I’'m using a mysql DB.

thanks for the note! Anyone else have any ideas about things to look for, or should I just restart the server?

Hi,

here http://www.jivesoftware.org/community/thread.jspa?threadID=17442 is my win32 start script. You will have something similar to start Wildfire, but you did not set ‘’-Xmx64m’’ as a JVM option. 64 MB is the default Xmx value and means: Wildfire (the process in the JVM) must not use more than 64 MB, if it needs more then the JVM sends an OutOfMemory to the java application. You have 16 MB free, this could be enough or too little - very hard to tell.

I mentioned VisualGC because GUI tools are usually more intuitive and I really like it for tuning.

LG

Well, I’‘m not seeing any errors in the logs about OutOfMemory requests (like the one in the thread you mention) and as far as I can tell I’‘m only using 75% or so of my available memory, so I just don’'t see how this could be the issue.

Plus, the server is constantly, persistently busy (chewing 70%+ of the CPU at all times) but it does run correctly as far as I can tell… which just doesn’'t sound like an OOM type situation to me.

To my mind this feels more like a resource problem where a socket was opened and some thread is trying to manage it constantly or who knows what… but something that keeps the CPU busy doing something useless. Is there any way to debug the threads in the JVM or capture their status, to try to figure out what they’‘re doing and why they’'re doing it so often?

I’‘ll add a larger Xmx value just for future expansion but I don’‘t think this is related to the issue I’'m seeing.

Hi,

get the java pid and execute “kill -3 $PID”. This will write a stacktrace to STDERR / error.log. Depending on how many threads the process has ( I assume 500-2000 ) it can take a minute.

Like during a garbage collection all threads are stopped, so one will likely do this early in the morning.

You should try to use VisualGC and look how often a GC is triggered and how long it takes. If it is triggered every second and runs 500ms then 50% of your CPU are used for GC’'s and this is something that really can happen.

LG

OK, I’'ll try the kill -3 later tonight.

However, I did install visualgc and there definitely does appear to be some kind of GC issue! The GC Time graph in the Graph window shows that the process is doing GC almost constantly. The yellow bar is always present, with just a few minor dips to zero for basically a single interval (500ms). The server has been running for 7 days, 20 hours, and in that time it’'s done almost 18 million collections and spent over 17 hours doing GC. The GC time appears to go up about 1 second every 4-5 seconds of “real time”. The “Last Cause” shows “Allocation Failure” almost all the time, with a few scattered “No GC” messages.

I’‘m not familiar enough with the GC algorithm to be able to tell what good/bad values are for Old Gen, Perm Gen, and the three young gen spaces. However, Old Gen has a max of 56M and is currently using 51M, while Perm Gen has a current capacity of 16M and a utilization of 15+M. These seem like they are a problem. I don’‘t see how the Jive console is reporting a usage of 44.85M Java Memory used when visualgc shows 66M used…?? Also I’'m not getting any OutOfMemory messages in my error log which seems odd.

Anyway, seems like you’‘re right that this is a memory issue. I’‘ll restart tonight and double the memory using -Xmx128m and see if that works better. I’'ll let you know…

Thanks!

Hi,

you could also set Xms to 80 MB and PermSize to 16 MB (both are initial values), so there is no need for the JVM to increase these sizes for a long time.

-Xms80m -XX:PermSize=16m should be fine (currently used 64 MB + 30% free = 80 MB)

Don’‘t wonder that you are not getting OOM’'s - the JVM frees as you said every 5 seconds some MB because of an Allocation Failure. As long as it can free more memory than the AF needs no OOM will occur.

LG

Well, that seems to have taken care of the problem. My memory is much more sane now.

I think someone should file a bug against the Admin console, though: the “Jave Memory” display on the Server Settings admin page is very nice… but it’'s also very inaccurate! This makes it worse than useless unfortunately. For example, right now the display shows 50M used, but in fact if I use visualgc to check this process I see it has over 60M in use. It seems like it might be displaying the new gen and old gen values, but forgetting to add in the perm gen number, or something like that.

Thanks!