Server hangs

jay_scott · June 9, 2010, 4:27pm

that’s a helpful subject, isn’t it? unfortunately, that’s about all i

have to offer.

OS= redhat 5.4

openfire 3.6.4

about a week ago the openfire server started hanging, and

hanging fairly often. today it’s hung twice in about 4 hours.

before that, the thing had run just fine for … months, anyway.

updated the jre to 1.6.0_20. i was getting

java.lang.OutOfMemoryError: Java heap space

errors fairly often. that has slowed (i think) but they still

occur. however, i was getting these before, too.

turned off server to server connections (firewall blocks 'em anyway)

this was the latest change. since the hangs are intermittent

can’t say whether this helped.

in /etc/sysctl.conf i entered

vm.overcommit_memory = 2

and rebooted. didn’t help.

oh, yeah. i need to ask this:

in the system settings, after i turned off server to server connections

i did not restart openfire. do i need to restart after such changes?

i’m constantly getting

java.lang.UnsupportedOperationException: VCard provider is read-only.

errors. but i read online (somewhere) that, annoying as they are, they

don’t cause real problems. i’d like to get rid of these errors all the

same. how?

thanks in advance. i’m relatively new at this, and haven’t had many

problems. so i don’t have much experience w/ this, beyond installing it.

j.

LG1 · June 9, 2010, 4:36pm

Hi,

do a “ps -ef | grep openfire” and verify the Xmx value of Openfire. I recommend that you create a GC log with “-XX:+PrintGCDetails -Xloggc:/tmp/gc.log”

You can try to decrease the size of threads with “-Xss128k -Xoss128k -XX:ThreadStackSize=128”.

You could add “-Djava.net.preferIPv4Stack=true” unless you want to use IPv6.

You may need to set “-XX:MaxPermSize=128m”, but before doing this you should consult the gc.log file.

When the JVM does exessive garbage collections it looks like the server hangs and you usually need a restart to “fix” this problem.

See also JVM Settings and Debugging

LG

jay_scott · June 14, 2010, 2:33pm

things have improved, anyway. it hasn’t hung in a while.

FWIW ps -ef | … did not show the Xmx value, i had to dig it

up by looking through startup and config files.

grrr! somehow there were some ashes of ipv6 left over, and

i also cleaned those up.

bottom line: i still can’t say for sure what the problem was.

but at least the problem, if not solved – can’t know that for

a long time – has at least improved.

thanks for the tip.

j.

jay_scott · June 14, 2010, 9:38pm

i spoke too soon. the server hung again just a few minutes ago.

from /etc/sysconfig/openfire:

OPENFIRE_OPTS="-Xmx1024m -XX:+PrintGCDetails -Xloggc:/tmp/gc.log -Xss128k -Xoss128k -XX:ThreadStackSize=128"

the tail end of gc.log looks like this. notice the last line IS ACTUALLY NOT COMPLETE. ie, that

short line (w/ no newline char) does not have the [Times…] fields on it at all. i hope that gives

someone an idea of where the server is hanging.

434753.310: [GC 434753.328: [DefNew: 149958K->11890K(159360K), 1.2445050 secs] 423465K->285397K(513136K), 1.2446270 secs] [Times: user=0.05 sys=0.01, real=1.24 secs]
439995.420: [GC 439995.498: [DefNew: 153586K->13090K(159360K), 2.3775710 secs] 427093K->290070K(513136K), 2.3892470 secs]

…

sigh. they just told me that the ldap server (an active directory server)

just got monkeyed with. could that have caused my problem? they tell

me more of this is about to happen. can i list more than one ldap (AD)

server? does openfire know how to fail over?

j.

jay_scott · June 14, 2010, 9:56pm

okay, i think i know what to put in for multiple ldap servers.

on the system properties page, the ldap.host value should

be something like this

dc1.fqdn.com dc2.fqdn.com

right? just a blank (or comma) list of fqdn’s?

sorry. this ought to be simple but the server has behaved

so pathologically over the last few weeks that i’d really

like to have my every move double-checked. so i’d

appreciate verification.

thanks in advance.

j.

LG1 · June 15, 2010, 8:42pm

Hi,

you may want to create a new thread for the new “just a blank (or comma) list of fqdn’s?” question.

You GC log looks fine, 5000 seconds between the two garbage collections do not indicate a problem. Ony may consider to decrease the Xmx value, so the garbage collection may be faster. A delay of less than 500 ms would be fine, there are also JVM options to change the garbage collector type.

Did you specify “-server” on the command line? A “ps -ef | grep java | more” should list the complete line, without the “|” it may print out only 80 columns per line.

LG

jay_scott · June 15, 2010, 9:10pm

ps -ef | grep java
daemon 19334 1 0 Jun14 ? 00:00:57 /usr/java/jre1.6.0_20/bin/java -server -Xmx1024m -XX:+PrintGCDetails -Xloggc:/tmp/gc.log -Xss128k -Xoss128k -XX:ThreadStackSize=128 -DopenfireHome=/opt/openfire -Dopenfire.lib.dir=/opt/openfire/lib -classpath /opt/openfire/lib/startup.jar -jar /opt/openfire/lib/startup.jar

-server is there. FWIW.

it looks like my ldap.host value of

host1.com host2.com

is okay – at the very least it’s able to use the first value,

even if failover (not tested) won’t work.

having said that, it looks like the last set of “hang”

problems i was having was due to a pathological

active directory (our ldap server) server. it’s possible

that even the first set was due to the rogue server.

i can’t say for sure because i don’t know when they

started testing that rogue server. when it went into

production lots of things fell from the sky, and that’s

when i found out they were monkeying w/ it.

thanks for your help, BTW. the thing seems stable

now – well, at least, today its poor behavior was

coincident w/ the rogue AD box. i’ve done nothing

to openfire today, and it seems openfire only behaved

badly while the rogue AD box was running.

so for the moment, i’m no longer accusing openfire.

j.