Server ''freezes'' on invalid presence types

Hey all,

I know that the ‘‘invalid presence type’’ problem has been fixed in release 2.2.0, but I’'m curious if that ‘‘warning’’ could be causing bigger problems in the server. We have 2.1.5 installled on a RHEL box (kernel version: 2.4.21-27.0.2.ELsmp), with JVM version: 1.5.0_02 Sun Microsystems Inc. – Java HotSpot™ Server VM. The stack trace below shows a lot of the ‘‘invalid presence type’’ errors. I originally thought that these were just ‘‘warnings’’ (as seems to be indicated by the comments in the fix), but after less than a month running the server with these errors, we eventually get into a ‘‘semi-frozen’’ state.

The server is still running, however, some new connections are not fully ‘‘accepted’’, and we also are not able to send messages to anyone (including admin broadcast messages in the admin console). In addition, trying to kill the server from the admin console or through the ‘‘messenger stop’’ command also fails. We have to kill -9 the JVM process itself. Other than the stack trace below, we are seeing nothing that would indicate what the problem is. Has anyone else seen this? Note - we average about 700-800 online users on a consistent basis. Thanks.


java.lang.IllegalArgumentException: No enum const class org.xmpp.packet.Presence$Type.invisible

at java.lang.Enum.valueOf(Unknown Source)

at org.xmpp.packet.Presence$Type.valueOf(Presence.java:308)

at org.xmpp.packet.Presence.getType(Presence.java:93)

at org.jivesoftware.messenger.PresenceRouter.handle(PresenceRouter.java:77)

at org.jivesoftware.messenger.PresenceRouter.route(PresenceRouter.java:59)

at org.jivesoftware.messenger.PacketRouter.route(PacketRouter.java:73)

at org.jivesoftware.messenger.net.SocketReadThread.readStream(SocketReadThread.jav a:240)

at org.jivesoftware.messenger.net.SocketReadThread.run(SocketReadThread.java:105)

2005.09.06 11:08:33 org.jivesoftware.messenger.PresenceRouter.handle(PresenceRouter.java:114)

Could not route packet

java.lang.IllegalArgumentException: No enum const class org.xmpp.packet.Presence$Type.invisible

at java.lang.Enum.valueOf(Unknown Source)

at org.xmpp.packet.Presence$Type.valueOf(Presence.java:308)

at org.xmpp.packet.Presence.getType(Presence.java:93)

at org.jivesoftware.messenger.PresenceRouter.handle(PresenceRouter.java:77)

at org.jivesoftware.messenger.PresenceRouter.route(PresenceRouter.java:59)

at org.jivesoftware.messenger.PacketRouter.route(PacketRouter.java:73)

at org.jivesoftware.messenger.net.SocketReadThread.readStream(SocketReadThread.jav a:240)

at org.jivesoftware.messenger.net.SocketReadThread.run(SocketReadThread.java:105)

2005.09.06 11:08:41 org.jivesoftware.messenger.PresenceRouter.handle(PresenceRouter.java:114)

Could not route packet

Hey Guy,

Are you using JM 2.1.5 or 2.2.0? Anyway, I don’'t think the semi-frozen state is related to the invalid presence type. The next time the server freezes can you obtain a thread dump of the JVM? Execute kill -3 to obtain the thread dump and send it to me. This may be a db connection pool problem but we will need the thread dump to confirm the problem.

Regards,

– Gato

Hey Gato,

We upgraded to 2.2.0 today in hopes that this would solve the problem. Everything is running ok now - it seems to happen most often after a couple of weeks… I’'ll keep an eye out for it and send you a thread dump if it happens again. Thanks.

-Guy

Ok, so, it happened again this morning - my co-admin was at the office before I was, and he did a kill -3 procid, but didn’‘t see any thread dump (I checked in stderr.log and stdout.log in /opt/jive_messenger/logs and didn’'t see it there either). Repeated attempts at kill -3 produced nothing, and he had to do a kill -9 to get the server stopped and restarted.

We are now running 2.2.0, with the 1.5.0_02 JVM. I upgraded to 2.2.0 with the .tar.gz package, not the RPM - I didn’‘t think that the upgrade to the _04 VM was that large of a jump. Could this be the problem? Without a thread dump, I’'m not sure how to proceed? Ideas? This is very frustrating, as Jive was working fine for a long time at version 2.1.5 - we are trying to debug any environmental things that might have changed on the server, but are so far coming up empty.

If this is a DB connection pool problem (note that our messenger DB is sharing the postgresql DB engine with several other databases), is there anything else we can try for configuration, etc? Extra VM args (we already are running with the -server arg BTW)?

Thanks.

-Guy

Hey Guy,

Oops, I forgot to mention where the dumps are logged. You should find the thread dumps here /opt/jive_messenger/bin/nohup.out. Send me that file and I will be back with my analysis.

Regards,

– Gato

Hey Gato,

Arrg… I actually looked there and didn’‘t find anything - but now I realize why - I rolled my own jive-messengerd-like script (don’‘t ask me why, stupid sysad move on my part ), which redirects all output to /dev/null, so nohup.out has nothing from this morning’'s attempts at debugging.

I’‘m going to upgrade to the latest JVM tonight, as per the suggested config for 2.2.0, and I’'ll make sure I fix the /etc/init.d/script to use the jive-messengerd, so that if I have to do a thread dump, it will be there. If it happens again, I should have a thread dump for you. Thanks.

-Guy

Post the thread dump here so Gato doesn’'t get all the fun

Oh, most definitely, the thread dump will go here…

While I was configuring a test instance of the Jive Server (our user base is mad enough at me for having our production server go down 3 times in the past 24 hours ), I re-read some of the configuration options, and I noticed that the DB stuff in the jive_messenger.xml has options that affect the DB connection pool (min/max connections). They are both currently set at 5…

Gato (and anyone else that wants to chime in) - with nearly 800 online users at most hours of the day and night, should I be increasing these #’'s? Could this be the cause of our mysterious hangs?

Hey Guy,

I would say that 5 db connections for 800 users is quite low. Anyway, it depends on the kind of transactions requested by those users. You should check the max number of connections that your database may handle and the number of connections required by other applications that are accessing to the same DBMS. Once you have that information you may safely increase the max number of connections available for JM. The server will only create more connections as they are needed. A future version of JM may include a monitor on the connection pool activity so you can easily finetune the pool according to your need.

Anyway, were the server freezes temporary or permanent? If they were temporary then it may be that you were running short of db connections. If the freeze was permanent then we may have a connection leaking problem somewhere. That’'s why I was asking for the thread dumps.

Regards,

– Gato

Gato (and anyone else interested),

I have the thread dump for this, but can’'t figure out how to attach it in this forum (I would paste it but it is 3.7MB in size). Gato, I already emailed it to you, but am not sure it actually went out (stupid emails servers at work).

Please let me know if you folks could take a look at this (and how I should get it in here). It happened again after our server had been running a month with an average online user count of about 1000. Min and max DB connections were set at 10 & 35 respectively.

After I captured the thread dump, I bumped those DB pool numbers up to 30 and 55 before restarting the server. Thanks.

-Guy

Hey Guy,

I sent you two replies to your email. Have you received them?

Regards,

– Gato

Grrr… stupid spam filters… let me look again… sorry…