Logins failing around 1000 concurrent users

I’‘m running Wildfire 3.2.2 and I’‘m consistently seeing user logins fail once I reach around 1000 concurrent users. We use Active Directory for authentication and interestingly enough logins to the admin console work fine but the server just seems to get in a state where no new users can login and if I a user who is already logged in logs out they can’'t login again until I restart wildfired.

Now that I’‘ve reached this scale I’'m having a bit of trouble sorting out what might be meaningful from the log files because there are always logins failing due to users fat fingering their passwords, there are always tcp errors because clients go offline or lose their connection, etc etc.

For now I’‘m going to restart again but tomorrow or the next day I’'m going to surely be in this state again. What should I be looking at?

Which OS?

If you are on Linux and running Openfire as a non-root user, you are probably hitting a file handle limit. Edit /etc/security/limits.conf and up the “nofile” settings.

My guess is that you are on windows, but worth the shot

I’‘m running on RedHat Enterprise 3 (RHEL3) the kernel is 2.4.2x-something. I don’’ t think file handles is an issue on modern (or semi-modern) Linux. If I login to my host as the non-root user I run Wildfire as and do a ulimit -n it returns 65535.

Also my fs.file-max in /etc/sysctl.conf is 1048576.

Message was edited by: St0nkingByte

Hi,

you could try to produce some thread dumps (kill -3) when this happens. Does the JVM has enough free memory when this problem occurs? You can check this on the Server Settins page of Openfire.

LG

I’‘m running on RedHat Enterprise 3 (RHEL3) the kernel is 2.4.2x-something. I don’’ t think file handles is an issue on > modern (or semi-modern) Linux. If I login to my host as the non-root user I run Wildfire as and do a ulimit -n it

returns 65535.

Interesting. On my RHEL4.4 box

$ ulimit -n

1024

Your symptoms just sounded so much like what I ran into.

I run an in-house customized Linux kernel that is tweaked quite a bit to provide large scale applications on Linux, so I hope its not an OS or OS config limitation, but you never know.

On the open chat yesterday the guys suggested I decrease my xmpp.client.idle from the default of 30 minutes to something lower to address my problems. I waited until the server hung up again and got a core dump and now I’'m running with my xmpp.client.idle set to 5 minutes (300000). Hopefully this will help.

I have the coredump but it is 38mb. If there’'s somewhere I could upload it to…

Oops forgot to reply to LG. Yes the jvm has plently of free memory when this problem is occuring.

Hi,

do you have a stacktrace / javacore file? It may be written into nohup.out.

A “core” or a heapdump file will likely not help as it seems that the threads have problems.

There is an issue with HTTP-Bind JM-1001 and one with Old-SSL JM???, if you don’‘t need these options you may want to disable them just to make sure that they don’'t cause trouble.

LG

Sorry I’'m so slow, sometimes takes me a couple days to circle back to these things

I do have HTTP binding turned on but I don’‘t need it really badly. Also have the old SSL method turned on because I had a bunch of clients that couldn’‘t, or thought they couldn’'t, connect with TLS so the SSL guys are about 18% of my connections.

For the really good events it looks like I have a dump, nohup.out output and a crash log.

For now I think I’'ll flip off HTTP binding and upgrade to 3.2.3 and then see how that plays.