Clients cannot connect - must restart the service

Hi Everyone,

For the past few days when people start coming into the office and signing in to the server their clients sit there and never connect. There are various clients being used, PSI 0.9.3 is the most common but some are using 0.10 and some are using the latest version of Spark. The client doesn’'t appear to matter - nobody can log in. However, if I kill and restart the service, everyone gets right on!

When we first started using Jive Messenger we setup a cron job to kill and restart the service at 1 am because it would crash - this job is still active. I guess I could reschedule it to run at 7 am-ish, but I would really like to know what the culprit is. When the issue began I was running Jive Messenger 2.1.2, but we are now using Wildfire Messenger 2.4.4 and the same issue occurs.

Clients are all connecting using ssl on port 5223, but I have tried switching to plain text on port 5222 and still cannot connect. If anymore details are needed to help troubleshoot this just let me know.

Thanks for the help!

Hey Joe,

My first comment is that there is no need to restart the server every day. If you need to do that then something else is going on and we need to figure it out. Can you enable the debug log? And when this problem happens again could you post the errors that you see in the logs? I would also like you to try doing a telnet from a client machine to the server to confirm that ports 5222 and 5223 are running. If telnet is able to connect then the test is fine in which case I’'m going to need you to obtain a thread dump of the JVM. To get a thread dump of the JMV you will need to execute kill -3 under linux or type ctrl-break on the Java app and the dump will be printed to stdout (console window).

Regards,

– Gato

Thanks so much for the quick reply, Gato.

I have enabled the debug log now and will try the telnet suggestion once the problem occurs. On most days I only have to restart the service when a rush of people are attempting to sign on, so I most likely won’'t have to do it again today. By “rush” I only mean around 10 people trying to sign on at once.

Unless the error does occur today, I’'ll be providing the info Monday morning.

Thanks again,

Joe

Hi,

may I ask whether you have more than 100 users and did not modify the Xmx (max memory, default = 64 MB ) value? It’'s very easy to get OufOfMemory errors, by looking at your old log files you should be able to tell us if this was the case.

LG

Hi LG,

We have around 30 users total. However, even with that many users we were seeing the memory errors so we bumped up to a max of 256 megs. There are no errors pointing towards memory being an issue currently.

The server seems to be really stable, it has never crashed that I can remember. It just won’‘t allow new clients to sign on during peak sign on times. Since the debug log is enabled now, I’'m hoping next time the problem occures that it will help to reveal my issue to the experts.

Thanks for the suggestion, LG. Let me know if there is anything else I can check,

-Joe

Hey Joe,

Another thing to consider is increasing the max number of database connections that the pool may have. It may be possible that all db connections were being used so a bottleneck was generated trying to get a db connection from the pool.

Anyway, if clients are still not able to log in during peak hours even after increasing the db pool size then a thread dump will tell us the real reason of the problem. BTW, I have stress tests where more than 100 users log in at the same time and the server was able to properly handle them. I just had to increase the db connection pool size as I told you.

Regards,

– Gato

Hi there Gato,

Sounds like a logical thing to try.

I looked around the admin console, the wildfire script, and a bit of the online documentation but nothing stood out to me as a way to increase the db connection pool size. I know to change the max memory allocation, I had to use a custom parameter (-Xms256m) but I can’'t seem to find a documented parameter for the connection pool.

I apologize for my ignorance on the matter, if you could let me know how to change this or give me a nudge in the right direction I would appreciate it.

-Joe

Hi Gato,

did you or Matt fix or fill an issue for http://www.jivesoftware.org/community/thread.jspa?messageID=110219 or can this still be a cause for such problems? Actually the server should queue the requests if the pool reached the limit and not end in a deadlock like state.

LG

Ah - reading that thread made me realize I never mentioned what db I was running on!

We are using a linux server running MySQL 5.0.15, not the embedded db. Could that issue still be relevant?

-Joe

Hey Joe,

You will have to edit the conf/wildfire.xml file to change the db pool size. In particular, you will have to modify the maxConnections[/i] element that is defined in the database[/i] element. Remember to restart the server after these changes. A future version of the server may add to the admin console the option for configuring the db pool so admins can do it more easy way.

Regards,

– Gato

Great!

I’'ll make that change this weekend and be back in touch with an update on Monday. Thanks again, Gato and LG.

-Joe

Hey LG,

I was not monitoring that thread. Since the pool has a max size of 25 users (now) and there are 100 users trying to use the pool it’'s logical to expect a bottleneck accessing the db pool as shown in the thread dump. However, this situation should be resolved as soon as connections are returned to the pool. Do you know if the threads that were waiting finally got a connection from the pool? Or did they stay for ever waiting for a connection?

Regards,

– Gato

Hi Gato,

I just tried this once again with the new code (v2.2.4 - maxConnections = 25;) with 50 concurrent clients. I also changed conTimeout to 10 minutes (default 3h).

If I run the loadtest with 20 users and then with 50 users everything runs fine. Restarting Wildfire and running it with 50 users causes the problem. The stacktrace and netstat information is the same as in the other thread. I waited one hour (== for ever) and got still the same stack trace and netstat info.

Memory information: 30 MB (20 MB used, 10 MB free) of 64 MB allocated

After “for ever” I restarted Wildfire.

LG

Hi Gato,

there seems to be one possible “deadlock” in RosterItemProvider.java in getItems() if n threads succeed to get “con2” and then want to get “con”. Generating some load I also get these ones, but they are easy to fix.

I inserted a printStacktrace() in getCon() to see where it blocks, after 150kB stacktraces it was interesting to see that a transacted db connection is used, but this seems not to be the reason for the locks:

  • open

java.lang.Exception

at org.jivesoftware.database.ConnectionPool.getCon(ConnectionPool.java:406)

at org.jivesoftware.database.ConnectionPool.getConnection(ConnectionPool.java:137)

at org.jivesoftware.database.EmbeddedConnectionProvider.getConnection(EmbeddedConn ectionProvider.java:54)

at org.jivesoftware.database.DbConnectionManager.getConnection(DbConnectionManager .java:113)

at org.jivesoftware.database.DbConnectionManager.getTransactionConnection(DbConnec tionManager.java:134)

at org.jivesoftware.wildfire.user.DefaultUserProvider.deleteUser(DefaultUserProvid er.java:149)

at org.jivesoftware.wildfire.user.UserManager.deleteUser(UserManager.java:137)

at org.jivesoftware.wildfire.handler.IQRegisterHandler.handleIQ(IQRegisterHandler. java:221)

at org.jivesoftware.wildfire.handler.IQHandler.process(IQHandler.java:48)

at org.jivesoftware.wildfire.IQRouter.handle(IQRouter.java:256)

at org.jivesoftware.wildfire.IQRouter.route(IQRouter.java:79)

at org.jivesoftware.wildfire.PacketRouter.route(PacketRouter.java:65)

at org.jivesoftware.wildfire.net.SocketReader.processIQ(SocketReader.java:395)

at org.jivesoftware.wildfire.net.ClientSocketReader.processIQ(ClientSocketReader.j ava:50)

at org.jivesoftware.wildfire.net.SocketReader.readStream(SocketReader.java:263)

at org.jivesoftware.wildfire.net.SocketReader.run(SocketReader.java:119)

at java.lang.Thread.run(Unknown Source)

LG

Fix for RosterItemProvider.java

public Iterator getItems(String username) {

/* comment this, and get this connection a few lines later

con2 = DbConnectionManager.getConnection();

gstmt = con2.prepareStatement(LOAD_ROSTER_ITEM_GROUPS);

*/

// Load all the contacts in the roster

con = DbConnectionManager.getConnection(); // this one may causes a deadlock

pstmt = con.prepareStatement(LOAD_ROSTER);

// Load the groups for the loaded contact

// close the connection and open the next one

con.close();

con = null;

con2 = DbConnectionManager.getConnection();

gstmt = con2.prepareStatement(LOAD_ROSTER_ITEM_GROUPS);

ResultSet gs = null;

/code

Possible problem in DefaultUserProvider.java in deleteUser():

con = DbConnectionManager.getTransactionConnection();

// now the 2nd connection is requested, if many users are deleted this one will block

try {

VCardManager.getInstance().deleteVCard(username);

}

/code

The only valid fix imho would be a cascaded delete using constraints. I think that all serious databases support this feature.

Anyway, an invalid fix could be to done in ConnectionPool.java: run() closes idle connections but currently it does not call freeConnection() after doing so. My error: “free” (checkinTime) does not apply here - checkoutTime should be checked … how long may a transaction take? As nearly every db connection is using autocomit such a check s**ks, because it can not rollback the timed out transactions before closing the connection.

Message was extended

Well, this morning I came in thinking I would have to restart the service but everything was working. I hadn’'t even increased the maximum number of connections yet! Out of curiosity I went ahead and opened up the configuration file to see what the maxconections were set at. Here is an excerpt of my settings:

I would think that for 30 users, 40 connections would be an acceptable maximum. Should I change any of these settings?

-Joe

Hi Joe,

I think that also this will lead to a feature request. I already requested a purge cache button, and maybe the server-db.jsp or index.jsp will soon display the number of connections in use.

Min/max=5/10 connections are more than enough for 40 users. But if there are some more possibilities to create a deadlock then you can set maxC even to 100. The worst thing which happens is that your database slows down and consumes a lot of memory for every silly connection.

Before changing the values you should try to get a threaddump while Wildfire locks, maybe you have a complete different problem

LG

Alrighty, I will leave these settings as is for the time being and get the thread dump once the issue occurs. Although, now that I’'ve reported it the issue will probably NEVER occur again.

-Joe

A dirty hack which closes database connections / long running transactions:

ConnectionPool.java, lines 275ff:

// go over every connection, check it’'s health

for (int i = maxCon - 1; i >= 0; i–) {

if (wrappers[i] == null) {

continue;

}

try {

long time = System.currentTimeMillis();

/* TODO wrappers[i].lockTime */

// rollback after 60 seconds

if ((wrappers[i].checkedout==true) && (time - wrappers[i].lockTime > 60 * 1000))

{

synchronized (conCountLock)

{

cons[i].rollback();

cons[i].close();

wrappers[i] = null;

wrappers[i] = createCon(i);

wrappers[i].checkedout = false;

wrappers[i].checkinTime = time;

Log.error("Rollback connection "+i);

};

freeConnection();

};

synchronized (wrappers[i]) {

/code

Hey guys,

Thanks for reporting this problem and LG for pointing out the problematic areas in the code. Today I checked in a fix for JM-566 that should fix the problem you were observing. The fix will be available in the next nightly build.

Keep in mind that the nightly build version includes a database modification since privacy lists are now supported in Wildfire. That means that if you are going to use the nightly build version you should not only use the new wildfire.jar file but also you will need to copy the resources\database\upgrade folders. Those folders contain the upgrade scripts that are going to be automatically run by the server to update the database structure. Since the db is going to be altered by Wildfire I would recommend doing a backup before running Wildfire 2.5.0 (nightly build version).

Regards,

– Gato

Hi Gato,

you did leave out the nice rollback hack. Maybe you could add properties like database.defaultProvider.checkTimedoutConnections and database.defaultProvider.checkTimedoutConnectionsLimit to handle this, so no one is forced to use it.

One other thing got my attention. This is the well-know code:

private synchronized ConnectionWrapper getCon()

wrapper.setConnection(cons[i]);

wrapper.checkedout = true;

wrapper.lockTime = System.currentTimeMillis();

if (“true”.equals(JiveGlobals.getXMLProperty(

“database.defaultProvider.checkOpenConnections”)))

{

wrapper.exception = new Exception();

wrapper.hasLoggedException = false;

}

return wrapper;

ConnectionWrapper con = createCon(conCount); // line 410

conCount++;

return con;

/code

So one time “checkedout”, “lockTime”, “exception” and “hasLoggedException” are set, the second time “checkedout” is not set, the other values are set by createCon(). This leads to the problem that all methods which call getCon() must set “checkedout”, which they currently do, I’'d like to change them to:

public Connection getConnection()

ConnectionWrapper con = getCon();

if (con != null) {

/* commented by LG

synchronized (con) {

con.checkedout = true;

con.lockTime = System.currentTimeMillis();

}

*/

return con;

con = getCon();

if (con != null) {

–waitingForCon;

/* commented by LG

synchronized (con) {

con.checkedout = true;

con.lockTime = System.currentTimeMillis();

}

*/

return con;

/code

Is it possible to change getCon() that it sets all four values in every case, like:

ConnectionWrapper wrapper = createCon(conCount); // line 410

conCount++;

wrapper.checkedout = true;

return wrapper;

/code

LG