Clustered Openfire -- Issues That May Occur After Node Failure

wtf · February 16, 2010, 7:10pm

While doing some load testing of clustered Openfire I had a worrisome experience. Too much load caused a node to run out of memory and become unresponsive. The process (Openfire) for that node was forcefully killed. After that, a variety of problems appeared. When brought back up, the node’s log showed database unique constraint related errors (e.g. duplicate key…). Also, all nodes in the cluster encountered NullPointerException when Server->Statistics was accessed in the administration console.

This leads me to believe that if a node in the cluster suddenly dies that the persistent state (e.g. the db) may be corrupt. Might anyone be able to share some information/experience on this issue?

Bea_Eagle · February 17, 2010, 5:34pm

Yes, I did see something similar with one defined user instance. I don’t have a lot of specifics except a backtrace. My workaround was to delete the user, and create a new instance with a different user name. Aside from that I did not find any other signs of corruption. Also, it was not necessarily due to a load test, but I am thinking (not 100% certain) that it was because the head node was rebooted or something. I suppose further tests can confirm.

2009.12.01 10:51:00 [org.jivesoftware.openfire.pubsub.PubSubPersistenceManager.createPublishedItem( PubSubPersistenceManager.java:1029)]
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: Duplicate entry ‘userid@domain.com-urn:xmpp:avatar:metadata-3a697635e93’ for key 'PRIMARY’
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:931)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2985)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723)
at com.mysql.jdbc.Connection.execSQL(Connection.java:3283)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1332)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1604)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1519)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1504)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja va:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.logicalcobwebs.proxool.ProxyStatement.invoke(ProxyStatement.java:100)
at org.logicalcobwebs.proxool.ProxyStatement.intercept(ProxyStatement.java:57)
at $java.sql.Wrapper$$EnhancerByProxool$$2bad8d59.executeUpdate()
at org.jivesoftware.openfire.pubsub.PubSubPersistenceManager.createPublishedItem(P ubSubPersistenceManager.java:1024)
at org.jivesoftware.openfire.pubsub.PublishedItemTask.run(PublishedItemTask.java:7 2)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)

guus · February 17, 2010, 5:52pm

Hmm, that’s bad. We’ll have to look into that.

On a side note: the problems you’re experiencing are likely to be caused as described in this document: Openfires Achilles’ heel

wtf · February 18, 2010, 3:43pm

I’ve figured out the integrity constraint violation. Here’s an example of the errors I see:

2010.02.18 10:35:37 [org.jivesoftware.openfire.spi.PresenceManagerImpl.userUnavailable(PresenceMana gerImpl.java:271)] Error storing offline presence of user: test14947
com.mysql.jdbc.exceptions.MySQLIntegrityConstraintViolationException: Duplicate entry ‘test14947’ for key 1

A simple look at Openfire’s source shows that table ofPresence allows one record per user. This record is added when the user becomes unavailable and is deleted when they become available again. Therefore, this error occurs because the user somehow became unavailable twice without becoming available in between. I’m not entirely sure how this can happen yet, but the good news is that the impact of this issue seems to be rather innocuous.

Still not sure about the malfunctioning of the admin console after cluster node failure…stay tuned – or assist if possible

akrherz · February 18, 2010, 5:24pm

Howdy,

This issue is found here: OF-270

If your able to figure it out and come up with a patch, please let us know!

daryl