powered by Jive Software

Routing loops causing 100% CPU, plugins crashing

Hi,

one of my users accidentally found an critical bug, either in Openfire or the Kraken IM Gateway Plugin. However, it does also happen with pyICQt. I talked with Daniel Henninger (Kraken developer and former Openfire developer), he says:

I’m 99% sure that’s an issue with Openfire’s internal routing.
Since neither Daniel nor me have time to go more into this, I’m posting this issue here in public.

Setup (not sure if important)

  • there are two local, and one external user. I’m calling them userA, userB and userC.
  • userA and B are both using the ICQ transport and have subscribed each other via ICQ and via Jabber.
  • userC is an external ICQ user in mutual subscription with userA. In my case he is using an Transport on another server.

Steps to reproduce

Open XML direct input for userA, and send the following packet:

<message to="something.icq.yourserver.org" id="Test001">
<body>Test</body>
</message>

(where ‘icq.yourserver.org’ is your transports component JID)

visible Results

  • userA and B cannot send ICQ messages, but userA can receive messages from userC.
  • normal Jabber messages between userA and B are delayed. After one minute or so normal messages work normal again.
  • after an restart of Kraken everyone can talk to everyone again
  • Openfire takes 100% CPU.

**Conclusion
**

When using Raptor with Flood-Detection enabled, I didn’t detect that 100% CPU Problem. Probably there is somewhere an packet loop which is causing that high CPU load. Raptor does detect that flood and can break the loop by block the sender temporary. Also it can log the user who was causing the problem. (Actually Raptor was designed for exactly that use case.)

Since pyICQt is also delicately for this “attack” it’s possible it’s a bug in Openfire. However, since pyICQt was originally written by the same author, maybe it’s just the same bug.

I tried to write a simple “do nothing” plugin to reproduce this issue without Kraken. I had no success, it seems to be more complicated. Plugin source code as follows:

package org.jivesoftware.openfire.plugin; import org.jivesoftware.openfire.container.Plugin;
import org.jivesoftware.openfire.container.PluginManager;
import org.jivesoftware.util.Log;
import org.xmpp.packet.*;
import org.xmpp.component.*;
import java.io.File;
import java.util.*; public class Crash implements Component, Plugin {
    private ComponentManager componentManager;     public void initializePlugin(PluginManager manager, File pluginDirectory) {            try {
            componentManager = ComponentManagerFactory.getComponentManager();
            componentManager.addComponent("crash", this);                    }         catch (Exception e) { Log.error(e); }
        Log.info("Crash-Plugin started.");
    }             public void destroyPlugin() {
        Log.info("Crash-Plugin stoping.");
        try {
            componentManager.removeComponent("crash");
            componentManager = null;
        }
        catch (Exception e) {
            Log.error(e);
        }
    }     public synchronized void initialize(JID jid, ComponentManager componentManager) { }
    public synchronized void start() { }
    public synchronized void shutdown() { }
    public void processPacket(Packet packet) { Log.info("Crash-Plugin: \n" + packet.toString());  }
    public String getName() { return "Crash"; }     public String getDescription() { return "Bug demonstration plugin"; }
}

Other information (not sure if important)

  • Openfire 3.6.3
  • Kraken last SVN (from 2009-04-21)
  • MySQL database
  • CentOS 5.3 Linux
  • using Psi Jabber Client, 0.13-dev-rev2

Hey Cool-Cat,

Very good analysis. Were you able to get a thread dump of the JVM when the CPU goes to 100%? I’m curious what is going on.

– Gato

I’m really busy at the moment, it might take a while until I find time to create an thread dump.

I tried to reproduce your procedure, but I can’t. I’m not sure what I’m missing, my test environment:

  • Openfire 3.6.3

  • MySQL DB

  • Debian 5.0.1

  • latest Kraken Gateway Plugin from svn

  • Psi 0.12.1
    Other Openfire plugins are only:

  • DB Access

  • my Multi-User Gaming Plugins
    My Openfire reply to your message is I think a correct remote-server-not-found error. After a few tests I installed the Raptor plugin but without any changes. I can’t neither reproduce a high CPU load nor a delayed messages or problems between user A and B or ICQ.

Ok, I tried again:

  • Kraken SVN (Revision 58, 05/19/2009), build against Openfire 3.6.4

  • Openfire 3.6.4 (local installation)

  • embedded Database

  • no plugins except Kraken

  • fresh Kraken installation, just ICQ Transport activated with default settings

  • restart Openfire after user setup (user setup as above)

    • Kraken does update database version 9 to version 10. (don’t know why)
  • send message

<message to="something.icq.*********.org" id="test1" ><body>Test</body></message>

The first time I tried to reproduce this bug was on an small VM-Server with only 500 Mhz or so. Now I used my local machine. Since I have an Intel Core 2 Quad Q9300 running @ 2.0 Ghz the CPU load is not 100%. However, it’s much higher than normal.

I waited some time and got these output:

[coolcat@q9300 ~]$ ps -T -p 3474 -o time,pid,tid,pri | grep -v 00:00:00 | sort -r
    TIME   PID   TID PRI
00:01:54  3474  3726  19
00:01:33  3474  3730  19
00:01:32  3474  3729  19
00:00:18  3474  3482  19
00:00:04  3474  3727  19
00:00:04  3474  3485  19
00:00:03  3474  3484  19
00:00:03  3474  3480  19
00:00:02  3474  3479  19
00:00:01  3474  3504  19
00:00:01  3474  3478  19
00:00:01  3474  3477  19
00:00:01  3474  3476  19
00:00:01  3474  3475  19
[coolcat@q9300 ~]$ printf "0x%x\n" 3726
0xe8e
[coolcat@q9300 ~]$ printf "0x%x\n" 3730
0xe92
[coolcat@q9300 ~]$ printf "0x%x\n" 3729
0xe91

requesting thread dump:

[root@q9300 bin]# kill -3 3474

thread.dump (55336 Bytes)

Openfire was running while I was writing the above post, now it looks like this:

[coolcat@q9300 ~]$ ps -T -p 3474 -o time,pid,tid,pri | grep -v 00:00:00 | sort -r
    TIME   PID   TID PRI
00:03:30  3474  3726  19
00:02:50  3474  3730  19
00:02:49  3474  3729  19
00:00:33  3474  3482  19
....

Sorry, I forgot to mention:

In my first try I was not using Openfire 3.6.3. Instead I was using the 3.6.3 custom build by Daryl Herzmann. It was a hot fix for JM-1531, this password issue a month ago. Maybe the patch is causing these problems? Are there any other differences between 3.6.3 and Daryl’s version?

Hi Coolcat,

as far as I understand the gateway plugin it is not a remote server. I did not take a look at Kraken, but I guess that Openfire should not access it as a remote server. But in your javacore I see “org.jivesoftware.openfire.session.LocalOutgoingServerSession.deliver(LocalOutg oingServerSession.java:582)” - so I wonder whether Openfire tries to send your packet to a remote server.

LG

I checked some of the thread dumps and there is nothing weird that I can see going on. However, when I checked the GC activity log I see lots and lots of Full GCs. The Perm space is 99% used so no wonder the JVM is trying to make some room for new objects. Have you tried increasing the memory size and in particular making the perm space bigger? A heap dump would be the next step to take so we can figure out which objects are kept in memory and why they are not going away (a memory leak may be?).

– Gato

Have you tried increasing the memory size and in particular making the perm space bigger?
No, it’s all at default settings, because the server is for testing purposes only and has only 4 registered users.

A heap dump would be the next step to take so we can figure out which
objects are kept in memory and why they are not going away (a memory
leak may be?).
Since the server keeps running for at least 30 minutes (I stoped after that) and memory usage does not increase I don’t think it’s a memory leak. However, as I already said, I’m really busy at the moment. It might take again a few days until I find time to work on this again. Propably it’s faster if you try to reproduce it and create this memory dump yourself.