Wildfire connection manager login stress test (over 10k users) problems

I’'ve a test case to let lots of users (about 20k ~ 50k) login to the wildfire server through the connection manager.

The env is:

server1: Wildfire - Java5/Linux/MySQL 5.0.22 InnoDB

server2: my test code + connection Manager (in the same vm) - Java5/Linux

LAN: 1G ethernet

My login program is very simple:

  1. session create

  2. auth client,

  3. bind

  4. set session

  5. set presence status

I’'ve create a thread pool to start 50 login threads(Java 5.0 ''s thread pool), to provent wildfire get too many requests at the same time.

My problem is:

The login speed get slower and slower, from 200 users/sec to 100/sec, 50/sec, and 1 user/sec eventually(about 30k users logined then).

The problem unlikely be my code or connection manager:

  1. I have checked all the log files for wildfire and connection manager without found any error packet.

  2. And every logined user could be found in Wildfire’'s admin console.

  3. After login, my send/recv msg program runs very well without problems. The chat program began to send message packets to each other, the send/recv speed keeps in 5,000 ~ 8,000 packets / second after running 30 minutes.

  4. I’‘ve checked the client code and connection manager on JProfiler, I didn’'t find any problems.

  5. Wildfire and Connection Manager’'s code are checked out from svn about two weeks ago.

First I think the databasse could be a bottleneck, I checked mysql’'s log and found that every time a user login, Wildfire wrote a update log in the jiveUserProp.

I commented that code in Wildfire and ran the test again. But still the same problem.

So I think there is maybe something in Wildfire prevent me to login, maybe synchronized or deadlock problems? Anyone who are familiar with Wildfire’'s code can give me any tips? Multithread code is very difficult to debug, or it’'s hard for me to find the problem myself.

Thanks.

Tim

Hi Tim,

do you also write a gc log (options -XX:+PrintGCDetails -Xloggc:)? As the JVM uses more memory it may expand the heap and do long-running garbage collections. It does explain why it becomes slower a little bit, depending on your available memory. Has your server2 at least one idle CPU while you run this test?

Could you run 2 (ot 5) connection managers which login 25 (or 10) users at a time to make sure that Wildfire is the bottleneck and not a CM while you run one of the CMs on another server3?

LG

Thanks for LG’'s reply.

I distributed my test program on two linux box with connection managers, the speed seems doesn’'t improve.

Both the login modules became slow when more users had logged in. (During 10k~20k)

And no strange records could be found in the wildfire’'s GC log, most records are like these:

1031.218: [GC PSYoungGen: 321672K->10985K(333312K) 742546K->434516K(1032384K), 0.0375150 secs]

1032.624: [GC PSYoungGen: 328297K->12148K(329472K) 751828K->443122K(1028544K), 0.0456250 secs]

1034.073: [GC PSYoungGen: 329460K->7628K(333376K) 760434K->444910K(1032448K), 0.0387790 secs]

1035.497: [GC PSYoungGen: 324556K->15893K(332864K) 761838K->459544K(1031936K), 0.0547000 secs]

1036.963: [GC PSYoungGen: 332821K->8887K(331584K) 776472K->464159K(1030656K), 0.0533290 secs]

1038.372: [GC PSYoungGen: 323831K->5092K(320064K) 779103K->466344K(1019136K), 0.0392540 secs]

Both wildfire and connection manager servers have enough memory (4G total, Xmx=1024M for Java, which I think is enough)

The CPU 80%~99% on Wildfire server.

This is the result I got in the morning.

In the afternoon, I deployed the Wildfire on the JProfiler to see whether there are any bottleneck there and got some interesting results.

I started Wildfire with JProfiler on my Windows, and ran the test program from another Linux box. After running for a while, I checked the CPU view of JProfiler, and got the first position for top CPU using, here is the snapshot:

The source of the reason should be org.jivesoftware.wildfire.canFloodOfflineMessages()

  1. It calls SessionManager.getSessions()

  2. SessionManager.getSessions() calls copyUserSessions(allSessions);

  3. Copy user sessions is time-consuming, when the sessions get too much, it should take long time.

So every canFloodOfflineMessages() takes a long time when there are lots of users online.

I modified here and just return true for every canFloodOfflineMessages();

There are another top CPU method:

In class org.xmpp.packet.JID, init() method seems has problems.

in method

  • void init(String node, String domain, String resource),*

this.node = Stringprep.nodeprep(node);

Stringprep.nodeprep is very very slow, I don’'t know what is it doing there, just escape utf-8 characters?

String node is jabber username here, in my system, there are all plain ascii characters, so I don’'t need to run it at all.

Just change it to:

this.node = node.

I rebuild Wildfire and my test program runs several times faster. But the login speed is still decreasing, I found another top CPU method on Wildfire:

This time, some 5% cpu are on net traffic, this is the result I expect, but another Stringprep eat lots of CPU,

most of the time is doing a loop like this, s.length() is StringBuilder.length()

for (int j = 0; j < s.length(); j++) {

if (f <= s.charAt(j) && t >= s.charAt(j)) {

return true;

}

}

The s doesn’'t change inside the loop, so I changed the code to

for (int j = 0, n = s.lengh(); j < n; j++)…

The speed improved after the the above modification, current speed is 50~200 concurrent users login / sec, never got result like “only 1 concurrent user / sec”. The result is acceptable but not very ideal. Suppose there are 50k online users in a system, after a server restart, lots of users can’'t login immediately. If the speed is 50 users / sec, we need about 20 minutes to have all 50k users back to system.

And still I got the speed decreasing problem, need some more time to fight with it.

Message was edited by: timyang

Hi Tim,

I’'m quite sure that Gato will not like to skip the nodeprep part during authentication as this is a must-have.

canFloodOfflineMessages() - is this proper code and I did miss something or a bug?

public boolean canFloodOfflineMessages() {
        for (ClientSession session : sessionManager.getSessions()) {
            if (session.isOfflineFloodStopped()) {
                return false;
            }
        }
        return true;
    }

sessionManager.getSessions() get’‘s all sessions and not only the ones which match the bare JID of the connected user. (I’'m using … to display code.) So if one user does disable it no one will receive offline messages.

Looking at the contains() methods I really wonder why one is using loops while Java does offer regular expressions. I assume that one could increase the speed a lot by changing both methods and create prebuild patterns using RFC3454 as a reference.

I’'m a little bit disappointed that you did not complain about the database. During login a lot of database connections are established. Do you monitor the duration of getConnection()?

I did modify the existing connection pool half a year ago to be 16x faster, with a lot of database connections it will get 50x faster than the current code. I’'m not sure that one can use it within Wildfire 3.1 as-is as it was written so long ago. http://it.ma.cx/ConnectionWrapper.zip if you want to download it.

LG

Tim,

Thanks for the detailed analysis. We have created to jira issues thus far to deal with your findings:

JM-842

JM-843

Fixes have been checked in and they will be available in the next nightly build. I have yet to look into the node prep bottleneck(s) you have found but will do so soon.

Thanks,

Alex

Hi,

just a quick thought without looking at the code:

Does the connection manager do stringprep/nodeprep?

if no: Would this been an option to take load from Wildfire?

if yes: Why does Wildfire do the same again?

LG

Hi,

I checked out the lastest code from Wildfire svn and found that ClientSession.canFloodOfflineMessages() has been refactored.

I did the test again, the login speed was stable, no decreasing problems, that’'s great.

I also downloaded LG’'s connection pool patch, the speed is 10%~20% faster in my environment. Thanks LG for your code.

Now that speed was stable, I can test more users. The Average login speed on my server is 500 users / second, I have lots of users login to the Wildfire through the connection manager.

After 200K users login:

Time elapsed: 424 sec

Connection Manager memory: 305.31MB of 910.25 MB (33.5%) used

Wildfire memory: 1775.05 MB of 1925.31 MB (92.2%) used (Xmx=2G)

After 300K users login:

Time elapsed: 624 sec

Connection Manager memory: 329.93MB of 910.25 MB (36.2%) used

Wildfire memory: 2605.55 MB of 2831.94 MB (92.0%) used (Xmx=3G)

After all users login, I opened the web admin console, and everything is ok there. I can open the client sessions page, or click into user session details, though the client session page a little ugly.

Because my server only has 4G memory, I couldn’'t test more users. And I think 300K users is enough for most people.

Thanks for Alex and LG’'s support.

Tim

Hi Tim,

may I ask how many initial and maximum database connections you are using?

The code was not yet reviewed by Gato (dombiak_gaston), so it’‘s not in Wildfire and may have some bugs while I’'m not aware of any.

LG

200K 300K, is that a typo?

Thanks,

Alex

Yeah, that really startled me too. I’‘m not surprised if it’'s native binary. Does that figures hold true without the CM also (with and without blocking socket)?

I’'m tuning the database connection pool several times and found that min=15, max=50 is good for my system.

The status log shows that connections never exceed 50. My database is MySQL 5.0 with InnoDB.

Status during the peak:

1158887554555 HK used/avail/max/waitConn=38/42/50/0

1158887584773 HK used/avail/max/waitConn=39/45/50/0

1158887614798 HK used/avail/max/waitConn=39/46/50/0

1158887644803 HK used/avail/max/waitConn=39/46/50/0

1158887674805 HK used/avail/max/waitConn=39/46/50/0

1158887675265 HK used/avail/max/waitConn=39/46/50/0

1158887705306 HK used/avail/max/waitConn=42/46/50/0

1158887735470 HK used/avail/max/waitConn=42/49/50/0

1158887765472 HK used/avail/max/waitConn=42/49/50/0

1158887795475 HK used/avail/max/waitConn=42/49/50/0

I have 50 login thread pool from another connection manager, the connection manager settings is:

xmpp.manager.connections = 5
xmpp.manager.incoming.thread = 10

My test program is inside the CM(just send/recv packets), not make socket connections from outside. If we tuning the CM using technologies like epoll to accept incoming tcp socket, I think 300k is possible for 1 CM + 1 Wildfire server. So Wildfire + CM is as good as or better than ejabberd, djabberd.

Tim

To convince anyone who doubt the 300k result, here is the snapshot from admin console

Message was edited by: timyang

Message was edited by: timyang

Well good news. Java 6 supports epoll (link) on Linux with NIO - though NIO is currently not implemented in Wildfire or the CM it is on the road map.

Hi Tim,

interesting that so little connections are fine.I did test the pool with 1000 connections.

“guyma” did configure PostgreSQL with 550 (changed then to 330) connections in http://www.jivesoftware.org/community/thread.jspa?messageID=123802 - so this was probably a very bad idea or is really needed when one wants to query vCard and roster information and not only log-in.

Probably the “HouseKeeper” debug line should remain in code as is when Gato applies it to Wildfire, I hope that this one made it easy to tune the connection pool settings.

LG

Hi I need some help, can some body tell me how to use connection manager bacause I’'m new on this. I already use wildfire on regular setting and had upgrade the java memory to 1500MB but it not works. my server handle only 110 users. it hang after that.

thanks

Yohan

Hey Yohan,

Could you start a new thread with your problem? It would be easier to track your issue separately from this thread.

BTW, check your error log files and post any error that you have. You may also want to get a thread dump of the Java Virtual Machine. Under Unix execute kill -3 . Under windows if you are using the console press Ctrl-C.

Regards,

– Gato

Hi,

did you adjust Wildfire’'s memory usage or is it using the default settings (64 MB)? See “Java Memory” on http://server:9090/index.jsp .

If you need to adjust it and the existing documentation and threads don’'t help you please create a new thread.

LG

Hi Tim,

I have deployed Connection manager on the same machine as Wildfire server. I can start CM and can see in Wildfire Active Connection Managers that CM is running with zero client connections. After connecting to Wildfire server using Smack API my ‘‘Wildfire Active Connection Managers’’ still shows zero client connections. In one of your posts I saw your admin console showing number of active connections. I assume that you are connecting to CM instead of Wildfire. How can I connect to CM using Smack API?

My CM is using port 5262 to connect to Wildfire.

Thanks,

Ajay Singh