Load tests on Wildfire server with connection managers

Hello,

I am currently performing some load tests on Wildfire servers using several connection managers connected to it. We simulate clients with tsung using a cluster of 3 machines.

The ConnectionManagers are 4 DualCore with 2Gb and the Wildfire Server is a “dual” DualCore with 4Gb.

  • I was able to reach with a single CM connected to the Wildfire server 15000 connected users with no real problem. I only had to adjust the Java Memory and mainly adjust the Thread Stack Size to be able to start more native threads.

  • With 4 CM I cannot get more than 22000 users. In fact I was able to get higher (more than 30000) but while tsung is playing, the load of the server becomes really high and generally the system dies or the connections are closed.

On server side, the message Connection reset is logged in debug but no error nor warning.

java.net.SocketException: Connection reset

at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)

at java.net.SocketOutputStream.write(SocketOutputStream.java:136)

at org.jivesoftware.wildfire.net.ServerTrafficCounter$OutputStreamWrapper.write(Se rverTrafficCounter.java:244)

at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:404)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408)

at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)

at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213)

at java.io.BufferedWriter.flush(BufferedWriter.java:236)

at org.jivesoftware.util.XMLWriter.flush(XMLWriter.java:190)

at org.jivesoftware.wildfire.net.XMLSocketWriter.flush(XMLSocketWriter.java:31)

at org.jivesoftware.wildfire.net.SocketConnection.deliver(SocketConnection.java:56 8)

at org.jivesoftware.wildfire.multiplex.ConnectionMultiplexerSession.deliver(Connec tionMultiplexerSession.java:324)

at org.jivesoftware.wildfire.multiplex.ClientSessionConnection.deliver(ClientSessi onConnection.java:65)

at org.jivesoftware.wildfire.ClientSession.deliver(ClientSession.java:772)

at org.jivesoftware.wildfire.ClientSession.process(ClientSession.java:766)

at org.jivesoftware.wildfire.roster.Roster.broadcastPresence(Roster.java:586)

at org.jivesoftware.wildfire.handler.PresenceUpdateHandler.broadcastUpdate(Presenc eUpdateHandler.java:258)

at org.jivesoftware.wildfire.handler.PresenceUpdateHandler.process(PresenceUpdateH andler.java:109)

at org.jivesoftware.wildfire.handler.PresenceUpdateHandler.process(PresenceUpdateH andler.java:153)

at org.jivesoftware.wildfire.SessionManager$ClientSessionListener.onConnectionClos e(SessionManager.java:1458)

at org.jivesoftware.wildfire.net.VirtualConnection.notifyCloseListeners(VirtualCon nection.java:147)

at org.jivesoftware.wildfire.net.VirtualConnection.close(VirtualConnection.java:12 1)

at org.jivesoftware.wildfire.multiplex.ConnectionMultiplexerManager.closeClientSes sion(ConnectionMultiplexerManager.java:169)

at org.jivesoftware.wildfire.multiplex.MultiplexerPacketHandler.handle(Multiplexer PacketHandler.java:92)

at org.jivesoftware.wildfire.net.ConnectionMultiplexerSocketReader$1.run(Connectio nMultiplexerSocketReader.java:121)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java: 650)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)

at java.lang.Thread.run(Thread.java:595)

On client side I get several warnings saying:

IQ stanza with invalid type was discarded:

I tried to change the parameters like number of connections or threads for the connection managers but did not successfull results. I also tried the IBM JVM but it gave no significant result.

From all I got I think the server is overloaded by the traffic and I think that it is caused by a high number of simultaneous transactions.

Have anybody an idea or some tips I could use to find more precisely the bottleneck ?

Thanks in advance for your help

Best Regards

Pascal

I forgot to say - important that I also got the following message in stdout

Error while connecting to server: wildfireserver(DNS lookup: wildfireserver:5262)

java.net.SocketTimeoutException: Read timed out

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(SocketInputStream.java:129)

at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411)

at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)

at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)

at java.io.InputStreamReader.read(InputStreamReader.java:167)

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2992)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)

at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1410)

at org.jivesoftware.multiplexer.net.MXParser.nextImpl(MXParser.java:333)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.multiplexer.ConnectionWorkerThread.createConnection(Connection WorkerThread.java:166)

Which version of Wildfire are you using? As we recently fixed a rather large performance bottleneck/bug:

JM-842

Thanks,

Alex

Hello

Thanks for the information.

I saw a significant improvment during the users connection phase but after when users exchange messages and presence, load is still really high and problems still occur.

I will try to profile a bit and see what takes long.

I’'ll take you informed. Thanks again

Cohen,

We would be interested in seeing any profiling information you can produce as this helps us to improve Wildfire!

Thanks,

Alex

Hello I come back with some information.

I expect to be more precise and have some screenshots next week.

I made the tests with YourKit Java Profiler and JProfiler.

With JProfiler my test has not the same behaviour as without profiler while with yourkit it behave relatively closely to the standard behavior. I probably not set exactly the right parameters for JProfile.

Anyway to sum up what I observe when I play tsung scenarios. Users arrive, load is very slow, when they start to communicate, messages and presences, load start to increase. Tsung reports estimate a number of request per second and a throughput, and when problems occur I am around a peak around 2200/2500 requests per second and 300000 kbits/s. In fact the system survives but starting from that point load becomes high and I have Read Timeouts coming from the sockets on ConnectionManager side. Finally the server dies (we use the war version inside a tomcat server with the HotSpot JVM). When the server dies, tomcat has several log looking like:

Java HotSpot™ 64-Bit Server VM warning: Attempt to allocate stack guard pages failed.

From the Jprofiler I got some “HotSpots”:

Socket and BufferedInputSTream.read nut that seems necessary (77 and 7% of the time) but I have also 7% of the time for accepting Sockets (Blocking mode).

This seems high and could be correlated with the timeout observed.

Generally in the debug when system becomes instable, server closes regularly connection opened to the connection manager.

From JProfile tool I get hotspots related to StringPrep and StringBuilder. But as I told I should probably restart my tests because they are not nice from my point of view.

Anyway to conclude, I would say that it is possible that the test I run is not a “representative” test of the reality. I am not much experienced and cannot estimate the number of transactions a server should be able to manage per second.

I hope these first details may give you some ideas and be of some interest. My main opinion seems that I should focus on the accepting sockets duration.

Best Regards and nice week end

Pascal

P.S. if you need more information or estimate some details are missing I can try to get them

Hello, I am back with some information and questions.

Profiling did not show many things really interesting. Most costly part was the stream reading so I don’'t think we can avoid that ;).

Anyway I finally was able to simulate traffic with 31000 users and 4 CMs connected to the wildfire server. To do that I replaced my first try to simulate ping packets which sended regularly a short message 1 char with a particular raw sending just a whitespace char. This reduced significantly the load on the server because if I am right this is interpreted as a heartbeat by the connection manager and not sent and processed at server level.

I tried to go higher around 45000 users but faced problems only at the end when several users disconnected simultaneously: connection timeouts and sessions not removed see later. That problems is “reduced” when I make users logging off more smoothly. I just get warning messages related looking like

2006.10.11 08:17:18 Stream error detected. Connection: org.jivesoftware.multiplexer.net.SocketConnection@16b69d7 socket: Socket[addr=/10.0.0.105,port

=51934,localport=5222]

org.dom4j.DocumentException: Cannot have text content outside of the root document

Looking in the code I think I found probably the explanation of what happened at high load. Could anybody confirm:

When load on server side is important, packets are queuing and cannot be processed very fast. So at a certain point, connection manager and more precisely the SocketConnection class consider that the health of the connection is bad and closes it. But it tries to create a new connection which is add to the load and the connection cannot be opened fast enough. That is why I got on Connection Manager side:

Error while connecting to server: fermium.wimba.fr(DNS lookup: fermium.wimba.fr:5262)

java.net.SocketTimeoutException: Read timed out

On server side some packets that should be forwarded to the connection manager cannot be routed anymore because connection is closed and that is why I get either connection reset message in the debug log or broken pipe on the error depending if the RST message sent when the connection from the connection manager when it closes the socket has been processed by the server.

This is then like a vicious situation where connection managers try to connect to the server and increase the load. The server cannot neither process traffic nor accept new connections.

Related to that I have two questions:

First, it seems that SocketConnection class is used for connections between connection managers and servers and clients and connectionmanager. For the checkHealth method, they share the same property xmpp.session.sending-limit. Would it make sense to be able to have different value for the health depending on the connection type (Client -> CM and CM->Server) ?

Then I noticed when I had these timeouts that clients sessions were not removed from the wildfire server list of sessions.

I suppose when connection between CM and Server is lost or closed because in bad shape it is hard to send a message to the server telling him to remove clients. But when the client connection is closed, I could not find anything to tell the server the client connection was closed like a CloseListener or something like that that would forward the deconnection to the server.

I am now going to focus on users deconnection at a high rate to check if I can find something usefull.

Thanks for reading

Pascal

Hey Pascal,

Thanks a lot for your feedback and testing effort. Could you contact me directly by email at gaston@jivesoftware.com or IM at gato@jivesoftware.com? I would like to join efforts in this task. My current guess is that the TCP/IP queue is filling up and the network gets saturated thus connections are closed. This happens because the rate of incoming traffic is higher than the rate of processing incoming packets thus the queue fills up. I think that there are a few things worth testing (e.g. finetuning) and possibly changing some part of the architecture.

Regards,

– Gato

Hey Pascal,

Each Connection Manager will open by default up to 5 sockets to the server. You can change that setting in CM by setting the system property xmpp.manager.connections in config/manager.xml. The bigger the number the better for distributing load among connections and the server.

From the server there are a few settings that you can also play with. For each connection coming from a CM there is going to be a thread pool that will process incoming traffic. By default the thread pool will have 10 threads, once they are all busy incoming traffic will be queued. Once the queue has reached 50 elements then more threads will be started. The max number of threads will be 100. Once you have reached those 100 threads incoming traffic will be processed by the thread that reads from the socket.

Use the following system properties to change these settings:

xmpp.multiplex.processing.core.threads --> default 10.

xmpp.multiplex.processing.queue --> default 50.

xmpp.multiplex.processing.max.threads --> default 100.

Remember to restart Wildfire or Connection Manager after changing system properties.

As you may have realized from the above description, the “Once you have reached those 100 threads incoming traffic will be processed by the thread that reads from the socket.” looks like the culprit of saturating the network since rate of processing incoming traffic will be lower than rate of reading from the TCP queue. I just checked in an improvement to avoid this problem. I would like to test the improvement to confirm that in fact that was the culprit. Could you try running your test using the next nightly build? The issue I filed for this is JM-867.

Thanks,

– Gato

Message was edited by: dombiak_gaston

Hi Gato,

I assume that you still didn’'t take a close look at the db pool I did send you but there I use the housekeeping thread to log the max, avail, free and used connections and I also log a WARN message if max is reached - one could do something similar within the Connection Manager to allow a better monitoring and tuning of it.

LG

Hello Pascal,

From your stack trace, it seems the server got a TCP reset while reading/writing, so please check your client code. Such as the send/recv socket buffer is big enough, and your thread code is fast enough to handle thousands of incoming TCP thread. If the CM sent your Jabber client lots of packets which exceed your client machine’'s socket default recv buffer, your client pc will send a peer reset or some other socket errors to CM to cause the error.

I suggest you to use Java NIO and non-blocking IO to perform the test, not 1 connection/thread mode. Most OS are difficult to handle threads more than 10k. I can login 30k~50k users to 1 Connection Manager using NIO.

In my experience, increasing the TCP connections between CM and WF doesn’'t help if your CM and WF are in a local network or fast enough(100M or 1G ethernet). 3~5 connections is enough in my env.

You’‘ve asked me about the details of my 300k concurrent test. My test code is inside 1 CM, not from incoming TCP connections. I created every user using XMPP protocol and sent every packets to Wildfire server and also handled Wildfire’'s response.

The log info from my test after 300k user login (the cm and my code’'s Java VM):

298473 Logined. Time elapse(sec): 565, Avg: 528 users / sec
VM Memory: 430.62MB of 910.25 MB (47.3%) used

After all the users login, I began to send messages(just a “hello world” msg, about 0.2k) to each other.

The msg send/recv log after running for 30 minutes:

Sent packets: 11098353
Recv packets: 11097098
Total time elapse(Sec): 2409
Total Sending time elapse(Sec): 1840
Message Packet sent / Second: 6031
VM Memory: 756.37MB of 980.00 MB (77.2%) used

Tim

Message was edited by: timyang

Hello Timyang,

thanks for your answer.

My problem occured in fact between Connection Managers and the Wildfire server.

I also think about a RST message but sent by the CM when the server is highly loaded and when the checkHealth method decides to close the Socket.

The client I use is writtent in Erlang (Tsung) and I am really not sure I want to dig into it ;).

Anyway with BlockingReading mode I can reach 15000 users on a single CM by adjusting the Thread Stack Size. I have started to modify the NonBlockingReading mode from the server to adjust it to the CM side, initial tests were good (lets say a ratio around 3 seems available) but I had disconnection problems with the clients - I didn’'t check if this was updated recently but when I got the connection manager code, NonBlocking mode was commented.

I agree wit you about the number of connections between CM and servers. I tried to play with these parameters and did not get any noticeable difference.

I am currently testing the update from Gato but as everything related to network, one should analyze carefully the results and tests several times (in my opinion) before making conclusions.

I would say that load stays very high (as expected) but we do not lose connections between CMs and servers (although I observed it on a single CM during all the tests -weird). But now it is like my clients are disconnected - not so bad if the server is really loaded but it is not exactly as I expected and I am just trying to understand what is happening because logs seem OK, clients sessions are just closed. This is probably an IOException or EOFException caught that is not logged. I will try to find.

Can I also ask you what was the hardware config of your machines (especially the WF server) ?

Thanks for your help

Message was edited by: pcohen

Here are my observations:

As we can expect, load doesn’'t really change (things need to be processed in any case) but as expected also, connections are not closed between CM and WF servers. In that case generally process are delayed but processed anyway.

In the case of tests with whitespace to simulate pings, behavior is identical with or without the fix, but server survives even when many users disconnect simultaneously. With the fix, clients disconnect and server progressivley disconnects them. This is delayed but works fine. Without the fix, generally if users leave too fast simultaneously (around 200 simultaneous disconnections), some queuing implies CM disconnection and I get like “zombies” connections. Client is away but server thinks they still are alive.

So the fix improves the situation in case of whitespace pings.

In the case of character as ping, situation is also better because load becomes very high with or without the fix, but as in the previous case, we don’'t keep zombie client sessions.

But in case of high load, I still observe massive clients disconnection (25K -> 5K) and cannot find any log. I had at the begining such a problem with “EOFException” that were caught as normal exceptions. So it is possible to be a problem with the tsung clients. On client side I only get a “timeout” error message which is not very helpful.

Concerning the fix, I think it improves well the safety of the connection between CM and server and remove the zombie sessions (What happens in case of bad health between clients and CM, this question is still open in my opinion).

I will try now to adjust multiplex parameters on server side to check any effect.

Hello,

Both my WF & CM hardware are XEON 3G with 4 Gb memory.

I found that CM’‘s non-blocking mode doesn’'t work, it hangs when a new Jabber client connecting to it.

Non-blocking bug see also: http://www.jivesoftware.org/community/click.jspa?searchID=-1&messageID=126535

Sorry I found the above problem I reported was a TODO feature and NIO is not supported in current CM.

I’‘ve read CM’'s non-blocking code and found in SocketReader.java

// Set the blocking reading mode to use
        if (useBlockingMode) {
            readingMode = new BlockingReadingMode(socket, this);
        }
        else {
            //TODO readingMode = new NonBlockingReadingMode(socket, this);
        }

So when the server uses readingMode, it throws a NPE.

The last update date for SocketReader.java is June 18.

Tim