Can''t handle too many messages

Hi,

I have written a message listener that receives message from several Smack clients(100s of them) before Smack 2.2.0, my listener operates 24/7 with no fail. After the upgrade my listener won’'t even last for 3 hours and will hang.

It’'s there anything I need to fine tune on the Smack Client or the Wildfire server?

Your post got me curious, sounds like a good test. Difficult to set up with multiple users sending messages though.

Using Smack 2.2 and Wildfire 2.6.2 I sent 10,000 3 line messages from one user to another, no problems noticed.

Took less then a minute.

Sorry about not being too specific, the average message the listener receives is between 55,000 to 60,000 before it hangs, that’‘s the ‘‘many’’ I am talking about. It’'s also running on CentOS 4.3, IBM eServer P4 3.4Ghz with 1GB of RAM.

Any out of memory errors or anything along those lines? Do you have JProfiler or anything like that to run? Its curious this is happening now and not before as we recently fixed a memory leak inside of Smack.

If you don’'t have JProfiler and your test is easily reproducible perhaps you could post it here and I will test it on my end and see where the problem is?

Thanks,

Alex

I didn’'t see any server or smack memory problems in my 1 on 1, 10,000 message test.

I am testing on windows 2000 (server and 100 clients) and Redhat 7.3

a mere 400 Mhz P2 (receiving client). A slow computer seems to bring out the worst in threaded SW now-a-days.

How fast is the receiver receiving messages per second/minute/hour in your test?

I put together a test will 100 users sending 10,000 messages to one poor client (on the slow machine). I can add delays per 100 or in between each message. I ran in flat out and it was moving pretty good >50,000 messages maye even >100,000.

Then I iconified the cmd window on the 100 client test, which increases its performance about 10x. Now its really pumping and wham the recever threw an exception in PacketReader and PacketWriter.

So I shut off the security layers and ran again thinking the problem might be in SSL code. Boom, another exception in PacketReader similar to previous exception.

33, 534 messages received, here is the exception:

Free Memory=8834760

java.io.EOFException: no more data available - expected end tag </stream:stream> to close start tag stream:stream from line 1, parser stopped on END_TAG seen …e following\nis a test message\nwith multiple lines… @111391:37

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3014)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3025)

at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1144)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:377)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:43)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:63)

I am going to let a slower verson (with delay) run overnight, its pretty boring to watch, yawn.

Looks like the socket was closed abruptly. Don’'t know if it is the server or the client. Do you see anything related in your server error logs that might indicate it closed the connection?

Thanks,

Alex

My logs doesn’'t show any errors. Yes I got disconnected for sometime but my workaround is to add a Connection Listener so that if a closed connection event fires up, it will attempt to reconnect and this works, however, I suspect a Thread deadlock from thread and the executor threads I have written with it.

The server logs were clean.

My overnight test of 100 users sending a message every 100 milliseconds thats one message every 100 ms. was still running this morning no errors. 540, 395 messages were delivered with no problem.

We have jokingly determined that you must be gentle with Smack and Wildfire and it will work ok. If you really push it, Smack at least, breaks.

Will run some additional tests today.

I use Smack 2.2.0 and Wildfire 2.5 and I have a process that sends anywhere from 5-20+ messages per second (using it like you would use a JMS topic) and have no problems whatsoever.

I actually restarted the process yesterday for unrelated reasons (config change) and prior to the restart my Session for the user the process logs in as was showing 14.5 million packets sent and received in just under 30 days. It uses SSL for the connection as well.

I have found Smack/Wildire to work like a charm for a messaging solution for a somewhat large message volume. I don’'t seem to be anywhere close to maxing it out.

I do have to use a workaround for a memory leak in the Smack libraries as described in this thread:

http://www.jivesoftware.org/community/message.jspa?messageID=117134#117134

Are you sure you aren’'t running out of memory?

I’‘m sure I’‘m not having memory leaks as it doesn’‘t show in the “top” and the listener will just simply stop, stop as in will not receive messages, but will simply hangs and won’'t even shutdown without killing it. my processPocket() implementation looks like this:

public void processPacket(Packet packet){

if (packet instanceof Message){

logger.debug(“Packet received”);

recv = (Message)packet;

logger.debug(recv.getBody());

from = recv.getFrom();

subject = recv.getSubject();

body = recv.getBody();

logger.debug("Subject: " + subject);

try {

//this is a QueuedExecutor based on Doug Lea’'s concurrent API

//this will execute the run method that has a code to write on

//a database.

executor.execute(this);

} catch (InterruptedException e) {

// TODO Auto-generated catch block

logger.error(e.getMessage(), e);

}

}

// end of snippet

Now my question is, is it advisable to let another thread handle the writing to the database or should I just let the processPacket() method do that?

ok, I think I answered my own question. the processPacket() has to delegate to another thread so that it can handle multiple incoming messages.

For the sending part, I have no problem with that. I can send as many messages as my hundreds of Smack clients wants. There’'s only one client(a listener) that receives them all and this is where the problem is coming up.

Could you get a thread dump to see what your hanging listener is doing?

Thanks,

Alex

The test…

Login 100 consecutive users. Each sends 10,000 messages for a total of 1 million messages to a specific user. The sender and the server are on the same computer. The receiver is on another computer (albeit an old slow one).

In one variation of the test the sender delays 100 milliseconds before continuing to read from its command file. Otherwise, its full steam ahead as fast as they can be sent.

With the delay the test was always successful (with PacketReader fix http://www.jivesoftware.org/community/thread.jspa?threadID=19926&tstart=0).

Without the delay, it was successful twice. In one test 875,000 messages were received out of 1 million.

Usually, I would restart the server between tests. One time I did not I actually received more messages then was sent!

So what happened?

My current theory is offline message processing. The server overloads and drops messages because offline (queues?) are full. Perhaps messages are delivered later causing the received more then sent situation.

Although the test is pretty extreme, it runs 100% CPU for 20 minutes or so, it may not be a good example of normal real usage. I have shutdown offline message storage and will bounce messages.

Also, the sender idles when all messages are sent, takes a while for the receiver to catch up since its a 100 to 1 bottleneck.

In one instance of idling, the server apparently closed the connection to one of the 100 senders? This makes me very suspicious of the xmpp.client.idle logic in the server also. Its default is 30 minutes. The test had run about 20 minutes + the senders had a 10 sec KeepAlive.

So far I can only report test observations, none of which are repeatable on demand, yet.

Results of test are very disconcerting!

The 100 sending clients and server were run on the same machine. The receiver on another. Initially, the senders and server were run at the same priority. The senders run as fast as they can with no delay between message sends. A main thread dispatches the commands to the senders to send the messages so the test is somewhat serialized. Sometimes (twice) 1 million messages were sent and received. But in most runs an exception would be thrown in sender and it would die. The exception would typically be something like:

No delay sent: 948639 received: 156165

Apparently, a sender died for unknown reasons, no error message, no sign of socket closing?

java.net.SocketException: Software caused connection abort: recv failed

at java.net.SocketInputStream.socketRead0(Native Method)

at java.net.SocketInputStream.read(Unknown Source)

at com.sun.net.ssl.internal.ssl.InputRecord.readFully(Unknown Source)

at com.sun.net.ssl.internal.ssl.InputRecord.readV3Record(Unknown Source)

at com.sun.net.ssl.internal.ssl.InputRecord.read(Unknown Source)

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(Unknown Source)

at com.sun.net.ssl.internal.ssl.AppInputStream.read(Unknown Source)

at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(Unknown Source)

at sun.nio.cs.StreamDecoder$CharsetSD.implRead(Unknown Source)

at sun.nio.cs.StreamDecoder.read(Unknown Source)

at java.io.InputStreamReader.read(Unknown Source)

at java.io.BufferedReader.fill(Unknown Source)

at java.io.BufferedReader.read1(Unknown Source)

at java.io.BufferedReader.read(Unknown Source)

at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2971)

at org.xmlpull.mxp1.MXParser.more(MXParser.java:3025)

at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1144)

at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:384)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:43)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:64)

java.net.SocketException: Software caused connection abort: socket write error

at java.net.SocketOutputStream.socketWrite0(Native Method)

at java.net.SocketOutputStream.socketWrite(Unknown Source)

at java.net.SocketOutputStream.write(Unknown Source)

at com.sun.net.ssl.internal.ssl.OutputRecord.writeBuffer(Unknown Source)

at com.sun.net.ssl.internal.ssl.OutputRecord.write(Unknown Source)

at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(Unknown Source)

at com.sun.net.ssl.internal.ssl.AppOutputStream.write(Unknown Source)

at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(Unknown Source)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(Unknown Source)

at sun.nio.cs.StreamEncoder$CharsetSE.implFlush(Unknown Source)

at sun.nio.cs.StreamEncoder.flush(Unknown Source)

at java.io.OutputStreamWriter.flush(Unknown Source)

at java.io.BufferedWriter.flush(Unknown Source)

at org.jivesoftware.smack.PacketWriter.writePackets(PacketWriter.java:260)

at org.jivesoftware.smack.PacketWriter.access$000(PacketWriter.java:39)

at org.jivesoftware.smack.PacketWriter$1.run(PacketWriter.java:79)

Exception in thread “main” java.lang.IllegalStateException: Not connected to server.

at org.jivesoftware.smack.XMPPConnection.sendPacket(XMPPConnection.java:699)

at com.proxy.VirtualUser.sendMessage(VirtualUser.java:305)

at com.proxy.VirtualGroup.sendMessage(VirtualGroup.java:156)

at com.proxy.XMPPProxy.sendMessage(XMPPProxy.java:376)

at com.proxy.XMPPProxy.doPluginCommand(XMPPProxy.java:174)

at com.proxy.XMPPProxy.doit(XMPPProxy.java:123)

at com.proxy.XMPPProxy.(XMPPProxy.java:86)

at com.proxy.XMPPProxy.main(XMPPProxy.java:479)

The server logs are empty or indicate a comm failure. The results were not very repeatable and show no pattern.

I increased the server priority to above normal and behaviour changed. No more exceptions!

But, there is always a but.

30% of the messages never make it to receiver! There are no errors anywhere indicating anything went wrong!

Analogy

If you take I-75 North out of Atlanta (7 lanes) it eventually goes to two lanes. Usually traffic backs up painfully and it can take an hour to go a few miles. So one would expect in a backed up fully loaded scenario to eventually get to destination.

Instead the choices become:

  1. For no good reason you are involved in a crash.

  2. You eventually make it.

  3. You dissappear mysteriously from the face of the Earth never to be seen again and noone knows why.

Turning on server debug yields no clues. I pretty much out of ideas on what to do but will continue to test and think about as I have time.

What this tells me though is there is no guarentee your sent message will be received in a heavy loaded situation. I believe there is a problem here somewhere altough most users may never experience it.

One more thing I forgot. I decided to run the test using a separate machine for each of the players. A 100mbs network connected everyone. I expected to receive 1 million messages for 1 million sent. Instead the results were the same.

1 million sent 659,551 received. No errors, no exceptions. Test was about 20 minutes.

If anyone has any good ideas on something to try I will as I have time.

skip

Skip,

It is difficult for me to diagnose the issue without seeing your code. I just spent a few hours today diagnosing a memory leak in Spark that came as a result of how it uses messaging in Smack, it only became apparent with many messages.

There are some API and documentation changes that we are currently discussing to help improve the usability of these aspects of the Smack packages.

I would like to see your code to help diagnose if we are seeing some of the same issues I am now fixing in Spark and in turn Smack. If you are sensitive about posting your code to the forums please feel free to contact me of list by email or im.

Thanks,

Alex

The code I have been testing with is too sensitive to post. The tests I have been running take a very long 20 minutes which usually exceeds my attention span!

I changed by testing tactics yesterday and believe I can come up with a smaller more digestible example to potential multiple problems.

I had been letting the server have a higher priority then the sending clients. I reversed the tactic and lower the server priority less then a single sending client who is sending 100,000 messages rapidly. This causes the sendpacket queuing mechanism in Smack to kick in. My last test results showed more messages were being received then sent, considerably. Also an enormous amount of memory was allocated in the sending client and never released until program exit. I believe there is a huge memory leak in the storing of packets in the linked list of the sendpacket mechanism. I am going to whip up some small client test programs today and use them to test with. I instrumented a method in PacketWriter that allows me to monitor the size of the transmit queues which do, at least, seem to get worked down. All my findings thus far are very inconclusive though so I had neglected to post anything. No sense crying wolf so to speak. I want to be able to repeatedly demonstrate the problem(s) in a shorter period of time with simpler code. I will post my findings this afternoon.

skip

Hey Skip,

Ill let you know then where the problem is that we have found. I have investigated further where else this issue exists in the library but I know of at least two locations. The issue I have seen arrises with a PacketCollector, you may run into issues with this if you are using either the Chat object or the MultiUserChat object. The issue with these is that if you don’'t use the built in methods of these classes to read packets, packets build up in their PacketCollector(s). The PacketCollectors do have a max size but that is set to 65536 packets currently.

I just wanted to put that information out there, so you could think about that in the context of your current code. I am intrested and anxious to hear your findings.

Thanks,

Alex

My test results…

I don’'t use Chat or PacketCollector. Strictly messages.

I assembled a simple sender/receiver test. With a modified Smack so I could monitor the sending queue size on a Smack XMPPConnection.

Sender and Wildfire 2.6.2 server.

500,000 small messages.

500K in 500K out provided that memory limits were not exceeded in server, then it worked. 500K messages were queued in Smack in mere seconds and then sent to server about 1K messages/sec with my hardware. Sometimes the client would overwhelm the server and cause a server heap overflow which invalidated the test.

Usually I could temporarily bump the server priority up to get things flowing and primed and then all went well.

The memory leak I see is called Java. As Java allocates memory from the OS it does not return it to the OS until the JVM terminates. I fully understand this and why, boils down to sbrk() and sub memory management techniques.

In the case of my test though the sending queue is getting about 15-20k bytes per small message (when converted into XML) so 300 MB or so of OS memory disappears until I kill the sender.

To me this is an issue with using Java for any memory intense client/server applications which must run 24/7 and must coexist with other applications.

So my conclusion is there is NO problem here.

What I saw in my more complex test maybe a misconfig on my part or a shared group routing problem I am also chasing.

As many messages were received as were sent always unless heap overflowed.

skip

I take it back. There is a problem. I have tracked this one for a while. I have found other things during the testing I will post in other threads for the benefit of others who may be concerned or run into similar situations.

Yes you can lose packets. Apparently by design (as MS would say). If you don’'t processPackets fast enough and you have a flood of messages (70,000) you will start dropping packets right and left and Smack will “SILENTLY” dispose of them for you. In previous tests my receiver did nothing but println the message.

The break through came when I decided to simulate some processing with a 10 millisecond sleep. I was pushing +70,000 messages at my receiver. The messages were counted by the receiver and had sequence numbers in them to make them unique. Around 1500-3000 messages received 25-70 messages were being dropped for everyone received. I would lose 5K-10K messages.

I had seen this previously with a larger test of 1 million messages and was losing 300K messages!

This is why:

PacketCollector.java (snippet)

/**

  • Max number of packets that any one collector can hold. After the max is

  • reached, older packets will be automatically dropped from the queue as

  • new packets are added.

*/

private static final int MAX_PACKETS = 65536;

if (resultQueue.size() == MAX_PACKETS) {

resultQueue.removeLast();

}


Well, one would think that is reasonable enough right!

Well whether you know it or not outgoing messages are queued and sent asynchronously also. BTW, the limit for sending = available memory.

Build up too many and Heap Overflow.

Also, the server has no internal limits for such that I can find either except

limit = available memory. A single user could flood a server with messages and cause it to fail.

Offline messages are limited and stored in the SQL database, stuff 70K messages in there and it uses some storage!

So my whine is…

WHINE ON!

  1. Why limit the receiver queue if nothing else seems to have a limit. Exhaustion of memory will take care of the problem as it does it all other cases…

  2. Allow the MAX_MESSAGES to be getable and settable. It is documented…at least.

  3. When dropping a message out put a warning message

  4. Allow receiving and sending queue sizes to be examined.

WHINE OFF…

Skip