Deadlock when writing many messages

Hello, I am using Smack 4.4.6 and I found that if I wrote too many messages at once, my connection would suddenly lock. I did some digging and found this issue. I went ahead and disabled StreamManagement so that I wouldn’t depend on acks, and this helped the connection last a bit longer, but I was still hitting deadlock eventually. Note that I tested sending bursts of 1000 messages per second for several seconds, which is quite a lot, and which is in fact the scale I am working towards supporting in my application.

It appears that this deadlock comes from the XMPPTCPConnection PacketWriter. When it is .init()'d, it creates a thread that runs .writePackets(), which continually calls .nextStreamElement(), which calls queue.take(). At the same time, my thread/the user thread is ultimately calling PacketWriter.sendStreamElement(), which calls queue.put(). Here “queue” is an ArrayBlockingQueueWithShutdown.

The problem arises when this queue fills up (it is only 500 elements). In this case, queue.put() will first call lock.lockInterruptibly() and then putInternal(). It will then notFull.await(). Then, the spinning writePackets() thread will call queue.take(), which first also calls lock.lockInterruptibly(). Now, neither thread will advance, because they are both waiting on the other.

This situation arises when I am putting data into the queue faster than it can be read. There’s a few solutions that could help mitigate this:

  1. Have an unbounded queue, and leave it up to the developer to deal with this
  2. Allow the developer to get metrics on this queue so that we can ensure we aren’t going to fill it
  3. Probably best, ensure the dequeueing thread is not blocked by the enqueueing thread in this scenario.

Please let me know if any of this is unclear or you would like more information about it, thank you!

Another option that may be best is to return a “success” boolean value when calling sendStreamElement while the queue is full, or perhaps raise an Exception. Another option is to return the final Message that was sent, or null if nothing was sent.

In other words, instead of blocking until it is able to succeed, immediately return/raise an exception to signal to the caller that the queue has been filled. Then the caller is able to decide how to handle this. For example, perhaps the caller can just wait 1-2 seconds for the queue to flush and then start making calls again.