Compression bug uncovered?

We might have uncovered the root of a compression-related problem. We’re running openfire 3.5.1 with client/server compression enabled. To isolate the problem, we disabled server-server compression and all tls options.

The results of the problem are a NULL pointer exception in jzlib and an inability of Alice (described next) to send/recv messages in a remotely hosted chat room.

Here’s a description:

Alice logs into server X and connects to chat room Z on server Y (Alice enables compression). Bob logs into server Y and joins chat room Z on server Y. At this point, we exercise Alice and Bob’s chat clients to send numerous, simultaneous messages to room Z. Alice suddenly loses the ability to send/receive from the room, albeit, she hasn’t been kicked out.

We believe interleaved access is being granted to Alice’s deflator object and that this interleaved access is the problem.

Here are some bits of evidence that lead us to this conclusion:

  1. According to ServerSockerReader.java, server X can spawn threads to handle packets received from server Y.

  2. Packets destined to Alice are deflated (using jzlib) before being transmitted.

  3. According to our investigations, threads sometimes interleave access to the jzlib deflator object. (See the output at the end of this message.)

org.jivesoftware.openfire.net.ServerSocketReader is suspicious because it can create new threads (at the server) to handle incoming messages. What happens if two messages are received and destined for Alice? Could deflation of one message be incomplete before deflation of the second begins?

Here’s an example of a compression error we’ve seen. Exception messages vary, but usually end in a NULL pointer exception somewhere in jzlib deflate.

java.lang.NullPointerException
        at com.jcraft.jzlib.ZStream.flush_pending(ZStream.java:162)
        at com.jcraft.jzlib.Deflate.flush_block_only(Deflate.java:776)
        at com.jcraft.jzlib.Deflate.deflate_slow(Deflate.java:1207)
        at com.jcraft.jzlib.Deflate.deflate(Deflate.java:1567)
        at com.jcraft.jzlib.ZStream.deflate(ZStream.java:133)
        at org.apache.mina.filter.support.Zlib.deflate(Zlib.java:176)
        at org.apache.mina.filter.CompressionFilter.filterWrite(CompressionFilter.java:191)
        at org.apache.mina.common.support.AbstractIoFilterChain.callPreviousFilterWrite(AbstractIoFilterChain.java:361)
        at org.apache.mina.common.support.AbstractIoFilterChain.access$1300(AbstractIoFilterChain.java:53)
        at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.filterWrite(AbstractIoFilterChain.java:659)
        at org.apache.mina.filter.codec.ProtocolCodecFilter.filterWrite(ProtocolCodecFilter.java:205)
        at org.apache.mina.common.support.AbstractIoFilterChain.callPreviousFilterWrite(AbstractIoFilterChain.java:361)
        at org.apache.mina.common.support.AbstractIoFilterChain.access$1300(AbstractIoFilterChain.java:53)
        at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.filterWrite(AbstractIoFilterChain.java:659)
        at org.apache.mina.common.IoFilterAdapter.filterWrite(IoFilterAdapter.java:90)
        at org.jivesoftware.openfire.net.StalledSessionsFilter.filterWrite(StalledSessionsFilter.java:49)
        at org.apache.mina.common.support.AbstractIoFilterChain.callPreviousFilterWrite(AbstractIoFilterChain.java:361)
        at org.apache.mina.common.support.AbstractIoFilterChain.access$1300(AbstractIoFilterChain.java:53)
        at org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.filterWrite(AbstractIoFilterChain.java:659)
        at org.apache.mina.common.support.AbstractIoFilterChain$TailFilter.filterWrite(AbstractIoFilterChain.java:587)
        at org.apache.mina.common.support.AbstractIoFilterChain.callPreviousFilterWrite(AbstractIoFilterChain.java:361)
        at org.apache.mina.common.support.AbstractIoFilterChain.fireFilterWrite(AbstractIoFilterChain.java:355)
        at org.apache.mina.transport.socket.nio.SocketSessionImpl.write0(SocketSessionImpl.java:166)
        at org.apache.mina.common.support.BaseIoSession.write(BaseIoSession.java:177)
        at org.apache.mina.common.support.BaseIoSession.write(BaseIoSession.java:168)
        at org.jivesoftware.openfire.nio.NIOConnection.deliver(NIOConnection.java:223)
        at org.jivesoftware.openfire.session.LocalClientSession.deliver(LocalClientSession.java:787)
        at org.jivesoftware.openfire.session.LocalSession.process(LocalSession.java:262)
        at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:235)
        at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:356)
        at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:101)
        at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:68)
        at org.jivesoftware.openfire.net.SocketReader.processIQ(SocketReader.java:244)
        at org.jivesoftware.openfire.net.ServerSocketReader.access$001(ServerSocketReader.java:47)
        at org.jivesoftware.openfire.net.ServerSocketReader$1.run(ServerSocketReader.java:80)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:613)

We instrumented JZLib in com.jcraft.jzlib.Deflate.{longest_match,deflate} and noticed that all problems proceeded immediately after interleaved access to the deflation object. Here’s an example log output where one can see interleaved access to the same hashcode (the hashcode is the identifier of the deflator object).

85553     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 30368, cur_match 29945, best_len 240 85554     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 29945, cur_match 29522, best_len 240 85555     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 29522, cur_match 29099, best_len 240 85556     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 29099, cur_match 28676, best_len 240 85557     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 28676, cur_match 28253, best_len 240 85558     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 28253, cur_match 27830, best_len 240 85559     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 27830, cur_match 27407, best_len 240 85560     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 27407, cur_match 26984, best_len 240 85561     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 26984, cur_match 26436, best_len 240 85562     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 26436, cur_match 26201, best_len 240 85563     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 26201, cur_match 26011, best_len 240 85564     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 26011, cur_match 25776, best_len 240 85565     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 25776, cur_match 25586, best_len 240 85566     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 25586, cur_match 25351, best_len 240 85567     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 25351, cur_match 25161, best_len 240 85568     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 25161, cur_match 24926, best_len 240 85569     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 24926, cur_match 24736, best_len 240 85570     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 24736, cur_match 24501, best_len 240 85571     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, next_in 445, next_in_index 0, avail_in 445, next_out 458, next_out_index 0, avail_out 458 85572     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 24501, cur_match 24311, best_len 240 85573     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 24311, cur_match 24076, best_len 240 85574     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, longest_match: prev_length 2 85575     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 24076, cur_match 23886, best_len 240 85576     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, match 0, cur_match 47140, best_len 2 85577     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 23886, cur_match 23651, best_len 240 85578     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, longest_match: prev_length 2 85579     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, match 0, cur_match 53637, best_len 2 85580     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 23651, cur_match 23461, best_len 240 85581     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, match 53698, cur_match 53214, best_len 61 85582     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 23461, cur_match 23226, best_len 240 85583     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, match 53452, cur_match 52791, best_len 238 85584     thread Thread[pool-9-thread-1,5,main], hashcode 10114092, match 23226, cur_match 23036, best_len 240 85585     thread Thread[pool-9-thread-2,5,main], hashcode 10114092, match 52791, cur_match 52368, best_len 238

Thoughts?

Joe

I had to disable compression on my server. People (using Pandion) were getting some messages but not others. I know this was a bug a while back but with the recent OpenFire update I tried enabling it. The same issue occured. Disabling compression is the only fix.

I am not sure if this issue is Pandion related or OpenFire, but it could be related to what you describe. We don’t have huge amounts of traffic so having it disabled works for us now.

Maybe related to this: http://coccinella.im/node/80