powered by Jive Software

Is smack library un-encoding html in packets?

Hi all. I’ve built a pubsub consumer using the smackx library. Thing is, I am getting atom payloads. The content elements are of type html. According to the atom spec all html entities need to be encoded. When I run my pubsub consumer they are not encoded when they get to my packet listener. However, If I have debug enabled I can see that the entities are properly encoded. I am trying to figure out if the debugger is doing something special and doing it for me, or if the pubsub publisher is giving me bad atom payloads. If I’m not getting bad payloads, then why is it fine in the debugger, but once it gets to my application code, the html entities are un-encoded? My simple test is to open a connection, register a packet listener that just logs the packet.toXML() and I subscribe to a node. packets come down just fine, except for the difference between the packet when it gets to me and when it gets to the debugger. I’ve done some code tracing but nothing jumps out at me.

Obviously, once I try to parse the atom payloads as atom, the un-encoded entities cause problems.

Could it be the pullparser some how?

Is there any way to see the raw stream as it is being sent to me before it hits the pull parser?

Ugh. After some testing, it is indeed the pullparser.

The way I am doing it, which may be wrong, is I am adding an extension provider to handle the atom “entry” tag using http://www.w3.org/2005/Atom namespace. What I do then is just gather the buffer into a string via parser.getText(), then use a third party atom parser to parse the buffer. Problem is the pull parser turns stuff like

some html encodings. this is an apostrophe 's then a link <a href="http://google.com">http://google.com</a>

in to

some html encodings. this is an apostrophe ’ then a link http://google.com

so then when i use my atom parser, it will choke on that. It’s not just because of the type=‘html’ either, those characters could be anywhere and it will de-encode them.

Is that by design? Is there anything I can do besides reencode the html all over again?


decoding the entities within the content tag is the correct baviour according to XML specifications. So the XMLPullParser works correctly.

If your atom parser can’t use an XMLPullParser directly there is no way around reencoding the text of elements containing html.