[Resolved] Error reading vcard (Patch inside)

I get this exception when loading vcard that contanis accents like á o not standar letters like ñ. I’'ve tried with the latest stable version and with the latest nightly build.

:1:21: Invalid byte 2 of 3-byte UTF-8 sequence.

org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.

at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:264)

at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuild erImpl.java:292)

at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:98)

at org.jivesoftware.smackx.provider.VCardProvider._createVCardFromXml(VCardProvide r.java:84)

at org.jivesoftware.smackx.provider.VCardProvider.parseIQ(VCardProvider.java:76)

at org.jivesoftware.smack.PacketReader.parseIQ(PacketReader.java:603)RootFlag: false

at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:289)

at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:43)

at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:63)

Message was edited by: isendir

Looks like the server is returning the vcard in an incorrect encoding. This is not a problem with smack, but with the server.

<iq from=’‘robemart10@’’ to=’'hector29483@*/Smack’’ id=’‘PxPV5-12’’ type=’‘result’’>

This is for instance one of the vcards that make smack crash, and which gaim has no trouble at all to read. Even if it’'s not smack fault (though it looks so), it should parse the rest of the vcard.

The problem is that as soon as there’‘s invalid binary data in there, there’‘s no way to handle it correctly. You’'d have to go the ways of html renderers and just do some educated guesses on the actual meaning of the data.

I’‘m glad that Smack is doing it the correct way and just refusing invalid content. This is the biggest problem on the web, there’‘s close to no web page that doesn’'t have any bugs. If every browser would refuse invalid pages, there would be no invalid stuff and writing things like HTML parsers/renderers would be much easier.

Excuse me, but where’'s the invalid binary data?

Hi Hector,

looking at the name “Roberto Martínez Martín” the “Invalid byte 2 of 3-byte UTF-8 sequence.” must not occur. The server names are usually 1-byte UTF-8 characters so I assume that the packet you did post here is valid.

“í” looks for me like unicode character “U+00ED” and this should be as UTF-8 “0xC3 0xAD”.

As fas as I know Smack offers an XMPP debugger but the debugger does not display the received data as hex. As you may be using an encrypted connection you may not be able to use a network sniffer to get the data.

Hopefully you can look in the code at “_createVCardFromXml(VCardProvider.java:84)” and write the packet as hex to standard out before processing it. This should help a lot to identify where the invalid sequence gets used.

LG

My guess is that this character is transferred in another encoding (ISO8859-1 index 237?), which happens to be the beginning of a three-byte sequence in UTF-8 (note that I don’'t know UTF-8 well enough to be sure of that).

I’‘ve tried using several servers, and the result is always the same. If the vcard contains a accent, smack crashes when parsing it. I’‘ve tried several clients and all read the VCards correctly. Maybe it’'s a problem in the parser library.

Ok. I fixed it. Change line 84 of VCardProvider from:

Document document = documentBuilder.parse(new ByteArrayInputStream(xmlText.getBytes()));

to:

Document document = documentBuilder.parse(new ByteArrayInputStream(xmlText.getBytes(“UTF8”)));

Please, introduce this change upstream.

Hi,

some classes in the Wildfire code contain

/**
      * Preferred encoding.
      */
     private final static String PREFERRED_ENCODING = "UTF-8";
//...
//...
          try {
               bytes = s.getBytes(PREFERRED_ENCODING);
          } // end try
          catch (java.io.UnsupportedEncodingException uee) {
               bytes = s.getBytes();
          } // end catch

You may trace this issue here: SMACK-151

LG

uh, so it was a Smack error after all… I’'m sorry

Hi,

My guess is that this character is transferred in another encoding (ISO8859-1 index 237?) was very good as this seems to be the problem. Depending on the locale setting one will or will not hit such problems.

LG

*uh, so it was a Smack error after all… I’'m sorry *

Don’'t worry

Hey Guys,

Thanks for the fix and bug report. I have applied the patch, tested it, and I am checking it in now.

Thanks Again,

Alex

No problem. That’'s the goodness of open source