We have had an issue throughout all versions of Openfire with unicode characters (such as Chinese) getting corrupted in the messages, group names and JID’s.
For instance the JID:
坏脾气是我@ourserver.com
would sometimes (but not always) change to something like this:
Obviously this results in ongoing issues and lots of other random error messages.
Also if we send a message with chinese characters in a message it will often get these �� inserted randomly in the message, and will be losing some characters as a result.
It isn’t always �� characters - sometimes its other funny characters.
I have tried from various different clients and still have the same issues so I’m pretty sure the issue is in Openfire.
I assume this issue is related to MINA (but it may not be) and have seen other discussions about unicode issues but I am not sure if any of them are specifically related to this one. We have tried updating the MINA library to v1.1.7 and it actually seems to help a bit, but we still have the same issues.
Our service is multi-language and it is important that we fully support these character sets.
I couldn’t reproduce this problem but I did identify a problem in the current code. Your dump in OF-92 contains 226 bytes while the UTF8Buffer was intended to contain 2-4 bytes. So I did upload a new version with these three fixes.
LG
...
/* if needed: complete previous incomplete UTF8 char */
if (missingUTF8bytes > 0)
{
// FIX 2010.04.04 missingUTF8bytes_tmp as missingUTF8bytes is modified in loop (missingUTF8bytes--;)
int missingUTF8bytes_tmp = missingUTF8bytes;
for (int i = 0; i <= missingUTF8bytes_tmp; i++)
{
if (len == i)
{
return; /* not enough data to complete char */
}
/* fill the buffer */
UTF8Buffer.put(byteBuffer.get(i));
incompleteUTF8bytes++;
missingUTF8bytes--;
newbyteBufferPosition++;
// FIX 2010.04.04 break loop after filling buffer completely
if ( missingUTF8bytes == 0)
{
break;
}
}
/* read the buffer */
UTF8Buffer.flip();
// FIX 2010.04.04 read the whole UTF8Buffer (should make no diffence)
// -- buffer.append(UTF8Buffer.getString(incompleteUTF8bytes, decoder));
buffer.append(UTF8Buffer.getString(decoder));
UTF8Buffer = null;
...
This is still a major issue for us so I will try implementing your updated code asap.
If you send a few paragraphs of Chinese characters through Jabber just in a message you will see this problem occuring. You can just copy a bunch of characters from any website. You may need to send it a couple of time before it happens. You will see the square or some other pair of incorrect characters come through on the receiving client every now and then.
And as you can see in my dump it also effects JID nodes and it also effects group names. Interestingly it doesn’t seem to be effecting other things like Nicknames - they never get corrupted? Obviously the JID node corruption causes a bunch of secondary errors in Openfire.
I’ll let you know how it goes after implementing your update. Thanks again.
If this does not improve things then one needs to review the Openfire code. Unless every connection uses it own parser one will get these errors as I use the parser to store the last UTF-8 chars. My tests are single-threaded so they use always the same parser.