I18n QA Logo
i18nqa.com -> Encoding Debug Table -> utf8-debug

Encoding Problem: Double Mis-Conversion

Symptom

With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.

Á Í Ï Ð Ý

Explanation

Software that is incorrectly converting the bytes of UTF-8 characters from Windows-1252 to UTF-8 and back will have the problem that most characters seem to work, but certain values like U+00DD Ý do not.

The Windows-1252 code points 0x81, 0x8D, 0x8F, 0x90, 0x9D are unassigned. They do not yet represent any characters. An attempt to convert any of these code points from Windows-1252 to UTF-8, will return an error or unknown value (usually a question mark "?") or other signal that a problem has occurred.

An incorrect conversion of UTF-8 bytes from Windows-1252 to UTF-8 is being performed as well as a compensating conversion from UTF-8 to Windows-1252. This "works" or is harmless for most characters, since the retrieved byte sequences are identical to those that are stored. However, it fails for the characters where the unassigned code points are involved. The first conversion generates an error and then the reverse conversion cannot return the original bytes.

An example of this occurs If a database driver is not configured correctly. A program is using UTF-8 for text and stores its text in a UTF-8 database. Beause of the incorrect configuration, the driver treats the program's UTF-8 text as Windows-1252 chracter encoding. Each of the bytes of the UTF-8 text is converted from Windows-1252 to UTF-8 as the data is stored in the database and then converted back from UTF-8 to Windows-1252 when the data is retrieved. The application and database will seem to be working fine except on the occasions when one of the unassigned code points is encountered. See Table 2, Demonstration of Problem with Unassigned Code Points.

Table 2, Demonstrating Problem with Unassigned Code Points
CharacterUTF-8 BytesView as Windows-1252Convert to UTF-8Revert to Windows-1252View as UTF-8
è
U+00E8
0xC3, 0xA8Ã, ¨0xC3, 0x83, 0xC2, 0xA8
Ã, ƒ, Â, ¨
0xC3, 0xA8
Ã, ¨
è
Ý
U+00DD
0xC3, 0x9DÃ, InvalidÃ, invalidinvalidinvalid

References