Tonight's unrelated (but surprisingly related) project involved importing data from a source which had strings of bytes along with a string form of their encoding. For the result I just want to deal with UTF-8, so I normalized all the encoding names and shoved the strings through iconv.
But it puked on "ISO-8859-8-i", which is Hebrew where the -i means "in logical order" (as opposed to "visual order"*). (If those terms mean nothing to you, read the star at the bottom.) Not hard to fix -- encoding conversion is the same as ISO-8859-8 -- but it points out a flaw in my conversion process! If some of these byte sequences are in encodings like ISO-8859-8 that are "backwards" from the logical order (which all UTF-8 is), that means, for converting to UTF-8, something needs to get reversed!
Some tests with the command-line iconv indicates it works a character at a time, so its conversion to ISO-8859-8 comes out "backwards". (Amusing sidenote: try highlighting this text, which is in visual order... it looks like my Firefox gets confused by RTL text in visual order as well.)
So this means that proper UTF-8 to ISO-8859-8 conversion needs to reverse the order of the characters, while for ISO-8859-8-i you just tell iconv it's ISO-8859-8 and don't reverse anything. Hmm.
But here's the harder question: with these ISO encodings it seems like you know the fundamental directionality of the text. But with UTF-8 that is metadata that is supposed to be carried along externally from the data itself. Here's where I get confused, on two fronts.
- If the encoding really does specify a fundamental directionality, what is the right thing to do? I can't test how to handle an ISO encoding with mixed Hebrew and Western text because I don't trust any of the software to get it right: Firefox doesn't appear to even lay out my UTF-8 Hebrew in the right order (maybe it only applies the bidi algorithm for HTML documents?). Gedit does, but it can't display non-UTF-8 text.
- If I wanted to encode directionality into some UTF-8 text, what is the correct way? I'm guessing you stick an explicit RLE and PDF mark around the text but I'm not sure.
My one sample of ISO-8859-8-i turns out to just be all English anyway. All this for nothing...
* "visual order": a pet peeve of mine. It really means "the first characters are the leftmost if you displayed them as a human would read them", but there's nothing about left-first for vision, so a literal interpretation of it means it's the same as "logical order" (where the first character is the first character you'd read).