evan_tech

Previous Entry Share Next Entry
12:51 am, 2 Oct 05

bidi

A RTL/bidi support came up at work the other day and I was happy to realize I knew the general answer, the approaches available, and where to go to really nail down the exact solution. I partly blame (thank, really) gaal. Here's an old post of mine that would've been on here had this journal existed back then that's short and reveals some of the intricacies as well as the human solutions for them.

Tonight's unrelated (but surprisingly related) project involved importing data from a source which had strings of bytes along with a string form of their encoding. For the result I just want to deal with UTF-8, so I normalized all the encoding names and shoved the strings through iconv.

But it puked on "ISO-8859-8-i", which is Hebrew where the -i means "in logical order" (as opposed to "visual order"*). (If those terms mean nothing to you, read the star at the bottom.) Not hard to fix -- encoding conversion is the same as ISO-8859-8 -- but it points out a flaw in my conversion process! If some of these byte sequences are in encodings like ISO-8859-8 that are "backwards" from the logical order (which all UTF-8 is), that means, for converting to UTF-8, something needs to get reversed!

Some tests with the command-line iconv indicates it works a character at a time, so its conversion to ISO-8859-8 comes out "backwards". (Amusing sidenote: try highlighting this text, which is in visual order... it looks like my Firefox gets confused by RTL text in visual order as well.)

So this means that proper UTF-8 to ISO-8859-8 conversion needs to reverse the order of the characters, while for ISO-8859-8-i you just tell iconv it's ISO-8859-8 and don't reverse anything. Hmm.

But here's the harder question: with these ISO encodings it seems like you know the fundamental directionality of the text. But with UTF-8 that is metadata that is supposed to be carried along externally from the data itself. Here's where I get confused, on two fronts.
  1. If the encoding really does specify a fundamental directionality, what is the right thing to do? I can't test how to handle an ISO encoding with mixed Hebrew and Western text because I don't trust any of the software to get it right: Firefox doesn't appear to even lay out my UTF-8 Hebrew in the right order (maybe it only applies the bidi algorithm for HTML documents?). Gedit does, but it can't display non-UTF-8 text.
  2. If I wanted to encode directionality into some UTF-8 text, what is the correct way? I'm guessing you stick an explicit RLE and PDF mark around the text but I'm not sure.

My one sample of ISO-8859-8-i turns out to just be all English anyway. All this for nothing...

* "visual order": a pet peeve of mine. It really means "the first characters are the leftmost if you displayed them as a human would read them", but there's nothing about left-first for vision, so a literal interpretation of it means it's the same as "logical order" (where the first character is the first character you'd read).