Evan Martin (evan) wrote in evan_tech,
Evan Martin


A RTL/bidi support came up at work the other day and I was happy to realize I knew the general answer, the approaches available, and where to go to really nail down the exact solution. I partly blame (thank, really) gaal. Here's an old post of mine that would've been on here had this journal existed back then that's short and reveals some of the intricacies as well as the human solutions for them.

Tonight's unrelated (but surprisingly related) project involved importing data from a source which had strings of bytes along with a string form of their encoding. For the result I just want to deal with UTF-8, so I normalized all the encoding names and shoved the strings through iconv.

But it puked on "ISO-8859-8-i", which is Hebrew where the -i means "in logical order" (as opposed to "visual order"*). (If those terms mean nothing to you, read the star at the bottom.) Not hard to fix -- encoding conversion is the same as ISO-8859-8 -- but it points out a flaw in my conversion process! If some of these byte sequences are in encodings like ISO-8859-8 that are "backwards" from the logical order (which all UTF-8 is), that means, for converting to UTF-8, something needs to get reversed!

Some tests with the command-line iconv indicates it works a character at a time, so its conversion to ISO-8859-8 comes out "backwards". (Amusing sidenote: try highlighting this text, which is in visual order... it looks like my Firefox gets confused by RTL text in visual order as well.)

So this means that proper UTF-8 to ISO-8859-8 conversion needs to reverse the order of the characters, while for ISO-8859-8-i you just tell iconv it's ISO-8859-8 and don't reverse anything. Hmm.

But here's the harder question: with these ISO encodings it seems like you know the fundamental directionality of the text. But with UTF-8 that is metadata that is supposed to be carried along externally from the data itself. Here's where I get confused, on two fronts.
  1. If the encoding really does specify a fundamental directionality, what is the right thing to do? I can't test how to handle an ISO encoding with mixed Hebrew and Western text because I don't trust any of the software to get it right: Firefox doesn't appear to even lay out my UTF-8 Hebrew in the right order (maybe it only applies the bidi algorithm for HTML documents?). Gedit does, but it can't display non-UTF-8 text.
  2. If I wanted to encode directionality into some UTF-8 text, what is the correct way? I'm guessing you stick an explicit RLE and PDF mark around the text but I'm not sure.

My one sample of ISO-8859-8-i turns out to just be all English anyway. All this for nothing...

* "visual order": a pet peeve of mine. It really means "the first characters are the leftmost if you displayed them as a human would read them", but there's nothing about left-first for vision, so a literal interpretation of it means it's the same as "logical order" (where the first character is the first character you'd read).
Tags: i18n, linguistics, project

  • more on bug tracking; distributed editing

    A few separate posts, all in the same area. 1) Most (all?) the distributed bug tracking software I've glanced at stores bugs in a directory, one…

  • dvcs and offline

    I got a couple of comments on that previous post that betray a bit of a misunderstanding about how collaborative projects work in the presence of…

  • distributed bug tracking

    Distributed bug tracking is the natural extension of distributed version control. Aside from the normal benefits of distributed version control, like…

  • Post a new comment


    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.