Evan Martin (evan) wrote in evan_tech,
Evan Martin
evan
evan_tech

arabic is hard

Meena's been here all week so I haven't been doing much. I've been studying my Modern Arabic book and she helps me with that. I want to make a word list program but unfortunately Pango, which is typcially touted to have excellent foreign script support, doesn't actually support arabic vowel marks (which are pretty crucial when you're just learning words). I found the bug report and I'm intending to try out some of the patches...

Funny thing: gvim actually gets it right. Vim actually has really excellent (as far as a monospaced editor goes) arabic support. ":help arabic.txt" if you're curious.

Anyway, Arabic is unbelievably difficult to get right. Here's an example.
You have the character alif, which basically looks like an I (in a sans-serif font). It can optionally have a hamza (backwards italic 2) on top of it. But that's ok, that's a simple combining character problem: same as o plus umlaut.
Now you can add a vowel mark on top of that, like a forward slash. Two combining characters; more uncommon, but not unheard of (I think Vietnamese has a lot of them).
Now we add a lam after it. It looks like an L, but since arabic is right-to-left, the L is to the left of the alif, which means the foot of the L points at the alif. This would in theory make a big squarish U shape. The alif has to change to connect to the lam, but that's ok, you just have an extra "connected-to-the-left-alif" symbol in your font.
Unfortunately, the sequence of lam+alif is always written as a special ligature; instead of being a U, the lines cross at the bottom and make a small loop (as if the U is twisted at the bottom). This is not optional; the alternative is actually unreadable to a native reader. So there's another entry in your font for alif+lam, where both characters are joined in one cell.
Now you want to add a vowel mark to the lam. (Or, more commonly, a mark that means "no vowel". All of that together produces the "al", meaning "the", that you see throughout Arabic: al-Qaida, al-Jazeera, allah, etc.)

So we have one glyph that now represents five different attributes: alif, lam, the hamza on the alif, the alif vowel mark, and the lam vowel mark, and the attributes on each of those letters need to be applied over the correct places. I don't even know if it's possible to be done correctly... but of course it is, because countless arabic books (or at least textbooks; normal arabic text lacks the vowel marks) have it.
Subscribe

  • your vcs sucks

    I've been hacking on some Haskell stuff lately that's all managed in darcs and it's reminded me of an observation I made over two years ago now (see…

  • perl people, explain your language to me

    Every time I use perl I feel mildly positive about it right up until I encounter CPAN. I've never managed to make CPAN work, despite the multitude of…

  • dns attack of doom

    If I've learned anything from the new Kaminsky DNS attack, it's that if you want to keep something a secret while disclosing to a trusted subset of…

  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 15 comments

  • your vcs sucks

    I've been hacking on some Haskell stuff lately that's all managed in darcs and it's reminded me of an observation I made over two years ago now (see…

  • perl people, explain your language to me

    Every time I use perl I feel mildly positive about it right up until I encounter CPAN. I've never managed to make CPAN work, despite the multitude of…

  • dns attack of doom

    If I've learned anything from the new Kaminsky DNS attack, it's that if you want to keep something a secret while disclosing to a trusted subset of…