evan_tech

Previous Entry Share Next Entry
02:43 pm, 19 Dec 03

arabic is hard

Meena's been here all week so I haven't been doing much. I've been studying my Modern Arabic book and she helps me with that. I want to make a word list program but unfortunately Pango, which is typcially touted to have excellent foreign script support, doesn't actually support arabic vowel marks (which are pretty crucial when you're just learning words). I found the bug report and I'm intending to try out some of the patches...

Funny thing: gvim actually gets it right. Vim actually has really excellent (as far as a monospaced editor goes) arabic support. ":help arabic.txt" if you're curious.

Anyway, Arabic is unbelievably difficult to get right. Here's an example.
You have the character alif, which basically looks like an I (in a sans-serif font). It can optionally have a hamza (backwards italic 2) on top of it. But that's ok, that's a simple combining character problem: same as o plus umlaut.
Now you can add a vowel mark on top of that, like a forward slash. Two combining characters; more uncommon, but not unheard of (I think Vietnamese has a lot of them).
Now we add a lam after it. It looks like an L, but since arabic is right-to-left, the L is to the left of the alif, which means the foot of the L points at the alif. This would in theory make a big squarish U shape. The alif has to change to connect to the lam, but that's ok, you just have an extra "connected-to-the-left-alif" symbol in your font.
Unfortunately, the sequence of lam+alif is always written as a special ligature; instead of being a U, the lines cross at the bottom and make a small loop (as if the U is twisted at the bottom). This is not optional; the alternative is actually unreadable to a native reader. So there's another entry in your font for alif+lam, where both characters are joined in one cell.
Now you want to add a vowel mark to the lam. (Or, more commonly, a mark that means "no vowel". All of that together produces the "al", meaning "the", that you see throughout Arabic: al-Qaida, al-Jazeera, allah, etc.)

So we have one glyph that now represents five different attributes: alif, lam, the hamza on the alif, the alif vowel mark, and the lam vowel mark, and the attributes on each of those letters need to be applied over the correct places. I don't even know if it's possible to be done correctly... but of course it is, because countless arabic books (or at least textbooks; normal arabic text lacks the vowel marks) have it.