November 13th, 2003

  • evan

where syntax actually matters

Actually, that last subject was lying. I have some more pseudo-science between linguistics and computers.

Human language is generally pretty efficient, in that speakers are motivated to say what they're trying to say in the fewest number of words / smallest amount of time¹. Languages with longer common constructions get them shortened via slang or muttering ("I am going to", English's weird new future tense, has become "eye-m-un-a") and ideally that means they eventually optimize themselves to the "best" balance between efficiency and other concerns (redundancy, compatibility, clarity (non-ambiguousness)).

The example I keep thinking about is complementizers. Consider:
(1) The boy that I like has blue eyes.
(2) The boy that likes me has blue eyes.
The "that" in the first sentence can be dropped out, but not in the second. Really, the "that" in both sentences isn't a meaningful word in the way "boy" is; "that" is there only to indicate a subclause is following. I mentally liken it to a [for example] curly brace in programming language: when you hit the "that", you know you need to parse in a specific structure (in English, a relative clause) that follows and that it specifies attributes of the word preceeding it.
(Contrast class Foo; and class Foo { ...bar... };.)

We can drop the "that" in the first example because we can figure it out from the doubled noun ("boy I"), etc. A different syntax would allow us to avoid "that" completely: in (my broken) Japanese, the sentences would be (roughly, of course):
(1) i like boy blue eyes has.
(2) me likes boy blue eyes has.
Because phrases are verb-final, a noun that follows a verb indicates it's a relative clause. (Japanese does have complementizers in a different context. I recall some discussion in a class about complementizers versus structure that totally blew my mind, but I'm pretty sure I never wrote it down and I can't remember it.)

Ok, programming languages. Who cares about code efficiency in terms of letters (ie, typing)? Most people don't--though type inference is quite nice once you've used it a bit--with one exception that hit me today: shells.

There was some noise on lambda_ultimate about Microsoft's new shell for their next Windows and the fancy way it passes objects around between processes, and then someone else retorted with another fancy shell, and I realized with both of these all I immediately thought about was how much will I have to type?

Unix shells work because there's a good tradeoff between simplicity and what you can accomplish with it. Really, only being able to communicate via unstructured byte streams kinda sucks (as anyone who has tried to get a list of files in a directory ordered by size has found²), but we've managed to do quite a bit with it (-print0 is a nice hack around some limitations of plain text, for example).

And really, how much real programming do we do in the shell anyway, where the difference will matter? (I certainly use for pretty regularly.) Contextual tab-completion, for example, needs to know what context you're in, which implies syntactic parsing. Maybe we don't go farther only because we don't know what we're missing.


1 A notable exception is formal speech, where we often introduce useless or longer words/phrases. The effect is really noticable in Japanese, where something like "come" (kiru, two short syllables) inflates up to five syllables with a long vowel and doubled consonant (irasshaimasu, yeah?) when you're talking about a superior. I assume this is because doing things that take more effort or that are uncomfortable are a way of showing respect/inferiority.
2 ls -lS is cheating; try ls -l | awk '{ print $5 " " $9 }' | sort -n instead. Only knowing a few extra characters like | buys me quite a bit, but all the quoting and awk step in an ideal world wouldn't be necessary.
  • evan

two potential hacks, both ugly, one glorious

(1) To get the list of arguments a command understands (for use with, for example, tab completion in a shell), why not make an LD_PRELOAD library that wraps getopt (and maybe libpopt, or whatever people use these days).
(Yes, really the applications ought to all be able to emit argument information in the same format, but I'm not holding my breath for that one.)

(2) For my networking class, I'm supposed to write the glue between students' code (basically implementing TCP in addition to lower-level stuff) and their to-be-assigned application code. The plan is (and was when I was in the class) to run the two pieces in different processes, where the application uses an API just like the Unix sockets API, so the students get a taste for what socket programming is like both from the application and the implementation's perspective. Another nice benefit of this is that they can test their application code on top of real sockets, so it's easier to debug their application without wondering if the bug is in their TCP code.

It occurred to me that since these processes will communicate over a socket themselves, I could just as well let them write apps in C as I could in Ruby. But, going a step farther, wouldn't it be a rad hack if I could LD_PRELOAD my way underneath socket calls so they could run unmodified binaries (web browsers!) on top of their own network implementation? Unfortunately, the transfer protocol they implement only does one-way connections... I'm still thinking about that.