11:59 am, 13 Apr 08
google app engine limitations
It's weird to see people talk about Google App Engine online because I think many people focus on minor details. Like, to make apps scale horizontally you do need a "shared-nothing" infrastructure, so that's not really novel. The BigTable aspects are sorta interesting except there's nothing there (in terms of application design) that you couldn't have gotten out of the paper, and the App Engine API is so high-level it's not that close to BigTable. It's more like any other high-level flat-address object database, like maybe CouchDB. The Python thing is also pretty irrelevant; they just picked a language they have experience with (having Guido around helped), and it's easier to launch supporting one API than n languages' worth. A good engineer just solves the problem with the tools available, and Python is a pretty good tool to start with.
As for evil plans to steal ideas or code, that's between you and your skepticism. Big companies are surprisingly good at doing shitty things, and Google is definitely big, but it's also true that within Google people really try to do the right thing. I was touched to see a privacy-concerned friend of mine start using Gmail after he was hired, saying that only after he saw how seriously they take privacy inside the company could he feel confident about using it. But I can't tell you anything that will change your mind about this subject.
I developed an internal application using Google App Engine on and off over a period of months (during its development I kept trying it out) and then finally rewriting a few weeks before launch (after the APIs had all settled).
Here are some real problems I've encountered:
1) All code runs only in response to HTTP fetches. So that means no cron jobs, and no persistent server-side processes. I know I just wrote above that you can't really have persistent jobs if you want to scale, but ultimately real apps do occasionally need these. For example, imagine a timed test app that needs a consistent view of time no matter which server (or datacenter!) the user hits. A time server becomes a single point of failure but when it's critical for your app it can be engineered around.
2) No long connections means no "comet" (server-push messaging).
My first thought on hearing about App Engine was to port lmnopuz but I can't.
3) Playing around with your data is hard. Since there's no way to perform operations on your data except by uploading code to the server, you're often left creating a new URL per operation you want to perform. Hacks like the shell helps with this, but a lot of the time I want to be able to just run a local script and see the output. (For my project I found a decent workaround: make a URL that accepts Python code as a POST and runs it. Then your scripts just need to know to serialize themselves into strings and send them over the wire.) But see the next point.
4) Slow table scans. My app had ~1200 rows that it performs various analyses on and produces graphs. I can appreciate that such a query is labor-intensive, and so I had written it to cache the results of the graph generation (the rows only change once a day). But I can't even seed the cache once because fetching 1200 rows is too slow to happen within a single query.
5) Bulk operations are hard. Say you want to delete all objects in a table (or class, I forget the App Engine term). The "delete" operation requires you fetch the object first, and then you're back into slow table scans land. The best you can do is batch up your processing into multiple smaller stages, each of which write their intermediate output into the data store: either make a page that auto-refreshes itself with Javascript and leave a browser pointed at it, or make a command-line script that repeatedly hits a URL on your app.
6) No arbitrary queries. (If you haven't read the docs in detail, you wouldn't know this, but any query that involves multiple attributes [columns, if you're still thinking SQL] of an object must have an index exactly matching the query. They make index creation and maintenance trivial, and even automatic in most cases.)
Though everyone's repeatedly shoehorned SQL underneath object-relational mappers, App Engine (and others) demonstrate that you can provide an object storage API and gain performance by not using SQL underneath. I argue the real utility of SQL is that it lets you quickly (in terms of programmer time, not machine time) perform queries that you haven't done before and won't do again. Say I learn about a bug where I built all of March's data with the word "none" in place of where a column should really be null (None in Python terms) -- that's a line of SQL to fix but it's a world of pain with App Engine due to the bulk operations thing.
With all that said, it's still pretty good. When I was looking to switch projects about a year ago, it came down to basically three projects and App Engine was one of them, because the guys who work on it are some of the best hackers I know at the company. All of the above bullet points (and minor stuff like the languages thing) aren't fundamental limitations of the design, they're temporary flaws that can be solved by good engineering and are surely being prioritized by the team. I'm pretty confident it'll improve rapidly.
As for evil plans to steal ideas or code, that's between you and your skepticism. Big companies are surprisingly good at doing shitty things, and Google is definitely big, but it's also true that within Google people really try to do the right thing. I was touched to see a privacy-concerned friend of mine start using Gmail after he was hired, saying that only after he saw how seriously they take privacy inside the company could he feel confident about using it. But I can't tell you anything that will change your mind about this subject.
I developed an internal application using Google App Engine on and off over a period of months (during its development I kept trying it out) and then finally rewriting a few weeks before launch (after the APIs had all settled).
Here are some real problems I've encountered:
1) All code runs only in response to HTTP fetches. So that means no cron jobs, and no persistent server-side processes. I know I just wrote above that you can't really have persistent jobs if you want to scale, but ultimately real apps do occasionally need these. For example, imagine a timed test app that needs a consistent view of time no matter which server (or datacenter!) the user hits. A time server becomes a single point of failure but when it's critical for your app it can be engineered around.
2) No long connections means no "comet" (server-push messaging).
My first thought on hearing about App Engine was to port lmnopuz but I can't.
3) Playing around with your data is hard. Since there's no way to perform operations on your data except by uploading code to the server, you're often left creating a new URL per operation you want to perform. Hacks like the shell helps with this, but a lot of the time I want to be able to just run a local script and see the output. (For my project I found a decent workaround: make a URL that accepts Python code as a POST and runs it. Then your scripts just need to know to serialize themselves into strings and send them over the wire.) But see the next point.
4) Slow table scans. My app had ~1200 rows that it performs various analyses on and produces graphs. I can appreciate that such a query is labor-intensive, and so I had written it to cache the results of the graph generation (the rows only change once a day). But I can't even seed the cache once because fetching 1200 rows is too slow to happen within a single query.
5) Bulk operations are hard. Say you want to delete all objects in a table (or class, I forget the App Engine term). The "delete" operation requires you fetch the object first, and then you're back into slow table scans land. The best you can do is batch up your processing into multiple smaller stages, each of which write their intermediate output into the data store: either make a page that auto-refreshes itself with Javascript and leave a browser pointed at it, or make a command-line script that repeatedly hits a URL on your app.
6) No arbitrary queries. (If you haven't read the docs in detail, you wouldn't know this, but any query that involves multiple attributes [columns, if you're still thinking SQL] of an object must have an index exactly matching the query. They make index creation and maintenance trivial, and even automatic in most cases.)
Though everyone's repeatedly shoehorned SQL underneath object-relational mappers, App Engine (and others) demonstrate that you can provide an object storage API and gain performance by not using SQL underneath. I argue the real utility of SQL is that it lets you quickly (in terms of programmer time, not machine time) perform queries that you haven't done before and won't do again. Say I learn about a bug where I built all of March's data with the word "none" in place of where a column should really be null (None in Python terms) -- that's a line of SQL to fix but it's a world of pain with App Engine due to the bulk operations thing.
With all that said, it's still pretty good. When I was looking to switch projects about a year ago, it came down to basically three projects and App Engine was one of them, because the guys who work on it are some of the best hackers I know at the company. All of the above bullet points (and minor stuff like the languages thing) aren't fundamental limitations of the design, they're temporary flaws that can be solved by good engineering and are surely being prioritized by the team. I'm pretty confident it'll improve rapidly.
It can't fetch 1200 rows at once? That seems like a low number. Were the rows full of huge data?
"For my project I found a decent workaround: make a URL that accepts Python code as a POST and runs it"
That's the best security hole I've heard about all week ;-)
And regarding the security hole: yep. :\
But it's an internal app, so if someone destroys it there's something more seriously wrong than my security model. You could imagine it requiring an "admin" cookie of some sort.
Can you elaborate on why server-side jobs categorically don't scale?
*furrows brow*
You know, i'm not an expert in parallelization, but i like to consider myself at least competent. But i really cannot think of a way of approaching this except for modulo-n sorts of solutions. At some point, you have to accept a single centralized point, perhaps one that's totallly implicit, or one that's master-master replicated, but it's inevitable. And this single point trickles outwards, into questions like the one you pose. Or am i missing something fundamental about partitioning in the large?
Ning
It still reminds me of Ning.... I can't see the ability to build a system of any non-hack scale in that kind of sandbox. That's probably just down to the kind of systems I tend to build, though. ;)Re: Ning
"The Python thing is also pretty irrelevant"I disagree. It shows that "scripting" languages can totally be used for such stuff, which will be the beginning of a huge paradigm shift.
I think you underestimate the relevance of this move in the long run. :)
Re: Ning
People don't seriously think of Python as a "scripting" language, still?That would be very enterprisey...
Re: Ning
Like I was trying to say in the post: people who code for a living and know a bit about what they're talking about (like say, you or me) are fine using Perl or Python or whatever for "real" applications. This whole "paradigm shift" the anon comment suggests happened a decade ago (certainly LJ dates back that far).Re: Ning
I think problems with scripting languages generally manifest themselves as issues with sloppy programming/design more often than perf problems.
I've learned to write excessive amounts of paranoia into my scripting language code so that I don't get stuck supporting backward compatibility with things that shouldn't have worked in the first place. This paranoia probably hurts performance at least some of the time. I'm starting to prefer languages that let me write down the API contract more completely[1] to avoid misunderstandings later. It's not a perf thing at all.
[1] (Though not to the extreme; I'm not writing Eiffel here.)
The problem with 1200 rows is a bit scary.
Practical data Access
I was very impressed with the App Engine APIs - clearly brilliant people, an as importantly, EXPERIENCED people have thought hard about it.The two things that really shut my enthusiasm down were these:
1. "Inequality Filters Are Allowed On One Property Only. A query may only use inequality filters (<, <=, >=, and >) on one property across all of its filters."
AND
2. "No Not-Equal Filter"
I am really struggling to imagine an application (I mean a useful one) that does not need to express a data query of the form:
select * where date_updated < yesterday() and account_balance > 0
These days we consider a million row table in a SQL database a triviality -
yet how would we ever handle it here?
Re: Practical data Access
They mention in the doc why these sorts of queries aren't supported.Here's some undereducated guessing on my part.. If you consider what an SQL database must do to answer such a query, it also won't be able to use indexes on both columns simultaneously. (I'm guessing here -- it seems that a database must use one index and then merge its results against using the other.) In theory you can do the same sort of merging operation in your code:
yesterday_accounts = set(Account.all().filter("date_updated <", yesterday).fetch()) e >", yesterday).fetch()) accounts)
nonzero_accounts = set(Account.all().filter("account_balanc
return yesterday_accounts.intersection(nonzero_
What you lose there versus SQL is that you're fetching all these unused accounts from the datastore (which is a bandwidth thing) and that you're not doing the merge inline (though perhaps the iterator version of fetch returns them in some well-specified order and would do that). It fails if you have more than a thousand accounts in either category, but in theory the SQL query also slows down as you add more rows.
I guess their suggestion would be to denormalize your data more -- if that query matters, make a table of accounts with nonzero balance and put a date index on it. I haven't yet decided how painful that is for real applications (modulo what I've written in the post.)
Thanks for the post, a question about table scan
I don't know about others, but returning 1200 rows of data seems too trivial to be a problem of any sort if say we were using a relational database, the truth is that even if i had to get this off a file, it's trivial and should be relatively fast. I just wonder whether this is a bigger problem than it appears to be.This post is indeed different
Thanks for this post. I am using appengine for quite a while now. To bad that I didn't had this list, when I was starting. By now, I can feel with you and feel the pain of every single point.As with most tools, appengine is no silver bullet, and only good for some things. I think, its great to have it, when you need an app running fast. I once impressed duly impressed a friend, by coding a remote UI for his IP robot in 2 hours and had it also running for everyone in the web. But I would actually never consider it, to build a feature rich app, especially with complex persistent data structures.
Thanks for this honest sharing of appengines real problems. People should write posts like this far more often.
This post is indeed different
Thanks for this post. I am using appengine for quite a while now. To bad that I didn't had this list, when I was starting. By now, I can feel with you and feel the pain of every single point.As with most tools, appengine is no silver bullet, and only good for some things. I think, its great to have it, when you need an app running fast. I once impressed duly impressed a friend, by coding a remote UI for his IP robot in 2 hours and had it also running for everyone in the web. But I would actually never consider it, to build a feature rich app, especially with complex persistent data structures.
Thanks for this honest sharing of appengines real problems. People should write posts like this far more often.
(sorry for double posting)