Google: PHDs with Tanks

The other day I was listening to the Stack Overflow Podcast, which features two interesting bloggers who I’ve read for some time: Joel Spolsky of JoelOnSoftware, and Jeff Atwood of Coding Horror. They tend to ramble a bit, and they’re not as funny as they think they are, but I enjoy it. On Episode 28 they speculated as to how Google does its suggestions about typos. Their speculations were incorrect, because I remembered another piece of Audio I had listened to, Adam Bosworth speaking at the MySQL Developer’s Conference in 2005. That whole speech is wonderful. It’s got food for thought about developing at web-scale, it’s funny, and entertaining. The money quote that addresses what Messrs Spolsky and Atwood were saying is this, which I lovingly transcribed for comment on their blog:

“How many people here have built a system that takes a billion requests a day? Well you could. And actually that’s the point of this conversation–what I want to talk about. It’s the same thing that’s made Google possible I mean think about what Google does, we take hundreds of millions of fairly hard queries a day; the queries tend to say things like ‘searching for camels in Tanzania’ and we sort of shake our head and try and figure out what that means and we go over petabytes of content, not terabytes but petabytes of content. And we have a couple hundred milliseconds in which we’re allowed to search the entire petabytes and return back to you what we found in rank order. So not only are we trying to search really, really large amounts of data we’re trying to search it extraordinarily quickly and we’re trying to do this hundreds of millions of times a day. And we do it. And we do it without a helluva lot of sweat. The way I think about Google is that’s it’s lots of PHDs driving tanks. It’s all about brute force. Everyone’s sort of General Patton–they don’t drive around the wall they drive through the wall. It’s really dumb techniques, used in large scale: I mean for example, the spellchecking. Every so often when you type a Google query and it will say ‘did you mean,’ and it’s usually because you put in a typo. This is not because we have some incredible dictionary or some brilliant thesaurus that tells us what you meant. It’s because we’re tracking what people type _after_ they type the query that didn’t return anything — and it turned out that that was a very efficient way to figure out what you probably meant to type, in fact it works much better than any spellchecker. But notice the stupidity of the approach: ‘people who typed this usually wanted to do this’–works great.”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.