ArtLung: I am Joe Crawford. Welcome to my website circa July 2008. I am a web developer. I live in Moorpark, California, USA. I work in Los Angeles, and have lived elsewhere and done many surprising things. I put up the first ArtLung website 12 years ago, moved it to artlung.com 10 years ago, nd I've been blogging for 7 years. I'm still learning, every day. Welcome.

November 11th, 2008

Google: PHDs with Tanks

The other day I was listening to the Stack Overflow Podcast, which features two interesting bloggers who I’ve read for some time: Joel Spolsky of JoelOnSoftware, and Jeff Atwood of Coding Horror. They tend to ramble a bit, and they’re not as funny as they think they are, but I enjoy it. On Episode 28 they speculated as to how Google does its suggestions about typos. Their speculations were incorrect, because I remembered another piece of Audio I had listened to, Adam Bosworth speaking at the MySQL Developer’s Conference in 2005. That whole speech is wonderful. It’s got food for thought about developing at web-scale, it’s funny, and entertaining. The money quote that addresses what Messrs Spolsky and Atwood were saying is this, which I lovingly transcribed for comment on their blog:

“How many people here have built a system that takes a billion requests a day? Well you could. And actually that’s the point of this conversation–what I want to talk about. It’s the same thing that’s made Google possible I mean think about what Google does, we take hundreds of millions of fairly hard queries a day; the queries tend to say things like ‘searching for camels in Tanzania’ and we sort of shake our head and try and figure out what that means and we go over petabytes of content, not terabytes but petabytes of content. And we have a couple hundred milliseconds in which we’re allowed to search the entire petabytes and return back to you what we found in rank order. So not only are we trying to search really, really large amounts of data we’re trying to search it extraordinarily quickly and we’re trying to do this hundreds of millions of times a day. And we do it. And we do it without a helluva lot of sweat. The way I think about Google is that’s it’s lots of PHDs driving tanks. It’s all about brute force. Everyone’s sort of General Patton–they don’t drive around the wall they drive through the wall. It’s really dumb techniques, used in large scale: I mean for example, the spellchecking. Every so often when you type a Google query and it will say ‘did you mean,’ and it’s usually because you put in a typo. This is not because we have some incredible dictionary or some brilliant thesaurus that tells us what you meant. It’s because we’re tracking what people type _after_ they type the query that didn’t return anything — and it turned out that that was a very efficient way to figure out what you probably meant to type, in fact it works much better than any spellchecker. But notice the stupidity of the approach: ‘people who typed this usually wanted to do this’–works great.”

Leave a Reply