I found his talk at the 2005 MySQL Users Conference inspirational. It’s one of a few talks I return to every so often because it’s so packed full of ideas.
I have done some transcription on this file. It might be useful to you. First, here’s the introduction from the original IT Conversations site:
Building a system that is capable of handling one billion transactions a day is easier than it sounds. That is Adam Bosworth’s view and he should know because he works for a company that has managed to achieve this level of scale on a simple architecture based on commodity hardware and simple brute force algorithms. Adam covers a lot of ground in this presentation that focuses on the success of the web, the scalability of simplicity and the emergence of the information server.
It’s not always about finding a simple solution to a complex problem, occasionally it’s about simplifying the problem. Whereas complex solutions are brittle and break, simple solutions just tend to work. HTML, HTTP, RSS and ATOM all fall in this category; simple solutions that have been widely adopted and work well. Adam believes it is time for database vendors to reflect on how they can provide an open, simple data model to easily server up information similar to the way a web server delivers content to the browser. Delivering an information server that is capable of federating information across the web, intelligently caching and scaling linearly is the next big database challenge.
And now the transcript:
Doug Kay: Up next from O’Reilly Media’s MySQL Users Conference. It’s a presentation by Adam Bosworth entitled “Database Requirements in the Age of Scalable Services” from I.T. Conversations.
Doug Kay: Hello, this is Doug Kay, the executive producer of I.T. Conversations. And today I’m excited to bring you another presentation from O’Reilly Media’s MySQL Users Conference held April 18th through 21st 2005 in Santa Clara, California. I.T. Conversations audio content is delivered by Limelight Networks the content delivery network for online broadcasts, music, movies and software downloads — taking the cost and complexity out of internet distribution on the web at www.limelightnetworks.com. And now, here’s our presentation from the MySQL users conference.
Introducer: We do have a lot of companies that we work with at MySQL. Some are famous, some are infamous, some we decline their existence. This company’s name is Google. Some of you guys are using it right now and they have a ton of good people over there. They have a ton of great software and doing good things. And of course also in that crowd is a gentleman named Adam Bosworth.
Adam Bosworth: I was amused to hear about people using Microsoft Project. I always viewed it when I was at Microsoft as this secret weapon that was used to stop everyone else from being able to compete. (Crowd laughs) Not not that Microsoft would do that, of course. The other thing I was thinking as I was listening that incredibly high energy talk that preceded me was “why do these conferences start so early?”
Adam Bosworth: I don’t know about you but you know I’ve been a developer all my life and as far as I can tell I don’t think I’ve ever woken up on a normal day before about 8:30. So this business of having conferences and start 8 o’clock as a theory is completely bemusing to me and I normally get up I have my coffee, I have smoked salmon, I read the paper. I sort of think about life, I stare blankly into space while I assemble myself and finally around 9:30 or 10 I drag myself into the office.
Adam Bosworth: So when I’m asked to give a talk at this hour of the morning I’m feeling slightly fragile to begin with. But all developer conferences do this. It’s sort of some sort of giant masochism. The second thing I noticed was that when this talk began to light they brought up the lights and didn’t turn off the video reminded me of an old movie. In which, a jockey is found dead in one of the Thin Man movies. And when the jockeys found that it said he’s fixing races and Myrna Loy says “Boy, they’re tough in this town.” And I figure if they’re gonna turn me off if I’m late so I better hurry up because I’ve got more slides than time.
Adam Bosworth: What I want to talk about is something that I’ve actually had on my mind for about 10 years. I’ve been at Google for about 10 months so clearly this precedes the Google talk. Though I will admit that I went to Google in part to help realize when I’m about to talk about.
Adam Bosworth: The talk is something that I noticed and got excited about about 10 years ago when I finally started paying attention to the web that that sort of 2 years after everybody else did. In 1995 I woke up from a sort of Microsoft-induced narcoleptic coma and realized there was this thing called the Web and got really excited about it. Because all of a sudden there was a thing where anyone could get to any data or anyone at any time. And this seemed pretty cool. And I had been building a whole bunch of stuff at Microsoft at the time including databases: Access and those infamous IDEO things and various other pieces and building a lot of UI and all of it suddenly seemed profoundly irrelevant to me. I don’t want to get into an argument with MONO but I suddenly decided that the idea that I could have zero install applications was astonishingly useful and the idea that the applications could be accessed from anywhere at any time without me worrying about what PC I had was astonishingly useful. I sort of view the PC very much the way I view my phone. It’s a useful cache of the latest data that I’ve got and a useful access point to what I care about but it’s not actually the place I store stuff.
Adam Bosworth: So what I want to talk about is what this means for everyone in this crowd. But in order to do that I want to talk about –what– how did the Web happen. People take it for granted. You just assume that there’s always been the web. You know “In the beginning there was the Web and the web was good” and and so on but that actually isn’t true. When Tim Berners-Lee came out with HTTP and HTML it turned out that he sort of had a perfect storm of efficiency.
Adam Bosworth: He made something really simple. And because it was really simple virtually any what I call “P programmer” by which I mean Perl, Python and PHP in those days PHP wasn’t quite around could build stuff that actually generated web content. That was a heck of a lot easier than for example building on those very cool looking MONO apps that we just saw. Building a really simple form that came up and said “Here’s a list of the tasks that you have do today” with something that virtually anyone who could who could think could put up on the screen. And then a, we came along and built something called IE4 and had this cute idea which is, sort of like this delayed seven year time bomb which has now ticked off –that the client could actually help, that you could move data up to the client and then you could use intelligent code. How many people here have seen Google Maps? So Google Maps is kind of fun.
Adam Bosworth: How much fun I don’t think even Google realized until we put the satellite stuff in the other day and we very nearly shut down for bandwidth. And we have a lot of bandwidth. It appeared that virtually every human being on the face of the planet decided it would be cool to go to the Maps, put on the satellite view and then start dragging around and see what they could say. You may be surprised how many pictures actually get served up when you do that–it’s more than we expected. But. So simple is huge because it meant that everybody could play.
Adam Bosworth: The other issue was sloppy — HTML is really sloppy. In fact when we were building IE4 we had this very simple mantra. It doesn’t matter what you give us: we will render it. And every so often we sort of choke and we would stare at something we go “no no we will render this”– if table rows overlap table rows: we’d render if
Adam Bosworth: They didn’t have to be sort of high priests of syntax. You’ve never gotten an error that saus “I’m sorry we can’t render your page the HTML is invalid.” Now you do actually occasionally get that on phones with XHTML. And it’s not what we call a good user experience. But the other issue is again because it was a very sloppy standard a lot of people who otherwise maybe couldn’t have authored content could author content. That was a very empowering thing.
Adam Bosworth: Third issue was that it was a standard that meant it ran everywhere and it particularly ran everywhere because it was easy and so everybody who had Linux and didn’t like Microsoft could put it on Linux and everybody had Unix put it on Unix and everyone had Mac put it on Mac and everyone and so on and so forth. Making things that are easy and sloppy causes them to make to run everywhere because there’s a lot of smart people can just go and build them everywhere–and they did.
Adam Bosworth: But the fourth issue which is actually the subject of today’s talk more than the first three is scale. I was making a joke just now about bandwidth though not not a complete joke it actually did use up a third of our bandwidth. But, the scale that’s possible on the Web today is just astonishing. The number of hits that can hit a single interesting thing when it gets slashdotted is overwhelming. And the amount of data that gets served up is incredible and the numbers that they represent dwarf what we in the database world used to think of as really big numbers.
Adam Bosworth: And there are four reasons this was possible.
Adam Bosworth: Number one: partitioning there’s this thing called DNS. I don’t go to one place on the web. There is no centralized web. Now I bring that up because databases aren’t so good at this. One of the things that the web does is it distributes the information across God knows how many machines. We alone have –“Oh no I’m not allowed to say that”–we have quite a lot of machines. But we have a small fraction of the world’s machines. Bigger than you think but nonetheless a small fraction. The Web works because it served up from everywhere and partitioning DNS is what causes that to be possible.
Adam Bosworth: The second thing is caching. We were looking the other day: what were the hot hits just on our blogs and a couple of them became very hot briefly and suddenly there were one 120,000 hits and one second on one blog now 120,001 seconds actually a lot, it turns out. And the only reason it worked was because there were about 18 zillion places caching it all the way across the web. There were proxy servers caching frontends caching, our own frontends were caching and if they’d all gone back to the same database and said fetch this thing one hundred twenty thousand seconds effect times a second a spindle would have melted. And you know that isn’t what happened.
Adam Bosworth: And then there’s “stateless”– and I’m not going to belabor that because I think everyone here understands why that’s important but there’s one thing that I think was actually missed on the web and it’s generally pointed at as as a defect by the rich-client crowd. Coarse grained interactions.
Adam Bosworth: The Web in fact is usually dismissively says “oh god we’re going back to that (IBM) 3270 (terminal) standard.” The browser based model was one where when you talk from the client to the server you talk in terms of a chunk of data. It’s not that every time you type the key or drag the mouse or click on something you caused a reaction to go to the server. You go to the server when you’re good and ready. So instead of having maybe 1500 or a four or even 5000 interactions per second with a server you have one or two.
Adam Bosworth: That’s a huge difference. Servers it turns out are just fine at dealing with fairly large numbers of things per second so if they wanted to support a thousand users and this is users are doing say one thing every five seconds the servers can do that. But if users are doing 5000 things a second and they’re trying to support a few hundred users. It doesn’t work nearly as well. They they basically break. So having course grained interactions really matter. Now this fairly simple recipe had an incredible effect.
Adam Bosworth: We really do have information at our fingertips and it’s not just us. It’s not just Google. You know whenever I want to look up a book I automatically go to Amazon and look up the books or I go elsewhere if I’m looking for what’s for sale I automatically go to eBay. Virtually everybody I know is now using some sort of an online CRM so when they want to look up who are their prospects of their customers or what do they know about their customers. they go to Salesforce. The information is coming on very fast. And one of the reasons that. This is so profoundly effective is not just that the software was there. The hardware was there.
Adam Bosworth: I did a back of the envelope calculation the other day about what a million dollars buys you. At Google we buy fairly cheap machines. They’re about 2 k per machine at most. Actually a little bit less. And the machines tend to be fairly big. They have four gigabytes and they have a bunch of disk, and so I sort of rounded up and said OK what are we actually getting for a million dollars? And it turned out we were getting roughly two terabytes of in-memory data. Well two terabytes of in-memory data is a lot of data.
Adam Bosworth: Two terabytes covers most things. And you’d be surprised how even the 30 gigs photos that came up as this would cover a lot of photos so terabytes are big. You’re getting a petabytes of on-disk data and you’re getting about 50,000 requests per second. For about a billion requests a day. A million dollars is incredibly cost effective today. And yet if you look at the database world and you say oh I spend a million dollars of hardware and I get a database and I put the database on it will I then have two terabytes of in-memory data and so on and so forth and will thing just scale so that I can automatically support a billion requests today? And the answer is no. How many people here have built a system takes a billion requests a day? Well you could: and actually that’s the point of this conversation. What I want to talk about.
Adam Bosworth: It’s the same thing it’s made Google possible. I mean think about what Google does. We get hundreds of millions of fairly hard queries a day. The queries tend to say things like “I’m searching for camels in Tanzania” or Tanzania whatever. And you know we sort of shake our head and try and figure out what that means and then we go over petabytes of content not terabytes but petabytes of content.
Adam Bosworth: And we have a couple of hundred milliseconds in which we’re allowed to take to search the entire petabytes and return back to what we found in rank order. So not only are we trying to search really really large amounts of data we’re trying to search it extraordinarily quickly and we’re trying to do this hundreds of millions of times a day. And we do it. And we do it without a hell of a lot of sweat the way I think about Google as it’s like lots of PHDs driving tanks. It’s all about brute force.
Adam Bosworth: Everyone’s sort of General Patton. They don’t drive around the wall they drive through the wall. It’s really dumb techniques used in large scale. For example the spell checking: every so often when you’re typing a Google query and it will say “Did you–did you mean?” and it’s usually because you put in a typo. This is not because we have some incredible dictionary or some brilliant thesaurus that tells us what you meant. It’s because we’ve been tracking what people typed after they typed a query that didn’t return anything. And it turned out that that was a very efficient way to figure out what you probably meant to type. In fact it works much better than any spell checker. But notice the stupidity of the approach–“oh, people who typed this usually wanted to do this.” Works great. And. So.
Adam Bosworth: It turns out that the hardware is there and the strategy for software is there to deliver this kind of scale. When you do deliver this kind of scale, when you can search really large amounts of data with low latency it changes how you think about data. I mean photos are a good example but virtually anything at once you can just search. This whole business of organizing things into folders starts to diminish in importance. The problem with folders is you can’t remember which folder you put them in. I was talking to some guy at Google the other day who can only be described as being anal compulsive to the nth degree. He has a folder called “school” and under inside that folder as a folder called “college” and a folder called “high school” and inside of the college one he has –I’m not kidding now–he has a folder called freshman, sophomore, junior and senior years. This is a true story with all the stuff that he actually sent or received during. I can’t even imagine doing that. I mean the organizational skills are impressive but of course it means that if he’s searching in any of the myriad of other ways he might care about searching for that content he’s not going to find it. And he won’t even remember you know–who remembers what you did sophomore year versus freshman year?
Adam Bosworth: So folders are not a very efficient way to find things. Searches are. So once you have searches this can be an extraordinarily fast and effective model. So the vision for you to think about is–“can we take what you’ve already been using?” Can we take a database? Can we do the same thing for the Web that was done for content? Can we serve up a web of data? Can we take all the information on the Web–all scientific information, ecological information, Every task, bug, tracking system every optimistic Microsoft Project schedule that said “here are the tasks” and the dependency graphs.
Adam Bosworth: Everything that you’d want to know–my kid my daughter’s fencing schedule and PTA schedule and you know the project I’m working with some people on in bioinformatics and we take all of that data and make it easily findable by anyone who might say “well we’ve already done that”–that was called the Web. But we haven’t done it in a way where you can actually search for information and get back information when you get back as content. And trying to infer information from content is trying to look at people. Are you interested in trying look at people hijabs and figure out whether they really were cute or not. You know, it’s not necessarily an easy way to figure out the information.
Adam Bosworth: In order to do this. We need a model that sort of does for information what HTML/HTTP did for the user interface. We need something that makes it easy to scale massively and linearly across the entire web.
Adam Bosworth: In other words, if you guys go out and start building some information in MySQL and it turns out that more people need more of it you should start buying 10 or 20 or 50 or 100 machines if the market is there and the things you’re automatically and effortlessly scale x100 and then if it turns out that you need one in Europe because there’s a network latency you stick another 100 there and it keeps scaling, you want in Asia it’s take another hundred there and so on and so forth. And then after a while you start federating because you don’t actually have a special expertise so you have some other data that you’re looking at and it still scales. That’s the vision.
Adam Bosworth: That’s the one I’ve been working on for about 10 years. I wrote a memo that’s actually available on the web somewhere about it and I wrote a memo once at Microsoft called “Sea Urchins” — I said think of every site as a sea urchin with lots of prickly spines of useful information coming out of it. Imagine that as he’s finding these urchins you could ask for the intelligent information. Not that you can go to their Web site and say what does it look like? Let’s say what is a task. What is the schedule? What do I know about these activities? when are they happening? where are they happening? in ways so that you could then do- – you could see them in the context of a calendar or you could see them in the context of a map. Or you could find the things that were available for you if you flew into Dallas for a day assuming there are things available to you if you — I shouldn’t say that — I’m flying to Dallas tomorrow and spending the night there some sort of wondering what will I do? So all of these ways for finding information is the thing that I’ve been looking at doing.
Adam Bosworth: Now originally I had this bright very naive idea we would do this using XML. We’d build this text grammar it was simple sort of like HTML we’d make it as simple as flexible as possible. We sort of do it over HTTP. And that would solve our problem. The problem turned out that what we did is we created a Tower of Babel. The big difference between HTML and XML is that there is a single HTML grammar. Any browser talking to any website talks one grammar: HTML. So the websites just have to support one grammar: HTML And that works really well. However if every Web site starts serving up arbitrary XML the browsers don’t really know what to do. They actually can browse the X amount but that’s kind of low calorie.
Adam Bosworth: That’s not a very useful way to deal with the information. And the things that are getting the information while the thing is self-parsing. Don’t usually know how to think about it but they don’t understand it and the schema are so complicated that the average PHP developer reading the XML Schema spec turns and runs out the door and has a sort of stiff drink.
Adam Bosworth: So this vision didn’t occur.
Adam Bosworth: It didn’t occur for a variety of reasons. It didn’t occur because the XML query which was part of the vision took way too long. We kicked off a working group to say “let’s make it easy to walk up to these sea urchins and ask questions of the data” like “how many tasks are going to occur next week?” or “where am I going to be next month?” only to find out that it took them over four years to produce a schema / spec called XML Query. Anything that takes four years is probably not worth doing.
Adam Bosworth: Sort of a general Maxim it’s better to do things in six months and then start learning from customers. If it takes four years you’ve overdesigned the thing.
Adam Bosworth: The second thing is it didn’t work like the web. The Web is all about starting small and simple and flexible and easy. But the schema and the query languages were huge and complex and very inflexible. So people said boy this is hard. Why would I do something that’s hard?
Adam Bosworth: And then they would to actually support this let’s imagine I walked up to you and I said “support a Web site” and “supporting your websites shipping up a set of bugs” and “support this general purpose XML query language” on top of the bugs you’d look at me like I was crazy. Or maybe you’d turn yo MySQL guys. I was warned not to say “My Sequel.” I’ve had several meetings with MySQL because I’m a technical advisor and most of the meetings involve large amounts of Russian vodka and something called Salmiakki which has to be tasted to be experienced but it’s basically licorice mixed with Finnish vodka while Finns grin at you inquiringly while you drink it to make sure that you’re real a real man. And.
Adam Bosworth: So most of it’s been pretty informal. But one thing that we have never agreed that MySQL should do, now I’ve been briefed is support XML Query. That would just be silly. It’s a big really complex language. Then there was this thing called Web Services and Web Services started out as a really simple idea. I’ll send an XML message to you and you sent and you send me one back. And that seemed like a good idea to me at the time. And then I got the WS Addressing spec and the WS Policy spec and the WS Transactions spec and the WS Reliable Messaging spec and then after and I’ve tried to read them. And after a while I also ran out of the room and had a stiff drink.
Adam Bosworth: So that didn’t seem like the good simple small flexible way to start. Why did this happen? It happened because the people who were working on these specs are big institutionalized companies trying to protect themselves. The people who really control these specs are people like IBM and Microsoft. And they were being deliberately frankly my opinion making something hard– because if it was hard only they could deliver technology to do it. What I think we want to do is just the reverse. So what I want to talk about now is what we should do. Let’s see. So in order to talk about I wanted to just talk briefly about MySQL.
Adam Bosworth: I sat there and listened to all the tutorials on Monday and what I heard you guys doing was you were basically pushing MySQL to be Oracle. You were pushing my sequel that stored procedures because real men have stored procedures and you were pushing them to have views because real men have views and your pushing to have triggers because needless to say real men have triggers so they can pull the trigger on the gun. All of this is basically about centralizing the processing logic in the database.
Adam Bosworth: To be blunt centralizing processing logic in one computer even as one that was usually a really bad idea. It doesn’t scale. What has caused the web to work was not centralizing logic. What caused the web to work was you didn’t you took, you basically wrote some piece of front end hardware that none of us really understood but we trusted Cisco or Level3 or whatever to give us. And then so that routed it to as many machines as we could find and if things got slow we added more machines. In short we decentralized logic.
Adam Bosworth: So centralizing logic, while I understand that you all want it. And for some things that’s going to be very useful–is by and large not going to give you scale. In fact it’s going to give you what we can only call anti-scale.
Adam Bosworth: And I said suppose you think about the other way around?
Adam Bosworth: Don’t think about what to do because you want to be Oracle. Think about what to do because Oracle itself isn’t nearly big enough because Oracle itself can’t deliver on the kinds of queries I was talking about. Oracle itself can’t deliver on billions of questions a day over very large amounts of data. So think about the question somewhat differently. Imagine an open model. For what I’m about to describe not for open source. But an open model for how you get at data. So what does “open” mean? Open means two things to most people. If you go to the average customer and say “Why do you like open source” apart from what I’ll call the zealots there are typically two arguments. Number one I know what I’m getting.
Adam Bosworth: It’s why do I know what I’m getting because I can read the source. Now I have to say I’ve counted all the people who actually use open source who’ve read the open source and I’m not going to report that percentage but it’s smaller than you think. But nonetheless they like the idea that they could read the source and some of them actually do read the source. Once in a while you can get someone to check something in our drives to the source but that’s even more rare. But the idea is that at least it’s open and the other thing is that if it doesn’t work they can usually hire someone to go fix it.
Adam Bosworth: After all it’s their source. Anyone who’s ever dealt with a large institutional enterprise company where you needed a fix understands the value of that. Instead of waiting for them to decide it’s two or three years later and they should get a release out that does the fix. You can fix it in a day or two. So that’s actually a good thing. So transparency is clearly a reason. The other issue is lock-in.
Adam Bosworth: Lock-in says most people conflate open source with standards. They basically are trying to avoid lock-in. They’re trying to make sure that if they didn’t like the provider they can go get another provider. Obviously they ultimately the provider can be themselves because it’s open source that people are trying to avoid lock-in. So what’s not open today in the database world is how you talk to the database. The actual wire protocol. In other words if I go and walk up to all these places that have information all over the web I don’t have the equivalent of HTML.
Adam Bosworth: I don’t have a standard simple wire protocol that I can use so that it doesn’t matter whether I’m talking to MySQL or Oracle or whatever I can just write this client that’s talking this really simple protocol. And if I were open to say it’s currently not a standard.
Adam Bosworth: This has really been a major blocker what MySQL has done is they’ve created many different storage engines. They’re using an API can go and talk to their query engine. But it isn’t something where anybody who has any information that’s available and just wanted to let MySQL query it could just serve it up and instantaneously and effortlessly MySQL could do that because it would be an open standard. In other words MySQL’s query engine is not to anyone who has data as the browser is to a website.
Adam Bosworth: Meaning that the average PHP programmer can go build a website and serve up pages that the browser can then consume. But the average simple you know like my son is sitting wiring PHP because he needs to get an app up and running is not going to go and write a storage engine that he can feed into MySQL that he could just use and that they could find and discover. Furthermore over the web and it could be massively decentralized.
Adam Bosworth: Now. I think this is a very 20th century way to think about things. I think that the lesson of the last 10 years is that you open up your formats and your data directly through wire formats that are easy and simple and flexible. And when you do that. You get a huge, huge increase in what’s going to happen.
Adam Bosworth: I think that if we do this, if we make it as easy as I just said to serve up information where the query engine MySQL or anyone else who writes it or some other things that I’ll talk about can serve up any kind of information whether its products are used motorcycles or the books that I have in my garage or some auction or you know the things I think about restaurants in New York City that my or that a group of us think about the restaurants in New York City or anything else you think of it that information is effortlessly available.
Adam Bosworth: That’s going to change how you think about what you do because you’re no longer just going to be querying your tables. You’re going to have the query any data that’s available anywhere on the web and you’re going to be doing do it as easily or actually more easily I would argue that you can query your tables today. And if that happens just as we saw an enormous change in computing between 1994 and 2004 — which is basically the Web– I think we’ll see an equally big change in computing around data. In order to do this we’re going to need an open format for items.
Adam Bosworth: We’re gonna need to say that whether it’s listings or project reviews or tasks or schedules or books or motorcycles or whatever there’s some standard simple way that you could feed these things up. And it’s gonna have to be one grammar because otherwise the query engine that MySQL or anyone else or seek or Google builds is going to have a very hard time with it. It’s going to have to be sloppy because otherwise the average person is not going to figure out how to do it. It’s gonna have to be easy to serve the information up because otherwise there’s no point. The whole point of this argument is that we’re trying to open up and democratize how data is served up to the query world.
Adam Bosworth: And for that matter updated. I haven’t gotten there yet. And it’s gonna have to scale now the scale turns out to require something that’s going to cause everyone in this room to walk out of here after my talk thinking “what was he smoking?” I am a big believer in stupidity. On my blog. Adam Bosworth dot net I frequently talk about the virtues of dumbness. You can go read it for yourself and decide whether I know what I’m talking about. But I believe that complex things tend to break and simple things tend to work. And I believe that really simple things work amazingly well. Now I put my money where my mouth is. I went to work for a company that has probably the dumbest query language we have ever seen in the western–actually in the world.
Adam Bosworth: You type in some words. There is no language. Think about it. You go to Google you type in some words and we try and guess based on those words what you wanted. There is no structure no syntax no grammar no language. There’s just some words. But it’s cheap, it’s linear, it’s fast and it scales across the web. Now today there is actually something like this coming out and this is the point of how this actually can happen. I talked for the last 10 minutes about how we hadn’t realized the vision that I had 10 years ago. But we actually are starting to realize this vision and I don’t know if everyone in this room has been keeping an eye on it.
Adam Bosworth: How many people in this room know what RSS or Atom are. Well that’s very encouraging but about half the hands didn’t go up for those of you who didn’t raise your hand. It’s time to go find out. Because there’s every possibility that both RSS 2.0 and Atom and I’ll talk for a moment about the differences are going to be to data what HTML was to content. They’re going to be the lingua franca that we use to consume virtually all data.
Adam Bosworth: From everywhere. It’s a surprisingly simple model and a surprisingly complete one. I’m going to talk about in a second I’m actually not going to go into it here. It’s sloppily extensible to put it mildly. It serves you up a set of items: think of them as rows. And it gives you some things: Think of them as fields that you get for free like title and description and who wrote it and when was the stuff written or updated and what useful related data could you find and what kinds of tags does that have anyway.
Adam Bosworth: But then and is there an attached file and if so how big is it and what mime-type is it. But in addition it lets you stick anything else you want. So for example if you were looking at books you might want “price” if you were looking at bugs you might want “priority” and “assigned to” and so on and you can just stick this stuff straight in. So it’s completely extensible model. These guys got the web because they got the Web. This thing has taken off like wildfire.
Adam Bosworth: Originally it was started to use both by the news coming in the blog community but by now, for example Flickr which was mentioned earlier serves up RSS and most people starting to serve up RSS. I’m gonna skip past this in the interest of time because while I’m not down to my last 8 seconds, I am down to my last 8 minutes.
Adam Bosworth: But there is a cheat sheet just very quickly saying the only difference is between Atom and RSS. Atom was formed by a consortium of bloggers. Largely due in my opinion to an unnecessary political argument but, a consortium of bloggers got together and came up with sort of an alternative. They’re isomorphic and very virtually everybody in the world can read Atom can read RSS 2.0 and vice versa. They both have things called authors but in RSS they’re just text in Atom it tells you there’s a thing called a name and if you want an email if you want a URI. And then of course you can stick your own things in and has a thing called date but they’re formatted differently which is just nuts. And. You are all in one case it’s just your URL inside of the text and in the other case actually says what kind of URL is it. And in one case for text it can be rich text or not you don’t know: that’s RSS and the other case it actually says oh this is HTML text or plain text or what have you. Which is a little handy.
Adam Bosworth: So the key idea is this is very simple format. The Atom API had one other advantage. It doesn’t just let you ask for data. In other words, most of the RSS models you give it a URL and they give you back some data. And this is how we get blogs. It’s how we get the latest posts, how we get pictures, is how we get news stories these days, that’s how we get book listings. And for — how many people here know about A9 in the Amazon’s Open Search RSS. Now a much smaller percentage of hands.
Adam Bosworth: So I’m just gonna talk briefly. Amazon announced something called Open Search RSS a couple of weeks ago. And what Open Search RSS is is that their query engine will actually go. You can give it a little thing and say here if you give me a query and a URL that looks like this I’ll give you back RSS and if you do this they will incorporate it into their search engine. They will let you therefore show your used motorcycles or your books in your garage or anything else you want to be searchable. Now not everyone is going to want to search for used motorcycles. They have a model which frankly I think needs some work for how you decide which things you want to search over but they provided a model for directly plugging any data you want in. And I’m here to tell you that any programmer that can write a web page. (Apparently this is a tougher town than the prior speaker, maybe it was the content.)
Adam Bosworth: Anyone who wants content, and to author content can do this if you can write a web page you can serve up the RSS feed for Amazon what the Atom guys did is they also said sometimes you’d like to actually change the data. You might actually want to put in a new motorcycle. You might want to delete a motorcycle because it’s already been sold. You might want to actually replace the motorcycle descriptions because it turns out that it was hit in a large car crash with a smart car and while the smart car did worse the motorcycle is still damaged. You know whatever it is you want to do. You may want to be able to do backwards.
Adam Bosworth: In short you may want to say it’s not a read-only database. So there’s some extensions that we would need even over Atom and we’ve been looking at them but this is a very very simple way to talk about things over the web. I just mentioned this. This is the Amazon thing. Basically you give them a little thing and you say “here’s the URL to give me, here’s where you typed the text words and here’s how you say which page I want and how many roads I want” and then they’ll automatically feed that into the UI.
Adam Bosworth: That’s it. That’s what they give you. That’s the query string you get in in the GET and then what you give them back looks like this and this turns out to be RSS. And since we skipped over this just briefly RSS is typically a channel that contains a set of items and items typically contain title and link and description and these things here if you need them.
Adam Bosworth: So suppose that we actually built a model where everyone present presented their information this way. It was an open source stack where we encouraged anyone who had information could serve up the information.
Adam Bosworth: And when I say anyone that includes us I mean you Google anyone who has information and share up the information in this way to make it globally available. There’s still an issue. How do you massively distribute this thing? How do you cause it to scale? Now, It’s easy enough to come up with a basic model: the basic model is I send you in a query and that scales because I send it to any of any very large number of machines in the query goes in as a URL. And. Then there’s this HTTP model for either getting RSS or asking you to change the RSS but the queries have to be limited.
Adam Bosworth: You can’t go and give a query to people that says I want to find all the items that were sold by someone who came last Friday and was hired by me on the 5th of March and isn’t actually a felon because the average person who can write a web page can’t actually solve that query. They can solve the query that says I want all the things that have the word foo in it because they can use some text indexing software or something. But if you make the query hard they can’t do it. So it’s got to be easy to implement.
Adam Bosworth: The other issue is distribution I mentioned earlier that the thing about the Web is partitioning. Partitioning means that if I give a query in one machine with bottleneck I put the data on two machines if two machines would bottleneck I put the data on four machines four machines… and so on and so forth.
Adam Bosworth: But in order to do that the queries have to be queries that don’t need the data across machines. If the query needs data from all four machines at once that isn’t going to work very well. I’m gonna repeat that. In order to scale these queries linearly the queries have to be able to be queries that you can run just on the data that happens to be on the machine that you’re on. And since you can ultimately split all data across all machines that means they have to run at the item level. Now they have to run out the item level.
Adam Bosworth: They’re not going to be as rich as SQL. In particular. They’re not likely to do JOINs. I need to compare rows. They’re not likely to aggregate. I need to add rose up cross things and not likely to subqueries I need to compare rows and so on and so forth. These are going to not be queries as you know them. They’re likely to be things that say “I want to find everything that has an author called Fred” or “I want to find everything and has Fred” or “I want to find everything that was entered in the last six week”s or “I want to find everything in the following date range.”
Adam Bosworth: It’s a very different model. If we don’t do that it won’t scale. So the kind of querying I’m talking about here and this is why I said you’re going to walk out thinking What is he smoking is a much lower tech query. But it is one that virtually everybody can support. It is the HTTP and HTML of data. The model that this thing does says OK, now if I’m a PHP programmer or I’m a Java programmer I catch this query. Remember the query is trivial. I’m only operating at the transaction level so worst case I’ll just fetch all the items I’ve got and then item by item just compare. See if it meets the query.
Adam Bosworth: For updates. We have a very simple model. We basically say if you want to let the thing be updatable all you use a field that’s already in the Atom and RSS world called the GUID (globally unique ID) and probably some kind of a CRC32 to use the MySQL version of this that you would round trip. The semantics are gonna be up to the machine. Some machines are gonna say you can’t update data — the CRC32 has changed since you last read this item. Others are going to say no that’s cool I’ll just create a new version. Others are going to say you can update in the first place.
Adam Bosworth: And that’s just because you’re distributing this data again across load. But that’s like the web today.
Adam Bosworth: If you’re going to have a Web site doesn’t always mean you can do a FORM POST. That’s up to what authentication/authorization you have. So some won’t support DELETE. So what you won’t have again is you won’t have complex conditions and you probably won’t have UPDATE you’ll have REPLACE. Why do I say that? (I’m out of time so I’m not gonna explain why I say that). So the idea very simply I was — Zach complained to me that I had no diagrams — so I have to really hideous diagrams just to make Zach happy.
Adam Bosworth: Really hideous diagram number one is this one. Basically when you come in with a query it’s going to go out to one or more partitions and those partitions are going to go and ask one or more machines for the data because often you’ll duplicate the data as a result of the query and they’re going to send that back and then it will send it back to you. Notice that this works because a query can be sent to each and all machines because the query only operates at the item level. If you try and change data on the other hand.
Adam Bosworth: You’re gonna send in a change request which is going to go to only one machine the one that owns that item. And that machine is then going to send you back what happened. Either your changes were accepted or your changes were denied. So one model scales linearly across caching, one with slaves and one doesn’t. And that’s like. But if we do this, if we build a model where virtually everybody can consume this model of Extended RSS, and people like MySQL start supporting this. So we work with them to deliver this kind of feature. You guys can be Google too. You guys can handle hundreds of millions of queries a day. You can scale up and out in ways that Oracle could only dream of.
Adam Bosworth: You can effortlessly support hard questions where you search the entire web for restaurant reviews and report the ones that we’re interested in about a place and a time by people you you cared about because you can write your own intelligence as you consume this data. In short the technology you’ve been using so far to solve the problems for an individual company with their data. You can now apply to search and asking questions about any data anywhere in the world at any time.
Adam Bosworth: I’m gonna skip over the hard issue. What makes this possible is really simple sloppy highly scaled scalable protocols.
Adam Bosworth: HTML and HTTP did this for the user interface. I believe that RSS 2.0 and Atom can do this for data. There are some things we would need to add: a formalization for query which we’re looking at just doing an extension of what Amazon’s already done. And a standard model for replacing, annotating, archiving, and inserting new items. But when this happens I think that the big centralized databases that right now I feel like sometimes MySQL is trying to compete with, will be completely passé. We will speak about them just ten years from now as being those strange things that never really got the Web.
Adam Bosworth: And instead the open model. Open not just in the open source model that MSQL is today but open in the protocol model. That will replace it will change the world. Thank you.
Adam Bosworth: So, I’m gonna be late for a flight but I can take one or two questions.
Adam Bosworth: The question as best as I can paraphrase it is: “if Moore’s Law is already driving down by an order of magnitude the cost of disk and the cost of memory is there really a scale problem?” Is that the question?
Adam Bosworth: So the comment was Miguel had 30 mega– gigabytes actually– of photos from Beirut. The bigger issue is suppose there’s lots of Miguels out there. And suppose I now want to start querying for all the people who have tagged something as whatever it was Miguel tagged his photos and I’m doing that across the entire web. How do I do it? And that’s the point that I’m really addressing. I’m saying that it’s not I don’t put them all in one database. I don’t go to everybody in the world including Miguel and say “oh all of you have 30 gigabytes of photos please put them all in one centralized database” because if I took a billion people and asked them to put 30 gigabytes of photos and I believe that’s 3000 petabytes and 3000 petabytes is 3 million terabytes, which is, if I have my math right which is a really large amount of data. And you know the average database isn’t going to deal well with over three million terabytes even today. On the other hand if we distribute this. So instead I can go simultaneously to all of the machines like Miguel and ask for questions of what have you got that’s tagged and tagged as one of the things in my query then I solved that problem.
Adam Bosworth: Next question.
Adam Bosworth: I actually did mention the semantic web but I ran out of time. I said two things are actually three things about the semantic web–that’s in the slides.
Adam Bosworth: Number one I said this is not a semantic web. And I actually put it in capitals. I did the equivalent of shouting. I don’t understand top down ontologies. I don’t understand invariant ontologies. I don’t understand. I’ve never been able to get people within one company to agree on the schema of one thing. So how I would get 6, 7 billion people to agree on the schema of anything is completely beyond me. So I’ve specifically said this is not some sort of subject/verb/object or a semantic model for navigation. You– the semantic problem is still yours. All I said is I’ll serve up those pictures with tags. You have to figure out if the tags are talking about the same thing. If “ME” means Maine or means Maryland or whatever the heck it is. So the second thing I said about this semantic I’m going to say that semantic web is RDF actually has failed empirically. The test is simple and easy and flexible that I describe such flexible enough. But the average person and we can see it’s in a very simple way. RSS 1.0 was an RDF grammar. RSS 2.0 was not. RSS 1.0 was a standards based thing supported by everybody everyone had come together they’ve been a long group growth. They’ve all talked about how pure and simple and wonderful it was. RSS 2.0 was one grouchy guy, like me sort of a difficult guy, grew up in Brooklyn New York, who went off and in a hissy fit went and wrote it down. RSS 2.0 won in a walk. And it won in a walk because virtually any programmer in there who can breathe can figure out how to emit RSS 2.0 and all of them get confused about how do you stuff as arcs and nodes and verbs or RDF grammar so I didn’t mention it because frankly– I’m a huge skeptic.
Adam Bosworth: OK. Thank you.
Doug Kay: You’ve been listening to a presentation delivered at the O’Reilly Media MySQL Users Conference April 18th through 21st 2005 in Santa Clara California. I.T. Conversations audio content is delivered by limelight networks the content delivery network for online broadcasts, music, movies, and software downloads. Taking the cost and complexity out of internet distribution on the web at www.limelightnetworks.com. I.T. conversations is listeners supported so please contribute to our tip jar. One hundred percent of your donation will go to our team of volunteers who produce these programs for you in their spare time. The series producer for IT conversations coverage of the MySQL Conference is Arun Talkally the website producer for this program was Jim Alloterris. Post-production audio by Bruce Sharpe. My name is Doug Kay and I hope you’ll join me next time for another edition of I.T. Conversations.