Friday, November 20, 2009

Raindrop, CouchDB and data models

Raindrop uses CouchDB for data storage. We are starting to hit some tough issues with how data is stored and queried. This is my attempt to explain them. I am probably not the best to talk about these things. Mark Hammond, Raindrop's back-end lead is a better candidate for it. I am hoping by trying to write it out myself, I can get a better understanding of the issues and trade-offs. Also note that this is my opinion/view, may not be the view of my employer and work colleagues, etc...

First, what are our requirements for the data?
  • Extensible Data: we want people to write extensions that extend the data.
  • Rollback: we want it easy for people to try extensions, but this means some may not work out. We need to roll back data created by an extension by easily removing the data they create.
  • Efficient Querying: We need to be able to efficiently query this data for UI purposes. This includes possibly filtering the data that comes back.
  • Copies: Having copies of the data helps with two things:
    • Replication: beneficial when we think about a user having a Raindrop CouchDB on the client as well as the server.
    • Backup: for recovering data if something bad happens.
How Raindrop tries to meet these goals today

Extensible Data: each back-end data extension writes a new "schema" for the type of data it wants to emit. A schema for our purposes is just a type of JSON object. It has a "rd_schema_id" on it that tells us the "type" of the schema. For instance a schema object with rd_schema_id == "rd.msg.body" means that we expect it to have properties like "from", "to" and "body" on it. Details on how schemas relate to extensions:
  • An extension specifies what input schema it wants to consume, and the extension is free to emit no schemas (if the input schema does not match some criteria), or one or more schemas.
  • Each schema written by an extension is stamped with a property rd_schema_provider = "extension name".
  • All the messages schemas are tied together via an rd_key value, a unique, per-message value. Schemas that have the same rd_key value all relate to the same message.
More info is on the Document Model page.

Rollback: Right now each schema is stored as a couch document. To roll back an extension, we just select all documents with rd_schema_provider = "extension name" that we want to remove, and remove them. As part of that action, we can re-run extensions that depended on that data to have them recalculate their values, or to just remove the schemas generated by those extensions.

Having each schema as a separate document also helps with the way CouchDB stores data -- if you make a change to a document and save it back, then it appends the new document to the end of the storage. The previous version is still in storage, but can be removed via a compaction call.

If we store all the schemas for a message in one CouchDB document, then it results in more frequent writes of larger documents to storage, making compaction much more necessary.

Efficient Querying: Querying in CouchDB means writing Views. However, a view is like a query that is run as data is written, not when the UI may actually want to retrieve the information. The views can then be very efficient and fast when actually called.

However, the down side is that you must know the query (or a pretty good idea of it) ahead of time. This is hard since we want extensible data. There may be some interesting things that need to be queried later, but adding a view after there are thousands of documents is painful: you need to wait for couch to run all the documents through the view when you create the view.

Our solution to this, started by Andrew Sutherland and refined by Mark, was to create what we call "the megaview". It essentially tries to emit every piece of interesting data in a document as a row in the view. Then, using the filtering capabilities of CouchDB when calling the view (which are cheap), we can select the documents we want to get.

Copies: While we have not actively tested it, we planned on using CouchDB's built-in replication support. This was seen as particularly valuable for master-master use cases: when I have a Raindrop CouchDB on my laptop and one in the cloud.

Problems Today

It feels like the old saying, "Features, Quality or Time, pick two", except for us it is "Extensible, Rollback, Querying or Copies, pick three". What we have now is an extensible system with rollback and copies, but the querying is really cumbersome.

One of the problems with the megaview: no way to do joins. For instance, "give me all twitter messages that have not been seen by the user". Right now, knowledge of a message being from twitter is in a different schema document than the schema document that knows if it has been seen by the user. And the structure of the megaview means we can really only select one property at a time on a schema.

So it means doing multiple megaview calls and then doing the join in application code. We recently created a server-side API layer in python to do this. So the browser only makes one call to the server API and that API layer does multiple network calls to CouchDB to get the data, then does the join merging in memory.

Possible solutions

Save all schemas for a message in one document and more CouchDB views
Saving all schemas for a message in one document makes it possible to then at least consult one document for both the "type=twtter, seen=false" sort of data, but we still cannot query that with the megaview. It most likely means using more CouchDB views to get at the data. But views are expensive to generate after data has been written. So this approach does not seem to scale for our extensible platform.

This approach means taking a bit more care on rollbacks, but it is possible. It also increases the size of data stored on disk via Couch's append-only model, and will require compaction. With our existing system, we could consider just never compacting.

This is actually the approach we are starting to take. Mark is looking at creating "summary documents" of the data, but the summary documents are based on the API entry points, and the kind of data the API wants to consume. These API entry points are very application-specific, so the summary document generation will likely operated like just another back end extension. Mark has mentioned possibly just going to one document to store all schemas for a message too.

However, what we have not sorted out how to do is an easier join model: "type=twitter and seen=false". What we really want is "type=twitter and seen=false, ordered by time with most recent first". Perhaps we can get away with a small set of CouchDB views that are very specific and that we can identify up-front. Searching on message type and being seen or unseen, ordered by time seems like a fairly generic need for a messaging system.

However, it means that the system as a whole is less extensible. Other applications on the Raindrop platform need to either use our server API model of using the megaview then doing joins in their app API code (may not be so easy to learn/perform), or tell the user to take the hit waiting for their custom views to get up to date with all the old messages.

Something that could help: Make CouchDB views less painful to create after the fact. Right now, creating a new view, then changing any document means waiting for that view to index all the documents in the couch, and it seems to take a lot of resources for this to happen. I think we would be fine with something that started with most recent documents first and worked backwards in time, using a bit more resources at first, but then tailing off and doing it in the background more, and allow the view to return data for things it has already seen.

Do not use CouchDB
It would be very hard for us to move away from CouchDB, and we would likely try to work with the CouchDB folks to make our system work best with couch and vice versa. It is helpful though to look at alternatives, and make sure we are not using a hammer for a screwdriver.

Schema-less storage is a requirement for our extensible platform. Something that handles ad-hoc queries better might be nice, since we basically are running ad-hoc queries with our API layer now, in that they have to do all the join work each time, for each request.

Dan Goldstein in the Raindrop chat mentioned MongoDB. Here is a comparison of MongoDB and CouchDB. Some things that might be useful:
  • Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
  • Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
  • Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
  • Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the "joins" effectively in our API app code.
What we give up:
1) easy master-master replication. However, for me personally, this is not so important. In my mind, the primary use case for Raindrop is in the cloud, given that we want to support things like mobile devices and simplified systems like Chrome OS. In those cases it is not realistic to run a local couch server. So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.

2) ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language -- right now we are on the path to inventing our own with the server API with a "query language" made up of other API calls.

Persevere might be a possibility. Here is an older comparison with CouchDB. However, I have not looked in depth at it. I may ask Kris Zyp more about it and how it relates to the issues above. I have admired it from afar for a while. While it would be nice to get other features like built-in comet support, I am not sure it will address our fundamental issues any differently than say, MongoDB. It seems like an update-in-place model is used with queries run at runtime. But definitely worth more of a look.

Something else?

What did I miss? Bad formulation of the problem? Missing design solution with the tools we have now?

10 comments:

Anonymous said...

I haven't tried it yet, but Riak looks pretty interesting - especially for the 'join' problem.

Unknown said...

Hi James, nice analysis. I agree that it would generally be more useful if CouchDB views indexed the most recently updated documents first, instead of starting with the oldest data. The current model is cleaner internally, that's probably why it works the way it does now.

By the way, you *can* use the results of a view as its being generated if you pass the stale=ok parameter in the view query. Combine that with a newest-data-first indexer and you've got something more useful.

I'm definitely in favor of implementing a way to "re-nice" background tasks like view generation. It shouldn't get in the way of regular production running, especially if it's indexing low-priority data.

James Burke said...

jpick: Riak seems really new, and reading the docs, it sounds like an erlang-based map/reduce system. So at first look, I do not see it providing any new over CouchDB. Maybe as it gets more traction and improves, things might change for it.

kocolosk: The stale=ok parameter is nice, but as you also mention, just part of the solution. We talk with the CouchDB folks and I am sure we can talk more to them about new-data-first views that play nice as the data gets older. That might be enough then to make ad-hoc queries and queries generated for extensions easier to support.

Unknown said...

James, yep, I should have disclosed up front -- I'm a "CouchDB folk". We're listening :-)

Dan Mosedale said...

I think it's a good problem summary; very helpful to get all these things written down in one place. My suspicion is that we're going to need to actually prioritize the list of requirements, and I suspect that copies and rollback probably want to be lower priority than the other two.

Dean Landolt said...

I should point out that persvr's index everything is pretty much the same approach as your megaview -- though it has the added benefit that it can stop updating the indexes that aren't ever used. You can also use JSONSchema [1] to control that process as well (don't index this field, don't stop indexing that field) [2]


2) ad-hoc query cost may still be too high.

But ad-hoc query costs are always high (by some definition of high). This is why couch's views are a real win (as long as your query isn't multidimensional). In persvr you can hook into an object's lifecycle so you can effectively do what couch does (albeit probably less efficiently I imagine) to generate new "views" on the fly -- however it's far more powerful in a lot of ways -- you can precache more than just json -- you could run the object against a template language and generate any necessary html, atom, whatever...

Oh, and don't forget the fact that you can do other cool stuff to requests/responses before it even hits the DB (persvr supports JSGI 0.2, and Kris is building version 2 from the ground up on 0.3.

Definitely have a look and hit the ML or pop into the #persvr channel on irc.freenode.org with any questions...


[1] http://docs.persvr.org/documentation/storage-model/json-schema

[2] http://docs.persvr.org/documentation/storage-model#TOC-JavaScriptDB

Dean Landolt said...

I forgot to mention -- there are a few other benefits from the persvr side...

First, you could drop all that crazy twisted code and just hook into an object's lifecycle events -- all your code can live in one place (in the same language -- yay js)...

Another awesome thing -- extensions would be even easier (and would still be js) and could still be installed/enabled/disabled/removed on the fly (perhaps as simple as a dozen line JSGI middleware).

Finally, deployment would be insanely easy (and after all, this is supposed to be a locally-deployed app). "Install jre if necessary, copy persvr here, copy the raindrop app code there, set up as a service, run" -- an installer would be simple to write.

James Burke said...

Dean Landolt: Thanks for the Persevere!

What I like about Persevere is all the other stuff around the actual data store. The data store might be nice too. What if Persevere used CouchDB as the data store with our megaview concept, but then used all the Persevere stuff for doing the server-side JavaScript/filters and such, that might be the best of both worlds?

Or would that not be possible in Persevere? What would be lost with that approach?

I guess I do not have a good handle about the performance characteristics of Persevere's JavaScriptDB. At one time I thought it was actually backed by MySQL, but I think that changed.

Dean landolt said...

James:

Persevere's default data store is actually something Kris wrote called jsDB. I've found it to be plenty fast but while I haven't tested on this front, I'm sure it can't handle the kind of concurrency that couch is designed for. But where would that ever come into play in a product like Raindrop? It's local for every user so many of the massive scalability benefits of couch go away. I understand there is some thrust to host Raindrop in the cloud a la Bespin so couch _would_ be a win there, but I'd bet the bottleneck (and security issues) would be with the server side twisted code more than the client app anyway. IMHO it'd be better to focus on the ease-of-install story -- this is one kind of app that many (most?) folks would prefer in isolation on their desktop (with perhaps some kind of hosted mirror).

To your point of using persvr as the framework with couch as the datastore, that's absolutely possible, but you'd have to write a couch datastore adapter to get all the way with the persvr goodness (I'm using persvr as a proxy to MarkLogic for some heavy xml storage to great success so far but I'm also storing everything in jsDB to get the benefits of jsonschema, jsonquery, rest channels, conditional MVCC and the like).

For the next version of persvr (pintura) Kris is using WebSimpleDB (or whatever the cool kids are calling it these days -- ISAM Level 1 or some such) as the means by which to implement stores. I have no clue how long it'll be before that's ready for general use but when it is I'd be happy to write a couch (and perhaps mongo) store implementation.

But I guess what I'm still missing is what you gain from the couch dependency in this context. I'm a big believer in jchris' couchapp idea but there are so many limiting factors that writing a `real` couchapp (with no server-side proxy) is nearly impossible. What's clean in theory would ultimately end up as architectural spaghetti as opposed to a cleaner CommonJS model (which was obviously not possible not too long ago).

If you have a big sample megaview I can access I'd be happy to push into into a hosted persvr server for you to test on.

James Burke said...

Dean landolt: I apologize for the late reply.

For me, the primary use case for Raindrop is in the cloud. But that is my personal opinion, not necessarily shared by the whole Raindrop team.

I think once we look at hosting things like mobile interfaces and even lightweight netbook/Chrome OS machines, installing stuff on the local box gets more unrealistic.

If we need "offline" support I expect using HTML5 features like local storage and an app cache are more likely what will be used.

We still need something that can run local for at least advance users (like raindrop developers), but to me, the cloud is the real use case, the highest yield.

I have been thinking about how to get you a dataset to play with. It seems like the best way to go is the public Enron mail data. Mark Hammond did some exploratory work with this last week, for a local perf benchmark he was working on.

Talking with him tonight, it seems best if we can get part of those emails, maybe for one user into mbox format and then create a raindrop protocol to pull in mbox format data and drop it into raindrop.

I may get to looking at that this week, but it is a lower priority for the week. However I am still interested in exploring this conversation and trying out real data.