Could a couchdb guru explain this, please?

I’m in the process of trying to build (and benchmarking) a couchdb project and I decided to use some word count & frequency samples as data. Since “word count” and “grep” are the quintessential map/reduce examples I thought this would be fairly simple.

However, couchdb doesn’t seem to be following the expected semantics.

Let’s say I’ve got some data, here’s how it looks in python;

>>> import couchdb
>>> s = couchdb.Server()
>>> db = s['kw2']
>>> for d in db: print db[d]
...
<Document '133da883092e206d7191f81661beb813'@'3188228489' {'word': 'ho'}>
<Document '2287406943e627278d98a3a2f3d3483b'@'634745217' {'word': 'do'}>
<Document '2717deb4df8ba09601166021fb758126'@'2083376980' {'word': 'mo'}>
<Document '38d48e8e069538a55902dd2d2b7e1771'@'2475366164' {'word': 'ho'}>
<Document '39ef4a9e3eb0eeb02d483ce658d08356'@'2904312995' {'word': 'hi'}>
<Document '4237064ad7a89fa11e9bbbc8ca4ed302'@'722283984' {'word': 'do'}>
<Document '4d0e61dedaf2af93a9d4d261cab696de'@'996995145' {'word': 'we'}>
<Document '55ba96501ed1e9573b2cb6e647c35b47'@'3153984663' {'word': 'my'}>
<Document '5be13ca69c76d202b131d50f5b9c1ecb'@'1584030189' {'word': 'do'}>
<Document '612e4a0d32f4c91f7fb2414e4de47845'@'3488016124' {'word': 'be'}>
<Document '61426c868dc388e6edb2b4ce2078ce06'@'2761346180' {'word': 'me'}>
<Document '908acaf4ad704951dbb08d27ddfbe9a9'@'941727127' {'word': 'mo'}>
<Document '9136e093fda2dda7d5585983299fcbc7'@'4166962206' {'word': 'mo'}>
<Document '9decb25944110c04d040feb31e532c78'@'1016718857' {'word': 'do'}>
<Document 'ad7f4aab329d55c3a2fb97390df5ae0a'@'1660663052' {'word': 'my'}>
<Document 'c4d976a789e37e1c3eb4d57bd50d47aa'@'923287257' {'word': 'my'}>
<Document 'cccf15515077d100498573fe40244130'@'3846996388' {'word': 'hi'}>
<Document 'd747a88eb2cb18776237852aceff96fc'@'3596694550' {'word': 'we'}>
<Document 'dc115f5d42d442f0b5e7d3680aeb62c2'@'3446491946' {'word': 'to'}>

Feel free to add your own but that’s what I’ve got. Each doc has a simple structure, an “_id” (supplied by couchdb when the document is created) and an element called “word” which obviously contains some fabricated two letter structures (which I hesitate to actually call words).

What’s important to note is that the same word may appear in multiple documents.

Now we want to build a view to show each word as well as the sum of how many times it appears in our database.

Again, following the classic paradigm we build our map function (in javascript) as such;

function(doc) {
  emit(doc["word"], 1);
}

So far so good, now reduce;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return value.length;
   }
}

You can pretty much ignore the “rereduce” clause as our dataset’s not big enough right now, nor are we updating it. However, I will mention explain the function’s trick which is that while sum(value) is actually the “mathematically correct” action to take regardless of whether this is our first time through, we’re relying on the fact that since we’re emitting a “1” for each key (i.e. each word instance) that the sum of those values is simply the length of the array we’re passed in. [I learned this from one of the masters]

Ok, despite the attempt at “premature optimization” this actually seems to work out, or at least it looks to when shown in the couchdb key/value view. Here’s my screenshot for proof;

picture-21

However, what I see from a direct URL query to this view is markedly different then the data that’s represented. To test this either use Firefox or a command line client like curl and go to the following url;

http://localhost:5984/kw2/_view/finding/word_count

What I see (and I suspect you will as well) is

{"rows":[{"key":null,"value":19}]}

Which seems to break our expected key/value pairing!!!

Suspecting my understanding of couchdb’s map/reduce representation has been occluded by all the Google videos I’ve watched, it seems like an intuitive modification might be to change our reduce function to return the key & and the value, like this;

return [key, value];

However, that yields an even more shocking outcome;

{"rows":[{"key":null,"value":[[["we","d747a88eb2cb18776237852aceff96fc"],["we","4d0e61dedaf2af93a9d4d261cab696de"],["to","dc115f5d42d442f0b5e7d3680aeb62c2"],["my","c4d976a789e37e1c3eb4d57bd50d47aa"],["my","ad7f4aab329d55c3a2fb97390df5ae0a"],["my","55ba96501ed1e9573b2cb6e647c35b47"],["mo","9136e093fda2dda7d5585983299fcbc7"],["mo","908acaf4ad704951dbb08d27ddfbe9a9"],["mo","2717deb4df8ba09601166021fb758126"],["me","61426c868dc388e6edb2b4ce2078ce06"],["ho","38d48e8e069538a55902dd2d2b7e1771"],["ho","133da883092e206d7191f81661beb813"],["hi","cccf15515077d100498573fe40244130"],["hi","39ef4a9e3eb0eeb02d483ce658d08356"],["do","9decb25944110c04d040feb31e532c78"],["do","5be13ca69c76d202b131d50f5b9c1ecb"],["do","4237064ad7a89fa11e9bbbc8ca4ed302"],["do","2287406943e627278d98a3a2f3d3483b"],["be","612e4a0d32f4c91f7fb2414e4de47845"]],19]}]}

Of course I’m still baffled as to why we seem to have no entry set for key and all our rows as values.

However, my larger concern is beyond even that perplexing situation;

What’s most surprising here is that the key we’re being passed includes the doc id even though it was not emitted as part of our map phase!

Let’s give it one last go here, thinking perhaps we need to be more explicit;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return {"key": key[0],"value": value.length};
   }
}

Unfortunately, this seems to still not yield the organized rows we expected and returns;

{"rows":[{"key":null,"value":{"key":["we","d747a88eb2cb18776237852aceff96fc"],"value":19}}]}

Which stands in high contrast to what couchdb continues to show us;

picture-11

So whatever we emit from reduce ends up as the value part of the reply (as indexed by “value”). Which matches our original expectation (that couchdb will handles setting this based) but doesn’t explain why it’s “null”.

In short I’m left with three questions;

1) Why does couchdb pass our reduce function the doc ID, when it’s not emitted in the map phase!

2) Why is “key” null in our output?

3) How do we get our JSON output to match the same pretty key/value representation that couchdb shows?

I wish I could promise that if you tune in next time I’ll have the answers but we’ll have to rely on the good nature of our experts out there to help us out.

About jay

I'm trying to build something interactive where I can learn from others and hopefully share useful knowledge too. thecapacity@gmail.com
This entry was posted in code, couchdb, frustration. Bookmark the permalink.

3 Responses to Could a couchdb guru explain this, please?

  1. Pingback: thecapacity » Blog Archive » CouchDB Performance or Use a File

  2. jay says:

    Thanks Chris, that does indeed give me what I want!!

    I’m kind of surprised by the option because it seems backwards, i.e. I could understand if grouping the results gave me the total count, but not sure why the default isnt’ to emit each distinct key / value result.

    I hadn’t thought to check with firebug so I’ll play around with “group_level”.

    I’m still not sure why “key” in the reduce function is actually the key & the doc ID…

  3. J Chris A says:

    QUOTE:

    http://localhost:5984/kw2/_view/finding/word_count

    What I see (and I suspect you will as well) is

    {“rows”:[{“key”:null,”value”:19}]}

    ENDQUOTE

    I expect that 19 is the total number of documents in your database (or rows in your view). The output you see is the count of all words (regardless of which word they are). If you wanted to find the count of “do” you’d query:

    http://localhost:5984/kw2/_view/finding/word_count?key=“do”

    If you wanted to find the count of all the words that start with a, b, c, or d, you’d query

    http://localhost:5984/kw2/_view/finding/word_count?startkey=“a”&endkey=”d\u9999″

    NOTE that the high numbered unicode character is a bit of a hack. We’re considering changing endkey to be an open interval. Once that’s done, instead of d-followed-by-the-last-character-you-can-think-of, you could just have endkey=”e” ENDNOTE

    To get what you expect here, you should make the same HTTP request that Futon does. Perhaps it could be better documented but most people just spy on it in Firebug.

    I think you want:

    http://localhost:5984/kw2/_view/finding/word_count?group=true

    Group=true is just a macro that says “for every unique key in the map, run the reduce”. As such it has cost O(K log N) where N is the size of the map and K is the number of unique keys.

    group_level gives you even more control over the grouping (when the keys are json arrays).

    Hope that helps!

Comments are closed.