CouchDB Performance or Use a File

If this is the first post you’ve read from my blog you should probably go check some others and assert for yourself that I’m a big fan of couchDB.

Even if it wasn’t easy to be impressed by Damon Katz’s, it would be hard to overlook the interest his code has created. If even those miracles weren’t enough for you, then just look to what other amazing minds have done. Finally, for the truly skeptical there’s now a business you can contact.

There are quite a few things that make it an amazing piece of engineering, which includes its simplicity of purpose, something you don’t often get a chance to appreciate these days. I’m a big fan of pipes, lots of individual pieces doing their dedicated task, and to me the MapReduce model epitomizes that behavior.

Still, there’s a lot I’ve taken on faith, since I haven’t been able to dedicate weeks to it’s internals instead trying to leverage it for projects. One of those traits has been the assumption of performance.

Truly, it’s more then just an assumption. Reports are that couchDB’s performance is already quite decent and it’s not even been tuned, so I’ve never attempted to benchmark it’s behavior.

Instead I’ve been working on some language processing code for my wife. Learning about NLP has aligned with my A.I. background, although it’s reminded me about all the math I’ve forgotten!

And reading samples and feeling like I’d gotten my legs underneath me I decided to “port” a nice little example over to couchDB. If you want to play along at home then you’ll want to check out the article and grab his code (and keywords2.txt file).

Although the code is geared more for education then performance it still runs fairly snappy on my laptop, running with some pretty consistent times;

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m0.286s
user	0m0.228s
sys	0m0.047s

I’ve also run it with some performance sampling but let’s stick with simple timing for now.

There’s about 124,580 “words” in the text file;

>>> key_file = open("keywords2.txt")
>>> data = key_file.read()
>>> words = data.split()
>>> len(words)
124580

This data is then used to create a word frequency count and is generated each time the program is run, not a bad 0.2 seconds worth of work!

Naturally, having static text data and supplementing this original data with some derived “data structures” (like total count, or a view showing each word and the number of times it appears), is a perfect case for couchDB.

So I decided to simply load this data right into couchDB. Here’s how I did this, skipping details like creating the database itself;

def load_db():
    key_file = open('keywords2.txt')
    data = key_file.read()
    words = data.split()
    for word in words:
        node = db.create( { "word": word } )

You’d expect this to take some time, databases provide valuable services but of course can only do so at the expense of some cycles. However, I was surprised to find out this took almost 30 minutes!

time ./couchdb_finding_keywords.py 

real	27m2.356s
user	2m35.921s
sys	1m14.478s

Based on the low user and sys times you can guess most of the delay is due to transport overhead, i.e. network communication. This is all going to a couchdb running on localhost, a MacBook with 4G RAM and an Intel 2.0 GHz Core 2 Duo, so it’s a bit surprising but not really critical.

I didn’t bother running this three times and taking an average. The keywords2.txt file should already be in memory having been read by the file backed example. Nor is upfront cost a big consideration for me, I’m willing to spend the time once especially if it can save me work on the backend!

So naturally I was pretty excited port things over to a more couchDB / pythonic example and here’s what I came up with. After you load your data you then need a view, which you can get from my previous post, along with jChris’ helpful comment. Note, if this is your first time with this stuff (or even if it isn’t) you may want to practice on a smaller database first!!

Next we’ll need some code to get this data, and while I highly recommend the fantastic couchdb-python library for the rest of my examples I’ll use JSON & urllib to remove a layer of indirection.

Here’s how we can get the overall word count (used to calculate relative frequencies);

def total_word_count(word):
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count" % (db_name)
        j = simplejson.loads(urllib.urlopen(u).read())
        # Sample Output: {"rows":[{"key":null,"value":19}]}
        return j['rows'][0]['value']
    except:
        return 0

We can do the same thing with “?group=true” in our URL to get the individual words each with their respective count. Here’s some code and a contrived bit of output to serve as our sample;

def all_word_count():
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count?group=true" % (db_name)
        ### Example Output: {"rows":[{"key":"be","value":1},{"key":"do","value":4},{"key":"to","value":1},{"key":"we","value":2}]}
        j = json.loads(urllib.urlopen(u).read())
        return j['rows']
    except:
        return [{}]

Now what is a bit problematic from this (vs the original example) is that we’re actually getting a long list of dictionaries instead of one dictionary, but we can convert this to a full word frequency dictionary and end up on equal footing again all at the same time.

def build_prob_dict(word_list, total_words):
    num = float(total_words)
    try:
        return dict([ (r['key'], r['value'] / num ) for r in word_list])
    except:
        return {}

So that should get us the rest of the way. Here’s the relevant excerpt from the new script vs the original:

def find_keyword(test_string = None):
    if not test_string:
        test_string = 'Hacker news is a good site while Techcrunch not so much'
    word_prob_dict = build_prob_dict(all_word_count(), total_word_count())
    non_exist_prob = min(word_prob_dict.values()) / 2.0
    #... everything blow should function unchanged

OK, so how does this fair? Well let’s give it a try;

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real    0m33.878s
user    0m0.692s
sys    0m0.408s

Ouch… this is after the view had been generated, by multiple calls (and thus cached), by couchDB. If you look at some more detailed numbers you can see that the bulk of the delay is again spend in socket calls. Even downloading the view results via wget is painful at ~ 11.8 KB/s vs. ~163 MB/s when serving a static file with the results via apache.

Here’s an interesting tidbit from a more detailed profiling;

   ncalls  tottime  percall  cumtime  percall      filename:lineno(function)
      2     32.437   16.218   32.437  16.218        socket.py:278(read)

I know the team has not focused on tuning couchDB, and I’ve read lots of anecdotal evidence that erlang is fast for computation especially on multicore systems, but my hope is they can get the transport layer working quickly as well!

As a final curiosity I’d love it if couchDB supported queries from STDIN!! Think about the piping fun you could have you could insert couchDB as part of your bash pipe! I also wouldn’t have to worry about adding another network server to my hosted service!

Did I mess up here? Can someone try this and tell me if they get similar results?

About jay

I'm trying to build something interactive where I can learn from others and hopefully share useful knowledge too. thecapacity@gmail.com
This entry was posted in code, couchdb, python. Bookmark the permalink.

14 Responses to CouchDB Performance or Use a File

  1. Pingback: Számítógépes nyelvészet, CouchDB Performance or Use a File | thecapacity

  2. There is obviously a lot to know about this. There are some good points here.

    I’m Out! 🙂

  3. Pingback: GNOME roadmap updated, version 3 pushed back to late 2010 - Topic Powered by Eve Community

  4. Pingback: thecapacity » Blog Archive » couchdb coming back for more

  5. J Chris A says:

    Jay,

    Now that CouchDB 0.10 is out and Beta, it’d be interesting to see an update of this post. Please note that using bulk load should give you orders of magnitude better insert performance. Even the naive approach should be much faster on 0.10.

    0.11 (current trunk) has view processing optimizations to better utilize cores and avoid IO wait for view generation, resulting in a 3-5x speed increase.

    • jay says:

      Absolutely!

      I’ve actually got Jans request for OSX (not snow leopard) users tagged to try out when I get a chance.

      So I’ll update this post too!

      Thanks for the reminder!

  6. Pingback: thecapacity » Blog Archive » CouchDB Performance - Too much TCP

  7. Charlie says:

    I’m gonna guess that you mean “Natural Language Processing” and not, say, “Neuro-Linguistic Programming” 🙂

    • jay says:

      Haha, you’re absolutely right Charlie!
      I wrote my reply and then checked out the website … rather odd but I figured someone else might enjoy my confusion too!

  8. Andy Henry says:

    I found your blog on Google. I’ve bookmarked it and will watch out for your next NLP blog post.

    • jay says:

      Thanks Andy,
      I’ve got a site I’m about to “launch” (seems the wrong term to apply to hobbies) which will probably need some NLP help!

  9. Pingback: Create a new social networking site in few hours using pinax platform (django). — The Uswaretech Blog - Django Web Development

  10. Jan says:

    Can you try CouchDB trunk and Erlang 5.6.5?

  11. jay says:

    I should have also stated that I’m using version 0.8.1, i.e. curl http://localhost:5984/
    {“couchdb”:”Welcome”,”version”:”0.8.1-incubating”}

    I know 0.9 or better is in VCS but I’m using Janl’s build for OSX and this is the latest version I think.

Comments are closed.