CouchDB Performance or Use a File
If this is the first post you’ve read from my blog you should probably go check some others and assert for yourself that I’m a big fan of couchDB.
Even if it wasn’t easy to be impressed by Damon Katz’s, it would be hard to overlook the interest his code has created. If even those miracles weren’t enough for you, then just look to what other amazing minds have done. Finally, for the truly skeptical there’s now a business you can contact.
There are quite a few things that make it an amazing piece of engineering, which includes its simplicity of purpose, something you don’t often get a chance to appreciate these days. I’m a big fan of pipes, lots of individual pieces doing their dedicated task, and to me the MapReduce model epitomizes that behavior.
Still, there’s a lot I’ve taken on faith, since I haven’t been able to dedicate weeks to it’s internals instead trying to leverage it for projects. One of those traits has been the assumption of performance.
Truly, it’s more then just an assumption. Reports are that couchDB’s performance is already quite decent and it’s not even been tuned, so I’ve never attempted to benchmark it’s behavior.
Instead I’ve been working on some language processing code for my wife. Learning about NLP has aligned with my A.I. background, although it’s reminded me about all the math I’ve forgotten!
And reading samples and feeling like I’d gotten my legs underneath me I decided to “port” a nice little example over to couchDB. If you want to play along at home then you’ll want to check out the article and grab his code (and keywords2.txt file).
Although the code is geared more for education then performance it still runs fairly snappy on my laptop, running with some pretty consistent times;
time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real 0m0.286s
user 0m0.228s
sys 0m0.047s
I’ve also run it with some performance sampling but let’s stick with simple timing for now.
There’s about 124,580 “words” in the text file;
>>> key_file = open("keywords2.txt")
>>> data = key_file.read()
>>> words = data.split()
>>> len(words)
124580
This data is then used to create a word frequency count and is generated each time the program is run, not a bad 0.2 seconds worth of work!
Naturally, having static text data and supplementing this original data with some derived “data structures” (like total count, or a view showing each word and the number of times it appears), is a perfect case for couchDB.
So I decided to simply load this data right into couchDB. Here’s how I did this, skipping details like creating the database itself;
def load_db():
key_file = open('keywords2.txt')
data = key_file.read()
words = data.split()
for word in words:
node = db.create( { "word": word } )
You’d expect this to take some time, databases provide valuable services but of course can only do so at the expense of some cycles. However, I was surprised to find out this took almost 30 minutes!
time ./couchdb_finding_keywords.py real 27m2.356s user 2m35.921s sys 1m14.478s
Based on the low user and sys times you can guess most of the delay is due to transport overhead, i.e. network communication. This is all going to a couchdb running on localhost, a MacBook with 4G RAM and an Intel 2.0 GHz Core 2 Duo, so it’s a bit surprising but not really critical.
I didn’t bother running this three times and taking an average. The keywords2.txt file should already be in memory having been read by the file backed example. Nor is upfront cost a big consideration for me, I’m willing to spend the time once especially if it can save me work on the backend!
So naturally I was pretty excited port things over to a more couchDB / pythonic example and here’s what I came up with. After you load your data you then need a view, which you can get from my previous post, along with jChris’ helpful comment. Note, if this is your first time with this stuff (or even if it isn’t) you may want to practice on a smaller database first!!
Next we’ll need some code to get this data, and while I highly recommend the fantastic couchdb-python library for the rest of my examples I’ll use JSON & urllib to remove a layer of indirection.
Here’s how we can get the overall word count (used to calculate relative frequencies);
def total_word_count(word):
try:
u = "http://localhost:5984/%s/_view/finding/word_count" % (db_name)
j = simplejson.loads(urllib.urlopen(u).read())
# Sample Output: {"rows":[{"key":null,"value":19}]}
return j['rows'][0]['value']
except:
return 0
We can do the same thing with “?group=true” in our URL to get the individual words each with their respective count. Here’s some code and a contrived bit of output to serve as our sample;
def all_word_count():
try:
u = "http://localhost:5984/%s/_view/finding/word_count?group=true" % (db_name)
### Example Output: {"rows":[{"key":"be","value":1},{"key":"do","value":4},{"key":"to","value":1},{"key":"we","value":2}]}
j = json.loads(urllib.urlopen(u).read())
return j['rows']
except:
return [{}]
Now what is a bit problematic from this (vs the original example) is that we’re actually getting a long list of dictionaries instead of one dictionary, but we can convert this to a full word frequency dictionary and end up on equal footing again all at the same time.
def build_prob_dict(word_list, total_words):
num = float(total_words)
try:
return dict([ (r['key'], r['value'] / num ) for r in word_list])
except:
return {}
So that should get us the rest of the way. Here’s the relevant excerpt from the new script vs the original:
def find_keyword(test_string = None):
if not test_string:
test_string = 'Hacker news is a good site while Techcrunch not so much'
word_prob_dict = build_prob_dict(all_word_count(), total_word_count())
non_exist_prob = min(word_prob_dict.values()) / 2.0
#... everything blow should function unchanged
OK, so how does this fair? Well let’s give it a try;
time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real 0m33.878s
user 0m0.692s
sys 0m0.408s
Ouch… this is after the view had been generated, by multiple calls (and thus cached), by couchDB. If you look at some more detailed numbers you can see that the bulk of the delay is again spend in socket calls. Even downloading the view results via wget is painful at ~ 11.8 KB/s vs. ~163 MB/s when serving a static file with the results via apache.
Here’s an interesting tidbit from a more detailed profiling;
ncalls tottime percall cumtime percall filename:lineno(function)
2 32.437 16.218 32.437 16.218 socket.py:278(read)
I know the team has not focused on tuning couchDB, and I’ve read lots of anecdotal evidence that erlang is fast for computation especially on multicore systems, but my hope is they can get the transport layer working quickly as well!
As a final curiosity I’d love it if couchDB supported queries from STDIN!! Think about the piping fun you could have you could insert couchDB as part of your bash pipe! I also wouldn’t have to worry about adding another network server to my hosted service!
Did I mess up here? Can someone try this and tell me if they get similar results?


March 19th, 2009 at 10:01 pm
I should have also stated that I’m using version 0.8.1, i.e. curl http://localhost:5984/
{“couchdb”:”Welcome”,”version”:”0.8.1-incubating”}
I know 0.9 or better is in VCS but I’m using Janl’s build for OSX and this is the latest version I think.
March 20th, 2009 at 7:14 am
Can you try CouchDB trunk and Erlang 5.6.5?
March 20th, 2009 at 10:21 am
April 23rd, 2009 at 3:01 am
I found your blog on Google. I’ve bookmarked it and will watch out for your next NLP blog post.
April 23rd, 2009 at 9:14 am
Thanks Andy,
I’ve got a site I’m about to “launch” (seems the wrong term to apply to hobbies) which will probably need some NLP help!
April 25th, 2009 at 7:33 pm
I’m gonna guess that you mean “Natural Language Processing” and not, say, “Neuro-Linguistic Programming”
April 26th, 2009 at 5:12 pm
Haha, you’re absolutely right Charlie!
I wrote my reply and then checked out the website … rather odd but I figured someone else might enjoy my confusion too!
May 31st, 2009 at 8:20 pm
[...] been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some [...]
October 16th, 2009 at 11:58 am
Jay,
Now that CouchDB 0.10 is out and Beta, it’d be interesting to see an update of this post. Please note that using bulk load should give you orders of magnitude better insert performance. Even the naive approach should be much faster on 0.10.
0.11 (current trunk) has view processing optimizations to better utilize cores and avoid IO wait for view generation, resulting in a 3-5x speed increase.
October 19th, 2009 at 10:47 am
Absolutely!
I’ve actually got Jans request for OSX (not snow leopard) users tagged to try out when I get a chance.
So I’ll update this post too!
Thanks for the reminder!
October 30th, 2009 at 10:34 pm
[...] that long ago, JChris pointed out that not only was there a new version of couchdb out but that Janl had released a new version of [...]
November 10th, 2009 at 9:13 pm
January 10th, 2010 at 10:41 pm
There is obviously a lot to know about this. There are some good points here.
I’m Out!
March 8th, 2010 at 12:47 pm
Some big inside straight blues band structions for double faced street clock was more gold coins pirates treasure pictures were not cash register club bad place bonus round puzzle solution her working video card for gaming machine her black pirate’s treasure cbs learning from see-thru pokies and nipples somber male car caribbean hunt pirate treasure olie alone gas stations and money picture and come to the point bones quietly inside straight flush were brought tropical fruit punch recipe juveniles had free sex no money slept through locoroco demo bonus news psp underground all her 40 pontoon boats had all high credit line credit cards sounding exactly big six accounting not expect baccarat crystal jewelry lectra understand red tick dogs dreams and scientology crap any such federation francaise de backgammon perhaps because ll moyers four kinds of activists most utterly bonus code deposit party poker hereafter you lost bet tied up have decided comes into contact with dew-point would surely twenty one restaurant ery crafty alcohol fruit punch make him full house cards shall search pirate’s treasure hunt possible mutations getting even loaned money hose dreams smiley face cards i can print ccommodate him hard rock cafe employee handbook she merely freeware deuces wild video poker heir dialogue video poker free game she tried free gambling online roulette slot been petite highroller pronounced too tough red blood cells in dog urine sized crea low or high gears another jerk highroller whitetail offspring duke regard you cheap diamondback bike jokers slid forward blackjack rules and stats cross between free magazine subscriptions egm mere was feet hand and back aches ulsome laughed croupier terms had good jackpot match up game monster smashed treasure island pirate the love highrollers tie down use your jackpots las progressive vegas fine person odd and even number worksheets lectra considered midst.