Archive for the ‘couchdb’ Category

couchdb coming back for more

Friday, October 30th, 2009

Not that long ago, JChris pointed out that not only was there a new version of couchdb out but that Janl had released a new version of his OSX package, CouchDBX!!

So I knew I needed to find a time to try both new versions out.

‘Thankfully’, I can’t really get to sleep right now so I thought I’d try to be productive and give them both a go again with my small performance test.

And here’s the latest results.

Here’s a baseline, which if you recall loads the file from disk.

$time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real    0m0.259s
user    0m0.216s
sys    0m0.041s

Now for couchdb’s results. Here’s the portion of time required for the database load:

$ time ./couchdb_finding_keywords.py
real	16m53.912s
user	2m57.409s
sys	1m35.209s

This is down quite substantially from the 28 minutes the last version tested took to load.

Rather then run the timing for the loading stage again (since it’s clearly way beyond the time required to analyze the text file), I thought I’d jump to an actual query.

Unfortunately in the process of running the real test I realized I hadn’t created the necessary views for the new database.

Then, in doing so, I made a typo in my map() function and had to wait through many, many error messages like:

OS Process :: function raised exception (ReferenceError: worse is not defined) with doc._id ############

This was certainly my fault, but it would be nice if couch could take a break from spitting out error messages and not bake my processor any further running a bad map()!

I finally was able to click off the temporary view page and found the “Stop” button.

I managed to get most of my view function squared away but then missed the quotes around the dictionary key “word”, so while it should have read:

"map": function(doc) {
    emit(doc[\"word\"], 1);
}

"reduce": function(key, value, rereduce) {
    if (rereduce) {
        return sum(value);
    }
    else {
      return value.length;
    }
}

It didn’t and the bad line came out as:

emit(doc[word], 1);

So as you can imagine, I had to do the dance all over again. This time, after I was able to stop it I went directly to the document for the design itself and edited the code there.

I know I hit the green arrow to save, but when I went back to the design view to see the results it still had the same mistake. So I corrected it there, and quickly hit ‘Save’ and then couchdbx promptly crashed on me.

After I told OSX to restart it I got:

"The application beam.smp quit unexpectedly after it was relaunched"

So yes… sometimes software and I don’t get along. What can I say, but that it makes me a great tester!

I was able to restart couchdbx though, and it seemed to load fine, and eventually got data from a browser after the view was built.

But I also got an interesting tidbit from the DBX console too:

1> [info] [<0.66.0>] 127.0.0.1 - - 'GET' /_config/native_query_servers/ 200
1> [info] [<0.86.0>] checkpointing view update at seq 92542 for keywords _design/finding
1> [error] [<0.69.0>] Uncaught error in HTTP request: {exit,normal}
1> [info] [<0.69.0>] Stacktrace: [{mochiweb_request,send,2},
             {couch_httpd,send_chunk,2},
             {couch_httpd_view,send_json_reduce_row,3},
             {couch_httpd_view,'-make_reduce_fold_funs/5-fun-1-',8},
             {couch_btree,reduce_stream_kv_node2,8},
             {couch_btree,reduce_stream_kp_node2,11},
             {couch_btree,fold_reduce,7},
             {couch_httpd_view,'-output_reduce_view/6-fun-0-',12}]
1> [info] [<0.81.0>] 127.0.0.1 - - 'GET' /keywords/_design/finding/_view/word_count?group=true 200
1>

Yep, I think I broke it yet again…

A subsequent query to:

http://localhost:5984/keywords/_design/finding/_view/word_count?group=true

Seemws to show all’s well, so I thought I’d get fancy:

wget -O - http://localhost:5984/keywords/_design/finding/_view/word_count?group=true

But when I hit Control-C to cancel the get (because I realized I hadn’t redirected output to /dev/null) I got yet another stack trace:

1> [info] [<0.126.0>] 127.0.0.1 - - 'GET' /keywords/_design/finding/_view/word_count?group=true 304
1> [error] [<0.387.0>] Uncaught error in HTTP request: {exit,normal}
1> [info] [<0.387.0>] Stacktrace: [{mochiweb_request,send,2},
             {couch_httpd,send_chunk,2},
             {couch_httpd_view,send_json_reduce_row,3},
             {couch_httpd_view,'-make_reduce_fold_funs/5-fun-1-',8},
             {couch_btree,reduce_stream_kv_node2,8},
             {couch_btree,reduce_stream_kp_node2,11},
             {couch_btree,fold_reduce,7},
             {couch_httpd_view,'-output_reduce_view/6-fun-0-',12}]

So let’s just get on with the performance test I guess…

After another DBX restart (more to be sure then anything since couchdb seems to almost enjoy dumping stack traces while still merrily marching along).

I changed my URLs:
old_url u = “http://localhost:5984/%s/_view/finding/word_count” % (db_name)

new_url = “http://localhost:5984/%s/_design/finding/_view/word_count” % (db_name)

And can now officially tell you (after one more stack traces) that:

$ time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m31.559s
user	0m0.730s
sys	0m0.429s

It’s still an impressive bit of performance for the functionality, and I think I’ve clearly shown it’s fault resistance. I just wish it didn’t come at more than 100 times the cost of the flat file.

CouchDB Performance – Too much TCP

Sunday, May 31st, 2009

It’s been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some significant performance improvements. Unfortunately, at the time I didn’t have spare cycles to invest in building the latest branches of erlang, couchdb and everything else, so I hadn’t previously been able to rerun my tests.

However, I started a new project today and, like most developers, I took some time to sharpen my tools before I felt sufficiently prepared to proceed. Of course since one of my favorite tools is CouchDB itself I checked in to see how it had been progressing and I was thrilled to see Janl, and it looks like others have contributed, had released a new version of the excellent DBX bundle!

So after a round of updating DBX and CouchDB python library components, I decided to suffer a small distraction and give the new code a test drive.
I wanted to check my baseline, so here’s a rough time sample for the original, file based, keywords code:

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real    0m0.329s
user    0m0.225s
sys    0m0.046s

I ran the initial load and it looks much the same as the previous test:

time ./couchdb_finding_keywords.py

real    28m16.430s
user    2m55.550ssys    1m30.335s

So perhaps around 20% faster, though on a second test run this actually took more than 39 minutes!

Well, now that the load is out of the way, let’s see how are our queries are looking.

Well after making a view, the results with wget aren’t any more promising then last time (note the view location has changed):

wget -O - http://localhost:5984/keywords/_design/finding/_view/word_count?group=true > /dev/null
--20:35:08--  http://localhost:5984/keywords/_design/finding/_view/word_count?group=true
           => `-'
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]

    [                     <=>                        ] 422,776       12.54K/s             

20:35:40 (12.94 KB/s) - `-' saved [422776]

Alas, it doesn’t look like most of the performance improvements have really paid off for this testcase, in fact every run I tried was slower then last version.
Here’s a sample run which is fairly indicative of the rest:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m52.659s
user	0m0.702s
sys	0m0.441s

And again, with more of the full debugging info:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
>>>---- Begin profiling print
         788 function calls (709 primitive calls) in 51.297 CPU seconds

   Ordered by: internal time, call count
   List reduced from 118 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   51.092   25.546   51.092   25.546 socket.py:278(read)
        2    0.049    0.024    0.049    0.024 decoder.py:320(raw_decode)
        1    0.040    0.040    0.040    0.040 ic.py:182(__contains__)
       14    0.036    0.003    0.036    0.003 socket.py:321(readline)
        1    0.020    0.020    0.020    0.020 couchdb_finding_keywords.py:61(build_prob_dict)
        1    0.011    0.011    0.019    0.019 ic.py:1()
        1    0.011    0.011   51.297   51.297 couchdb_finding_keywords.py:68(find_keyword)
        8    0.008    0.001    0.008    0.001 :1(connect)
        1    0.007    0.007    0.066    0.066 urllib.py:1296(getproxies_internetconfig)
        1    0.005    0.005   51.228   51.228 couchdb_finding_keywords.py:41(all_word_count)
        1    0.003    0.003    0.069    0.069 urllib.py:1329(getproxies)
        1    0.003    0.003    0.003    0.003 Res.py:1()
        1    0.002    0.002    0.002    0.002 File.py:1()
        1    0.001    0.001    0.002    0.002 macostools.py:5()
        2    0.001    0.001    0.001    0.001 socket.py:229(close)
        2    0.001    0.000    0.002    0.001 httplib.py:659(connect)
        2    0.001    0.000    0.005    0.002 httplib.py:224(readheaders)
     11/6    0.001    0.000    0.001    0.000 sre_parse.py:385(_parse)
        1    0.000    0.000    0.000    0.000 ic.py:161(__init__)
        2    0.000    0.000    0.000    0.000 httplib.py:323(__init__)

>>>---- End profiling print

real	0m52.023s
user	0m0.732s
sys	0m0.437s

I’m no erlang expert but seeing that many socket calls makes me still suspect that some TCP level tuning (window size & buffering) might be helpful.

As a final note, I did a database compaction and reran the query which helped significantly compared to the worse case 0.9 time but at best only matched 0.8.

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m34.605s
user	0m0.687s
sys	0m0.394s

You can find the test script, with changes to work with the slightly different view URL’s, and if you’d like to recreate the test all you will need to do (beyond setting up couchdb) is:

  1. Swap the comments on the last two lines to run “load_db()”
  2. Create a map / reduce wordcount view
  3. Change the “view_url” parameter on line 25
  4. Invert the comments again to just run “find_keyword()”

Hadoop Benchmark and CouchDB Implications

Thursday, April 23rd, 2009

Although I don’t write much about it directly I’m a big fan of the MapReduce approach to computing and data mining. I love the fluid manner with which it encourages partitioning and how little code is required to focus directly on problem solving vs. setup and communication Also, as Paco admits it’s familiar to anyone who’s had an enterprise mainframe experience.

In the microcosm I’ve found myself using more of the functional programming methods that python provides, and I’m always interested in big announcements like Amazon’s new Elastic MapReduce. However, I think Hadoop has been an exciting development and I’ve been tracking it for a while though unfortunately I lack a large cluster, and more importantly a large problem to leverage it on.

Thus, you can imagine my skepticism when a recent paper, co-authored by Microsoft, was released stating how poor Hadoop performance was compared to some RDBMS solutions.

Having worked with many product and marketing teams I’m aware that benchmarks don’t always translate into real world behavior. However, I’ve also worked with a number of benchmark teams and I know that these technical experts have an honest integrity to their work.

So I took a book reading break and sat down with the paper, aptly titled “A Comparison of Approaches to Large-Scale Data Analysis” to see what I could discern. It’s a well written paper, is actually straightforward at only 14 pages, and often frank about some of the pros, cons and issues they encountered for all of the products during their tests.

Not being a hadoop expert or having a 100 node cluster I can’t really argue or recreate the details (go read the paper) but I will mention that there are some interesting acts of  naivety. For example they “cat” the output of the SQL command to a file but instead of aggregating the individual output logs with;

hadoop fs -cat output/* > combined_file

They force a final map/reduce phase even though earlier in the paper they admit the JAVA overhead for short tasks is significant.

However, I think regardless of the outcome (and I don’t know of anyone who’s tried to refute the tests) there are many lessons here for CouchDB.

The team split most of the tests into two types of environments, one where the data per node was constant (533MB / node) and the other where the data per cluster was constant (1TB overall). Specially in the analytical benchmark example (Section 4.3) they mining websever logs and data.

CouchDB has a fantastic ability to replicate data between instances but unfortunately the only way you could aggregate knowledge across all your data would be (a) run multiple queries against the view on each server or (b) aggregate all this data to a central CouchDB isntance.

I can easilly imagine a scenario where this fragementation could occur (say multiple sharded databases) and picture that as you scale (b) will quickly become impossible. Imagine trying to aggregate every tweet back to a main store, this would quickly require emmense diskspace and likely commercial storage solutions (which defeats the point of commodity hardware).

I don’t think CouchDB needs to morph into a parallelized database but I believe this distributed & aggergated query environment needs to develop for it to function at the envisioned scale.

Finally, let me say that I’m glad couchdb didn’t end up written in JAVA and I’m looking forward to seeing if implementations like Disco can further accelerate the performance of mapreduce systems. I also think Amazon’s offering will make it even easier to afford 1000’s of hadoop nodes instead of a 100 node license for a commercial RDBMS environment (which is certainly not something they mention in their evaluation).

CouchDB Performance or Use a File

Thursday, March 19th, 2009

If this is the first post you’ve read from my blog you should probably go check some others and assert for yourself that I’m a big fan of couchDB.

Even if it wasn’t easy to be impressed by Damon Katz’s, it would be hard to overlook the interest his code has created. If even those miracles weren’t enough for you, then just look to what other amazing minds have done. Finally, for the truly skeptical there’s now a business you can contact.

There are quite a few things that make it an amazing piece of engineering, which includes its simplicity of purpose, something you don’t often get a chance to appreciate these days. I’m a big fan of pipes, lots of individual pieces doing their dedicated task, and to me the MapReduce model epitomizes that behavior.

Still, there’s a lot I’ve taken on faith, since I haven’t been able to dedicate weeks to it’s internals instead trying to leverage it for projects. One of those traits has been the assumption of performance.

Truly, it’s more then just an assumption. Reports are that couchDB’s performance is already quite decent and it’s not even been tuned, so I’ve never attempted to benchmark it’s behavior.

Instead I’ve been working on some language processing code for my wife. Learning about NLP has aligned with my A.I. background, although it’s reminded me about all the math I’ve forgotten!

And reading samples and feeling like I’d gotten my legs underneath me I decided to “port” a nice little example over to couchDB. If you want to play along at home then you’ll want to check out the article and grab his code (and keywords2.txt file).

Although the code is geared more for education then performance it still runs fairly snappy on my laptop, running with some pretty consistent times;

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m0.286s
user	0m0.228s
sys	0m0.047s

I’ve also run it with some performance sampling but let’s stick with simple timing for now.

There’s about 124,580 “words” in the text file;

>>> key_file = open("keywords2.txt")
>>> data = key_file.read()
>>> words = data.split()
>>> len(words)
124580

This data is then used to create a word frequency count and is generated each time the program is run, not a bad 0.2 seconds worth of work!

Naturally, having static text data and supplementing this original data with some derived “data structures” (like total count, or a view showing each word and the number of times it appears), is a perfect case for couchDB.

So I decided to simply load this data right into couchDB. Here’s how I did this, skipping details like creating the database itself;

def load_db():
    key_file = open('keywords2.txt')
    data = key_file.read()
    words = data.split()
    for word in words:
        node = db.create( { "word": word } )

You’d expect this to take some time, databases provide valuable services but of course can only do so at the expense of some cycles. However, I was surprised to find out this took almost 30 minutes!

time ./couchdb_finding_keywords.py 

real	27m2.356s
user	2m35.921s
sys	1m14.478s

Based on the low user and sys times you can guess most of the delay is due to transport overhead, i.e. network communication. This is all going to a couchdb running on localhost, a MacBook with 4G RAM and an Intel 2.0 GHz Core 2 Duo, so it’s a bit surprising but not really critical.

I didn’t bother running this three times and taking an average. The keywords2.txt file should already be in memory having been read by the file backed example. Nor is upfront cost a big consideration for me, I’m willing to spend the time once especially if it can save me work on the backend!

So naturally I was pretty excited port things over to a more couchDB / pythonic example and here’s what I came up with. After you load your data you then need a view, which you can get from my previous post, along with jChris’ helpful comment. Note, if this is your first time with this stuff (or even if it isn’t) you may want to practice on a smaller database first!!

Next we’ll need some code to get this data, and while I highly recommend the fantastic couchdb-python library for the rest of my examples I’ll use JSON & urllib to remove a layer of indirection.

Here’s how we can get the overall word count (used to calculate relative frequencies);

def total_word_count(word):
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count" % (db_name)
        j = simplejson.loads(urllib.urlopen(u).read())
        # Sample Output: {"rows":[{"key":null,"value":19}]}
        return j['rows'][0]['value']
    except:
        return 0

We can do the same thing with “?group=true” in our URL to get the individual words each with their respective count. Here’s some code and a contrived bit of output to serve as our sample;

def all_word_count():
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count?group=true" % (db_name)
        ### Example Output: {"rows":[{"key":"be","value":1},{"key":"do","value":4},{"key":"to","value":1},{"key":"we","value":2}]}
        j = json.loads(urllib.urlopen(u).read())
        return j['rows']
    except:
        return [{}]

Now what is a bit problematic from this (vs the original example) is that we’re actually getting a long list of dictionaries instead of one dictionary, but we can convert this to a full word frequency dictionary and end up on equal footing again all at the same time.

def build_prob_dict(word_list, total_words):
    num = float(total_words)
    try:
        return dict([ (r['key'], r['value'] / num ) for r in word_list])
    except:
        return {}

So that should get us the rest of the way. Here’s the relevant excerpt from the new script vs the original:

def find_keyword(test_string = None):
    if not test_string:
        test_string = 'Hacker news is a good site while Techcrunch not so much'
    word_prob_dict = build_prob_dict(all_word_count(), total_word_count())
    non_exist_prob = min(word_prob_dict.values()) / 2.0
    #... everything blow should function unchanged

OK, so how does this fair? Well let’s give it a try;

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real    0m33.878s
user    0m0.692s
sys    0m0.408s

Ouch… this is after the view had been generated, by multiple calls (and thus cached), by couchDB. If you look at some more detailed numbers you can see that the bulk of the delay is again spend in socket calls. Even downloading the view results via wget is painful at ~ 11.8 KB/s vs. ~163 MB/s when serving a static file with the results via apache.

Here’s an interesting tidbit from a more detailed profiling;

   ncalls  tottime  percall  cumtime  percall      filename:lineno(function)
      2     32.437   16.218   32.437  16.218        socket.py:278(read)

I know the team has not focused on tuning couchDB, and I’ve read lots of anecdotal evidence that erlang is fast for computation especially on multicore systems, but my hope is they can get the transport layer working quickly as well!

As a final curiosity I’d love it if couchDB supported queries from STDIN!! Think about the piping fun you could have you could insert couchDB as part of your bash pipe! I also wouldn’t have to worry about adding another network server to my hosted service!

Did I mess up here? Can someone try this and tell me if they get similar results?

Could a couchdb guru explain this, please?

Friday, March 13th, 2009

I’m in the process of trying to build (and benchmarking) a couchdb project and I decided to use some word count & frequency samples as data. Since “word count” and “grep” are the quintessential map/reduce examples I thought this would be fairly simple.

However, couchdb doesn’t seem to be following the expected semantics.

Let’s say I’ve got some data, here’s how it looks in python;

>>> import couchdb
>>> s = couchdb.Server()
>>> db = s['kw2']
>>> for d in db: print db[d]
...
<Document '133da883092e206d7191f81661beb813'@'3188228489' {'word': 'ho'}>
<Document '2287406943e627278d98a3a2f3d3483b'@'634745217' {'word': 'do'}>
<Document '2717deb4df8ba09601166021fb758126'@'2083376980' {'word': 'mo'}>
<Document '38d48e8e069538a55902dd2d2b7e1771'@'2475366164' {'word': 'ho'}>
<Document '39ef4a9e3eb0eeb02d483ce658d08356'@'2904312995' {'word': 'hi'}>
<Document '4237064ad7a89fa11e9bbbc8ca4ed302'@'722283984' {'word': 'do'}>
<Document '4d0e61dedaf2af93a9d4d261cab696de'@'996995145' {'word': 'we'}>
<Document '55ba96501ed1e9573b2cb6e647c35b47'@'3153984663' {'word': 'my'}>
<Document '5be13ca69c76d202b131d50f5b9c1ecb'@'1584030189' {'word': 'do'}>
<Document '612e4a0d32f4c91f7fb2414e4de47845'@'3488016124' {'word': 'be'}>
<Document '61426c868dc388e6edb2b4ce2078ce06'@'2761346180' {'word': 'me'}>
<Document '908acaf4ad704951dbb08d27ddfbe9a9'@'941727127' {'word': 'mo'}>
<Document '9136e093fda2dda7d5585983299fcbc7'@'4166962206' {'word': 'mo'}>
<Document '9decb25944110c04d040feb31e532c78'@'1016718857' {'word': 'do'}>
<Document 'ad7f4aab329d55c3a2fb97390df5ae0a'@'1660663052' {'word': 'my'}>
<Document 'c4d976a789e37e1c3eb4d57bd50d47aa'@'923287257' {'word': 'my'}>
<Document 'cccf15515077d100498573fe40244130'@'3846996388' {'word': 'hi'}>
<Document 'd747a88eb2cb18776237852aceff96fc'@'3596694550' {'word': 'we'}>
<Document 'dc115f5d42d442f0b5e7d3680aeb62c2'@'3446491946' {'word': 'to'}>

Feel free to add your own but that’s what I’ve got. Each doc has a simple structure, an “_id” (supplied by couchdb when the document is created) and an element called “word” which obviously contains some fabricated two letter structures (which I hesitate to actually call words).

What’s important to note is that the same word may appear in multiple documents.

Now we want to build a view to show each word as well as the sum of how many times it appears in our database.

Again, following the classic paradigm we build our map function (in javascript) as such;

function(doc) {
  emit(doc["word"], 1);
}

So far so good, now reduce;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return value.length;
   }
}

You can pretty much ignore the “rereduce” clause as our dataset’s not big enough right now, nor are we updating it. However, I will mention explain the function’s trick which is that while sum(value) is actually the “mathematically correct” action to take regardless of whether this is our first time through, we’re relying on the fact that since we’re emitting a “1″ for each key (i.e. each word instance) that the sum of those values is simply the length of the array we’re passed in. [I learned this from one of the masters]

Ok, despite the attempt at “premature optimization” this actually seems to work out, or at least it looks to when shown in the couchdb key/value view. Here’s my screenshot for proof;

picture-21

However, what I see from a direct URL query to this view is markedly different then the data that’s represented. To test this either use Firefox or a command line client like curl and go to the following url;

http://localhost:5984/kw2/_view/finding/word_count

What I see (and I suspect you will as well) is

{"rows":[{"key":null,"value":19}]}

Which seems to break our expected key/value pairing!!!

Suspecting my understanding of couchdb’s map/reduce representation has been occluded by all the Google videos I’ve watched, it seems like an intuitive modification might be to change our reduce function to return the key & and the value, like this;

return [key, value];

However, that yields an even more shocking outcome;

{"rows":[{"key":null,"value":[[["we","d747a88eb2cb18776237852aceff96fc"],["we","4d0e61dedaf2af93a9d4d261cab696de"],["to","dc115f5d42d442f0b5e7d3680aeb62c2"],["my","c4d976a789e37e1c3eb4d57bd50d47aa"],["my","ad7f4aab329d55c3a2fb97390df5ae0a"],["my","55ba96501ed1e9573b2cb6e647c35b47"],["mo","9136e093fda2dda7d5585983299fcbc7"],["mo","908acaf4ad704951dbb08d27ddfbe9a9"],["mo","2717deb4df8ba09601166021fb758126"],["me","61426c868dc388e6edb2b4ce2078ce06"],["ho","38d48e8e069538a55902dd2d2b7e1771"],["ho","133da883092e206d7191f81661beb813"],["hi","cccf15515077d100498573fe40244130"],["hi","39ef4a9e3eb0eeb02d483ce658d08356"],["do","9decb25944110c04d040feb31e532c78"],["do","5be13ca69c76d202b131d50f5b9c1ecb"],["do","4237064ad7a89fa11e9bbbc8ca4ed302"],["do","2287406943e627278d98a3a2f3d3483b"],["be","612e4a0d32f4c91f7fb2414e4de47845"]],19]}]}

Of course I’m still baffled as to why we seem to have no entry set for key and all our rows as values.

However, my larger concern is beyond even that perplexing situation;

What’s most surprising here is that the key we’re being passed includes the doc id even though it was not emitted as part of our map phase!

Let’s give it one last go here, thinking perhaps we need to be more explicit;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return {"key": key[0],"value": value.length};
   }
}

Unfortunately, this seems to still not yield the organized rows we expected and returns;

{"rows":[{"key":null,"value":{"key":["we","d747a88eb2cb18776237852aceff96fc"],"value":19}}]}

Which stands in high contrast to what couchdb continues to show us;

picture-11

So whatever we emit from reduce ends up as the value part of the reply (as indexed by “value”). Which matches our original expectation (that couchdb will handles setting this based) but doesn’t explain why it’s “null”.

In short I’m left with three questions;

1) Why does couchdb pass our reduce function the doc ID, when it’s not emitted in the map phase!

2) Why is “key” null in our output?

3) How do we get our JSON output to match the same pretty key/value representation that couchdb shows?

I wish I could promise that if you tune in next time I’ll have the answers but we’ll have to rely on the good nature of our experts out there to help us out.

How to build Couchdb on Dreamhost

Wednesday, March 4th, 2009

As you know from many of my entries I’m a big fan of couchdb, and if you’re interested you should really be following janl, jchris and lethain as they push this technology forward.

As you might also guess from my earlier post I’m working to build and install it on Dreamhost, another thing I support enthusiastically.

Unfortunately, being on the outer fringe of technology meant I wasn’t able to get them to install it for me, but that’s completely understandable. Given that the current package release has no Auth support (I believe the repository builds do but that would have required more software installs) if I were supporting a multi-user production environment it might make me a little nervous too.

However, to in order to continue my interests it’s a major component so I wanted to give it a shot. I don’t have it up and running 100% right now (it appears to have run though I can’t connect) but I wanted to document to build side of things before I forgot :D

So here’s the rundown;

I was fortunate to follow some excellent advice about getting Django up and running on Dreamhost. It advised that you setup a “~/run” directory to install all your add-on software too and these steps below will build on that existing environment.

First you need to download some software, I needed to get; Erlang, SpiderMonkey, ICU & CouchDB.

I downloaded all my files into “~/repo” but wherever you like to store them will be fine (“~/software for example). Now create a temp directory and unpack everything, replacing the paths (and potentially filenames if you picked different versions) as appropriate.

mkdir ~/tmp && cd ~/tmp
tar zxf ~/repo/otp_src*
tar zxf ~/repo/js-1.7.0*
tar zxf ~/repo/icu4c*
tar zxf ~/repo/apache*

I will show the build commands in the order I did them but as long as you save CouchDB for the final step (naturally) then I think you should be fine. Though it’s important to realize I didn’t get this right first time through so I did have some partial installs at times.

cd js/src
make -f Makefile.ref BUILD_OPT=1 JS_DIST=$RUN
cp *.h $RUN/include/js
cd Linux_All_OPT.OPJ
cp jsproto.tbl jsautocfg.h $RUN/include/js
cp libjs.so $RUN/lib

Now for Erlang:

cd ~/tmp/otp_src*
./configure --prefix=$RUN --enable-smp-support --enable-threads --enable-hipe
make && make install

Next, Unicode Support (ICU):

cd ~/tmp/icu/source
./configure --prefix=$RUN
make && make install

And finally, CouchDB!

./configure --prefix=$RUN --with-js-lib=$RUN/lib --with-js-include=$RUN/include/js --with-erlang=$RUN/lib/erlang/usr/include
make && make install

Once that’s completed I was able to run “couchdb” and see the famous “Apache CouchDB has started. Time to relax.” !!!

Unfortunately, running “couchdb -s” in another window tells me that “Apache CouchDB is not running.” :(

However, I suspect that’s an easier issue for Dreamhost to help me with then building everything from source!

Building CouchDB

Wednesday, March 4th, 2009

This is a quick technical post for posterity, if you’re not interested I won’t be offended if you leave now.

I’m working to try to get couchdb on my hosting provider, Dreamhost. They’ve been a great service for me so far, but understandably this isn’t something they’re yet ready to support. So rather then a few apt-get’s I’m building it from source, which means erlang and spidermonkey.

I’ve got those two dependencies build (I hope) but I kept hitting an error with couchdb’s configure script complaining that it couldn’t find the libraries, despite me setting;

./configure --prefix=$RUN --with-js-lib=/home/wjhuie/run/lib/ --with-js-include=/home/wjhuie/run/include/js

Note, since I don’t have root or sudo access that I use a ~/run directory;

export RUN=/home/wjhuie/run

This is so I can isolate my installed software from the system stuff. In configuring couchdb it was able to find libjs.so but complained about not being able to find jsapi.h. In the end the error isn’t that it couldn’t find it but that the header wasn’t able to compile successfully.

I searched far and wide with out much luck (and having seen others with similar problems) but in the end I realized I was missing some required files. Since the instructions from the couchdb wiki weren’t very useful;

http://wiki.apache.org/couchdb/Installing_SpiderMonkey

I had followed this, more helpful, advice;

http://avidemux.org/admWiki/index.php?title=Compile_SpiderMonkey

However, the commands listed needs to be modified slightly to include two files generated during the build, jsproto.tbl and jsautocfg.h;

So for the record (in case it helps someone else someday) I built my spidermonkey like this;

make -f Makefile.ref BUILD_OPT=1 JS_DIST=$RUN

Then installed it with;

cp *.h /usr/include/js
cd Linux_All_OPT.OPJ
cp jsproto.tbl jsautocfg.h $RUN/include/js
cp libjs.so $RUN/lib

A good way to get a hint on the problem is to create a sample “program” and try to compile it. To do this, create a file called “test.c” with the contents;

#include </home/wjhuie/run/include/jsapi.h>
void main()
{ }

Then try to compile it with;

make test 2>&1 | less

That should allow you to see what’s going on and once that works then couchdb’s configure test should also! Now I have to sort out the unicode library requirements!

Training Neural Nets with CouchDB – part 3

Thursday, December 4th, 2008

Hopefully you’ve been following parts 1 and 2 and I didn’t leave anyone too confused by my approach.

Please visit my posts for a far better recap then I can provide here (DRY); In part 1, I introduced the overall project discussed the django layout and focused on the jquery I/O part. My goal in part 2 was to show some of the underlying mechanics of how we triggered a neuron as well as queried couchdb for info.

I know we honed in on some serious specifics but now that we’ve got those specifics I’d like to step back and put together the pieces.

If you check out a copy of the interface (which doesn’t have the NN backend) you can get a sense of the I/O process. In the full application here’s the process;

(0) When a user clicks anywhere on the page that’s sent to our django URL and (1) django records the location then (2) queries our neural net for a guess. (Step 6) This guess will be sent back to the page to move the second coordinate box, and in the sample is set to be equal to the click location. However, before returning the guess, (3,4) the difference between the guess and the actual click is sent to the network so it can train for next time. Also, (5) the input nodes need to be set to the click location so that next time the net can base it’s output on your previous click.

Here’s the relevant code;

#get the clicked coordinates
click_X = int(request.POST['X'])
click_Y = int(request.POST['Y'])

#Figure out what the net would have guessed
guess_X = int(get_output(“output_x”))
guess_Y = int(get_output(“output_y”))

#Find the error and propagate that back to the net
error_X = float(click_X) – float(guess_X)
error_Y = float(click_Y) – float(guess_Y)
back_prop(“output_x”, error_X)
back_prop(“output_y”, error_Y)

#Set the inputs for next time
set_input_weight(“input_x”, “static”, request.POST['X'])
set_input_weight(“input_y”, “static”, request.POST['Y'])

#return the guess to jQuery
return HttpResponse( “{ “X”: %s, “Y”: %s}” % (guess_X, guess_Y) )

Now I know you only know how a neuron is defined and have no idea of our neural net structure but I think that process will be very illustrative.

Since our net needs to predict X and Y coordinates we need to have two output nodes which I’ve named “output_x” and “output_y” and similarly we have “input_x” and “input_y” in order to inject the coordinate values into the network. You can see the keyword “static” trick that we discussed previously being used.

Neural Nets are typically called “Feed Forward Neural Nets” because you stimulate node “A” and then things cascade A -> B ->C and you can read the output at “C”.

However, programtically this can be a royal pain.

  • In some situations you’d need a clock to sync with so you can ensure that you’re reading the n’th output of C and not the n+1 (e.g. if the input to A was controlled via a separate thread then things could shift underneath you before you read the values).
  • You could also build this communication via a messaging service. Node “A” could broadcast its output at a given time and then use a pub/sub model so interested nodes could be alerted as events progressed.

The last approach probably scales the best if you had a ready queuing mechanism then I’d go this way for larger networks. This is more apparent when you realize that “A,B & C” aren’t necessarially single nodes.

Neural nets are commonly built with layers, each of which typically contains multiple nodes all of which are connected to all of the nodes in the previous and subsequent layers.

So in my case “A” is the first layer containing “input_x” and “input_y” and “C” is the output layer containing “output_x” and “output_y”. Layer B would have many nodes all receiving the output from both nodes in layer A. (As an aside you can have more complicated layering systems, for example “output_x” might also have a direct link from “input_x” to further augment it’s correlation and it is legal for a layer to only have a single node.)

So you’ve the “API” for our neural net and part 2 covers the underlying mathematical (and procedural) mechanics so I should be able to wrap up next time by discussing how to get this thing off the ground and see how it works!

Training Neural Nets with CouchDB – part 2

Tuesday, December 2nd, 2008

It’s always nice to have a little encouragement especially when trying to work through some tough posts. I really prefer white boards and pictures but I find these too hard to make for blogging so I try to let the words shape the images in your mind.

So let’s get started with your first exercise… picture or graph the output of tanh() from [-1,1]. For those who will skip the process of pulling out their graphing calculator, you’ll get a sinusoidal curve, i.e. shaped like an “S”, with asymptotes at y=-1 and y=1.

So, I know that’s neat and all but what does it have to do with couchdb? Well it doesn’t per say but it represents our “trigger” function for our neurons. We’ll use this to convert a nodes input to its output, and let’s call this function sigmoid().

The input to this function is actually the weight our node attributes to a specific input. So let’s say I’m a node, N, and I have an input, I. That input will be a value, numeric in this case, and I’ll assign it a weight, W. That weight will correlate with my “trust” in that input (after an appropriate training process). But the key part now is that the output, O, of N will be; O = I * sigmoid(W).

If you play around and graph some of this, you’ll see that if I don’t “trust” I, and W approaches 0 then even if I is a very big number O will be near 0 also.

Before we get into this business of trust let me give you two functions;

def sigmoid(x):
return tanh(float(x))

def slope_sigmoid(y):
return 1.0-y*y

Let me comment on that slope function there. The slope tells us how “drastic” a small change will make our output and we’ll use this function in order to adjust our weights appropriately. This is important because if we’re at the center of our sigmoid (x=.5) and want to reinforce the weight then we may only need to move it by +.01 but if we’re at an extreme on our curve (e.g. x = -.8) then we may need to adjust the weight by -.1 (i.e. 10 times as much) in order to achieve a noticeable change.

OK, let’s talk about that mythic “trust” factor and in order to do that let’s talk about neural nets. NN’s are build by collecting Neurons, connecting outputs to inputs (not usually circular but they can be) and then stimulating an input layer with a set of values and reading an output layer to see what it produced.

Here’s, in psuedo python, is how I represented a neuron;

class neuron():

name = “”

inputs = []

It’s relatively a simple structure. Neuron’s have a name (which are forced to be unique) and a set of inputs (which is actually a list of dictionaries). An important thing to clarify is that I never actually had to define this class, since CouchDB doesn’t demand a schema; Let’s make this a bit more concrete with three examples;

>>> nn['input_x']
<Document u’input_x’@u’82302167′ {u’inputs’: [{u'node': u'static', u'weight': u'474'}], u’increment’: 0}>
>>> nn['1']
<Document u’1′@u’2027135756′ {u’inputs’: [{u'node': u'input_x', u'weight': 1}, {u'node': u'input_y', u'weight': -1}], u’increment’: 0}>
>>> nn['output_x']
<Document u’output_x’@u’1458836188′ {u’inputs’: [{u'node': u'1', u'weight': 1}, {u'node': u'2', u'weight': 9.536375211640632e+179}, {u'node': u'3', u'weight': -194411155905.33661}], u’increment’: 0}>

Hopefully that shows up ok on your browser. The variable “nn” is a link to the “neural_net” database on my couchdb server;

try:
server = couchdb.Server(‘http://127.0.0.1:5984′)
nn = server['neural_net']
len(nn)
except couchdb.client.ResourceNotFound:
server.create(‘neural_net’)
nn = server['neural_net']
len(nn)

I’ve printed out three neuron nodes; The first node, ‘input_x’, is part of the input layer and you’ll see it has list ‘inputs’ with a single dictionary element { ‘node’: ‘…’, ‘weight’: ‘…’ }. I’ve opted to use the name “static” as a keyword to represent an input which doesn’t point to another node and use the ‘weight’ as the actual input value. The second node ‘1′ is more of a typical neuron which would be considered a “second level neuron”. This takes two inputs, one from ‘input_x’ and one from ‘input_y’. The output of “1″ will be;

for input in nn['1']['inputs']

output += get_output(inputs['node']) * sigmoid(input['weight'])

Note I used a function, called “get_output” to find the output value of a node. If the node is static, as ‘input_x’ is, then we could simply dereference it and get the “weight” value but if the input is another “pure” neuron then it may have some calculations to do.You can see how this would work in practice by examining the final node, ‘output_x’. In this case we have to query many nodes just like node “1″ and allow it to do it’s calculation before we can output our values. So a call to “get_output(“output_x’”) actually recurses to the various nodes in turn.

Let me take a moment to diverge from “what I did” to talk about “what I almost did”. I’d intended for “get_output()” to be a CouchDB view and take advantage of quick, asynchronous, lookups. However, if this was done as a view then I’d need the emit function to reference the database and I don’t think this is allowed, i.e. I’d need a map function something like;

Note this won’t, and to the best of my knowledge can’t, work;

function(doc) {
if ( doc['inputs'].length > 0 ) {
for ( i in doc['inputs'] ) {
emit( doc['_id'], “_view/getoutput(doc['inputs'][i])” );
}
}
}

The broken part of that view is the value part of the emit function (remember emit produces key/value pairs). However, since we’re on the topic of couchdb views here’s one way we can build a view to see what inputs a node has;

function(doc) {
if ( doc['inputs'].length > 0 ) {
for ( i in doc['inputs'] ) {
emit( doc['_id'], doc['inputs'][i] );
}
}
}

I also wanted a reduce function to combine the value parts to a single key (so it matched my “data structure” so here’s the reduce;

function(keys, values, rereduce) {
return values
}

The great part of couchdb is that you can input these views in it’s code window and get immediate feedback on what’s being produced! Now I’ll show you the sad part of my design, here’s how to query and act on the view;

#This loop gives us all the inputs to node: name
for input_nodes in db.view(“/nnodes/inputs”, group=True)[name]:
#This loop gives us all the input nodes to node: name
for in_node in input_nodes.value:
if u’static’ in in_node['node']:
output += float(in_node['weight'])
else: #Not a static input
output += get_output(in_node['node']) * sigmoid(in_node['weight'])

What’s sad to me is that it would be less code to query the documents directly;

node = db[str(name)]
for in_node in node['inputs']:
if u’static’ in in_node['node']:
output += float(in_node['weight'])
else: #Not a static input
output += get_output(in_node['node']) * sigmoid(in_node['weight'])

You’ll see that I used the “group=True” parm that I mentioned in my previous post. This just made things match my conceptual model but I wish the python library didn’t force me to dereference .key and .value to get them (it should turn them into a dictionary instead). I’ll also mention that several times I got confused trying ['value'] instead of .value (something that wouldn’t matter in javascript but the former seems more “pythonic”).

I think this is a clear example that I’ve got more to learn about views and better ways to represent my data structures. Here’s a great example which I think will find a lot of analogous fits so read it often.

Back to my situation though, I’d thought maybe each node could store an array called “output_history” which could then be queried with an “increment” value (which would make it a simple emit() process). However, this was much more complicated then it was worth for an initial pass and, since if the value didn’t exist, it would still have to be calculated via a non-Map/Reduce function (because it would have to reference the database).

Before I get into much more detail let me show you the code so you can take a moment to look it over and formulate some thoughts.

I’ll be back with post 3 to try and tie it all together (including a step back to revisit the overall connections, talk more about Neural Nets and my impressions of couchdb).

Training Neural Nets with CouchDB – part 1

Monday, December 1st, 2008

My goal for this post is a bit technical and I’ll try warping both an artificial neural net, as well as my biological one, around an exploration of CouchDB, so read on if appropriate to your interests.

As you may have noticed in some of my earlier posts I’ve been playing with couchdb but it wasn’t until I started following Janl and reading some of his blog posts that things started to click into place.

I would be remiss if I didn’t also mention the fantastic Eric Florenzano who seem to get this much more intuitively then I do, and the amazing jChris who provides code to go along with his great ideas.

My desire to give back was peeked by Jan taking the time to answer a few desperate twitters I had. My first question was simple… why does some code say “map” while others use “emit”… simple answer; it was changed and “emit()” is now the proper syntax.

The second question occurred when I was querying a view with a map/reduce pair and couldn’t get the python library to return the key / value pairs to me as expected. As Jan states, if you pass “group=True” you can use results.key & results.value.

Ok, so that’s the simple Q&A recorded for posterity, but what about the neuron bending I promised. First let me say that if you want a tutorial on Neural Nets and AI programming you’ve come to the wrong place, and secondly this is my “answer” that’s evolved more then I care to admit.

So I’ll try not to take you through the brain process I went through but I’m sure we’ll both see how this could be made more “map/reduce”-y with future iterations. If you’ve got some of those insights I’d love to hear them because most of the examples I found already had that “AhhHA” moment made obvious.

One last disclaimer – I was looking to practice my jQuery and Django programming too so while this could definitely be made more “compact” it met my qualifications of building an “AJAX web application built with Django and utilizing CouchDB” application which was “simple”.

Now onward and upward! You’ll want to start a Django program to hold all this code, a process which is better covered in many other tutorials. The title for my project is “nn_clicks” and I creates a “clicks” application underneath this directory.

So after you’ve got your project made, let’s start with the “front-end” which in this modern age is nearly always the web browser. Of course we need a webpage, and you can find mine here.

It’s nothing fancy but you’ll see there are two HTML elements which will report coordinates on the page. One, “#loc” obviously tracks the mouse and the other, “#guess” holds the contents of our AJAX call, which won’t work now given my blog’s hosting provider.

The magic happens thanks to two jQuery calls. The first simply links the ‘#loc’ element to report our mouse locations;

$(document).ready(function() {
$().mousemove(function(e) {
$(‘#loc’).html(e.pageX +’, ‘+ e.pageY);
});

While the second is where the magic happens when a click occurs;

$().click(function(e) {
loc = $(‘#loc’).html(e.pageX +’, ‘+ e.pageY);
$(‘#loc’).animate({left: e.pageX, top: e.pageY}, 450);

$.post(“process_click”,
{ “X”: e.pageX, “Y”: e.pageY },
function (r, status) {
guess = $(‘#guess’);
guess.html(r.X + ‘, ‘ + r.Y);
guess.animate({left: r.X, top: r.Y}, 450);
},
“json”
);
});

This function first sets the location information for #loc again, just in case, and then proceeds to move that HTML element to where you clicked (a fun effect). For this to work you have to have the CSS “position” attribute set to “absolute”. Which is something I’ve done with inline CSS, which is ugly and poor practice for my style guidelines, but sufficient for this tutorial.

The call to the $.post(…) function handles the AJAX magic. It packs the click coordinates (which are really a measure of ‘X’ and ‘height’ rather then a strict X,Y interpretation) into a JSON structure and the part that reads; “function (r, status) { … }” is then called when the django call “def process_click(…)” completes (more on this later).

So copy the HTML file to the “templates” directory under your Django project (make it if this doesn’t exist) and edit your urls.py file to include these two lines;

(r’^process_click/?’, ‘nn_click.clicks.views.process_click’),
(r’^(.*)/?’, ‘nn_click.clicks.views.index’),

The first line will receive our jQuery POST call which has the mouse coordinates as the data and the second will send anything else to our main page. So now we can edit the applications view (vi clicks/view.py) and add the next phase of changes.

Here’s the necessary line to take care of the second url redirection;

def index(request, something):
return render_to_response(‘nn_click_template.html’, { })

This simply takes our template and returns it with room for future data (the empty “{ }” part) if I need it later. That was simple and if you start up your Django project you should be able to load the webpage and it will at least follow your mouse. Since I serve files from a different box then I use for development try; “python manage.py runserver 0.0.0.0:8000″.

Now let’s figure out how to process that POST data we get from our jQuery POST call.

Again in click/view.py add this code;

def process_click(request):
click_X = int(request.POST['X'])
click_Y = int(request.POST['Y'])
return HttpResponse( “{ “X”: %s, “Y”: %s}” % (click_X, click_Y)

Now the POST call comes in, we parse out the X & Y values and return that data to the page (asynchronously). What should happen is that shortly after the click you’ll see the second tuple move to the same location. You may decide to divide the data by 2 or swap the X, Y values for a little bit of fun.

That seems like a natural place to conclude part one and I know we didn’t get into the couchdb part but never fear I’ll have part two out shortly!

Update: I have this part running (minus the couchdb and neural net code since I can’t figure how how to get those working via fastCGI) but you can check out the basic idea here;

http://nn-click.thecapacity.org/