BookList – Entry 4: Starship Troopers

I know it’s been quite some time since I did my last review.  It’s not that I haven’t been reading, quite the opposite in fact. My wife and I have been enjoying access to a great library system. I told a friend that going online and requesting holds was like Amazon only I had go to pick them up, a relatively short drive, rather then having them delivered.

I think the reason is that I either read too fast, and move on to the next book too quickly to write a review. Or if I’m in one of my slogging phases (whether I’m busy with other things or reading a difficult book) where it just drags out too long to complete.

I’ll of course try to be better but I’d rather have a gap in my reviews and still be reading then the alternative.

So why this book, why now? Well I won’t get into the reasons that I decided to read Starship Troopers. I’ve read a few Heinlein books before and the way in which he uses “language” as a scientific concept is really interesting. However, my only exposure to this book is from the movie, and while I knew they made a second and third movie, I didn’t expect the volume of such film would reflect highly on the quality of the novel itself.

Let me assure you, this is still every bit the quick read you’d expect it would be. I finished it with just a few nights of effort. However, what I really enjoyed is that it’s much more like one of those thrash out “dialogues” you’d have in a class like “Religion and Moral Philosophy” then some sort of bug zapping hack and smash.

My minor in college allowed me to take a lot of philosophy classes, and I wouldn’t claim to be great at such abstract thinking (it’s really the act of conveying a thought that I find most difficult) but I do love the introspection and the insights you can gain from a good teacher. I happen to have had one at a Summer Scholar’s session I attended and this book brought me back to those sessions.

I think the movie does a good job of conveying the content, though the campy nature of the movie, which I enjoyed, seems opposite of the book’s flow. If you’re looking for an entertaining story which will take you back to the debating days of school then this is the one.

Posted in books | 2 Comments

Greetings Officefighter

In the beginning you were a Starfighter, recruited by the Star League to defend the Frontier against Xur and the Ko-Dan armada… well, actually the beginning was something much simpler, something more like Space Invaders.

Space Invaders was a classic, it spurred the video game industry, inspired a vast genre, can still be played online and occasionally even in a lecture hall.

Now, you can play in my office;

Office Invaders

Well, it would be hubris to claim that my humble beginnings have reached the same success but I do believe it has matured in much the same vein. I seem to have moved beyond building a deadly office putting toy to a serious office defense system!

Office Defender

This might even classify as something much more sophisticated than some Russian spy gear! Now I’m truly ready to defend my office against any and all invaders, alien or otherwise!

In truth, I’m hoping I can stash this in a friends office maybe for a surprise ambush! However, in practice it might be hard to convince someone to put on a face shield before heading to work.

Building from my previous creation I connected my wiimote to a servo via some python and my ioBridge 204 module, only this time I replaced the coilgun with a remotely triggered airsoft gun! Now there are no more issues with reloading a weapons system half a world a way after each shot!

Here is the action sequence;

Would you like to know more?

Posted in hacks, hardware, innovation, iobridge | 107 Comments

CouchDB Performance or Use a File

If this is the first post you’ve read from my blog you should probably go check some others and assert for yourself that I’m a big fan of couchDB.

Even if it wasn’t easy to be impressed by Damon Katz’s, it would be hard to overlook the interest his code has created. If even those miracles weren’t enough for you, then just look to what other amazing minds have done. Finally, for the truly skeptical there’s now a business you can contact.

There are quite a few things that make it an amazing piece of engineering, which includes its simplicity of purpose, something you don’t often get a chance to appreciate these days. I’m a big fan of pipes, lots of individual pieces doing their dedicated task, and to me the MapReduce model epitomizes that behavior.

Still, there’s a lot I’ve taken on faith, since I haven’t been able to dedicate weeks to it’s internals instead trying to leverage it for projects. One of those traits has been the assumption of performance.

Truly, it’s more then just an assumption. Reports are that couchDB’s performance is already quite decent and it’s not even been tuned, so I’ve never attempted to benchmark it’s behavior.

Instead I’ve been working on some language processing code for my wife. Learning about NLP has aligned with my A.I. background, although it’s reminded me about all the math I’ve forgotten!

And reading samples and feeling like I’d gotten my legs underneath me I decided to “port” a nice little example over to couchDB. If you want to play along at home then you’ll want to check out the article and grab his code (and keywords2.txt file).

Although the code is geared more for education then performance it still runs fairly snappy on my laptop, running with some pretty consistent times;

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m0.286s
user	0m0.228s
sys	0m0.047s

I’ve also run it with some performance sampling but let’s stick with simple timing for now.

There’s about 124,580 “words” in the text file;

>>> key_file = open("keywords2.txt")
>>> data = key_file.read()
>>> words = data.split()
>>> len(words)
124580

This data is then used to create a word frequency count and is generated each time the program is run, not a bad 0.2 seconds worth of work!

Naturally, having static text data and supplementing this original data with some derived “data structures” (like total count, or a view showing each word and the number of times it appears), is a perfect case for couchDB.

So I decided to simply load this data right into couchDB. Here’s how I did this, skipping details like creating the database itself;

def load_db():
    key_file = open('keywords2.txt')
    data = key_file.read()
    words = data.split()
    for word in words:
        node = db.create( { "word": word } )

You’d expect this to take some time, databases provide valuable services but of course can only do so at the expense of some cycles. However, I was surprised to find out this took almost 30 minutes!

time ./couchdb_finding_keywords.py 

real	27m2.356s
user	2m35.921s
sys	1m14.478s

Based on the low user and sys times you can guess most of the delay is due to transport overhead, i.e. network communication. This is all going to a couchdb running on localhost, a MacBook with 4G RAM and an Intel 2.0 GHz Core 2 Duo, so it’s a bit surprising but not really critical.

I didn’t bother running this three times and taking an average. The keywords2.txt file should already be in memory having been read by the file backed example. Nor is upfront cost a big consideration for me, I’m willing to spend the time once especially if it can save me work on the backend!

So naturally I was pretty excited port things over to a more couchDB / pythonic example and here’s what I came up with. After you load your data you then need a view, which you can get from my previous post, along with jChris’ helpful comment. Note, if this is your first time with this stuff (or even if it isn’t) you may want to practice on a smaller database first!!

Next we’ll need some code to get this data, and while I highly recommend the fantastic couchdb-python library for the rest of my examples I’ll use JSON & urllib to remove a layer of indirection.

Here’s how we can get the overall word count (used to calculate relative frequencies);

def total_word_count(word):
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count" % (db_name)
        j = simplejson.loads(urllib.urlopen(u).read())
        # Sample Output: {"rows":[{"key":null,"value":19}]}
        return j['rows'][0]['value']
    except:
        return 0

We can do the same thing with “?group=true” in our URL to get the individual words each with their respective count. Here’s some code and a contrived bit of output to serve as our sample;

def all_word_count():
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count?group=true" % (db_name)
        ### Example Output: {"rows":[{"key":"be","value":1},{"key":"do","value":4},{"key":"to","value":1},{"key":"we","value":2}]}
        j = json.loads(urllib.urlopen(u).read())
        return j['rows']
    except:
        return [{}]

Now what is a bit problematic from this (vs the original example) is that we’re actually getting a long list of dictionaries instead of one dictionary, but we can convert this to a full word frequency dictionary and end up on equal footing again all at the same time.

def build_prob_dict(word_list, total_words):
    num = float(total_words)
    try:
        return dict([ (r['key'], r['value'] / num ) for r in word_list])
    except:
        return {}

So that should get us the rest of the way. Here’s the relevant excerpt from the new script vs the original:

def find_keyword(test_string = None):
    if not test_string:
        test_string = 'Hacker news is a good site while Techcrunch not so much'
    word_prob_dict = build_prob_dict(all_word_count(), total_word_count())
    non_exist_prob = min(word_prob_dict.values()) / 2.0
    #... everything blow should function unchanged

OK, so how does this fair? Well let’s give it a try;

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real    0m33.878s
user    0m0.692s
sys    0m0.408s

Ouch… this is after the view had been generated, by multiple calls (and thus cached), by couchDB. If you look at some more detailed numbers you can see that the bulk of the delay is again spend in socket calls. Even downloading the view results via wget is painful at ~ 11.8 KB/s vs. ~163 MB/s when serving a static file with the results via apache.

Here’s an interesting tidbit from a more detailed profiling;

   ncalls  tottime  percall  cumtime  percall      filename:lineno(function)
      2     32.437   16.218   32.437  16.218        socket.py:278(read)

I know the team has not focused on tuning couchDB, and I’ve read lots of anecdotal evidence that erlang is fast for computation especially on multicore systems, but my hope is they can get the transport layer working quickly as well!

As a final curiosity I’d love it if couchDB supported queries from STDIN!! Think about the piping fun you could have you could insert couchDB as part of your bash pipe! I also wouldn’t have to worry about adding another network server to my hosted service!

Did I mess up here? Can someone try this and tell me if they get similar results?

Posted in code, couchdb, python | 14 Comments

Could a couchdb guru explain this, please?

I’m in the process of trying to build (and benchmarking) a couchdb project and I decided to use some word count & frequency samples as data. Since “word count” and “grep” are the quintessential map/reduce examples I thought this would be fairly simple.

However, couchdb doesn’t seem to be following the expected semantics.

Let’s say I’ve got some data, here’s how it looks in python;

>>> import couchdb
>>> s = couchdb.Server()
>>> db = s['kw2']
>>> for d in db: print db[d]
...
<Document '133da883092e206d7191f81661beb813'@'3188228489' {'word': 'ho'}>
<Document '2287406943e627278d98a3a2f3d3483b'@'634745217' {'word': 'do'}>
<Document '2717deb4df8ba09601166021fb758126'@'2083376980' {'word': 'mo'}>
<Document '38d48e8e069538a55902dd2d2b7e1771'@'2475366164' {'word': 'ho'}>
<Document '39ef4a9e3eb0eeb02d483ce658d08356'@'2904312995' {'word': 'hi'}>
<Document '4237064ad7a89fa11e9bbbc8ca4ed302'@'722283984' {'word': 'do'}>
<Document '4d0e61dedaf2af93a9d4d261cab696de'@'996995145' {'word': 'we'}>
<Document '55ba96501ed1e9573b2cb6e647c35b47'@'3153984663' {'word': 'my'}>
<Document '5be13ca69c76d202b131d50f5b9c1ecb'@'1584030189' {'word': 'do'}>
<Document '612e4a0d32f4c91f7fb2414e4de47845'@'3488016124' {'word': 'be'}>
<Document '61426c868dc388e6edb2b4ce2078ce06'@'2761346180' {'word': 'me'}>
<Document '908acaf4ad704951dbb08d27ddfbe9a9'@'941727127' {'word': 'mo'}>
<Document '9136e093fda2dda7d5585983299fcbc7'@'4166962206' {'word': 'mo'}>
<Document '9decb25944110c04d040feb31e532c78'@'1016718857' {'word': 'do'}>
<Document 'ad7f4aab329d55c3a2fb97390df5ae0a'@'1660663052' {'word': 'my'}>
<Document 'c4d976a789e37e1c3eb4d57bd50d47aa'@'923287257' {'word': 'my'}>
<Document 'cccf15515077d100498573fe40244130'@'3846996388' {'word': 'hi'}>
<Document 'd747a88eb2cb18776237852aceff96fc'@'3596694550' {'word': 'we'}>
<Document 'dc115f5d42d442f0b5e7d3680aeb62c2'@'3446491946' {'word': 'to'}>

Feel free to add your own but that’s what I’ve got. Each doc has a simple structure, an “_id” (supplied by couchdb when the document is created) and an element called “word” which obviously contains some fabricated two letter structures (which I hesitate to actually call words).

What’s important to note is that the same word may appear in multiple documents.

Now we want to build a view to show each word as well as the sum of how many times it appears in our database.

Again, following the classic paradigm we build our map function (in javascript) as such;

function(doc) {
  emit(doc["word"], 1);
}

So far so good, now reduce;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return value.length;
   }
}

You can pretty much ignore the “rereduce” clause as our dataset’s not big enough right now, nor are we updating it. However, I will mention explain the function’s trick which is that while sum(value) is actually the “mathematically correct” action to take regardless of whether this is our first time through, we’re relying on the fact that since we’re emitting a “1” for each key (i.e. each word instance) that the sum of those values is simply the length of the array we’re passed in. [I learned this from one of the masters]

Ok, despite the attempt at “premature optimization” this actually seems to work out, or at least it looks to when shown in the couchdb key/value view. Here’s my screenshot for proof;

picture-21

However, what I see from a direct URL query to this view is markedly different then the data that’s represented. To test this either use Firefox or a command line client like curl and go to the following url;

http://localhost:5984/kw2/_view/finding/word_count

What I see (and I suspect you will as well) is

{"rows":[{"key":null,"value":19}]}

Which seems to break our expected key/value pairing!!!

Suspecting my understanding of couchdb’s map/reduce representation has been occluded by all the Google videos I’ve watched, it seems like an intuitive modification might be to change our reduce function to return the key & and the value, like this;

return [key, value];

However, that yields an even more shocking outcome;

{"rows":[{"key":null,"value":[[["we","d747a88eb2cb18776237852aceff96fc"],["we","4d0e61dedaf2af93a9d4d261cab696de"],["to","dc115f5d42d442f0b5e7d3680aeb62c2"],["my","c4d976a789e37e1c3eb4d57bd50d47aa"],["my","ad7f4aab329d55c3a2fb97390df5ae0a"],["my","55ba96501ed1e9573b2cb6e647c35b47"],["mo","9136e093fda2dda7d5585983299fcbc7"],["mo","908acaf4ad704951dbb08d27ddfbe9a9"],["mo","2717deb4df8ba09601166021fb758126"],["me","61426c868dc388e6edb2b4ce2078ce06"],["ho","38d48e8e069538a55902dd2d2b7e1771"],["ho","133da883092e206d7191f81661beb813"],["hi","cccf15515077d100498573fe40244130"],["hi","39ef4a9e3eb0eeb02d483ce658d08356"],["do","9decb25944110c04d040feb31e532c78"],["do","5be13ca69c76d202b131d50f5b9c1ecb"],["do","4237064ad7a89fa11e9bbbc8ca4ed302"],["do","2287406943e627278d98a3a2f3d3483b"],["be","612e4a0d32f4c91f7fb2414e4de47845"]],19]}]}

Of course I’m still baffled as to why we seem to have no entry set for key and all our rows as values.

However, my larger concern is beyond even that perplexing situation;

What’s most surprising here is that the key we’re being passed includes the doc id even though it was not emitted as part of our map phase!

Let’s give it one last go here, thinking perhaps we need to be more explicit;

function(key, value, rereduce) {
   if (rereduce) {
      return sum(value);
   }
   else {
      return {"key": key[0],"value": value.length};
   }
}

Unfortunately, this seems to still not yield the organized rows we expected and returns;

{"rows":[{"key":null,"value":{"key":["we","d747a88eb2cb18776237852aceff96fc"],"value":19}}]}

Which stands in high contrast to what couchdb continues to show us;

picture-11

So whatever we emit from reduce ends up as the value part of the reply (as indexed by “value”). Which matches our original expectation (that couchdb will handles setting this based) but doesn’t explain why it’s “null”.

In short I’m left with three questions;

1) Why does couchdb pass our reduce function the doc ID, when it’s not emitted in the map phase!

2) Why is “key” null in our output?

3) How do we get our JSON output to match the same pretty key/value representation that couchdb shows?

I wish I could promise that if you tune in next time I’ll have the answers but we’ll have to rely on the good nature of our experts out there to help us out.

Posted in code, couchdb, frustration | 3 Comments

How to build Couchdb on Dreamhost

As you know from many of my entries I’m a big fan of couchdb, and if you’re interested you should really be following janl, jchris and lethain as they push this technology forward.

As you might also guess from my earlier post I’m working to build and install it on Dreamhost, another thing I support enthusiastically.

Unfortunately, being on the outer fringe of technology meant I wasn’t able to get them to install it for me, but that’s completely understandable. Given that the current package release has no Auth support (I believe the repository builds do but that would have required more software installs) if I were supporting a multi-user production environment it might make me a little nervous too.

However, to in order to continue my interests it’s a major component so I wanted to give it a shot. I don’t have it up and running 100% right now (it appears to have run though I can’t connect) but I wanted to document to build side of things before I forgot 😀

So here’s the rundown;

I was fortunate to follow some excellent advice about getting Django up and running on Dreamhost. It advised that you setup a “~/run” directory to install all your add-on software too and these steps below will build on that existing environment.

First you need to download some software, I needed to get; Erlang, SpiderMonkey, ICU & CouchDB.

I downloaded all my files into “~/repo” but wherever you like to store them will be fine (“~/software for example). Now create a temp directory and unpack everything, replacing the paths (and potentially filenames if you picked different versions) as appropriate.

mkdir ~/tmp && cd ~/tmp
tar zxf ~/repo/otp_src*
tar zxf ~/repo/js-1.7.0*
tar zxf ~/repo/icu4c*
tar zxf ~/repo/apache*

I will show the build commands in the order I did them but as long as you save CouchDB for the final step (naturally) then I think you should be fine. Though it’s important to realize I didn’t get this right first time through so I did have some partial installs at times.

cd js/src
make -f Makefile.ref BUILD_OPT=1 JS_DIST=$RUN
cp *.h $RUN/include/js
cd Linux_All_OPT.OPJ
cp jsproto.tbl jsautocfg.h $RUN/include/js
cp libjs.so $RUN/lib

Now for Erlang:

cd ~/tmp/otp_src*
./configure --prefix=$RUN --enable-smp-support --enable-threads --enable-hipe
make && make install

Next, Unicode Support (ICU):

cd ~/tmp/icu/source
./configure --prefix=$RUN
make && make install

And finally, CouchDB!

./configure --prefix=$RUN --with-js-lib=$RUN/lib --with-js-include=$RUN/include/js --with-erlang=$RUN/lib/erlang/usr/include
make && make install

Once that’s completed I was able to run “couchdb” and see the famous “Apache CouchDB has started. Time to relax.” !!!

Unfortunately, running “couchdb -s” in another window tells me that “Apache CouchDB is not running.” 🙁

However, I suspect that’s an easier issue for Dreamhost to help me with then building everything from source!

Posted in code, couchdb | 14 Comments