Archive for the ‘python’ Category

Cloudera’s Hadoop Education

Sunday, June 14th, 2009

A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.

I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.

However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.

Here’s my map function

import sys, re
WORDS = re.compile(r'(\w+)')
PARSER = re.compile('(.+?)\t(.+?)\n')

for input in sys.stdin.readlines():
m = PARSER.match(input)
if m:
    key = m.groups()[0]
    for word in WORDS.findall(m.groups()[1]):
        print "%s\t%s" % (word, key)

Cloudera’s version is:

import re
import sys
NONALPHA = re.compile("\W")

for input in sys.stdin.readlines():
    keyline = input.split("\t", 1)
    if (len(keyline) == 2):
        (key, line) = keyline
        for w in NONALPHA.split(line):
            if w:
                print w + "\t" + key

By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.

What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.

However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.

There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI

CouchDB Performance – Too much TCP

Sunday, May 31st, 2009

It’s been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some significant performance improvements. Unfortunately, at the time I didn’t have spare cycles to invest in building the latest branches of erlang, couchdb and everything else, so I hadn’t previously been able to rerun my tests.

However, I started a new project today and, like most developers, I took some time to sharpen my tools before I felt sufficiently prepared to proceed. Of course since one of my favorite tools is CouchDB itself I checked in to see how it had been progressing and I was thrilled to see Janl, and it looks like others have contributed, had released a new version of the excellent DBX bundle!

So after a round of updating DBX and CouchDB python library components, I decided to suffer a small distraction and give the new code a test drive.
I wanted to check my baseline, so here’s a rough time sample for the original, file based, keywords code:

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real    0m0.329s
user    0m0.225s
sys    0m0.046s

I ran the initial load and it looks much the same as the previous test:

time ./couchdb_finding_keywords.py

real    28m16.430s
user    2m55.550ssys    1m30.335s

So perhaps around 20% faster, though on a second test run this actually took more than 39 minutes!

Well, now that the load is out of the way, let’s see how are our queries are looking.

Well after making a view, the results with wget aren’t any more promising then last time (note the view location has changed):

wget -O - http://localhost:5984/keywords/_design/finding/_view/word_count?group=true > /dev/null
--20:35:08--  http://localhost:5984/keywords/_design/finding/_view/word_count?group=true
           => `-'
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]

    [                     <=>                        ] 422,776       12.54K/s             

20:35:40 (12.94 KB/s) - `-' saved [422776]

Alas, it doesn’t look like most of the performance improvements have really paid off for this testcase, in fact every run I tried was slower then last version.
Here’s a sample run which is fairly indicative of the rest:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m52.659s
user	0m0.702s
sys	0m0.441s

And again, with more of the full debugging info:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
>>>---- Begin profiling print
         788 function calls (709 primitive calls) in 51.297 CPU seconds

   Ordered by: internal time, call count
   List reduced from 118 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   51.092   25.546   51.092   25.546 socket.py:278(read)
        2    0.049    0.024    0.049    0.024 decoder.py:320(raw_decode)
        1    0.040    0.040    0.040    0.040 ic.py:182(__contains__)
       14    0.036    0.003    0.036    0.003 socket.py:321(readline)
        1    0.020    0.020    0.020    0.020 couchdb_finding_keywords.py:61(build_prob_dict)
        1    0.011    0.011    0.019    0.019 ic.py:1()
        1    0.011    0.011   51.297   51.297 couchdb_finding_keywords.py:68(find_keyword)
        8    0.008    0.001    0.008    0.001 :1(connect)
        1    0.007    0.007    0.066    0.066 urllib.py:1296(getproxies_internetconfig)
        1    0.005    0.005   51.228   51.228 couchdb_finding_keywords.py:41(all_word_count)
        1    0.003    0.003    0.069    0.069 urllib.py:1329(getproxies)
        1    0.003    0.003    0.003    0.003 Res.py:1()
        1    0.002    0.002    0.002    0.002 File.py:1()
        1    0.001    0.001    0.002    0.002 macostools.py:5()
        2    0.001    0.001    0.001    0.001 socket.py:229(close)
        2    0.001    0.000    0.002    0.001 httplib.py:659(connect)
        2    0.001    0.000    0.005    0.002 httplib.py:224(readheaders)
     11/6    0.001    0.000    0.001    0.000 sre_parse.py:385(_parse)
        1    0.000    0.000    0.000    0.000 ic.py:161(__init__)
        2    0.000    0.000    0.000    0.000 httplib.py:323(__init__)

>>>---- End profiling print

real	0m52.023s
user	0m0.732s
sys	0m0.437s

I’m no erlang expert but seeing that many socket calls makes me still suspect that some TCP level tuning (window size & buffering) might be helpful.

As a final note, I did a database compaction and reran the query which helped significantly compared to the worse case 0.9 time but at best only matched 0.8.

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m34.605s
user	0m0.687s
sys	0m0.394s

You can find the test script, with changes to work with the slightly different view URL’s, and if you’d like to recreate the test all you will need to do (beyond setting up couchdb) is:

  1. Swap the comments on the last two lines to run “load_db()”
  2. Create a map / reduce wordcount view
  3. Change the “view_url” parameter on line 25
  4. Invert the comments again to just run “find_keyword()”

A simple twitter library in python

Wednesday, April 29th, 2009

I’ve been working on a project built on Google App Engine and I’m relying on twitter to mediate some of the interaction with my end users.

What I find great about the growing prevalence of social interfaces is that I don’t have to focus predominately on coding an interface and with so many clients my users can interact with in whatever way is most appropriate, i.e. from a mobile phone or a desktop client.

Unfortunately, the standard python-twitter library doesn’t readily run under GAE because of some library issues. Originally, I was looking at providing some code changes for it but it’s a spiderweb of more abstractions then I think the problem deserves.

In the process of building my own library I found out that Avinash had figured out how to setup the authentication properly so I built upon his work and added a few other functions I needed.

We all know about twitter’s growing popularity so I thought I’d share my version as well in case it proved helpful to anyone. Twitter provides a great mechanism to decouple your interface from your backend code and I hope to see many more smart systems to come!

CouchDB Performance or Use a File

Thursday, March 19th, 2009

If this is the first post you’ve read from my blog you should probably go check some others and assert for yourself that I’m a big fan of couchDB.

Even if it wasn’t easy to be impressed by Damon Katz’s, it would be hard to overlook the interest his code has created. If even those miracles weren’t enough for you, then just look to what other amazing minds have done. Finally, for the truly skeptical there’s now a business you can contact.

There are quite a few things that make it an amazing piece of engineering, which includes its simplicity of purpose, something you don’t often get a chance to appreciate these days. I’m a big fan of pipes, lots of individual pieces doing their dedicated task, and to me the MapReduce model epitomizes that behavior.

Still, there’s a lot I’ve taken on faith, since I haven’t been able to dedicate weeks to it’s internals instead trying to leverage it for projects. One of those traits has been the assumption of performance.

Truly, it’s more then just an assumption. Reports are that couchDB’s performance is already quite decent and it’s not even been tuned, so I’ve never attempted to benchmark it’s behavior.

Instead I’ve been working on some language processing code for my wife. Learning about NLP has aligned with my A.I. background, although it’s reminded me about all the math I’ve forgotten!

And reading samples and feeling like I’d gotten my legs underneath me I decided to “port” a nice little example over to couchDB. If you want to play along at home then you’ll want to check out the article and grab his code (and keywords2.txt file).

Although the code is geared more for education then performance it still runs fairly snappy on my laptop, running with some pretty consistent times;

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m0.286s
user	0m0.228s
sys	0m0.047s

I’ve also run it with some performance sampling but let’s stick with simple timing for now.

There’s about 124,580 “words” in the text file;

>>> key_file = open("keywords2.txt")
>>> data = key_file.read()
>>> words = data.split()
>>> len(words)
124580

This data is then used to create a word frequency count and is generated each time the program is run, not a bad 0.2 seconds worth of work!

Naturally, having static text data and supplementing this original data with some derived “data structures” (like total count, or a view showing each word and the number of times it appears), is a perfect case for couchDB.

So I decided to simply load this data right into couchDB. Here’s how I did this, skipping details like creating the database itself;

def load_db():
    key_file = open('keywords2.txt')
    data = key_file.read()
    words = data.split()
    for word in words:
        node = db.create( { "word": word } )

You’d expect this to take some time, databases provide valuable services but of course can only do so at the expense of some cycles. However, I was surprised to find out this took almost 30 minutes!

time ./couchdb_finding_keywords.py 

real	27m2.356s
user	2m35.921s
sys	1m14.478s

Based on the low user and sys times you can guess most of the delay is due to transport overhead, i.e. network communication. This is all going to a couchdb running on localhost, a MacBook with 4G RAM and an Intel 2.0 GHz Core 2 Duo, so it’s a bit surprising but not really critical.

I didn’t bother running this three times and taking an average. The keywords2.txt file should already be in memory having been read by the file backed example. Nor is upfront cost a big consideration for me, I’m willing to spend the time once especially if it can save me work on the backend!

So naturally I was pretty excited port things over to a more couchDB / pythonic example and here’s what I came up with. After you load your data you then need a view, which you can get from my previous post, along with jChris’ helpful comment. Note, if this is your first time with this stuff (or even if it isn’t) you may want to practice on a smaller database first!!

Next we’ll need some code to get this data, and while I highly recommend the fantastic couchdb-python library for the rest of my examples I’ll use JSON & urllib to remove a layer of indirection.

Here’s how we can get the overall word count (used to calculate relative frequencies);

def total_word_count(word):
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count" % (db_name)
        j = simplejson.loads(urllib.urlopen(u).read())
        # Sample Output: {"rows":[{"key":null,"value":19}]}
        return j['rows'][0]['value']
    except:
        return 0

We can do the same thing with “?group=true” in our URL to get the individual words each with their respective count. Here’s some code and a contrived bit of output to serve as our sample;

def all_word_count():
    try:
        u = "http://localhost:5984/%s/_view/finding/word_count?group=true" % (db_name)
        ### Example Output: {"rows":[{"key":"be","value":1},{"key":"do","value":4},{"key":"to","value":1},{"key":"we","value":2}]}
        j = json.loads(urllib.urlopen(u).read())
        return j['rows']
    except:
        return [{}]

Now what is a bit problematic from this (vs the original example) is that we’re actually getting a long list of dictionaries instead of one dictionary, but we can convert this to a full word frequency dictionary and end up on equal footing again all at the same time.

def build_prob_dict(word_list, total_words):
    num = float(total_words)
    try:
        return dict([ (r['key'], r['value'] / num ) for r in word_list])
    except:
        return {}

So that should get us the rest of the way. Here’s the relevant excerpt from the new script vs the original:

def find_keyword(test_string = None):
    if not test_string:
        test_string = 'Hacker news is a good site while Techcrunch not so much'
    word_prob_dict = build_prob_dict(all_word_count(), total_word_count())
    non_exist_prob = min(word_prob_dict.values()) / 2.0
    #... everything blow should function unchanged

OK, so how does this fair? Well let’s give it a try;

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
real    0m33.878s
user    0m0.692s
sys    0m0.408s

Ouch… this is after the view had been generated, by multiple calls (and thus cached), by couchDB. If you look at some more detailed numbers you can see that the bulk of the delay is again spend in socket calls. Even downloading the view results via wget is painful at ~ 11.8 KB/s vs. ~163 MB/s when serving a static file with the results via apache.

Here’s an interesting tidbit from a more detailed profiling;

   ncalls  tottime  percall  cumtime  percall      filename:lineno(function)
      2     32.437   16.218   32.437  16.218        socket.py:278(read)

I know the team has not focused on tuning couchDB, and I’ve read lots of anecdotal evidence that erlang is fast for computation especially on multicore systems, but my hope is they can get the transport layer working quickly as well!

As a final curiosity I’d love it if couchDB supported queries from STDIN!! Think about the piping fun you could have you could insert couchDB as part of your bash pipe! I also wouldn’t have to worry about adding another network server to my hosted service!

Did I mess up here? Can someone try this and tell me if they get similar results?

twitterline

Friday, February 20th, 2009

I’ve been using a lot of python, jQuery and web services recently and thought it was time to pull together those skills into a public app.

Like most developers, I often “scratch my own itch” and write code to solve a problem or learn something. I try to post what I can but some solutions are hosted internally and I know there are numerous code fragments scattered across my hard drives which haven’t made it into posts.

Many of these projects are “evolutionary dead ends” but I think it’s important to engage in “purposeful play” without anticipating success or failure. You really have to take time to nurture your childlike creativity and it’s often in these limitless exercises that we develop the foundation for real breakthroughs for more “respected” works.

I was reminded of this recently when I watched a presentation by Aaron Koblin on some of his creative works. His compositions are stunning and while I recall noticing many of those projects independently over time, it was seeing the evolution of his portfolio that really inspired me.

If you watch the video you can see how his work went from a type of deliberate play to having a full “application”. It’s a lesson I try to perpetually embody with a “just do it” attitude, and it’s rewarding to see someone having applied it with such success.

So in that vein, I decided to clean up one of my sites and pull together a lot of these components into something “useful”. I call it “twitterline”, because “twitterbar” may be more descriptive but doesn’t roll off the tongue as well. You can see an example and get a pretty good idea of what it’s used for.

The API is “RESTful” and is simply “http://twitterline.shelv.us/twitterline” followed by your twitter ID, e.g. “/wjhuie” and the number of days that you’d like to graph, e.g. “/4″. I’ve limited the number between [1, 14] and if you don’t supply a number the default is 7, which all make for reasonable defaults.

However, beyond just looking at the bar graph on my site, you should be able to embed it wherever you wish! You can check the source on my example, mostly you’ll need to make sure jQuery and jQuery.Flot are embedded first and you’ll likely want to tweak the CSS. Just let me know if you need help to it up and running or if you’d like some different defaults.

It’s intended to be a simple culmination of a more complex process (which I’ll blog more on later) but I hope it inspires you to dust off a project of your own or start a new one!

Build your own ioGun!

Friday, January 16th, 2009

I apologize for what will effectively be a brain dump post, but a new friend of mine from the hackaday forums is getting started on his own accelerometer controlled system and I wanted to see if I could save him some time and frustration.

I think standing on the shoulders of Giants is a fantastic aspect of human nature, now if only someone could show me to some friendly Cyclopi (is that really the right pluralization?), and frankly I could not exist, let along thrive in the technological world if it were for the good graces of a great many people.

Perhaps I will write a _very long_ blog thank-you to them (if I can remember them all). So consider this my little chance at saving some of you a few hours of your precious time (and brain strain). Because of the very nature of hacking (i.e. everything’s just a little different and throw together) this can’t quite be the same quality as an instructable but it should get some of you started!

I’ve put up my code here so that’s quite obviously the place to start. You’ll find a python script and an HTML page. The script connects to the MoteDaemon port I referenced earlier and will write these data points out to a JSON file which can be served via your webserver, and is then picked up by the HTML page.

If you look through the code you’ll see there’s clearly some tuning that can be done. As people on many forums pointed out there’s a “lag” which is really just because of my polling rates (and less to do with the network traffic).

Two important things here, first you’ll see that I created a separate thread for the monitoring and one for the output, that’s because the wiimote data is fast and furious and you don’t want to block and miss any!

Second, I know writing to an JSON file isn’t ideal so the output part actually buffers 10 events and writes those, luckily jQuery is smart enough to only pull the file if it’s changed so it’s not as bad as it sounds.

Once you’ve got the accelerometer data into python (and then into JSON) it’s ‘just’ a matter of writing the webpage you want! jQuery makes some stuff really easy so I suggest giving it a look if you haven’t yet!

Of course what makes it really easy for me was the ioBridge module. I just plugged in a servo there and defined a widget on their page and ‘viola I had a webservice I could send commands to!

I hope that helps give some of you (or at least one of you) a boost, and if I can help out at all please let me know!

Taking (and keeping) your temperature!

Friday, January 2nd, 2009

I swear I don’t have a penchant for medical terminology but this ioBridge stuff is making me feel like that time I stayed at a Holiday Inn… so refreshing, I think I could perform surgery!

After my heart hacks (see my previous posts) I had questions from some friends about how we could graph the data from an ioWidget’s (my term). Initially, I wanted to push the data into a Google Spreadsheet but unfortunately there doesn’t seem to be a higher level javascript library supporting Google Docs and sorting through the feed URL’s was just too complicated.

At the same time I was thinking through this request, I received a resounding response from the ioBridge team! Although I’d worked around the need for a simple API they quickly responded to the desire (apparently I wasn’t the only one with interesting ideas) and now there’s a full JSON API!

I won’t bore you with my initial graphing solution as the sample made it into their official API demo (along with some much needed code enhancements). However, there were a few pitfalls with that approach that I still didn’t like, most specifically that the data is “lost” every time you reload the page.

It took me until after the holiday break (along with a welcome return to python) but I’ve solved my initial frustration with great results.

Here’s a script which will poll my ioBridge module and then store the results of my tempreature sensor in a Google Spreadsheet that I created! Once the data’s there you can use Google’s visualization widgets to make some fun graphs!

Aside from some setup and “ease of use” code, the real work is done by two very brief classes. I deliberately didn’t add some error checking nor make the widget class generic (it’s actually proxying the full ioBridge module) so I think it should be straightforward enough to modify for your own uses!

All you need to do is create a spreadsheet and open it in your browser. Copy the key from that URL and paste it, along with your ioBridge feed URL, into the appropriate places in the script (the locations are commented).

I simply run this from a cron script every few minutes (once I get more data I’ll reduce the time) and although there’s not a lot of variation in the data (I deliberately introduced some to make the graph more interesting) it’s a spectacular way to record, visualization and act on the sensor’s findings!

Good luck with your own modifications and let me know if I can help!

Training Neural Nets with CouchDB – part 3

Thursday, December 4th, 2008

Hopefully you’ve been following parts 1 and 2 and I didn’t leave anyone too confused by my approach.

Please visit my posts for a far better recap then I can provide here (DRY); In part 1, I introduced the overall project discussed the django layout and focused on the jquery I/O part. My goal in part 2 was to show some of the underlying mechanics of how we triggered a neuron as well as queried couchdb for info.

I know we honed in on some serious specifics but now that we’ve got those specifics I’d like to step back and put together the pieces.

If you check out a copy of the interface (which doesn’t have the NN backend) you can get a sense of the I/O process. In the full application here’s the process;

(0) When a user clicks anywhere on the page that’s sent to our django URL and (1) django records the location then (2) queries our neural net for a guess. (Step 6) This guess will be sent back to the page to move the second coordinate box, and in the sample is set to be equal to the click location. However, before returning the guess, (3,4) the difference between the guess and the actual click is sent to the network so it can train for next time. Also, (5) the input nodes need to be set to the click location so that next time the net can base it’s output on your previous click.

Here’s the relevant code;

#get the clicked coordinates
click_X = int(request.POST['X'])
click_Y = int(request.POST['Y'])

#Figure out what the net would have guessed
guess_X = int(get_output(“output_x”))
guess_Y = int(get_output(“output_y”))

#Find the error and propagate that back to the net
error_X = float(click_X) – float(guess_X)
error_Y = float(click_Y) – float(guess_Y)
back_prop(“output_x”, error_X)
back_prop(“output_y”, error_Y)

#Set the inputs for next time
set_input_weight(“input_x”, “static”, request.POST['X'])
set_input_weight(“input_y”, “static”, request.POST['Y'])

#return the guess to jQuery
return HttpResponse( “{ “X”: %s, “Y”: %s}” % (guess_X, guess_Y) )

Now I know you only know how a neuron is defined and have no idea of our neural net structure but I think that process will be very illustrative.

Since our net needs to predict X and Y coordinates we need to have two output nodes which I’ve named “output_x” and “output_y” and similarly we have “input_x” and “input_y” in order to inject the coordinate values into the network. You can see the keyword “static” trick that we discussed previously being used.

Neural Nets are typically called “Feed Forward Neural Nets” because you stimulate node “A” and then things cascade A -> B ->C and you can read the output at “C”.

However, programtically this can be a royal pain.

  • In some situations you’d need a clock to sync with so you can ensure that you’re reading the n’th output of C and not the n+1 (e.g. if the input to A was controlled via a separate thread then things could shift underneath you before you read the values).
  • You could also build this communication via a messaging service. Node “A” could broadcast its output at a given time and then use a pub/sub model so interested nodes could be alerted as events progressed.

The last approach probably scales the best if you had a ready queuing mechanism then I’d go this way for larger networks. This is more apparent when you realize that “A,B & C” aren’t necessarially single nodes.

Neural nets are commonly built with layers, each of which typically contains multiple nodes all of which are connected to all of the nodes in the previous and subsequent layers.

So in my case “A” is the first layer containing “input_x” and “input_y” and “C” is the output layer containing “output_x” and “output_y”. Layer B would have many nodes all receiving the output from both nodes in layer A. (As an aside you can have more complicated layering systems, for example “output_x” might also have a direct link from “input_x” to further augment it’s correlation and it is legal for a layer to only have a single node.)

So you’ve the “API” for our neural net and part 2 covers the underlying mathematical (and procedural) mechanics so I should be able to wrap up next time by discussing how to get this thing off the ground and see how it works!

Training Neural Nets with CouchDB – part 2

Tuesday, December 2nd, 2008

It’s always nice to have a little encouragement especially when trying to work through some tough posts. I really prefer white boards and pictures but I find these too hard to make for blogging so I try to let the words shape the images in your mind.

So let’s get started with your first exercise… picture or graph the output of tanh() from [-1,1]. For those who will skip the process of pulling out their graphing calculator, you’ll get a sinusoidal curve, i.e. shaped like an “S”, with asymptotes at y=-1 and y=1.

So, I know that’s neat and all but what does it have to do with couchdb? Well it doesn’t per say but it represents our “trigger” function for our neurons. We’ll use this to convert a nodes input to its output, and let’s call this function sigmoid().

The input to this function is actually the weight our node attributes to a specific input. So let’s say I’m a node, N, and I have an input, I. That input will be a value, numeric in this case, and I’ll assign it a weight, W. That weight will correlate with my “trust” in that input (after an appropriate training process). But the key part now is that the output, O, of N will be; O = I * sigmoid(W).

If you play around and graph some of this, you’ll see that if I don’t “trust” I, and W approaches 0 then even if I is a very big number O will be near 0 also.

Before we get into this business of trust let me give you two functions;

def sigmoid(x):
return tanh(float(x))

def slope_sigmoid(y):
return 1.0-y*y

Let me comment on that slope function there. The slope tells us how “drastic” a small change will make our output and we’ll use this function in order to adjust our weights appropriately. This is important because if we’re at the center of our sigmoid (x=.5) and want to reinforce the weight then we may only need to move it by +.01 but if we’re at an extreme on our curve (e.g. x = -.8) then we may need to adjust the weight by -.1 (i.e. 10 times as much) in order to achieve a noticeable change.

OK, let’s talk about that mythic “trust” factor and in order to do that let’s talk about neural nets. NN’s are build by collecting Neurons, connecting outputs to inputs (not usually circular but they can be) and then stimulating an input layer with a set of values and reading an output layer to see what it produced.

Here’s, in psuedo python, is how I represented a neuron;

class neuron():

name = “”

inputs = []

It’s relatively a simple structure. Neuron’s have a name (which are forced to be unique) and a set of inputs (which is actually a list of dictionaries). An important thing to clarify is that I never actually had to define this class, since CouchDB doesn’t demand a schema; Let’s make this a bit more concrete with three examples;

>>> nn['input_x']
<Document u’input_x’@u’82302167′ {u’inputs’: [{u'node': u'static', u'weight': u'474'}], u’increment’: 0}>
>>> nn['1']
<Document u’1′@u’2027135756′ {u’inputs’: [{u'node': u'input_x', u'weight': 1}, {u'node': u'input_y', u'weight': -1}], u’increment’: 0}>
>>> nn['output_x']
<Document u’output_x’@u’1458836188′ {u’inputs’: [{u'node': u'1', u'weight': 1}, {u'node': u'2', u'weight': 9.536375211640632e+179}, {u'node': u'3', u'weight': -194411155905.33661}], u’increment’: 0}>

Hopefully that shows up ok on your browser. The variable “nn” is a link to the “neural_net” database on my couchdb server;

try:
server = couchdb.Server(‘http://127.0.0.1:5984′)
nn = server['neural_net']
len(nn)
except couchdb.client.ResourceNotFound:
server.create(‘neural_net’)
nn = server['neural_net']
len(nn)

I’ve printed out three neuron nodes; The first node, ‘input_x’, is part of the input layer and you’ll see it has list ‘inputs’ with a single dictionary element { ‘node’: ‘…’, ‘weight’: ‘…’ }. I’ve opted to use the name “static” as a keyword to represent an input which doesn’t point to another node and use the ‘weight’ as the actual input value. The second node ‘1′ is more of a typical neuron which would be considered a “second level neuron”. This takes two inputs, one from ‘input_x’ and one from ‘input_y’. The output of “1″ will be;

for input in nn['1']['inputs']

output += get_output(inputs['node']) * sigmoid(input['weight'])

Note I used a function, called “get_output” to find the output value of a node. If the node is static, as ‘input_x’ is, then we could simply dereference it and get the “weight” value but if the input is another “pure” neuron then it may have some calculations to do.You can see how this would work in practice by examining the final node, ‘output_x’. In this case we have to query many nodes just like node “1″ and allow it to do it’s calculation before we can output our values. So a call to “get_output(“output_x’”) actually recurses to the various nodes in turn.

Let me take a moment to diverge from “what I did” to talk about “what I almost did”. I’d intended for “get_output()” to be a CouchDB view and take advantage of quick, asynchronous, lookups. However, if this was done as a view then I’d need the emit function to reference the database and I don’t think this is allowed, i.e. I’d need a map function something like;

Note this won’t, and to the best of my knowledge can’t, work;

function(doc) {
if ( doc['inputs'].length > 0 ) {
for ( i in doc['inputs'] ) {
emit( doc['_id'], “_view/getoutput(doc['inputs'][i])” );
}
}
}

The broken part of that view is the value part of the emit function (remember emit produces key/value pairs). However, since we’re on the topic of couchdb views here’s one way we can build a view to see what inputs a node has;

function(doc) {
if ( doc['inputs'].length > 0 ) {
for ( i in doc['inputs'] ) {
emit( doc['_id'], doc['inputs'][i] );
}
}
}

I also wanted a reduce function to combine the value parts to a single key (so it matched my “data structure” so here’s the reduce;

function(keys, values, rereduce) {
return values
}

The great part of couchdb is that you can input these views in it’s code window and get immediate feedback on what’s being produced! Now I’ll show you the sad part of my design, here’s how to query and act on the view;

#This loop gives us all the inputs to node: name
for input_nodes in db.view(“/nnodes/inputs”, group=True)[name]:
#This loop gives us all the input nodes to node: name
for in_node in input_nodes.value:
if u’static’ in in_node['node']:
output += float(in_node['weight'])
else: #Not a static input
output += get_output(in_node['node']) * sigmoid(in_node['weight'])

What’s sad to me is that it would be less code to query the documents directly;

node = db[str(name)]
for in_node in node['inputs']:
if u’static’ in in_node['node']:
output += float(in_node['weight'])
else: #Not a static input
output += get_output(in_node['node']) * sigmoid(in_node['weight'])

You’ll see that I used the “group=True” parm that I mentioned in my previous post. This just made things match my conceptual model but I wish the python library didn’t force me to dereference .key and .value to get them (it should turn them into a dictionary instead). I’ll also mention that several times I got confused trying ['value'] instead of .value (something that wouldn’t matter in javascript but the former seems more “pythonic”).

I think this is a clear example that I’ve got more to learn about views and better ways to represent my data structures. Here’s a great example which I think will find a lot of analogous fits so read it often.

Back to my situation though, I’d thought maybe each node could store an array called “output_history” which could then be queried with an “increment” value (which would make it a simple emit() process). However, this was much more complicated then it was worth for an initial pass and, since if the value didn’t exist, it would still have to be calculated via a non-Map/Reduce function (because it would have to reference the database).

Before I get into much more detail let me show you the code so you can take a moment to look it over and formulate some thoughts.

I’ll be back with post 3 to try and tie it all together (including a step back to revisit the overall connections, talk more about Neural Nets and my impressions of couchdb).

Training Neural Nets with CouchDB – part 1

Monday, December 1st, 2008

My goal for this post is a bit technical and I’ll try warping both an artificial neural net, as well as my biological one, around an exploration of CouchDB, so read on if appropriate to your interests.

As you may have noticed in some of my earlier posts I’ve been playing with couchdb but it wasn’t until I started following Janl and reading some of his blog posts that things started to click into place.

I would be remiss if I didn’t also mention the fantastic Eric Florenzano who seem to get this much more intuitively then I do, and the amazing jChris who provides code to go along with his great ideas.

My desire to give back was peeked by Jan taking the time to answer a few desperate twitters I had. My first question was simple… why does some code say “map” while others use “emit”… simple answer; it was changed and “emit()” is now the proper syntax.

The second question occurred when I was querying a view with a map/reduce pair and couldn’t get the python library to return the key / value pairs to me as expected. As Jan states, if you pass “group=True” you can use results.key & results.value.

Ok, so that’s the simple Q&A recorded for posterity, but what about the neuron bending I promised. First let me say that if you want a tutorial on Neural Nets and AI programming you’ve come to the wrong place, and secondly this is my “answer” that’s evolved more then I care to admit.

So I’ll try not to take you through the brain process I went through but I’m sure we’ll both see how this could be made more “map/reduce”-y with future iterations. If you’ve got some of those insights I’d love to hear them because most of the examples I found already had that “AhhHA” moment made obvious.

One last disclaimer – I was looking to practice my jQuery and Django programming too so while this could definitely be made more “compact” it met my qualifications of building an “AJAX web application built with Django and utilizing CouchDB” application which was “simple”.

Now onward and upward! You’ll want to start a Django program to hold all this code, a process which is better covered in many other tutorials. The title for my project is “nn_clicks” and I creates a “clicks” application underneath this directory.

So after you’ve got your project made, let’s start with the “front-end” which in this modern age is nearly always the web browser. Of course we need a webpage, and you can find mine here.

It’s nothing fancy but you’ll see there are two HTML elements which will report coordinates on the page. One, “#loc” obviously tracks the mouse and the other, “#guess” holds the contents of our AJAX call, which won’t work now given my blog’s hosting provider.

The magic happens thanks to two jQuery calls. The first simply links the ‘#loc’ element to report our mouse locations;

$(document).ready(function() {
$().mousemove(function(e) {
$(‘#loc’).html(e.pageX +’, ‘+ e.pageY);
});

While the second is where the magic happens when a click occurs;

$().click(function(e) {
loc = $(‘#loc’).html(e.pageX +’, ‘+ e.pageY);
$(‘#loc’).animate({left: e.pageX, top: e.pageY}, 450);

$.post(“process_click”,
{ “X”: e.pageX, “Y”: e.pageY },
function (r, status) {
guess = $(‘#guess’);
guess.html(r.X + ‘, ‘ + r.Y);
guess.animate({left: r.X, top: r.Y}, 450);
},
“json”
);
});

This function first sets the location information for #loc again, just in case, and then proceeds to move that HTML element to where you clicked (a fun effect). For this to work you have to have the CSS “position” attribute set to “absolute”. Which is something I’ve done with inline CSS, which is ugly and poor practice for my style guidelines, but sufficient for this tutorial.

The call to the $.post(…) function handles the AJAX magic. It packs the click coordinates (which are really a measure of ‘X’ and ‘height’ rather then a strict X,Y interpretation) into a JSON structure and the part that reads; “function (r, status) { … }” is then called when the django call “def process_click(…)” completes (more on this later).

So copy the HTML file to the “templates” directory under your Django project (make it if this doesn’t exist) and edit your urls.py file to include these two lines;

(r’^process_click/?’, ‘nn_click.clicks.views.process_click’),
(r’^(.*)/?’, ‘nn_click.clicks.views.index’),

The first line will receive our jQuery POST call which has the mouse coordinates as the data and the second will send anything else to our main page. So now we can edit the applications view (vi clicks/view.py) and add the next phase of changes.

Here’s the necessary line to take care of the second url redirection;

def index(request, something):
return render_to_response(‘nn_click_template.html’, { })

This simply takes our template and returns it with room for future data (the empty “{ }” part) if I need it later. That was simple and if you start up your Django project you should be able to load the webpage and it will at least follow your mouse. Since I serve files from a different box then I use for development try; “python manage.py runserver 0.0.0.0:8000″.

Now let’s figure out how to process that POST data we get from our jQuery POST call.

Again in click/view.py add this code;

def process_click(request):
click_X = int(request.POST['X'])
click_Y = int(request.POST['Y'])
return HttpResponse( “{ “X”: %s, “Y”: %s}” % (click_X, click_Y)

Now the POST call comes in, we parse out the X & Y values and return that data to the page (asynchronously). What should happen is that shortly after the click you’ll see the second tuple move to the same location. You may decide to divide the data by 2 or swap the X, Y values for a little bit of fun.

That seems like a natural place to conclude part one and I know we didn’t get into the couchdb part but never fear I’ll have part two out shortly!

Update: I have this part running (minus the couchdb and neural net code since I can’t figure how how to get those working via fastCGI) but you can check out the basic idea here;

http://nn-click.thecapacity.org/