Links as Code

John Willis’ “Infrastructure as Code” should be a startling epiphany for anyone who has long neglected process and people in favor of technological solutions. Yet, I hope anyone here doesn’t need convincing about the validity of institutionalizing the collective knowledge

However, I wonder about a critical level of infrastructure maintenance that seems to be missed, document maintenance.

I’m sure everyone has experienced the frustration of reading a document with an invalid URL, but is this be accepted?

Should documents not be kept in repositories as well? Why not take the same proactive approach to maintaining links as we do “Not breaking the Build” when programming?

So why doesn’t your team have a utility to scan internal documents and links when they propose changing page structures, before they’re made live?

Posted in enterprise, frustration | 5 Comments

Converging Google Services

The always fabulous Louis Grey makes good points about using GMail in a corporate environment and got me thinking in a different direction.

I begain to consider: Why can’t I share emails the same way I can share RSS entries?

Google Reader allows you to publish all entries you tagged with specific keywords, or you can share entries on an individual basis. Yet, despite the obvious analogue, it’s impossible for me to share email messages or threads in the same manner!

I realize there are some privacy concerns, since RSS & Atom explicitly make things public and email does not. However, there’s no reading I couldn’t use an email to RSS gateway and violate expected convention easily.

I might also argue that by opening up email to the same type of social collaboration we get via Google Reader then the potential would exist to make things more secure.

For example, by adding a default copy-left style licensing, a la creative commons, or a per-email “off the record” flag like Google Talk. There could even be “free to share” delivery options rather then keeping everything on an “honor system”.

Posted in uncategorized | 6 Comments

Who’s Really Testing Chrome?

Just a quick gripe to share with anyone using Chrome.

For all of Chrome’s new high performance design, there’s a very simple way to bring your tabbed experience to it’s knees, Print something…

In my case it was a 100+ page PDF printed 2 pages per sheet, but I’m sure most anything of a decent size would work.

Disapopinting to say the least, since printing is supposed to be a background task, that’s why we have spooling, not take front and center stage!

Posted in frustration, Google | Comments Off

Mashing up the Dashboard

This post is for anyone interested in any of the Government Transparency inituatives. If you’ve been following this topic then you’re probably aware that Vivek Kundra sees a dashboard as a way of accelerating the transparency and transformation of the Government.

After watching groups like the Sunlight Foundation and Change Congress work their magic, I’ve now begun seeing much of this transformation from the inside due to my new job.

However, I still endeavor to participate externally as well and so I wanted to do some analysis on the public data.

In order to start, I wanted to import the USASpending.gov information into Google spreadsheets, since I can kill two buzzwords at once by leveraging Cloud Computing services for Transparency!

For a while, I fought with the easiest way to import the data and wanted to share what eventually worked for me.

First, create an new spreadsheet and name your tab appropriately. Then go to the USASpending Feeds page to select the specific data you want. I suggest starting with the Exhibit 300 information since it’s typically a smaller dataset and my Exhibit 53 tab with more than 1200 rows has proven to be very slow.

Next pick, and I highly suggest reordering, the data fields you want you must then pick which Agency or Agencies you’d like info on. Again, considering that lots of data will be pretty slow.

Finally, select the CSV icon which should open the download prompt for your browser. It’s unfortunate that the implementers didn’t use a dynamic tag here because you can’t simply copy the URL. Instead I had to first download the file itself and then copy the originating URL into my clipboard (I was using Chrome so how to do this will depend on your browser).

The URL should look like a much longer version of this:

http://it.usaspending.gov/customcode/build_feed.php?extype=300&select1=agencyName&columns%5B%5D=bureauName…

Now that we’ve got the URL we can import everything into our spreadsheet by selecting the A1 cell and entering:

=ImportData("<url>")

Where “<url>” is of course the long URL you copied earlier.

After a quick few seconds your data should be automajically imported!

For the Exhibit 300, things worked just great but for the Exhibit 53 data I ended up with each cell of Column A holding the full data for each entry. So in B1 I simply entered: =SPLIT(A1, “,”) (note thre’s a bug with Google where the quotes ave to be double quotes not single) and then things auto populated left to right.

Unfortunatley, the “SPLIT()” didn’t auto-populate downwards as well and dragging the function down the full B column is very very painful.

Happy Data Hacking!

Posted in business, frustration, Google, hacks | Comments Off

Cloudera’s Hadoop Education

A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.

I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.

However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.

Here’s my map function

import sys, re
WORDS = re.compile(r'(\w+)')
PARSER = re.compile('(.+?)\t(.+?)\n')

for input in sys.stdin.readlines():
m = PARSER.match(input)
if m:
    key = m.groups()[0]
    for word in WORDS.findall(m.groups()[1]):
        print "%s\t%s" % (word, key)

Cloudera’s version is:

import re
import sys
NONALPHA = re.compile("\W")

for input in sys.stdin.readlines():
    keyline = input.split("\t", 1)
    if (len(keyline) == 2):
        (key, line) = keyline
        for w in NONALPHA.split(line):
            if w:
                print w + "\t" + key

By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.

What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.

However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.

There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI

Posted in code, mapreduce, opensource, python | 1 Comment

CouchDB Performance – Too much TCP

It’s been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some significant performance improvements. Unfortunately, at the time I didn’t have spare cycles to invest in building the latest branches of erlang, couchdb and everything else, so I hadn’t previously been able to rerun my tests.

However, I started a new project today and, like most developers, I took some time to sharpen my tools before I felt sufficiently prepared to proceed. Of course since one of my favorite tools is CouchDB itself I checked in to see how it had been progressing and I was thrilled to see Janl, and it looks like others have contributed, had released a new version of the excellent DBX bundle!

So after a round of updating DBX and CouchDB python library components, I decided to suffer a small distraction and give the new code a test drive.
I wanted to check my baseline, so here’s a rough time sample for the original, file based, keywords code:

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real    0m0.329s
user    0m0.225s
sys    0m0.046s

I ran the initial load and it looks much the same as the previous test:

time ./couchdb_finding_keywords.py

real    28m16.430s
user    2m55.550ssys    1m30.335s

So perhaps around 20% faster, though on a second test run this actually took more than 39 minutes!

Well, now that the load is out of the way, let’s see how are our queries are looking.

Well after making a view, the results with wget aren’t any more promising then last time (note the view location has changed):

wget -O - http://localhost:5984/keywords/_design/finding/_view/word_count?group=true > /dev/null
--20:35:08--  http://localhost:5984/keywords/_design/finding/_view/word_count?group=true
           => `-'
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]

    [                     <=>                        ] 422,776       12.54K/s             

20:35:40 (12.94 KB/s) - `-' saved [422776]

Alas, it doesn’t look like most of the performance improvements have really paid off for this testcase, in fact every run I tried was slower then last version.
Here’s a sample run which is fairly indicative of the rest:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m52.659s
user	0m0.702s
sys	0m0.441s

And again, with more of the full debugging info:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
>>>---- Begin profiling print
         788 function calls (709 primitive calls) in 51.297 CPU seconds

   Ordered by: internal time, call count
   List reduced from 118 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   51.092   25.546   51.092   25.546 socket.py:278(read)
        2    0.049    0.024    0.049    0.024 decoder.py:320(raw_decode)
        1    0.040    0.040    0.040    0.040 ic.py:182(__contains__)
       14    0.036    0.003    0.036    0.003 socket.py:321(readline)
        1    0.020    0.020    0.020    0.020 couchdb_finding_keywords.py:61(build_prob_dict)
        1    0.011    0.011    0.019    0.019 ic.py:1()
        1    0.011    0.011   51.297   51.297 couchdb_finding_keywords.py:68(find_keyword)
        8    0.008    0.001    0.008    0.001 :1(connect)
        1    0.007    0.007    0.066    0.066 urllib.py:1296(getproxies_internetconfig)
        1    0.005    0.005   51.228   51.228 couchdb_finding_keywords.py:41(all_word_count)
        1    0.003    0.003    0.069    0.069 urllib.py:1329(getproxies)
        1    0.003    0.003    0.003    0.003 Res.py:1()
        1    0.002    0.002    0.002    0.002 File.py:1()
        1    0.001    0.001    0.002    0.002 macostools.py:5()
        2    0.001    0.001    0.001    0.001 socket.py:229(close)
        2    0.001    0.000    0.002    0.001 httplib.py:659(connect)
        2    0.001    0.000    0.005    0.002 httplib.py:224(readheaders)
     11/6    0.001    0.000    0.001    0.000 sre_parse.py:385(_parse)
        1    0.000    0.000    0.000    0.000 ic.py:161(__init__)
        2    0.000    0.000    0.000    0.000 httplib.py:323(__init__)

>>>---- End profiling print

real	0m52.023s
user	0m0.732s
sys	0m0.437s

I’m no erlang expert but seeing that many socket calls makes me still suspect that some TCP level tuning (window size & buffering) might be helpful.

As a final note, I did a database compaction and reran the query which helped significantly compared to the worse case 0.9 time but at best only matched 0.8.

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m34.605s
user	0m0.687s
sys	0m0.394s

You can find the test script, with changes to work with the slightly different view URL’s, and if you’d like to recreate the test all you will need to do (beyond setting up couchdb) is:

  1. Swap the comments on the last two lines to run “load_db()”
  2. Create a map / reduce wordcount view
  3. Change the “view_url” parameter on line 25
  4. Invert the comments again to just run “find_keyword()”
Posted in code, couchdb, python | 1 Comment

A simple twitter library in python

I’ve been working on a project built on Google App Engine and I’m relying on twitter to mediate some of the interaction with my end users.

What I find great about the growing prevalence of social interfaces is that I don’t have to focus predominately on coding an interface and with so many clients my users can interact with in whatever way is most appropriate, i.e. from a mobile phone or a desktop client.

Unfortunately, the standard python-twitter library doesn’t readily run under GAE because of some library issues. Originally, I was looking at providing some code changes for it but it’s a spiderweb of more abstractions then I think the problem deserves.

In the process of building my own library I found out that Avinash had figured out how to setup the authentication properly so I built upon his work and added a few other functions I needed.

We all know about twitter’s growing popularity so I thought I’d share my version as well in case it proved helpful to anyone. Twitter provides a great mechanism to decouple your interface from your backend code and I hope to see many more smart systems to come!

Posted in code, python | 3 Comments

Hadoop Benchmark and CouchDB Implications

Although I don’t write much about it directly I’m a big fan of the MapReduce approach to computing and data mining. I love the fluid manner with which it encourages partitioning and how little code is required to focus directly on problem solving vs. setup and communication Also, as Paco admits it’s familiar to anyone who’s had an enterprise mainframe experience.

In the microcosm I’ve found myself using more of the functional programming methods that python provides, and I’m always interested in big announcements like Amazon’s new Elastic MapReduce. However, I think Hadoop has been an exciting development and I’ve been tracking it for a while though unfortunately I lack a large cluster, and more importantly a large problem to leverage it on.

Thus, you can imagine my skepticism when a recent paper, co-authored by Microsoft, was released stating how poor Hadoop performance was compared to some RDBMS solutions.

Having worked with many product and marketing teams I’m aware that benchmarks don’t always translate into real world behavior. However, I’ve also worked with a number of benchmark teams and I know that these technical experts have an honest integrity to their work.

So I took a book reading break and sat down with the paper, aptly titled “A Comparison of Approaches to Large-Scale Data Analysis” to see what I could discern. It’s a well written paper, is actually straightforward at only 14 pages, and often frank about some of the pros, cons and issues they encountered for all of the products during their tests.

Not being a hadoop expert or having a 100 node cluster I can’t really argue or recreate the details (go read the paper) but I will mention that there are some interesting acts of  naivety. For example they “cat” the output of the SQL command to a file but instead of aggregating the individual output logs with;

hadoop fs -cat output/* > combined_file

They force a final map/reduce phase even though earlier in the paper they admit the JAVA overhead for short tasks is significant.

However, I think regardless of the outcome (and I don’t know of anyone who’s tried to refute the tests) there are many lessons here for CouchDB.

The team split most of the tests into two types of environments, one where the data per node was constant (533MB / node) and the other where the data per cluster was constant (1TB overall). Specially in the analytical benchmark example (Section 4.3) they mining websever logs and data.

CouchDB has a fantastic ability to replicate data between instances but unfortunately the only way you could aggregate knowledge across all your data would be (a) run multiple queries against the view on each server or (b) aggregate all this data to a central CouchDB isntance.

I can easilly imagine a scenario where this fragementation could occur (say multiple sharded databases) and picture that as you scale (b) will quickly become impossible. Imagine trying to aggregate every tweet back to a main store, this would quickly require emmense diskspace and likely commercial storage solutions (which defeats the point of commodity hardware).

I don’t think CouchDB needs to morph into a parallelized database but I believe this distributed & aggergated query environment needs to develop for it to function at the envisioned scale.

Finally, let me say that I’m glad couchdb didn’t end up written in JAVA and I’m looking forward to seeing if implementations like Disco can further accelerate the performance of mapreduce systems. I also think Amazon’s offering will make it even easier to afford 1000′s of hadoop nodes instead of a 100 node license for a commercial RDBMS environment (which is certainly not something they mention in their evaluation).

Posted in couchdb, mapreduce | Comments Off

BookList – Entry 6 & 7: “Outliers” & “Talent is Overrated”

Here’s a video review I saw on OpenCulture for two books which I really enjoyed, “Outliers” and “Talent is Overrated“.

It’s a little tongue in cheek, with a bit of profanity, but it made me laugh and remember how much I loved these books.

Both are great, and I wouldn’t want to have to pick a favorite, though I love Gladwell’s ability to make complex concepts and their implications so readable. If this is a topic you’re interested in then plan on reading them as a pair, in my case I read “Outliers” second.

Unlike the video’s humorous conclusion, that we still have the fates to blame, I ascribe to the Vaynerchuk believe that it’s up to you to “Crush It“. I also think that if you don’t “make it” then it’s less about fault and more about desire and whether or not that perceived outcome was appropriate.

Regardless of how you percieve Fate, Destiny or Karma I find it inspirational to believe “you can do it” is a creed experts think you can live by, just don’t stop.

Posted in books | Comments Off

BookList – Entry 8: Wired for War

If you’ve seen either of my recent robotics projects then you might suspect I have some slight fascination with the more exotic forms our technology can take. Which it not to say that I don’t have reservations, nor do I think there’s anything glamorous about the more sinister forms such innovations can take.

Which is why I was really interested to read Singer’s book, Wired for War.


After having seen him at TED I came to believe that he was someone who could appreciate the duality such developments bring. It’s not that I believe Skynet is near, and many of these new ‘bots are really cool. Rather, it’s actually the human side of these components which causes concerns, do we really think weapons of war should be the same as playing on your gaming system?

Singer more then delivers, both illuminating the fascinatingly secret world of military robotics as well as raising some serious moral and social implications of such situations. For example, predator drones are flown by ‘combatants’ which technically makes them valid military targets, however these operators never leave US soil and often go home to their families at night.

There’s really so much more to consider, however if you find these topics interesting then I can’t recommend this book enough and instead I’ll leave you with the following thought;

In the arts of peace Man is a bungler. I have seen his cotton factories and the like, with machinery that a greedy dog could have invented if it had wanted money instead of food. I know his clumsy typewriters and bungling locomotives and tedious bicycles: they are toys compared to the Maxim gun, the submarine torpedo boat. There is nothing in Man’s industrial machinery but his greed and sloth: his heart is in his weapons. This marvellous force of Life of which you boast is a force of Death: Man measures his strength by his destructiveness
– George Bernard Shaw

Posted in books, technology | 3 Comments