Converging Google Services

The always fabulous Louis Grey makes good points about using GMail in a corporate environment and got me thinking in a different direction.

I begain to consider: Why can’t I share emails the same way I can share RSS entries?

Google Reader allows you to publish all entries you tagged with specific keywords, or you can share entries on an individual basis. Yet, despite the obvious analogue, it’s impossible for me to share email messages or threads in the same manner!

I realize there are some privacy concerns, since RSS & Atom explicitly make things public and email does not. However, there’s no reading I couldn’t use an email to RSS gateway and violate expected convention easily.

I might also argue that by opening up email to the same type of social collaboration we get via Google Reader then the potential would exist to make things more secure.

For example, by adding a default copy-left style licensing, a la creative commons, or a per-email “off the record” flag like Google Talk. There could even be “free to share” delivery options rather then keeping everything on an “honor system”.

Posted in uncategorized | 6 Comments

Who’s Really Testing Chrome?

Just a quick gripe to share with anyone using Chrome.

For all of Chrome’s new high performance design, there’s a very simple way to bring your tabbed experience to it’s knees, Print something…

In my case it was a 100+ page PDF printed 2 pages per sheet, but I’m sure most anything of a decent size would work.

Disapopinting to say the least, since printing is supposed to be a background task, that’s why we have spooling, not take front and center stage!

Posted in frustration, Google | Comments Off on Who’s Really Testing Chrome?

Mashing up the Dashboard

This post is for anyone interested in any of the Government Transparency inituatives. If you’ve been following this topic then you’re probably aware that Vivek Kundra sees a dashboard as a way of accelerating the transparency and transformation of the Government.

After watching groups like the Sunlight Foundation and Change Congress work their magic, I’ve now begun seeing much of this transformation from the inside due to my new job.

However, I still endeavor to participate externally as well and so I wanted to do some analysis on the public data.

In order to start, I wanted to import the USASpending.gov information into Google spreadsheets, since I can kill two buzzwords at once by leveraging Cloud Computing services for Transparency!

For a while, I fought with the easiest way to import the data and wanted to share what eventually worked for me.

First, create an new spreadsheet and name your tab appropriately. Then go to the USASpending Feeds page to select the specific data you want. I suggest starting with the Exhibit 300 information since it’s typically a smaller dataset and my Exhibit 53 tab with more than 1200 rows has proven to be very slow.

Next pick, and I highly suggest reordering, the data fields you want you must then pick which Agency or Agencies you’d like info on. Again, considering that lots of data will be pretty slow.

Finally, select the CSV icon which should open the download prompt for your browser. It’s unfortunate that the implementers didn’t use a dynamic tag here because you can’t simply copy the URL. Instead I had to first download the file itself and then copy the originating URL into my clipboard (I was using Chrome so how to do this will depend on your browser).

The URL should look like a much longer version of this:

http://it.usaspending.gov/customcode/build_feed.php?extype=300&select1=agencyName&columns%5B%5D=bureauName…

Now that we’ve got the URL we can import everything into our spreadsheet by selecting the A1 cell and entering:

=ImportData("<url>")

Where “<url>” is of course the long URL you copied earlier.

After a quick few seconds your data should be automajically imported!

For the Exhibit 300, things worked just great but for the Exhibit 53 data I ended up with each cell of Column A holding the full data for each entry. So in B1 I simply entered: =SPLIT(A1, “,”) (note thre’s a bug with Google where the quotes ave to be double quotes not single) and then things auto populated left to right.

Unfortunatley, the “SPLIT()” didn’t auto-populate downwards as well and dragging the function down the full B column is very very painful.

Happy Data Hacking!

Posted in business, frustration, Google, hacks | Comments Off on Mashing up the Dashboard

Cloudera’s Hadoop Education

A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.

I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.

However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.

Here’s my map function

import sys, re
WORDS = re.compile(r'(\w+)')
PARSER = re.compile('(.+?)\t(.+?)\n')

for input in sys.stdin.readlines():
m = PARSER.match(input)
if m:
    key = m.groups()[0]
    for word in WORDS.findall(m.groups()[1]):
        print "%s\t%s" % (word, key)

Cloudera’s version is:

import re
import sys
NONALPHA = re.compile("\W")

for input in sys.stdin.readlines():
    keyline = input.split("\t", 1)
    if (len(keyline) == 2):
        (key, line) = keyline
        for w in NONALPHA.split(line):
            if w:
                print w + "\t" + key

By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.

What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.

However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.

There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI

Posted in code, mapreduce, opensource, python | 1 Comment

CouchDB Performance – Too much TCP

It’s been a while since I ran my CouchDB performance test, but many of the comments I received suggested that updating my codebase should yield some significant performance improvements. Unfortunately, at the time I didn’t have spare cycles to invest in building the latest branches of erlang, couchdb and everything else, so I hadn’t previously been able to rerun my tests.

However, I started a new project today and, like most developers, I took some time to sharpen my tools before I felt sufficiently prepared to proceed. Of course since one of my favorite tools is CouchDB itself I checked in to see how it had been progressing and I was thrilled to see Janl, and it looks like others have contributed, had released a new version of the excellent DBX bundle!

So after a round of updating DBX and CouchDB python library components, I decided to suffer a small distraction and give the new code a test drive.
I wanted to check my baseline, so here’s a rough time sample for the original, file based, keywords code:

time ./finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real    0m0.329s
user    0m0.225s
sys    0m0.046s

I ran the initial load and it looks much the same as the previous test:

time ./couchdb_finding_keywords.py

real    28m16.430s
user    2m55.550ssys    1m30.335s

So perhaps around 20% faster, though on a second test run this actually took more than 39 minutes!

Well, now that the load is out of the way, let’s see how are our queries are looking.

Well after making a view, the results with wget aren’t any more promising then last time (note the view location has changed):

wget -O - http://localhost:5984/keywords/_design/finding/_view/word_count?group=true > /dev/null
--20:35:08--  http://localhost:5984/keywords/_design/finding/_view/word_count?group=true
           => `-'
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]

    [                     <=>                        ] 422,776       12.54K/s             

20:35:40 (12.94 KB/s) - `-' saved [422776]

Alas, it doesn’t look like most of the performance improvements have really paid off for this testcase, in fact every run I tried was slower then last version.
Here’s a sample run which is fairly indicative of the rest:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m52.659s
user	0m0.702s
sys	0m0.441s

And again, with more of the full debugging info:

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]
>>>---- Begin profiling print
         788 function calls (709 primitive calls) in 51.297 CPU seconds

   Ordered by: internal time, call count
   List reduced from 118 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   51.092   25.546   51.092   25.546 socket.py:278(read)
        2    0.049    0.024    0.049    0.024 decoder.py:320(raw_decode)
        1    0.040    0.040    0.040    0.040 ic.py:182(__contains__)
       14    0.036    0.003    0.036    0.003 socket.py:321(readline)
        1    0.020    0.020    0.020    0.020 couchdb_finding_keywords.py:61(build_prob_dict)
        1    0.011    0.011    0.019    0.019 ic.py:1()
        1    0.011    0.011   51.297   51.297 couchdb_finding_keywords.py:68(find_keyword)
        8    0.008    0.001    0.008    0.001 :1(connect)
        1    0.007    0.007    0.066    0.066 urllib.py:1296(getproxies_internetconfig)
        1    0.005    0.005   51.228   51.228 couchdb_finding_keywords.py:41(all_word_count)
        1    0.003    0.003    0.069    0.069 urllib.py:1329(getproxies)
        1    0.003    0.003    0.003    0.003 Res.py:1()
        1    0.002    0.002    0.002    0.002 File.py:1()
        1    0.001    0.001    0.002    0.002 macostools.py:5()
        2    0.001    0.001    0.001    0.001 socket.py:229(close)
        2    0.001    0.000    0.002    0.001 httplib.py:659(connect)
        2    0.001    0.000    0.005    0.002 httplib.py:224(readheaders)
     11/6    0.001    0.000    0.001    0.000 sre_parse.py:385(_parse)
        1    0.000    0.000    0.000    0.000 ic.py:161(__init__)
        2    0.000    0.000    0.000    0.000 httplib.py:323(__init__)

>>>---- End profiling print

real	0m52.023s
user	0m0.732s
sys	0m0.437s

I’m no erlang expert but seeing that many socket calls makes me still suspect that some TCP level tuning (window size & buffering) might be helpful.

As a final note, I did a database compaction and reran the query which helped significantly compared to the worse case 0.9 time but at best only matched 0.8.

time ./couchdb_finding_keywords.py
[('Hacker', 249160.0), ('Techcrunch', 249160.0)]

real	0m34.605s
user	0m0.687s
sys	0m0.394s

You can find the test script, with changes to work with the slightly different view URL’s, and if you’d like to recreate the test all you will need to do (beyond setting up couchdb) is:

  1. Swap the comments on the last two lines to run “load_db()”
  2. Create a map / reduce wordcount view
  3. Change the “view_url” parameter on line 25
  4. Invert the comments again to just run “find_keyword()”
Posted in code, couchdb, python | 1 Comment