A simple twitter library in python

I’ve been working on a project built on Google App Engine and I’m relying on twitter to mediate some of the interaction with my end users.

What I find great about the growing prevalence of social interfaces is that I don’t have to focus predominately on coding an interface and with so many clients my users can interact with in whatever way is most appropriate, i.e. from a mobile phone or a desktop client.

Unfortunately, the standard python-twitter library doesn’t readily run under GAE because of some library issues. Originally, I was looking at providing some code changes for it but it’s a spiderweb of more abstractions then I think the problem deserves.

In the process of building my own library I found out that Avinash had figured out how to setup the authentication properly so I built upon his work and added a few other functions I needed.

We all know about twitter’s growing popularity so I thought I’d share my version as well in case it proved helpful to anyone. Twitter provides a great mechanism to decouple your interface from your backend code and I hope to see many more smart systems to come!

Posted in code, python | 3 Comments

Hadoop Benchmark and CouchDB Implications

Although I don’t write much about it directly I’m a big fan of the MapReduce approach to computing and data mining. I love the fluid manner with which it encourages partitioning and how little code is required to focus directly on problem solving vs. setup and communication Also, as Paco admits it’s familiar to anyone who’s had an enterprise mainframe experience.

In the microcosm I’ve found myself using more of the functional programming methods that python provides, and I’m always interested in big announcements like Amazon’s new Elastic MapReduce. However, I think Hadoop has been an exciting development and I’ve been tracking it for a while though unfortunately I lack a large cluster, and more importantly a large problem to leverage it on.

Thus, you can imagine my skepticism when a recent paper, co-authored by Microsoft, was released stating how poor Hadoop performance was compared to some RDBMS solutions.

Having worked with many product and marketing teams I’m aware that benchmarks don’t always translate into real world behavior. However, I’ve also worked with a number of benchmark teams and I know that these technical experts have an honest integrity to their work.

So I took a book reading break and sat down with the paper, aptly titled “A Comparison of Approaches to Large-Scale Data Analysis” to see what I could discern. It’s a well written paper, is actually straightforward at only 14 pages, and often frank about some of the pros, cons and issues they encountered for all of the products during their tests.

Not being a hadoop expert or having a 100 node cluster I can’t really argue or recreate the details (go read the paper) but I will mention that there are some interesting acts of  naivety. For example they “cat” the output of the SQL command to a file but instead of aggregating the individual output logs with;

hadoop fs -cat output/* > combined_file

They force a final map/reduce phase even though earlier in the paper they admit the JAVA overhead for short tasks is significant.

However, I think regardless of the outcome (and I don’t know of anyone who’s tried to refute the tests) there are many lessons here for CouchDB.

The team split most of the tests into two types of environments, one where the data per node was constant (533MB / node) and the other where the data per cluster was constant (1TB overall). Specially in the analytical benchmark example (Section 4.3) they mining websever logs and data.

CouchDB has a fantastic ability to replicate data between instances but unfortunately the only way you could aggregate knowledge across all your data would be (a) run multiple queries against the view on each server or (b) aggregate all this data to a central CouchDB isntance.

I can easilly imagine a scenario where this fragementation could occur (say multiple sharded databases) and picture that as you scale (b) will quickly become impossible. Imagine trying to aggregate every tweet back to a main store, this would quickly require emmense diskspace and likely commercial storage solutions (which defeats the point of commodity hardware).

I don’t think CouchDB needs to morph into a parallelized database but I believe this distributed & aggergated query environment needs to develop for it to function at the envisioned scale.

Finally, let me say that I’m glad couchdb didn’t end up written in JAVA and I’m looking forward to seeing if implementations like Disco can further accelerate the performance of mapreduce systems. I also think Amazon’s offering will make it even easier to afford 1000’s of hadoop nodes instead of a 100 node license for a commercial RDBMS environment (which is certainly not something they mention in their evaluation).

Posted in couchdb, mapreduce | Comments Off on Hadoop Benchmark and CouchDB Implications

BookList – Entry 6 & 7: “Outliers” & “Talent is Overrated”

Here’s a video review I saw on OpenCulture for two books which I really enjoyed, “Outliers” and “Talent is Overrated“.

It’s a little tongue in cheek, with a bit of profanity, but it made me laugh and remember how much I loved these books.

Both are great, and I wouldn’t want to have to pick a favorite, though I love Gladwell’s ability to make complex concepts and their implications so readable. If this is a topic you’re interested in then plan on reading them as a pair, in my case I read “Outliers” second.

Unlike the video’s humorous conclusion, that we still have the fates to blame, I ascribe to the Vaynerchuk believe that it’s up to you to “Crush It“. I also think that if you don’t “make it” then it’s less about fault and more about desire and whether or not that perceived outcome was appropriate.

Regardless of how you percieve Fate, Destiny or Karma I find it inspirational to believe “you can do it” is a creed experts think you can live by, just don’t stop.

Posted in books | Comments Off on BookList – Entry 6 & 7: “Outliers” & “Talent is Overrated”

BookList – Entry 8: Wired for War

If you’ve seen either of my recent robotics projects then you might suspect I have some slight fascination with the more exotic forms our technology can take. Which it not to say that I don’t have reservations, nor do I think there’s anything glamorous about the more sinister forms such innovations can take.

Which is why I was really interested to read Singer’s book, Wired for War.

After having seen him at TED I came to believe that he was someone who could appreciate the duality such developments bring. It’s not that I believe Skynet is near, and many of these new ‘bots are really cool. Rather, it’s actually the human side of these components which causes concerns, do we really think weapons of war should be the same as playing on your gaming system?

Singer more then delivers, both illuminating the fascinatingly secret world of military robotics as well as raising some serious moral and social implications of such situations. For example, predator drones are flown by ‘combatants’ which technically makes them valid military targets, however these operators never leave US soil and often go home to their families at night.

There’s really so much more to consider, however if you find these topics interesting then I can’t recommend this book enough and instead I’ll leave you with the following thought;

In the arts of peace Man is a bungler. I have seen his cotton factories and the like, with machinery that a greedy dog could have invented if it had wanted money instead of food. I know his clumsy typewriters and bungling locomotives and tedious bicycles: they are toys compared to the Maxim gun, the submarine torpedo boat. There is nothing in Man’s industrial machinery but his greed and sloth: his heart is in his weapons. This marvellous force of Life of which you boast is a force of Death: Man measures his strength by his destructiveness
— George Bernard Shaw

Posted in books, technology | 3 Comments

BookList – Entry 5: The Daemon

I heard about this book while listening to a talk the author gave at a “Long Now Foundation” meeting via their iTunes podcast. Apparently, Stewart Brand got an early copy and it was passed around enough to gain quite a bit of popularity.

Well, after vomiting my way through an Ops Center story and a few other bad “tech fiction” books, I don’t think of myself as one to dive into the hype about a new book unless it’s from someone who’s name ends in Gibson, Bear, Stephenson or the like.

However, I really respect the Foundation (they’ve got an amazing vision and a bunch of cool members after all) so I thought this one might be interesting enough to give it a try.

Daemon is based heavily on the premise of AI and natural language parsing but what was even more interesting to me was the distributed systems approach represented in the book, I’ve always been a sucker for grid-like systems.

You can find the plot overview anywhere you like so I won’t bother recreating it here, but I will tell you, yes, it’s a good book, though it has probably been so well received due to the current economic state and fears.

At times I felt like the technical stuff was polarized between kiddy-gloved explanations or glossed over like magic. However, it’s clear the author focused on actual technological facts and didn’t suffer any “I know this!” crap or pretend hacking had visual displays.

I also thought the story spent some credibility “justifying” its realism, e.g. explicitly calling it narrow-AI (i.e. expert systems). If you have to split hairs then maybe you’re too worried about disclaimers and not focusing enough on the story, but I guarantee you that Suarez has more then enough here for a sequel.

If you take the “futuristic” parts on faith (which is required for any good sci-fi book) then you can ignore the other bits as you see fit and I’m sure you’ll be well entertained and even a little torn as to which side to cheer for!

Also, if you like this one I highly recommend giving any of the books by Charles Stross a try. Specifically Halting State, which seems to have a lot of overlapping & parallels themes and was even more enjoyabale for me, but only just

Posted in books | Comments Off on BookList – Entry 5: The Daemon