Archive for the ‘opensource’ Category

Cloudera’s Hadoop Education

Sunday, June 14th, 2009

A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.

I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.

However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.

Here’s my map function

import sys, re
WORDS = re.compile(r'(\w+)')
PARSER = re.compile('(.+?)\t(.+?)\n')

for input in sys.stdin.readlines():
m = PARSER.match(input)
if m:
    key = m.groups()[0]
    for word in WORDS.findall(m.groups()[1]):
        print "%s\t%s" % (word, key)

Cloudera’s version is:

import re
import sys
NONALPHA = re.compile("\W")

for input in sys.stdin.readlines():
    keyline = input.split("\t", 1)
    if (len(keyline) == 2):
        (key, line) = keyline
        for w in NONALPHA.split(line):
            if w:
                print w + "\t" + key

By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.

What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.

However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.

There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI

Build your own ioGun!

Friday, January 16th, 2009

I apologize for what will effectively be a brain dump post, but a new friend of mine from the hackaday forums is getting started on his own accelerometer controlled system and I wanted to see if I could save him some time and frustration.

I think standing on the shoulders of Giants is a fantastic aspect of human nature, now if only someone could show me to some friendly Cyclopi (is that really the right pluralization?), and frankly I could not exist, let along thrive in the technological world if it were for the good graces of a great many people.

Perhaps I will write a _very long_ blog thank-you to them (if I can remember them all). So consider this my little chance at saving some of you a few hours of your precious time (and brain strain). Because of the very nature of hacking (i.e. everything’s just a little different and throw together) this can’t quite be the same quality as an instructable but it should get some of you started!

I’ve put up my code here so that’s quite obviously the place to start. You’ll find a python script and an HTML page. The script connects to the MoteDaemon port I referenced earlier and will write these data points out to a JSON file which can be served via your webserver, and is then picked up by the HTML page.

If you look through the code you’ll see there’s clearly some tuning that can be done. As people on many forums pointed out there’s a “lag” which is really just because of my polling rates (and less to do with the network traffic).

Two important things here, first you’ll see that I created a separate thread for the monitoring and one for the output, that’s because the wiimote data is fast and furious and you don’t want to block and miss any!

Second, I know writing to an JSON file isn’t ideal so the output part actually buffers 10 events and writes those, luckily jQuery is smart enough to only pull the file if it’s changed so it’s not as bad as it sounds.

Once you’ve got the accelerometer data into python (and then into JSON) it’s ‘just’ a matter of writing the webpage you want! jQuery makes some stuff really easy so I suggest giving it a look if you haven’t yet!

Of course what makes it really easy for me was the ioBridge module. I just plugged in a servo there and defined a widget on their page and ‘viola I had a webservice I could send commands to!

I hope that helps give some of you (or at least one of you) a boost, and if I can help out at all please let me know!

Amazon should participate in the OpenWeb

Tuesday, September 2nd, 2008

The socialweb.tv talks a lot about open standards, particularly in social networks. I find their videos are always energetic and help keep me abreast on aspects of the web that I don’t get to deal with frequently.

I believe their answer to the question of “Who owns your data?” (hint: “You do!”) is a little idealistic but the message and coverage is great. It makes little sense to duplicate this data and especially in tools like flickr, twitter, opensocial, and hopefully someday even Facebook, it seems obvious. Friends are friends no matter which network they’re on and if you tell me that your twitter friends aren’t the same as your facebook friends I’d reply they could be (and argue should) assuming there are more granular levels of classifications and control.

You hear a lot about this nirvana of open security and data for social sites, especially in the context of plaxo, yahoo, google, twitter and all the other “social web” buzzcompanies…. and that’s where it seems to be constrained.

It always seems limited to discussions about why no one would never implement a microsoft security API and why google and yahoo should talk more. Or speculated with hope that Facebook and MySpace will finally accept friend requests and, fingers crossed, that twitter will link with someone, anyone, who could tell them that drunk and disorderly does not make them cool.

What strikes me most is that within all these talks, Amazon is missing. Not only are they not “a player” but people have forgotten that they’re the reining homecoming king and queen when it comes to some new buzzwords like cloud computing and webservices! Many of these friends are sites built on Amazon’s services, from S3 to EC2 even the newly announced block storage gets people excited, but they haven’t stopped to think that inviting Amazon to the party would really get it started.

Amazon’s the popular kid that’s just too popular for their own good. Everyone else thinks they’re out at the college parties when instead they’re home alone day-trading while they’re waiting for their friends to call.

I think Amazon would benefit from a vast exposure to new customers and social data! Imagine what they could sell me if they knew I’d been boating with friends or that I had a camping excursion planned (maybe something first aid related). Even product “reviews” (which can be found in 140 character “this sucks” twitters) to broadcasting 40% discounts for kindle books when they know I’m stuck at an airport with a layover! There’s a huge wealth of valuable data for consumer companies to be gleaned from these social networks.

Amazon has a ton of users and already with their payment system and associates program they’ve shown that open standards can actually be used to make money, it seems that this would be another area in which they could reap the benefits and help everyone by driving the creation and adoption of standards.

Back in Bash

Sunday, January 27th, 2008

Quick little post today,

Using my previous two scripts you’ve (1) been able to separate a massive photo directory into 3rds (2) been able to take one of those directories and identify matching photos and move them into directories of their own, to make it easier to compare the results.

Now you’ve got a bunch of directories, most of which have 1 photo in them… what do you do.

I decided to take a break from Python for this one, it just didn’t seem the appropriate hammer for this nail.

So here’s a bit of bashing for you, I have an itch somewhere that makes me believe you can stat a directory to determine how many files are in it, but in this case find is the cheap date I was looking for.

After this script, which I call consolidate.sh, you’ll be left with only the directories which have duplicates, Enjoy!


#!/bin/bash

for e in `ls`; do
if [ -d $e ]; then
total_files=$(find $e -type f | wc -l)
if [ $total_files -le 1 ]; then
find $e -type f -exec mv {} !/uniques \;
rmdir $e
fi
fi
done

Continuing to Code

Friday, January 25th, 2008

Well, my python’s not exactly getting prettier but I’ve been able to make it more functional, may I present “find_image_dups.py”!

Although I’m learning this 4G language (is it truly?) I still tend towards a shell scripting approach so I write small bits of code rather then trying to write one script that will both subdivide my pictures as well as find duplicates.

You can see I’ve experimented with different ways to determine if files are “equal”. If they’re the exact same file then the MD5’s would match and there are plenty of tools for that, but I anticipate some instances where a few pixels may be different, so checking a thumbnail may be a more repeatable test of identify.

Although my first script allows me to circumvent the constraints imposed by working with too many files, I’m starting to feel frustrated that the limits aren’t easier to “ignore”. I don’t want to waste my time tuning linux limits or tweaking python, I want it to just work in the most simplistic (even if it’s brute force) manner possible!

One last comment on a language deficiency I find constraining. Mike and I were discussing the matter and his view, is that it’s not a problem. However, the solution is more code, which is sometimes a silly metric as both of pieces of code would operate equally “efficiently” I believe my “style” to be more readable (although it doesn’t work so Mike is clearly the victor in the argument).

Let’s assume for the purposes of the illustration that you were an old C programmer and didn’t use IDE’s all that much, nor want to learn the python debugger yet. So during your while loop you may be tempted to do;

print "counter: %s p: %s i:%s" % (counter, p.filename, i.filename)

Unfortunately, the first time through p is "None" and you’ll have problems for this single corner case. In C, when printing pointer references, I would use the ?: tertiary operator which led me to try this in python;

print "counter: %s p: %s i:%s" % (counter, p ? p.filename : "None", i.filename)

I think it’s pretty straightforward to understand what’s going on there and what you’d like done, but unfortunately python’s ternary is

op1 if condition else op2

Thus, in python I feel like I should be able to do;

print "counter: %s p: %s i:%s" % (counter, p.filename if p else "None", i.filename)

But that doesn’t work!
Mike’s solution is;

if p:
print "counter: %s p: %s i:%s" % (counter, p ? p.filename : "None", i.filename)
else:
print "counter: %s p: %s i:%s" % (counter, p ? p.filename : "None", i.filename)

While clearly functional but I dislike the added conditionals since, to me, they don’t feel like the main intent of the code. However, Mike expressed a good point; If the code were permanent, the ternary operator might cause to someone examining the printouts and sees two possible outputs from a single statement.

Nevertheless it’s great learning and discussing some of the finer points of style in a language I’m only newly familiar with!


#!/usr/bin/python

import Image
import ImageStat

from glob import glob
from os import mkdir
from shutil import move

def RMS_cmp(x, y):
# if ImageStat.Stat(x).rms == ImageStat.Stat(y).rms:
# if x.copy().resize((300,300)).histogram() == y.copy().resize((300,300)).histogram():
if x.histogram() == y.histogram():
return 0
elif ImageStat.Stat(x).rms > ImageStat.Stat(y).rms:
return 1
else: # x<y
return -1

images = map(Image.open, glob(“*.[Jj][Pp][Gg]“))
images.sort(RMS_cmp)

counter = 1
p = None
for i in images:
if p and ( RMS_cmp(p, i) == 0 ):
move(i.filename, str(counter))
else: #no match start a new directory
if p:
counter += 1
try:
mkdir(str(counter))
except:
pass
move(i.filename, str(counter))
p = i

Progress in the New Year

Friday, January 18th, 2008

Welcome to the New Year! A lot has happened already, between Google changing their page rank, Apple’s numerous announcements and tons of other cool stuff. I’ve been getting back into the swing of things and mostly trying to figure out how to start off the new year right, i.e. if there was some great “first post” for the year I could conceive.

As can be the case, my desire to make an amazing impression has resulted in me making none, until now.

I graduated from college with a B.S. in Computer Engineering and for anyone not familiar with a Comp.E. degree I tell people I’m a cross between an Electrical Engineer and a Computer Scientist.

Truly, I’m more of a Scientist then an Engineer and probably should have gone the C.S. route based on my job history through the years. However, I like being able to call myself an Engineer and it fits with who I am and how I approach problems.

I tend to act as a “translation layer” and like to approach problems with tangible exploration. I’m more of a system programmer (i.e. C and perl) then a “pomp and circumstance” application architect (e.g. java and relational DB’s).

However, as the “active complexity” (or perceived complexity) of development and IT moves “up the stack” I’ve felt my programming skills start to slip. It’s not theory and design failing but that the barrier of taking ideas to fruition has becomes steeper.

Intending to rectify that situation, and loving the compartmentalized development approach that web services brings, I’ve begun learning python and intend to practice the application and not just theory of programming.

My first “new year need” has been to “spring clean” my data files, I have around 20,000 pictures with many dups (the result of recovering a HD). So coupling intent with need I’ve written my first real python program.

It’s not fancy, and only serves to break out a large directory of files into a smaller (3rds) subset. However, it was a great exercise for me so far. Transitioning to Python isn’t really a syntactical problem but an exercise in realizing that things which were once hard are now easy!

It’s an exercise in mind-shift that we should all try to initiate this year, try it early and start your year out right!

So in case it helps anyone here’s my “code”. I’m sure there are some more “python-esque” ways to do it so if you’ve got any insights please leave a comment!

#!/usr/bin/python

from glob import glob
from os import stat,mkdir
from shutil import move

mkdir(“1″)
mkdir(“2″)
mkdir(“3″)

def size_cmp(x, y):
if stat(x).st_size > stat(y).st_size:
return 1
elif stat(x).st_size == stat(y).st_size:
return 0
else: # x<y
return -1

files = glob(“*.JPG”)
files.sort(size_cmp)
num_files = len(files)

for f in files[0: num_files/3]: #slice 1
move(f, “1″)
for f in files[num_files/3+1 : num_files*2/3]: #slice 2
move(f, “2″)
for f in files[num_files*2/3+1 : num_files]: #slice 3
move(f, “3″)

The Next Social Network: WordPress – GigaOM

Thursday, December 13th, 2007

I really appreciated the thoughts in The Next Social Network: WordPress – GigaOM. I’ve been on Facebook for a while now (and prior to that LinkedIn) and despite my initial misgivings, I’ve been surprised at how much value social networks have provided, for a relatively low amount of effort.

On the social side, Facebook has provided me with the ability to reconnect with old friends as well as making it extremely easy to keep up with all my friends. LinkedIn really hasn’t been a “game changer” for me but it’s a nice way of keeping an aggregated collection of business contacts.

However, given Facebook’s alarming disregard for our privacy most notability with their Beacon project, and their growing commercialization, I would feel more comfortable if I could manage my own social network presence.

Perhaps with the opening of the walled gardens social networking API’s, Gacebook, LinkedIn and OpenSocial all have announcements, we may gain this ability. It would be amazingly empowering to centrally manage my online presence(s), including the multiple views of “who I am” as well as being able to filter undesirable content, where I get to make that distinction.

This seems a great avenue for WordPress to pursue, given its opensource nature and might help continue it’s differentiation now that Movable Type has decided to opensource their product as well.

I wish I was a true “webhacker” and could just make this happen, sharing code is always more compelling then simply spreading an idea. However, I’ll be cheering on whomever does.

update:

Normally I’d make post anew rather then update an existing one, but I just saw Scoble’s Can we get a first step in social networking portability and wanted to comment on it here because it’s so pertinent to these thoughts.

As usual, and is typical, I think Scoble’s got the right idea just misplaced in an outdated modality. True social networking, as in the “seemless” desire of that purpose, is not about portability, import/export or “linking”. Those are all walled ways in which people still have to do the work, and implementors believe they should be in control.

So far google, or even yahoo, are the best representations of this ideal. They’re both in a position to find my interests, my pictures and my friends. Unfortunately, for now, name isn’t a sufficient differentiator for search engines.

Code of Conduct

Sunday, December 2nd, 2007

Brad has some great comments on my post about how software needs to extend beyond a single locality. I’ve yet to be happy with the “comment” structure of the blog’o’sphere, so I’ll follow Brad’s example and put my thoughts here, especially because Brad’s post and mine evolve the topic(s) rather then being follow-ups.

n particular, I like Brad’s insight that it’s not really an application, but the overall data that matters. It’s also liberating to realize that all data you’re interested in (and allowed to see) is “your data”.

Privacy might dictate differently but if I’m allowed to keep it in my brain then I believe Brad’s right that I’ allowed to keep it “on disk”. For example, when a friend updates status on Facebook, that change is their data, but the alert (with the same data) which I receive is my data and I should be able to store it and do with it what I will.

In the enterprise world, we talk about being “data centric” as opposed to the “infrastructure centric” approach of the past. Sometimes there are more pejorative terms like “enterprise data bus” used but they all refer to the goal of making the right data accessible to the right end-points (people or applications) at the right time, while ensuring requirements like consistency, audit-ability and performance are met.

Satisfying these demands usually represents two perspectives (1) centralize everything or (2) accept the fractional nature of data and try to embrace an appropriate level of policy. For example you may state “ok, for this application you can use whatever data you want but you’ll probably want to try to update every 30 min”

In reality it’s really always a mixture of the two, especially given compliance requirements. Whatever the ultimate solutions, even beginning the conversation is usually instructive enough to help business better define, and ultimately solve, these access issues.

So can we use Brad’s realization to inspire an “end consumer, SaaS using, RESTful application” bill of rights? We need a better name, but just as there are rules of behavior for participating in forums or etiquette for blogging and linking, I believe we should require applications to live up to a required set of expectations, perhaps we can get one of the opensource entities to help with defining, auditing and certifying such applications.

An initial standard must require “import/export”. However, I don’t think that captures what we’re after. There needs to be unmodified event reporting, and not the Facebook styile of “something’s changed come visit to check it out”. If we’re going to expect an application to report on state changes (presumably reporting these to another application which can act on that state) then we must also require an external mechanism for inducing state changes. Continuing to use Facebook as a model I believe it’s reasonable to expect an API for “friending” or changing status.

What other requirements should services we use be encouraged to provided?

Embodying Opensource

Monday, November 26th, 2007

Just the other day I wrote how I hope the “opensource” culture, not of technology but of social obligation, will begin to evolve the enterprise.

I’ve been working through a serious backlog of articles, and while this one’s been on my “to read” list for a while it was particular relevant to my last post.

The title “Fire your best people, reward the lazy ones” is inflammatory but the article itself accurately represents one way that opensource culture beats out the traditional enterprise behavior.

“To me, if the knowledge is locked in your head, you are a less valuable, not more valuable, resource.”

I know “traditionally” such knowledge lock-in was considered “job security” but it’s an attitude I find unfathomable. Perhaps if you enjoy being stuck forever in your current job, however such an isolationist tendency will quickly leave your code and skills outdated and ripe for replacement.

So ask yourself, How have you emabled someone today?

If I could change the world…

Thursday, November 15th, 2007

I often think I have good ideas, but I tend to be a “thinker” and have problems “doing”. It’s not that I can’t “do” it’s usually the familiar feeling of saying “Wouldn’t it be nice if…” quickly followed by “yea, and it would also be nice if I had time for that” which I’m sure we can all sympathize with.

I recently read Accelerando by Charles Stross. It’s not my favorite book of his but I really like his work. In this story one of the main characters embodies sort of the ultimate in opensource philosophy. Rather then just the typical “writing code and giving it away for free” that we’re all familiar with, he shares his ideas for free and lives off the goodwill of those who implement them.

It would be an ideal job description for me, and although I never expect to get compensated for an idea I thought I would try to embody that idealism. I seem to recall a website somewhere in the ether with the same philosophy but can’t recall the URL.

Anyway, in that spirit, here goes;

I think someone should hack greasemonkey and google reader so that “sharing” or “starting” an entry automatically promotes it on digg and reddit. There shouldn’t be an additional step. Or maybe digg and reddit could simply implement feed slurping for your shared list.

I believe the future of the Internet as a whole will be an increasingly seemless interaction between multiple services. User’s don’t want to manage Gmail and Yahoo mail as separate entities (yes I have both). That’s a simple example but ask yourself the question why, given the proper security / policy / profile controls, it’s necessary for you to goto Facebook to receive your Facebook messages.

Why can’t your Facebook “status” be slurped from Twitter, or pushed there?

Seamless integration, it’s what many of the best “hacks” seem to be about.

Lets be witty I’ll call it “seemless” integration, where multiple services seem unified.