A while back, after Cloudera released their lectures and VMware image for Hadoop, I watched the training sessions and worked through some of the initial exercises.
I must say I was a little disappointed by the videos but I believe that’s because I’d seen Christophe Bisciglia’s lectures when he was still at Google.
However, the exercises are definitely something to get you thinking and are worth giving a shot. It’s sort of like ‘programming golf‘ and I thought I’d share my version of the first map function vs. the packaged solution.
Here’s my map function
import sys, re
WORDS = re.compile(r'(\w+)')
PARSER = re.compile('(.+?)\t(.+?)\n')
for input in sys.stdin.readlines():
m = PARSER.match(input)
if m:
key = m.groups()[0]
for word in WORDS.findall(m.groups()[1]):
print "%s\t%s" % (word, key)
Cloudera’s version is:
import re
import sys
NONALPHA = re.compile("\W")
for input in sys.stdin.readlines():
keyline = input.split("\t", 1)
if (len(keyline) == 2):
(key, line) = keyline
for w in NONALPHA.split(line):
if w:
print w + "\t" + key
By definition they should produce the same output, i.e. the mappings should be identical, and barring buggy corner cases mine certainly passed the test.
What I found interesting was my instinctual desire to let regexps do the work, whereas their version relies on a simple “split()” to sort the input. It’s likely a faster solution and given the massive amounts of data for large data passes, it’s worth benchmarking.
However, although I’m clearly biased, I must admit I found mine easier to grok and should be more flexible, e.g. perhaps the input pattern could become a parameter rather then hard-coded into the flow.
There’s certainly not a “right” way to do it, other then one that works. The advantage of the MapReduce model is that the necessary code is often really really short and easy to modify but I thought others might find it interesting to realize that perl doesn’t have an exclusive license on ‘TMTOWTDI‘
Pingback: BotchagalupeMarks for June 14th - 09:00 | IT Management and Cloud Blog