I’ll be the firs to admit that despite it having been around for over ten years now, I’ve been generally “blind” to MapReduce and Big Data per se. That is, I’ve experience working with large datasets, but always through a nice and intuitive GUI designed to allow people like me to spot the malicious activity I need to do my job, but without requiring me to physically craft my own functions. It therefore just wasn’t necessary to know exactly how it all worked underneath. I like this to being a very competent driver, but breaking out in sweats at the thought of replacing a set of brake pads unassisted.
However, I’ve been making an effort this year (not quite the result of a New Year’s resolution, but you might call it that) to consciously push the boundaries of my knowledge, so that come year end I am far less likely to say “Yeah, I remember hearing something about that.”, and far more likely to say “Yeah, that’s a really cool system, do you want me to explain how it works?” My journey of discovery has recently taken me through Splunk territory (and that really is a cool system!) and by a meandering path through to understanding how Big Data works, and the unique computing challenges it brings, and how useful it can be in finding the most subtle of patterns and making businesses more profitable and more efficient (see here and here).
So… I did promise to talk about MapReduce… and at this point, I’ll admit that it’s not my own explanation, just the best one I’ve come across that explains the concept of MapReduce in simple enough terms as to be understandable to almost anyone, but without going all the way down to making the explanation so simple that it’s not of any practical use to anyone. The introduction that made sense to me was from the blog of Kaushik Sathupadi, and I’ll produce a version of it here in the event that his blog becomes unavailable at some future date.
Imagine that you work for a software company that is producing a brand new blogging platform. The CEO has given you the task of examining all blogs served on popular blogging platform blogger.com and producing a list of the number of words for each letter count. That is, how many one letter words are there, how many two letter words are there, and so on until you reach the number of ten letter words. They also are paranoid about security, so they want you to do this calculation manually, and on paper. You have a week. It’s not all bad news however. The CEO recognises that it’s a very big task, and so they have graciously agreed to let you have personal control over the company’s 50,000 employees for the week.
So what do you do? Besides resign and run away? Well, here’s where the MapReduce principles come in very handily. Remember that the CEO said you could have control over the 50,000 employees? Well, it’s how you put them to work and how you approach breaking one giant task down into a great many small tasks that is key here. For the next week, the staff will be divided up into different groups:
- Mapper (The vast majority of staff will be in this group)
- Grouper (Assume just a single person for now)
- Reducer (Approximately 10 staff in this group)
- Master (Just one – you)
Once you have assigned each staff member to one of the above groups, you need to brief them on their roles and responsibilities. So first of all you’d speak with the Mappers. You tell them that each one of them will have a set of 50 blog URLs and a very very large piece of paper. They have the next four days to go through each blog and write down on the paper each instance of a word they see at each blog, and prefix it with the number of letters in that word. So for example, take the sentence “The quick brown fox jumped over the lazy dog.”. The corresponding entries on the Mapper’s sheet of paper would look like this:
In reality, with 50 blogs and lots of content in each one, the Mapper’s piece of paper would have many millions of lines of data on it. The Mappers would continue this task independent of each other until they had completed a full reading of all of the blog content each one of them was assigned. As and when each Mapper complete’s their tasking, they hand their completed sheet to the Grouper.
The Grouper has ten sheets of paper: One for each length of word (1-10 characters). The Grouper looks through the sheet given to him by the Mapper and for each line looks first at the number. This tells The Grouper which of their ten sheets the following word will be written on. For example, “6,jumped” means that The Grouper should write the word “jumped” on sheet 6. “3,dog” will tell The Grouper to write the word “dog” on sheet 3. Once The Grouper has received all of the sheets from the Mappers, they send their ten sheets to ten different Reducers. So, sheet 1 to Reducer 1 and sheet 2 to Reducer 2 and so on.
The Reducer has a relatively simple task in this example. They are required only to count the number of words on the sheet that The Grouper gave them, then to write that number on the back of the sheet in big bold type. Once this is done, they give the list to The Master, you.
Within the week, you managed to count all instances of all words by length in all of the blogger.com blogging service. Well done. But why was it successful? Because you effectively used MapReduce. The key components of that success was that:
- The Mappers can work independently (and in parallel)
- The Reducers can work independently (and again, in parallel)
- The Grouper can work at great speed as there is no need to count anything, simply to sort based on the key value from the Mapper.
The key concept here can be applied to a number of different problems using Big Data. The roles of The Master and The Grouper (dividing up the work and grouping based on key value) will remain the same, and these functions are provided for by any MapReduce library. It is the functions carried out by The Mapper and The Reducer which will change based on the specific task in hand and so these that must be tailored to each job.
So that’s a simple analogy of how MapReduce works. For a bit more detailed look at it, with real code examples, I recommend taking a look at this piece written by Joel Spolsky.