Online education sites have around for some time now. One of my favourites, Udacity, has recently started a new series of courses: Data Science and Big Data Track. Big Data is a fascinating subject and I’ve been wanting to learn more about it. But so far my introductions were generally short lived. This time I intend to finish all these courses and have at least a guided tutorial. Their first course in this track is Introduction to Hadoop and MapReduce.
Named after the main developer’s child’s toy’s name, Hadoop is an open-source framework based on MapReduce that can run distributed data-intensive tasks. It has its own file system called Hadoop distributed file system (HDFS). It handles data redundancy by dividing the data into 64MB chunks and storing several copies of them (3 copies by default).
A programming model first developed at Google. It consists of 2 steps: Map and Reduce. Map function takes the input data and divides it into smaller datasets. In Reduce function takes the sub-problems as input and calculates the final output.
The course they are offering is very concise and to-the-point. It doesn’t take too long to finish. It’s instructors are employees of Cloudera and they do a very good job in explaining the basic concepts in simple terms. Also, in the course they provide a download to a virtual machine fully loaded with Hadoop and tools. It also contains the example datasets and code they use throughout the course so it makes it quite easy to practice on your own.
Final project was fun to implement. It’s based on the examples so you can develop on top of the code shown in the class. I submitted my answers to GitHub Gist. If you’re interested they’re available here. Files are named with “_xy” prefix where x is the project number (there are two parts for the final project) and y is the quesiton number.
I’m also curious about their new certification model. I haven’t enrolled to any of their paid programs. Basically the courses are still free to enroll but with paid program you have a dedicated tutor who reviews your code and gives you feedback. Also there is an exit interview and if you pass you get a verified certification. I’m not sure how that interviews is going to be conducted though. It’s not cheap ($150/month) though. You still work at your own pace but since you’re paying for it probably you’d want to finish it as soon as possible.