2 days

This course is designed for software researchers, engineers, and managers to learn the basics of Hadoop.

What is Hadoop?

Apache Hadoop is an open source software platform that enables distributed processing of large data sets across clusters of commodity servers. Hadoop is able to inexpensively process large amount of data by using the MapReduce programming model and processing data in a distributed computing environment. It is currently used by companies such as Facebook and the New York Times for tasks such as e-commerce and image processing, and by IBM in creating their Watson supercomputer that won Jeopardy! game show.

What Will I Learn?

As a result of attending this course, you should you will learn what Hadoop is, its strategic role in solving Big Data problems, and what types of problems are good candidates to solve with Hadoop. Topics covered in this course include:

  • An introduction to Hadoop and its role in solving Big Data problems
  • The MapReduce programming model
  • The Hadoop Distributed File System (HDFS)
  • The Hadoop ecosystem (e.g., Pig, Hive, HBase)
  • Case studies of using Hadoop in industry
  • An overview of cloud computing
  • Installing and running Hadoop on a cluster and in the cloud

Who Should Attend?

This is an introductory course to Apache Hadoop. It is suitable for software researchers, engineers (developers and testers), and managers who wish to gain an understanding of the basic aspects of the Hadoop core (MapReduce, HDFS) and the Hadoop ecosystem (Pig, Hive, HBase).

What are the Prerequisites?

Basic knowledge of programming (e.g., Java) is helpful but not necessary.

About the Instructor

Tauhida Parveen, PhD, is an independent consultant with an emphasis on cloud computing and software testing. She has worked in quality assurance with organizations such as WikiMedia Foundation, MEI, Yahoo!, Sabre, and Progressive. She is an adjunct faculty member with the Department of Engineering Systems at the Florida Institute of Technology. She is co-author of the book Software Testing in the Cloud: Migration & Execution (Spring, 2012) and co-editor of the book Software Testing in the Cloud: Perspectives on an Emerging Discipline (IGI Global, 2012).

Course Outline

Introduction to Hadoop

  • What is Hadoop?
    • Hadoop’s history and relation to distributed computing
    • Where Hadoop is used today
  • The MapReduce Programming Model

    • Functional programming and its relation to MapReduce
    • MapReduce programming model
    • Map
    • Reduce
    • Shuffle and sort
    • Combiner
    • Writing a MapReduce program
    • MapReduce Examples
  • Analyzing data with UNIX tools vs. MapReduce
    • Job failure
    • Job scheduling
    • Task execution
  • The Hadoop Distributed File System (HDFS)

    • Distributed File Systems
    • HDFS Overview
    • HDFS Architecture
    • Data organization
  • Using the Hadoop Core: MapReduce and HDFS

    • Sample MapReduce program
    • MapReduce workflow
  • The Hadoop Ecosystem: Pig, Hive, and HBase

    • The Hadoop ecosystem
    • Using Pig, Hive, and HBase
  • Case Studies

    • Facebook
    • IBM
    • Last.fm
  • Using Hadoop for Problem Solving (Hands on)

    • Developing a MapReduce application (Weather data example)
    • Other hands on examples
  • Cloud Computing

    • What is cloud computing
    • The role of Hadoop in cloud computing
    • Amazon Web Services (AWS)
  • Installing and Running Hadoop

    • Flavors of Hadoop: Apache, Cloudera, Amazon.com Elastic MapReduce
    • Hadoop modes
    • Building a Hadoop Cluster
    • Running a MapReduce job on Hadoop Cluster
    • Running Hadoop in the cloud