Cluster Configuration on Ubuntu


A guide to install and setup Single-Node Apache Hadoop 2.x

Introduction

    This setup and configuration document is a guide to setup a Single-Node Apache Hadoop 2.0 cluster on an Ubuntu virtual machine on your PC. If you are new to both Ubuntu and Hadoop, this guide comes handy to quickly setup a Single-Node Apache Hadoop 2.0 Cluster on Ubuntu and start your Big Data and Hadoop learning journey.

The guide describes the whole process in two parts:

Section 1: Setting up the Ubuntu OS for Hadoop 2.0

    This section describes step by step guide to download, configure an Ubuntu Virtual Machine image in VMPlayer, and provides steps to install pre-requisites for Hadoop Installation on Ubuntu.

Section 2: Installing Apache Hadoop 2.0 and Setting up the Single Node Cluster.

    This section describes step by step guide to download, configure an Ubuntu Virtual Machine image in VMPlayer, and provides steps to install pre-requisites for Hadoop Installation on Ubuntu.

1. Setting up the Ubuntu Desktop

This section describes the steps to download and create an Ubuntu image on VMware Player.

    1.1 Creating an Ubuntu VM Player instance

    The first step is to download an Ubuntu image and create an Ubuntu VMPlayer instance.

1.1.1 Download the VMware image

    Access the following link and download the 12.0.4 Ubuntu image:

    Naming Convention
    VMware image

1.1.2 Open the image file

    Extract the Ubuntu VM image and Open it in VMware Workstation.

    Naming Convention
    VMware Workstation

Click open virtual machine and select path where you have extracted the image

    Naming Convention
    VMware Workstation

1.1.3 Play the Virtual Machine

    You would see the below screen in VMware Workstation after the VM image creation completes.

    Naming Convention
    Home Screen

You will get the home screen with the following image.

    Naming Convention
    Home Screen

The user details for the Virtual instance is:
Username : user
Password : password
Open the terminal to access the file system. (CTRL+ALT+T) short cut key to open terminal.

    Naming Convention
    terminal

    Naming Convention
    terminal

The first task is to run ‘apt-get update’ to download the package lists from the repositories and "update" them to get information on the newest versions of packages and their dependencies. Type command on terminal.

    Naming Convention
    get update

    ProblemNote : sudo apt-get update not working error related to lock file than
    $sudo rm /var/lib/apt/lists/lock
    $sudo rm /var/cache/apt/archives/lock

1.1.5 Install the Java and openssh server for Hadoop 2.6.0

Check java version installation.

$java –version

    Naming Convention
    java version

Use-apt-get to install the JDK 7.

$sudo apt-get install openjdk-7-jdk

    Naming Convention
    Install JDK

$java –version
Check the location of java folder where java install.
$cd usr/lib/jvm

    Naming Convention
    Open JDK

$sudo apt-get install openssh-server

    Naming Convention
    Install on server

1.2 Download the Apache Hadoop 2.6.0 binaries
1.2.1 Download the Hadoop package

    Download the binaries to your home directory. Use-the default user ‘user’ for the installation.In Live production instances a dedicated Hadoop user account for running Hadoop is used. Though, it’s not mandatory to use-a dedicated Hadoop user account but is recommended because this helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (separating for security, permissions, backups, etc.). Click on following link to download hadoop-2.6.0.tar.gz file
    http://www.apache.org/dyn/closer.cgi/hadoop/common

    Naming Convention
    Hadoop tar file

    Naming Convention
    Hadoop tar file

    Naming Convention
    Hadoop tar file

    Unzip the files and review the package content and configuration files.
    $tar –xvf hadoop-2.6.0.tar.gz

    Naming Convention
    unzip Hadoop tar file

    Naming Convention
    folder structure

    Review the Hadoop configurations files.

    Naming Convention
    configurations files

    After creating and configuring your virtual servers, the Ubuntu instance is now ready to start installation and configuration of Apache Hadoop 2.6.0 Single Node Cluster. This section describes the steps in details to install Apache Hadoop 2.6.0 and configure a Single Node Apache Hadoop cluster.

2. Configure the Apache Hadoop 2.6.0 Single Node Server

    This section explains the steps to configure the Single Node Apache Hadoop 2.6.0 Server on Ubuntu.

2.1 Update the Configuration files
2.1.1 Update “.bashrc” file for user ‘ubuntu’.

    Must be move to ‘user’ $HOME directory and edit ‘.bashrc’ file.

    Naming Convention
    .bashrc’

    Update the ‘.bashrc’ file to add important Apache Hadoop environment variables for user. a) Change directory to home.
    $ cd
    (b) Edit the file
    $ sudo gedit .bashrc

    Naming Convention
    gedit.bashrc’

    # Set Hadoop-related environment variables
    export HADOOP_HOME=$HOME/hadoop-2.6.0
    export HADOOP_CONF_DIR=$HOME/hadoop-2.6.0/etc/hadoop
    export HADOOP_MAPRED_HOME=$HOME/hadoop-2.6.0
    export HADOOP_COMMON_HOME=$HOME/hadoop-2.6.0
    export HADOOP_HDFS_HOME=$HOME/hadoop-2.6.0
    export YARN_HOME=$HOME/hadoop-2.6.0
    # Set JAVA_HOME
    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:$HOME/hadoop-2.6.0/bin

    Naming Convention
    Export Path

c) Source the .bashrcfile to set the hadoop environment variables without having to invoke a new shell:

    Naming Convention
    resource file

2.2 Setup the Hadoop Cluster

    This section describes the detail steps needed for setting up the Hadoop Cluster and configuring the core Hadoop configuration files.

2.2.1 Configure JAVA_HOME

    Configure JAVA_HOME in ‘hadoop-env.sh’. This file specifies environment variables that affect the JDK used by Apache Hadoop 2.6.0 daemons started by the Hadoop start-up scripts:
    /hadoop-2.6.0/etc/hadoop$sudo gedit hadoop-env.sh

    Naming Convention
    hadoop-env.sh

Copy this line in haoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

    Naming Convention
    openjdk-7-jdk

2.2.2 Create NameNode and DataNode directory

    Check the $HADOOP_HOME value .

    Naming Convention
    Hadoop-home

    Create DataNode and NameNode directories to store HDFS data.
    $sudomkdir –p $HADOOP_HOME/hadoop2_data/hdfs/namenode
    $sudomkdir –p $HADOOP_HOME/hadoop2_data/hdfs/datanode

    Naming Convention
    datanode

    Naming Convention
    datanode

    Check by hadoop2_data folder create inside /home/user/hadoop-2.6.0

    Naming Convention
    location

2.2.3 Configure the Default File system

The core-site.xml file contains the configuration settings for Apache Hadoop Core such as I/O settings that are common to HDFS, YARN and MapReduce. Configure default files-system (Parameter: fs.default.name) used by clients in core-site.xml

    Naming Convention
    core-site.xml

    Note

      copy tag to and replace same tag in core-site.xml file

    <configuration>
    <property>
    <name>fs.default.name </name>
    <value>hdfs://localhost:9000 </property>
    </configuration>

    Where hostname and port are the machine and port on which Name Node daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 9000 and you can also specify IP address rather than hostname.

2.2.4 Configure the HDFS

This file contains the configuration settings for HDFS daemons; the Name Node and the data nodes. Configure hdfs-site.xml and specify default block replication, and NameNode and DataNode directories for HDFS. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

    Naming Convention
    hdfs-site.xml

    Naming Convention
    hdfs-site.xml

2.2.5 Configure YARN framework

    This file contains the configuration settings for YARN the NodeManager.

    Naming Convention
    yarn-site.xml

    Naming Convention
    yarn-site.xml

2.6 Configure MapReduce framewor

    This file contains the configuration settings for MapReduce. Configure mapred-site.xml and specify framework details.
    /hadoop-2.6.0/etc/hadoop$cpmapred-site.xml.templatemapred-site.xml

    Naming Convention
    mapered-site.xml

    Naming Convention
    mapered-site.xml

    /hadoop-2.6.0/etc/hadoop$sudogeditmapred-site.xml

    Naming Convention
    mapered-site.xml configuration

2.2.7 Edit /etc/hosts file

    Give ifconfigin the terminal and note down the ip address. Then put this ip address in /etc/hosts file as mentioned in below snapshots, save the file and then close it.
    $cd
    $ifconfig

    Naming Convention
    edit host file

    The ip address in this file, localhostand ubuntuare separated by tab.
    $sudo gedit /etc/hosts

    Naming Convention
    separated

    Note

      if not change anything in this etc/hosts file only first two line mention than also correct. etc/hosts File generally use- in Multimode cluster.

2.2.9 Creating ssh

    $ssh-keygen -t rsa -P ""

    Naming Convention
    separated

2.2.10 Moving the key to authorized key:

    $cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    Naming Convention
    keys

2.2. 11 start the DFS services

    The first step in starting up your Hadoop installation is formatting the Hadoop file-system, which is implemented on top of the local file-systems of your cluster. This is required on the first time Hadoop installation. Do not format a running Hadoop file-system, this will cause all your data to be erased. To format the file-system, run the command:
    $cd
    $hadoop namenode -format

    Naming Convention
    format

-----------------------Reboot the system------------------------

    You are now all set to start the HDFS services i.e. Name Node, Resource Manager, Node Manager and Data Nodes on your Apache Hadoop Cluster.
    $cd hadoop-2.6.0/sbin/
    $./hadoop-daemon.sh start namenode
    $./hadoop-daemon.sh start datanode

    Naming Convention
    Reboot system

    Start the YARN daemons i.e. Resource Manager and Node Manager. Cross check the service start-up using JPS (Java Process Monitoring Tool).
    $./yarn-daemon.sh start resourcemanager
    $./yarn-daemon.sh start nodemanager

    Naming Convention
    Reboot system

    Start the History server.
    $./mr-jobhistory-daemon.sh start historyserver

    Naming Convention
    Reboot system

    Note

      Always suspend your VMware Workstation, do not shut it down. So that when you open your VM again, your cluster will be up. In case you shut it down, so when you start your VM all your daemons will be down (not running). So again start all your daeons starting from namenode, do not format the namenode again.

2.2. 12 Perform the Health Check

    a) Check the NameNode status:
    http://localhost:50070/dfshealth.jsp

    Naming Convention
    localhost

    b) JobHistory status:
    http://localhost:19888/jobhistory

    Naming Convention
    job History

    c) Browse HDFS input and output file and log information.
    http:localhost:50070

    Naming Convention
    HDFS input and Output file