install hadoop on ubuntu

3 min read 20-10-2024

Installing Hadoop on Ubuntu: A Comprehensive Guide

Hadoop, the open-source framework for distributed storage and processing of massive datasets, is a powerful tool for data scientists, developers, and anyone working with big data. This article will guide you through the installation process of Hadoop on an Ubuntu system, focusing on a single-node setup for ease of understanding.

Pre-requisites:

Before embarking on the installation, make sure you have the following:

Ubuntu System: A working Ubuntu system with a minimum of 2GB RAM and 20GB of free disk space is recommended.
Java Development Kit (JDK): Hadoop relies on Java, so ensure you have a compatible JDK version installed.
SSH Client: While not strictly necessary for a single-node setup, SSH access can be helpful for interacting with the Hadoop cluster.

Step 1: Installing Java

Hadoop requires Java, so let's install the OpenJDK using the following command in your terminal:

sudo apt update
sudo apt install openjdk-11-jdk

This will install the Java Development Kit (JDK) version 11. You can verify the installation by checking the Java version:

java -version

Step 2: Downloading Hadoop

Download the latest stable version of Hadoop from the Apache Hadoop website https://hadoop.apache.org/. Choose the "Binary" distribution and extract the downloaded archive to a desired location. For example:

wget https://www.apache.org/dist/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
tar -xzvf hadoop-3.3.5.tar.gz -C /usr/local/

This will extract the Hadoop archive into the /usr/local/ directory. You can adjust the destination as per your preference.

Step 3: Configuring Hadoop

Navigate to the Hadoop directory:

cd /usr/local/hadoop-3.3.5/etc/hadoop

Now, open the core-site.xml file in a text editor:

sudo nano core-site.xml

Add the following configuration properties within the <configuration> tags:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
</property>

Explanation:

fs.defaultFS: This specifies the default file system used by Hadoop, which is the local HDFS in this case.
hadoop.tmp.dir: This defines the temporary directory used by Hadoop.

Next, open the hdfs-site.xml file:

sudo nano hdfs-site.xml

Add the following properties:

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop-3.3.5/data/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop-3.3.5/data/datanode</value>
</property>

Explanation:

dfs.replication: This property sets the replication factor for HDFS, which determines how many copies of each block are stored. In a single-node setup, the replication factor is set to 1.
dfs.namenode.name.dir: This defines the directory where the NameNode stores its metadata.
dfs.datanode.data.dir: This specifies the directory where the DataNode stores data blocks.

Finally, open the yarn-site.xml file:

sudo nano yarn-site.xml

Add the following property:

<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.yarn.server.nodemanager.auxservices.shuffle.ShuffleService</value>
</property>

This property ensures that the NodeManager can handle shuffle operations for MapReduce jobs.

Step 4: Formatting the NameNode

Before starting Hadoop, you need to format the NameNode. This initializes the HDFS namespace and creates necessary directories. Execute the following command from the Hadoop installation directory:

sudo -u hdfs /usr/local/hadoop-3.3.5/bin/hdfs namenode -format

You will be prompted to confirm the formatting process. Type "yes" and press enter.

Step 5: Starting Hadoop

You can now start all Hadoop services:

sudo -u hdfs /usr/local/hadoop-3.3.5/sbin/start-all.sh

This command will start the NameNode, DataNode, and other essential services.

To stop all services:

sudo -u hdfs /usr/local/hadoop-3.3.5/sbin/stop-all.sh

Step 6: Verifying the Installation

You can verify that Hadoop is installed correctly by accessing the Hadoop web UI. Open a web browser and navigate to http://localhost:8088/. If you see the Hadoop YARN web UI, then your installation was successful!

Additional Tips:

Security: For production environments, consider securing Hadoop by implementing Kerberos authentication.
Cluster Setup: For larger deployments, you can configure a multi-node Hadoop cluster with multiple NameNodes, DataNodes, and ResourceManagers.
Troubleshooting: Refer to the Hadoop documentation and online resources for troubleshooting common issues during installation and configuration.

Conclusion

Installing Hadoop on Ubuntu can be a simple and rewarding process, allowing you to leverage its powerful capabilities for data processing and storage. Remember to consult the official Hadoop documentation for more detailed instructions and customization options. Happy Hadoop-ing!