Autologica: java

These instructions are for installing and running Hadoop on a single node cluster. This tutorial follows the same format and largely the same steps of the incredibly thorough and well-written tutorial by Michael Noll about Ubuntu cluster setup. This is pretty much his procedure with changes made for OS X users. I also added other things that I was able to piece together after looking up things from the Hadoop Quickstart and the forums/archives.

Step 1: Creating a designated hadoop user on your system

This isn't -entirely- necessary, but it's a good idea for security reasons.
To add a user, go to:
System Preferences > Accounts
Click the "+" button near the bottom of the account list. You may need to unlock this ability by hitting the lock icon at the bottom corner and entering the admin username and password.
When the New account window comes out enter a name, as short name and a password. I entered the following:
Name: hadoop
Short name: Hadoop
Password: MyPassword (well you get the idea)

Once you are done, hit "create account".
Now, log in as the hadoop user. You are ready to set up everything!

Step 2: Install/Configure Preliminary Software

Before installing Hadoop, there are several things that you need make sure you have on your system.

1. Java, and the latest version of the JDK
2. SSH

Because OS X is awesome, you actually don't have to install these things. However, you will have to enable and update what you have. Let's start with Java:

Updating Java
Open up the Terminal application. If it's not already on your dock, you can access it through
Applications > Utilities > Terminal
Next check to see the version of Java that's currently available on the system:
$:~ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

You may want to update this to Java Sun 6, which is available as an update for OS X 10.5 (Update 1). It's currently only available for 64-bit machines though. You can download it here.

After you download and install the update, you are going to need to configure Java on your system so the default points to this new update.
Go to Applications > Utilities > Java > Java Preferences
Under "Java Version" hit the radio button next to "Java SE 6"
Down by "Java Application Runtime Settings" change the order so Java SE 6 (64 bit) is first, followed by Java SE 5 (64 bit) and so on.
Hit "Save" and close this window.

Now, when you go to the terminal, and type in "java -version" you should get the following:
$:~ java -version
java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13-120)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_05-b13-52, mixed mode)

and for "javac -version":
$:~ javac -version
javac 1.6.0_05

Onto ssh!

SSH: Setting up Remote Desktop and enabling self-login
SSH also comes installed on your Mac. However, you need to enable access to your own machine (so hadoop doesn't ask you for a password at inconvenient times).
To do this, go to System Preferences > Sharing (under Internet & Network)
Under the list of services, check "Remote Login". For extra security, you can hit the radio button for "Only these Users" and select hadoop

Now, we're going to configure things so we can log into localhost without being asked for a password. Type the following into the terminal:

$:~ ssh-keygen -t rsa -P ""
$:~ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now try:
$:~ ssh localhost

You should be able to log in without a problem.

You are now ready to install Hadoop. Let's go to step 3!

Step 3: Downloading and Installing Hadoop

So this actually involves several smaller steps:

1. Downloading and Unpacking Hadoop
2. Configuring Hadoop
3. Formatting and Testing Hadoop

After we finish these, you should be ready to go! So let's get started:

Downloading and Unpacking Hadoop

Download Hadoop. Make sure you download the latest version (as of this blogpost, 0.17.2 and 0.18.0 are the latest versions). We call our generic version of hadoop hadoop-* in this tutorial.

Unpack the hadoop-*.tar.gz in the directory of your choice. I placed mine in /Users/hadoop. You may also want to set ownership permissions for the directory:

$:~ tar -xzvf hadoop-*.tar.gz
$:~ chown -R hadoop hadoop-*

Configuring Hadoop

There are two files that we want to modify when we configure Hadoop. The first is conf/hadoop-env.sh . Open this in nano or your favorite text editor and do the following:

- uncomment the export JAVA_HOME line and set it to /Library/Java/Home

- uncomment the export HADOOP_HEAPSIZE line and keep it at 2000

You may want to change other settings as well, but I chose to leave the rest of hadoop-env.sh the same. Here is an idea of what part of mine looks like:

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/Library/Java/Home

# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000

The next part that we need to set up is hadoop-site.xml. The most important parts to set here are hadoop.tmp.dir (which should be set to the directory of your choice) and to add mapred.tasktracker.maximum property to the file. This will effectively set the maximum number of tasks that can simulataneously run by a task tracker. You should also set dfs.replication 's value to one.

Below is a sample hadoop-site.xml file:

---------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/Users/hadoop/hadoop-0.17.2.1/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>8</value>
<description>The maximum number of tasks that will be run simultaneously by a
a task tracker
</description>
</property>

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

</configuration>
-----------

Now to our last step!

Formatting and Testing Hadoop

Our last step involves formatting the namenode and testing our system.

$:~ hadoop-*/bin/hadoop namenode -format

This will give you output along the lines of

$:~ hadoop-*/bin/hadoop namenode -format
08/09/14 21:22:14 INFO dfs.NameNode: STARTUP_MSG:
/***********************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = loteria/127.0.0.1
STARTUP_MSG: args = [-format]
***********************************************************/
08/09/14 21:22:14 INFO dfs.Storage: Storage directory [...] has been successfully formatted.
09/09/14 21:22:14 INFO dfs.NameNode: SHUTDOWN_MSG:
/***********************************************************
SHUTDOWN_MSG: Shutting down NameNode at loteria/127.0.0.1
***********************************************************/

Once this is done, we are ready to test our program.

First, start up the DFS. This will start up a TaskTracker, JobTracker, and DataNode on the machine.

$:~ hadoop-*/bin/start-all.sh

As input for our test, we are going to copy the conf folder up to our DFS.

$:~ hadoop-*/bin/hadoop dfs -copyFromLocal hadoop-*/conf input

You can check to see if this actually worked by doing an ls on the dfs as follows:
$:~ hadoop-*/bin/hadoop dfs -ls
Found 1 item
/user/hadoop/input %ltdir> 2008-09-11 13:33 rwxr-xr-x hadoop supergroup

Now, we need to compile the code. cd into the hadoop-*/ directory and do:
$:~ ant examples

This will compile the example programs found in hadoop-*/src/examples

Now, we will run the example distributed grep program on the conf program as input.

$:~ hadoop-*/bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

If this works, you'll see something like this pop up on your screen:
08/09/13 20:47:24 INFO mapred.FileInputFormat: Total input paths to process : 1
08/09/13 20:47:24 INFO mapred.JobClient: Running job: job_200809111608_0033
08/09/13 20:47:25 INFO mapred.JobClient: map 0% reduce 0%
08/09/13 20:47:38 INFO mapred.JobClient: map 13% reduce 0%
08/09/13 20:47:39 INFO mapred.JobClient: map 16% reduce 0%
08/09/13 20:47:43 INFO mapred.JobClient: map 22% reduce 0%
08/09/13 20:47:44 INFO mapred.JobClient: map 24% reduce 0%
08/09/13 20:47:48 INFO mapred.JobClient: map 33% reduce 0%
08/09/13 20:47:53 INFO mapred.JobClient: map 41% reduce 0%
08/09/13 20:47:54 INFO mapred.JobClient: map 44% reduce 0%
08/09/13 20:47:58 INFO mapred.JobClient: map 50% reduce 0%
08/09/13 20:47:59 INFO mapred.JobClient: map 52% reduce 0%
08/09/13 20:48:03 INFO mapred.JobClient: map 61% reduce 0%
08/09/13 20:48:08 INFO mapred.JobClient: map 69% reduce 0%
08/09/13 20:48:09 INFO mapred.JobClient: map 72% reduce 0%
08/09/13 20:48:13 INFO mapred.JobClient: map 78% reduce 0%
08/09/13 20:48:14 INFO mapred.JobClient: map 80% reduce 0%

... and so on

The last step is to check if you have output!
You can do this by doing a

$:~ hadoop-*/bin/hadoop dfs -ls output
Found 2 items
/user/hadoop/output/_logs <dir> 2008-09-13 19:21 rwxr-xr-x hadoop supergroup
/user/hadoop/output/part-00000 <r 1> 2917 2008-09-13 20:10 rw-r--r-- hadoop supergroup

The most important part is that the number next to the <r 1> should not be 0.

To check the actual contents of the output do a

$:~ hadoop-*/bin/hadoop dfs -cat output/*

Alternatively, you can copy it to local disk and check/modify it:

$:~ hadoop-*/bin/hadoop dfs -copyFromLocal output myoutput
$:~ cat myoutput/*

When you're done running jobs on the dfs, run the stop-all.sh command.

$:~ hadoop-*/bin/stop-all.sh

And that concludes our tutorial! Hope someone finds the helpful!

So for the last couple of days, I've been trying to get Hadoop working on my computer at work, which runs Fedora. Monday (yesterday), I found out that the source of my troubles had to do with the fact that Hadoop requires Sun's Java. Fedora, by default, comes with Gnu's Java.This would be, in most cases, ok, as Gnu's Java mimics Sun's Java and has most of the same functionality. However, certain projects (like Hadoop) requires Sun's Java, so it's a good idea to try and have both on the system.

Installing Java on Fedora is a pain. No one guide online helped me successfully install the damn thing onto my work machine, which led to quite a bit of frustration. However, since things seem to be working correctly (finally!), I thought I would share my procedure with all of you who may have to do this at some point. Even though the majority of my code posts concern Ubuntu, I think this is a worthwhile diversion.

To install Java properly (for the coders out there), there are two steps: i.) The Java Run Time Environment (JRE) and ii.) The Java Developer's Kit (JDK). Most of the instructions I'm putting here is an amalgamation of instructions found on two sites, both of which had some instructions that worked for me, and others that did not.

----

Before we begin, Make sure your System is Completely Up to Date This is CRUCIAL. All the following instructions probably will not work properly if your Fedora install is not properly updated. One thing that slowed me down quite a bit is that I initially did not do this step.

INSTALLING THE JAVA RUNTIME ENVIRONMENT (JRE)

1. Enter root mode by typing in
su

While I imagine it is possible to install this stuff locally, having root access is ideal.

2. Make sure that your system has rpmdevtools and jpackage-utils. You can install these by typing:

yum -y install rpmdevtools
yum -y install jpackage-utils

3. Next grab the jpackage key and set up the jpackage repositories for yum

rpm --import http://jpackage.org/jpackage.asc
cd /etc/yum.repos.d
wget http://www.jpackage.org/jpackage17.repo

If SUCCESS, go to step 4.
If FAIL, check to make sure that you have wget installed. Install it using:
yum -install wget

4. Get the latest Jpackage java-x-sun-xjpp.nosrc.rpm package from the non-free branch at jpackage. The package I used was this one.
Install this file by typing something akin to the following (yours may be different based on version numbers)
./java-1.6.0-sun-1.6.0.6-1jpp.nosrc.rpm

If this works, you will see it installed under /usr/src/redhat/

If SUCCESS: go to step 5.
If FAIL: Check permissions. You may have to change permissions as follows:

chmod 755 java-1.6.0-sun-1.6.0.6-1jpp.nosrc.rpm

If SUCCESS, go to step 5.
If FAIL, I cannot help you. Sorry!

5. Get the latest binary. I got mine from here. It should look something like jre-6u6-linux-i586.bin.
Move this file to the SOURCES directory. So something like,

mv jre-6u6-linux-i586.bin /usr/src/redhat/SOURCES/.

Next, rebuild the java rpm using

rpmbuild -ba java-1.x.0-sun.spec

If SUCCESS, you should now be able to see the binary in /usr/src/redhat/RPMS/i586/ . Go to step 6.
If FAIL, make sure that you are running this command in the /usr/src/redhat/SPEC/ directory. If you do not have a spec file, you did something wrong in the previous steps, or I can't help you.

6. Install the binaries in /usr/src/redhat/RPMS/i586/ by using the following command:
yum --nogpgcheck localinstall java*.rpm

If SUCCESS, go on to part 2.
IF FAIL, the most likely reason for failure is some sort of message like --nogpgcheck not a valid command for yum. Omit it then. Alternatively, you can manually go into the /usr/src/redhat/RPMS/i586/ folder and install all the rpms by double clicking on them.

INSTALLING JDK

1. Make sure you have the following packages: rpm-build and fedora-rpmdevetools. If not, install using
yum install fedora-rpmdevtools
yum install rpm-build

2. Grab the sun jdk rpm file. I got mine from here. The file should look something like: jdk-6u6-linux-i586-rpm.bin

Run the rpm by doing the following:
chmod 755 jdk-6u2-linux-i586-rpm.bin
./jdk-6u2-linux-i586-rpm.bin

If SUCCESS, you should be able to see a whole bunch of RPMs located in /usr/java/jdk1.6.0_02 and a new directory in /opt/sun . Move to step 3

3. Next, install the RPM. You'll also need a compat file from Jpackages. I got mine here. It should look something like java-1.6.0-sun-compat-1.6.0.06-1jpp.src.rpm.

Do the following:
yum --enablerepo=jpackage-generic-nonfree install java-1.6.0-sun-compat-1.6.0.06-1jpp.src.rpm

Say yes when prompted. This should complete installation of JDK.

4. Keep in mind that the default Java may still not be Sun Java. To fix this, we need one last step:

/usr/sbin/alternatives --config java

This should show two options: GIJ (the older java) and Sun's Java. On my machine it was option 2. Enter the number that corresponds to the Sun Java install and press enter.

Now, if you type in:

java -version

It should print out something like:

java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06)
Java HotSpot(TM) Client VM (build 1.6.0_06, mixed mode, sharing)

You're done!

Additionally, you'll probably want to install the mozilla plugin. Do something akin to the following
ln -s /usr/lib/jvm/java-1.6.0-sun-1.6.0/jre/plugin/i386/ns7/libjavaplugin_oji.so /usr/lib/mozilla/plugins/libjavaplugin_oji.so

And that should fix everything. Hope this post will eventually be helpful to someone. Comments welcome.

Autologica

Wednesday, September 10, 2008

Running Hadoop on OS X 10.5 (64-bit) single node cluster

Step 1: Creating a designated hadoop user on your system

Step 2: Install/Configure Preliminary Software

Step 3: Downloading and Installing Hadoop

Tuesday, June 24, 2008

Installing Sun's Java JRE and JDK on Fedora 5

Blog Archive

Links