Step-by-Step Guide to Setting Up an R-Hadoop System

It is assumed that you had already installed hadoop in your system.If you dont had installed now,you can visit the following link:

Install hadoop

1.Install R

The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.

It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.

2. Install GCC, Homebrew, git, pkg-config and thrift

GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase orrhbase, you donot need to install pkg-config or thrift.

2.1 Download and install GCC

Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

2.2 Install Homebrew

Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.


su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew update
brew doctor

Refer to the Homebrew website at http://brew.sh if any errors at above step.

2.3 Install git and pkg-config


brew install git
brew install pkg-config

2.4 Install thrift 0.9.0

Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.

Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).


## Do NOT run command below !!!
brew install thrift

Instead, follow steps below to install thrift 0.9.0.


$ brew versions thrift

Warning: brew-versions is unsupported and may be removed soon.
Please use the homebrew-versions tap instead:
  https://github.com/Homebrew/homebrew-versions
0.9.1    git checkout eccc96b Library/Formula/thrift.rb
0.9.0    git checkout c43fc30 Library/Formula/thrift.rb
0.8.0    git checkout e5475d9 Library/Formula/thrift.rb
0.7.0    git checkout 141ddb6 Library/Formula/thrift.rb
...

Find the formula for thrift 0.9.0 in above list, and install with that formula.


## go to the homebrew base directory
$ cd $( brew --prefix )

## check out thrift 0.9.0
git checkout c43fc30 Library/Formula/thrift.rb

## instal thrift
brew install thrift

Then we check whether pkg-config path is correct.


pkg-config --cflags thrift

The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.

If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.

2.5 More instructions

If there are problems with installing other packages above, more instructions can be found at links below.

Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.


sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

3. Environment settings

Run code below in R to set environment variables for Hadoop.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

Alternatively, add above to ~/.bashrc so that you don't need to set them every time.


export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2
export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop
export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar

4. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr

4.1 Install relevant R packages


install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", 
                   "functional", "stringr", "plyr", "reshape2", "dplyr", 
                   "R.methodsS3", "caTools", "Hmisc"))

RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.

The destination library can be set with function install.packages() using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install Packages.


## find your R libraries
.libPaths()
#"/Users/hadoop/Library/R/3.1/library" 
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

## check which library a package was installed into
system.file(package="functional")
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional"

## install package to a specific library
install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")

In addition to above packages, you are also suggested to install data.table. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The reason could be that RHadoop uses data.table to handle large data.


install.packages("data.table")

4.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING

Set environment variables for Hadoop, if you haven't done so at step 6.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")

4.3 Install RHadoop packages

Download packages rhdfs, rhbase, rmr2 and plyrmr fromhttps://github.com/RevolutionAnalytics/RHadoop/wiki and install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.


install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")

4.4 Further information

If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

5. Run an R job on Hadoop

Below is an example to count words in text files from HDFS folder wordcount/data. The R code is fromJeffrey Breen's presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.


## copy local text file to hdfs
bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/

After that, we can use R code below to run a Hadoop job for word counting.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

library(rmr2) 

## map function
map <- function(k,lines) {
  words.list <- strsplit(lines, '\\s') 
  words <- unlist(words.list)
  return( keyval(words, 1) )
}

## reduce function
reduce <- function(word, counts) { 
  keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) { 
  mapreduce(input=input, output=output, input.format="text", 
            map=map, reduce=reduce)
}


## delete previous result if any
system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out")

## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data') 
hdfs.out <- file.path(hdfs.root, 'out') 
out <- wordcount(hdfs.data, hdfs.out)

## Fetch results from HDFS
results <- from.dfs(out)

## check top 30 frequent words
results.df <- as.data.frame(results, stringsAsFactors=F) 
colnames(results.df) <- c('word', 'count') 
head(results.df[order(results.df$count, decreasing=T), ], 30)

If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!

1 comment:

Vikyjohn3 March 2016 at 15:57
Very useful and Informative blog.please keep updating.Hadoop is an open-source software for storing data and running applications on a hardware. It provides storage for any kind of data, enormous processing power and the ability to handle tasks. Hadoop changes the enterprise store, process, and analyze data.

Hadoop training in chennai

Nine Sixty Four

Pageviews

Thursday, 2 April 2015