It is assumed that you had already installed hadoop in your system.If you dont had installed now,you can visit the following link:
1.Install R
The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.
It is recommended to install
RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.
2. Install GCC, Homebrew, git, pkg-config and thrift
GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase orrhbase, you donot need to install pkg-config or thrift.
2.1 Download and install GCC
2.2 Install Homebrew
Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.
su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew update
brew doctor
Refer to the Homebrew website at
http://brew.sh if any errors at above step.
2.3 Install git and pkg-config
brew install git
brew install pkg-config
2.4 Install thrift 0.9.0
Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.
Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.
Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).
## Do NOT run command below !!!
brew install thrift
Instead, follow steps below to install thrift 0.9.0.
$ brew versions thrift
Warning: brew-versions is unsupported and may be removed soon.
Please use the homebrew-versions tap instead:
https://github.com/Homebrew/homebrew-versions
0.9.1 git checkout eccc96b Library/Formula/thrift.rb
0.9.0 git checkout c43fc30 Library/Formula/thrift.rb
0.8.0 git checkout e5475d9 Library/Formula/thrift.rb
0.7.0 git checkout 141ddb6 Library/Formula/thrift.rb
...
Find the formula for thrift 0.9.0 in above list, and install with that formula.
## go to the homebrew base directory
$ cd $( brew --prefix )
## check out thrift 0.9.0
git checkout c43fc30 Library/Formula/thrift.rb
## instal thrift
brew install thrift
Then we check whether pkg-config path is correct.
pkg-config --cflags thrift
The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.
2.5 More instructions
If there are problems with installing other packages above, more instructions can be found at links below.
Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.
sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/
3. Environment settings
Run code below in R to set environment variables for Hadoop.
Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")
Alternatively, add above to ~/.bashrc so that you don't need to set them every time.
export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2
export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop
export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar
4. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr
4.1 Install relevant R packages
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",
"functional", "stringr", "plyr", "reshape2", "dplyr",
"R.methodsS3", "caTools", "Hmisc"))
RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.
The destination library can be set with function install.packages()
using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install Packages.
## find your R libraries
.libPaths()
#"/Users/hadoop/Library/R/3.1/library"
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library"
## check which library a package was installed into
system.file(package="functional")
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional"
## install package to a specific library
install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")
In addition to above packages, you are also suggested to install data.table
. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The reason could be that RHadoop uses data.table
to handle large data.
install.packages("data.table")
4.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING
Set environment variables for Hadoop, if you haven't done so at step 6.
Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
4.3 Install RHadoop packages
Download packages
rhdfs
,
rhbase
,
rmr2
and
plyrmr
from
https://github.com/RevolutionAnalytics/RHadoop/wiki and install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.
install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")
4.4 Further information
5. Run an R job on Hadoop
First, we copy some text files to HDFS folder wordcount/data.
## copy local text file to hdfs
bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/
After that, we can use R code below to run a Hadoop job for word counting.
Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")
library(rmr2)
## map function
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
## reduce function
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text",
map=map, reduce=reduce)
}
## delete previous result if any
system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out")
## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
## check top 30 frequent words
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df[order(results.df$count, decreasing=T), ], 30)
If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!