Wednesday, 16 September 2015

Profanity Editor Mini Project in Python

by Unknown | in Python at Wednesday, September 16, 2015

Python is one of the most used functional programming language.Python has got many applications areas.We can build almost every type of application using Python.

What is Profanity Editor?

Profanity's meaning in English is obscene language word /curse word.For example,you used to write a large number of emails dailyto your Boss,so it is obvious that you may use curse words like shit,crap in your email also.Using of these words might get your job in trouble.In order to avoid this,there is Profanity Editor.

Profanity Checker tool by Google

Google has developed a Profanity checker tool which you can access here.You can type the word or a sentence after the = sign in the link.If there are profanity words in your sentence,the response will be true otherwise false.

Creating a Profanity Editor in Python:

import urllib

def read_text():
quotes = open("F:/Learn Python/text.txt")
contents_of_file = quotes.read()
#print the file contents
print(contents_of_file)
quotes.close()
check_profanity(contents_of_file)

def check_profanity(text_to_check):
connection = urllib.urlopen("http://www.wdyl.com/profanity?q="+text_to_check)
output = connection.read()
connection.close()
if "true" in output:
print("Profanity Alert")
elif "false" in output:
print("This document has no words")
else:
print("Could not scan the document")
read_text()

Since I have uploaded a text file from my PC,you can copy paste your email you want to write and you need not to read your email again,just run this program and check for profanity words.

Friday, 11 September 2015

Data Science - The Era of Data yet to begin

by Unknown | in Data Science at Friday, September 11, 2015

Data Scientist has been rewarded as the sexiest job of 21st century.It is a boom word which is no longer a hype now.World's renowned companies are coming up with their new Data Science Teams.
It has been estimated that the United States alone will fall short of 140,000 to 180,000 Data Scientists in 2018.The following graph depicts the growing trend:

What do Data Scientists do?

After Watching this video,you will easily get a raw idea of what do Data Scientists do in their daily tasks.

Skills Required to a Data Scientist:

Data Science is a deadly combinations of large number of trending technologies,the utmost skills required to be an ideal Data Scientist is as follow:

1. Basic Tools: The most requirement for becoming a Data Scientist is that you must be familiar with a statistical programming language such as Python or R,and a Database querying language such as SQL.

2.Basic Statistics: Statistics as we all know are related to facts,Data Science is all about facts.If you wish to become a Data Scientist you must be aware of all general Statistical tests,distributions,likelihood estimator etc. Since the new Startups from now onwards will be Data Startups,so getting good hands statistics is vital.

3.Machine Learning: If you are working with a large company,which manages a lot of data,and the results are data driven,then machine learning is a must,if not fully but you need to learn some of the machine learning methods such as Random Forests,K-means clustering etc.

4.Data Munging: It is quite often that data we are getting is not always clean,you need to first clean the impurities in Data,and then bring it back to your system in order to perform operations on it.This is one of the daily task of a Data Scientist.

5.Data Visualisation: Visualisation is a kind of telling story with Data.Visualisation can be done through many newly introduced softwares such as Tableau.Moreover Visualisation can also be done with R ,Pthon and D3.js- a Javascript visualisation library.A Data Scientist must not only be able to visualise data but also be able to decode the data in pictorial formats or graphs.

6.Creative Thinking: The Last and the most important thing is that you must think like a Data Scientist.You will be given a huge amount of data,you need to judge the importance of variables in data.You need to communicate with your team of how to do the tasks,what test you need to apply on Data.

Data Scientist Salary:

All the Best to all the Budding Data Scientists.

Saturday, 2 May 2015

Reading and Parsing JSON Data in R

by Unknown | in R at Saturday, May 02, 2015

R is a wonderful statistical language with a lot of features.R can read and parse a large number of data formats including .xls,.txt,.json etc.

JSON stands for Java Script Object Notation.For reading and parsing json data in R consists of following steps:

1.Grabbing Data:

Our first step includes how we grab the data.The data can be on a Server,html link or stored in a database.In my case the data is stored on a Server,which i was given access via SSH.

Here I have 11 Gb of tweets which i need to download.I had used the jsonlite package in R to get the data from server.The feature I liked the most is the streaming part.You can give the number of lines you want to read in one iteration.

Here I had used jsonlite package and i opted to read 10000 lines in one run.This is the process of reading json data in R.

2.Extracting Data from JSON file

Now we have loaded our data in R,now it is the time to extract our data from the file.With the following code we can extract data from json file.We need to use [[' ']] (double braces) to get our data.

After getting data,i converted the data to a data frame.After then we can apply text mining,social media analytics on it.

Tuesday, 28 April 2015

How to import different data formats into R

by Unknown | in R at Tuesday, April 28, 2015

Data Import

It is often necessary to import sample textbook data into R before you start working on your homework.

Excel File

Quite frequently, the sample data is in Excel format, and needs to be imported into R prior to use. For this, we can use the function read.xls from the gdata package. It reads from an Excel spreadsheet and returns a data frame. The following shows how to load an Excel spreadsheet named "mydata.xls". This method requires Perl runtime to be present in the system.

> library(gdata)                   # load gdata package
> help(read.xls)                   # documentation
> mydata = read.xls("mydata.xls")  # read from first sheet

Alternatively, we can use the function loadWorkbook from the XLConnect package to read the entire workbook, and then load the worksheets with readWorksheet. The XLConnect package requires Java to be pre-installed.

> library(XLConnect) # load XLConnect package
> wk = loadWorkbook("mydata.xls")
> df = readWorksheet(wk, sheet="Sheet1")

Minitab File

If the data file is in Minitab Portable Worksheet format, it can be opened with the functionread.mtp from the foreign package. It returns a list of components in the Minitab worksheet.

> library(foreign)                 # load the foreign package
> help(read.mtp)                   # documentation
> mydata = read.mtp("mydata.mtp")  # read from .mtp file

SPSS File

For the data files in SPSS format, it can be opened with the function read.spss also from theforeign package. There is a "to.data.frame" option for choosing whether a data frame is to be returned. By default, it returns a list of components instead.

> library(foreign) # load the foreign package
> help(read.spss) # documentation
> mydata = read.spss("myfile", to.data.frame=TRUE)

Table File

A data table can resides in a text file. The cells inside the table are separated by blank characters. Here is an example of a table with 4 rows and 3 columns.

100   a1   b1
200   a2   b2
300   a3   b3
400   a4   b4

Now copy and paste the table above in a file named "mydata.txt" with a text editor. Then load the data into the workspace with the function read.table.

> mydata = read.table("mydata.txt")  # read text file
> mydata                             # print data frame
   V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3
4 400 a4 b4

For further detail of the function read.table, please consult the R documentation.

> help(read.table)

CSV File

The sample data can also be in comma separated values (CSV) format. Each cell inside such data file is separated by a special character, which usually is a comma, although other characters can be used as well.

The first row of the data file should contain the column names instead of the actual data. Here is a sample of the expected format.

Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3

After we copy and paste the data above in a file named "mydata.csv" with a text editor, we can read the data with the function read.csv.

> mydata = read.csv("mydata.csv")  # read csv file
> mydata
  Col1 Col2 Col3
1  100   a1   b1
2  200   a2   b2
3  300   a3   b3

In various European locales, as the comma character serves as the decimal point, the functionread.csv2 should be used instead. For further detail of the read.csv and read.csv2 functions, please consult the R documentation.

> help(read.csv)

Working Directory

Finally, the code samples above assume the data files are located in the R working directory, which can be found with the function getwd.

> getwd() # get current working directory

You can select a different working directory with the function setwd(), and thus avoid entering the full path of the data files.

> setwd("<new path>") # set working directory

Thursday, 2 April 2015

Step-by-Step Guide to Setting Up an R-Hadoop System

by Unknown | in R at Thursday, April 02, 2015

It is assumed that you had already installed hadoop in your system.If you dont had installed now,you can visit the following link:

Install hadoop

1.Install R

The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.

It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.

2. Install GCC, Homebrew, git, pkg-config and thrift

GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase orrhbase, you donot need to install pkg-config or thrift.

2.1 Download and install GCC

Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

2.2 Install Homebrew

Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.


su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew update
brew doctor

Refer to the Homebrew website at http://brew.sh if any errors at above step.

2.3 Install git and pkg-config


brew install git
brew install pkg-config

2.4 Install thrift 0.9.0

Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.

Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).


## Do NOT run command below !!!
brew install thrift

Instead, follow steps below to install thrift 0.9.0.


$ brew versions thrift

Warning: brew-versions is unsupported and may be removed soon.
Please use the homebrew-versions tap instead:
  https://github.com/Homebrew/homebrew-versions
0.9.1    git checkout eccc96b Library/Formula/thrift.rb
0.9.0    git checkout c43fc30 Library/Formula/thrift.rb
0.8.0    git checkout e5475d9 Library/Formula/thrift.rb
0.7.0    git checkout 141ddb6 Library/Formula/thrift.rb
...

Find the formula for thrift 0.9.0 in above list, and install with that formula.


## go to the homebrew base directory
$ cd $( brew --prefix )

## check out thrift 0.9.0
git checkout c43fc30 Library/Formula/thrift.rb

## instal thrift
brew install thrift

Then we check whether pkg-config path is correct.


pkg-config --cflags thrift

The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.

If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.

2.5 More instructions

If there are problems with installing other packages above, more instructions can be found at links below.

Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.


sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

3. Environment settings

Run code below in R to set environment variables for Hadoop.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

Alternatively, add above to ~/.bashrc so that you don't need to set them every time.


export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2
export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop
export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar

4. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr

4.1 Install relevant R packages


install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", 
                   "functional", "stringr", "plyr", "reshape2", "dplyr", 
                   "R.methodsS3", "caTools", "Hmisc"))

RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.

The destination library can be set with function install.packages() using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install Packages.


## find your R libraries
.libPaths()
#"/Users/hadoop/Library/R/3.1/library" 
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

## check which library a package was installed into
system.file(package="functional")
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional"

## install package to a specific library
install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")

In addition to above packages, you are also suggested to install data.table. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The reason could be that RHadoop uses data.table to handle large data.


install.packages("data.table")

4.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING

Set environment variables for Hadoop, if you haven't done so at step 6.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")

4.3 Install RHadoop packages

Download packages rhdfs, rhbase, rmr2 and plyrmr fromhttps://github.com/RevolutionAnalytics/RHadoop/wiki and install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.


install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")

4.4 Further information

If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

5. Run an R job on Hadoop

Below is an example to count words in text files from HDFS folder wordcount/data. The R code is fromJeffrey Breen's presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.


## copy local text file to hdfs
bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/

After that, we can use R code below to run a Hadoop job for word counting.


Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

library(rmr2) 

## map function
map <- function(k,lines) {
  words.list <- strsplit(lines, '\\s') 
  words <- unlist(words.list)
  return( keyval(words, 1) )
}

## reduce function
reduce <- function(word, counts) { 
  keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) { 
  mapreduce(input=input, output=output, input.format="text", 
            map=map, reduce=reduce)
}


## delete previous result if any
system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out")

## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data') 
hdfs.out <- file.path(hdfs.root, 'out') 
out <- wordcount(hdfs.data, hdfs.out)

## Fetch results from HDFS
results <- from.dfs(out)

## check top 30 frequent words
results.df <- as.data.frame(results, stringsAsFactors=F) 
colnames(results.df) <- c('word', 'count') 
head(results.df[order(results.df$count, decreasing=T), ], 30)

If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!

Wednesday, 18 March 2015

Working with Ebola Dataset

by Unknown | in R at Wednesday, March 18, 2015

During the first month of training I was given the task to work on ebola Dataset.I was asked to plot the dataset on the Google maps.I gone to the wrong track for the first 15 days.After that I visited udacity.com which saved my life and guided me good.I recommend all the new R developers to follow this Udacity course.

Data Analysis with R

After that I tried hard to load a Google map in R.After 3-4 days of hard work I get this output of google map of Africa.

After getting this I plotted the number of death cases the suspected cases due to Ebola in different countries.In the plot.ly you can showcase your R data.Here is the presentation of Ebola Datatest.

Wednesday, 11 March 2015

How to create a WordCloud in R

by Unknown | in R at Wednesday, March 11, 2015

R, the open source package, has become the de facto standard for statistical computing and anything seriously data-related (note I am avoiding the term ‘big data’ here – oops, too late!). From data mining to predictive analytics to data visualisation, it seems like any self-respecting data professional now uses R. Or at least they pretend to. We all know that most people use Excel when nobody’s watching.

Wordcloud comes under Text mining process in R.Text mining means we are provided with a bunch of text documents and we need to extract different trends regarding the trends of text in the documents i.e the most used keywords in the documents.

1.Create a Corpus:

For creating a word cloud in R,we need to create a corpus of text,so that tm package can process it.A corpus is a collection of documents.

2.Apply Text Mining functions on the Corpus

In the tm package,there are various functions for mining your text i.e converting to lowercase,removing punctuation,removing stopwords etc.

3.Creating a WordCloud:

Now we need to pass the arguments to the wordcloud function and display the result.

Sunday, 1 February 2015

How to setup RStudio Server accessible via SSH

by Unknown | in R at Sunday, February 01, 2015

RStudio Server enables you to provide a browser based interface (the RStudio IDE) .You can easily access your R work anywhere in the world.All you need is one System having Internet.Major advantages for setting up RStudio Server.

>>Easy sharing of code, data, and other files with colleagues.

>>Allowing multiple users to share access to the more powerful compute resources (memory, processors, etc.) available on a well equipped server.

>>Centralized installation and configuration of R, R packages, TeX, and other supporting libraries.

Downloading and Installation:

You can download the RStudio Server from their official website.RStudio server is available for all major Linux distributions.

RStudio Server Download

Accessing RStudio Server:

RStudio Server by default opens on port 8787.After installing the serv:er,you can redirect your browser to the following address:

http://<server-ip>:8787

RStudio server will prompt for the Username Password for the account you want to Log in.You will only be allowed to log in if the account have access to Home directory.

In case you are unable to access the server you can verify your installation with following command.

$ sudo rstudio-server verify-installation

Pageviews