Tag: data mining

Use R in Hadoop by streaming

It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.

An aggregation example

The Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.
A typical MapReduce job contains two R scripts: Mapper.R and reducer.R.
Mapper.R
# Use batch mode under R (don't use the path like /usr/bin/R)  
#! /usr/bin/env Rscript

options(warn=-1)

# We need to input tab-separated file and output tab-separated file

input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
fields = unlist(strsplit(currentLine, "\t"))
# Make sure the line has six fields
if (length(fields)==6) {
cat(fields[3], fields[5], "\n", sep="\t")
}
}
close(input)
Reducer.R
#! /usr/bin/env Rscript  

options(warn=-1)
salesTotal = 0
oldKey = ""

# Loop around the data by the formats such as key-val pair
input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
data_mapped = unlist(strsplit(currentLine, "\t"))
if (length(data_mapped) != 2) {
# Something has gone wrong. However, we can do nothing.
continue
}

thisKey = data_mapped[1]
thisSale = as.double(data_mapped[2])

if (!identical(oldKey, "") && !identical(oldKey, thisKey)) {
cat(oldKey, salesTotal, "\n", sep="\t")
oldKey = thisKey
salesTotal = 0
}

oldKey = thisKey
salesTotal = salesTotal + thisSale
}

if (!identical(oldKey, "")) {
cat(oldKey, salesTotal, "\n", sep="\t")
}

close(input)

Testing

Before running MapReduce, it is better to test the codes by some linux commands.
# Make R scripts executable   
chmod w+x mapper.R
chmod w+x reducer.R
ls -l

# Strip out a small file to test
head -500 purchases.txt > test1.txt
cat test1.txt | ./mapper.R | sort | ./reducer.R

Execution

One way is to specify all the paths and therefore start the expected MapReduce job.
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar   
-mapper mapper.R –reducer reducer.R
–file mapper.R –file reducer.R
-input myinput
-output joboutput
Or we can use the alias under CDH4, which saves a lot of typing.
hs mapper.R reducer.R myinput joboutput
Overall, the MapReduce job driven by R is performed smoothly. The Hadoop JobTracker can be used to monitor or diagnose the overall process.

Rhadoop or streaming?

RHadoop is a package developed under Revolution Alytics, which allows the users to apply MapReduce job directly in R and is surely a much more popular way to integrate R and Hadoop. However, this package currently undergoes fast evolution and requires complicated dependency. As an alternative, the functionality of streaming is embedded with Hadoop, and supports all programming languages including R. If the proper installation of RHadoop poses a challenge, then streaming is a good starting point.

A cheat sheet for linear regression validation

The link of the cheat sheet is here.I have benefited a lot from the UCLA SAS tutorial, especially the chapter of regression diagnostics. However, the content on the webpage seems to be outdated. The great thing for PROC REG is that it creates…

Kernel selection in PROC SVM

The support vector machine (SVM) is a flexible classification or regression method by using its many kernels. To apply a SVM, we possibly need to specify a kernel, a regularization parameter c and some kernel parameters like gamma. Besides the selectio…

When ROC fails logistic regression for rare-event data

ROC or AUC is widely used in logistic regression or other classification methods for model comparison and feature selection, which measures the trade-off between sensitivity and specificity. The paper by Gary King warns the dangers using…

A SAS macro that exports data to MongoDB

MongoDB is possibly the most popular NoSQL data store. To bypass schema and constraint, I feel quite convenient to implement MongoDB as buffer to accompany current RDBMS .Also it is straightforward to use MongoDB and other tools (MEAN) to build s…

Regularization adjustment for PROC SVM

SVM is a popular statistical learning method for either classification or regression. For classification, a linear classifier or a hyperplane, such as with w as weight vector and b as the bias, would label data into various categories. The geome…

March Madness, Moneyball, and sports analytics

A big part of  “winning” these days (be it sports or a business) is performing analytics better than your competition.  This is demonstrated in awe-inspiring fashion in the book (and movie) “Moneyball.”  And on that topic, I’d like to show you a f…