SAS employees are now easier to spot on Communities

Since the launch of Communities on SAS, hundreds of SAS employees have been among you. Some SAS employees made themselves known by selecting a telling user name (such as Cynthia@SAS), but others remained camouflaged or incognito, keeping their secret identities like the SAS superheroes they are. That’s about to change. […]

The post SAS employees are now easier to spot on Communities appeared first on The SAS Dummy.

Spark practice(4): malicious web attack

Suppose there is a website tracking user activities to prevent robotic attack on the Internet. Please design an algorithm to identify user IDs that have more than 500 clicks within any given 10 minutes.
Sample.txt: anonymousUserID timeStamp clickCount
123    9:45am    10
234 9:46am 12
234 9:50am 20
456 9:53am 100
123 9:55am 33
456 9:56am 312
123 10:03am 110
123 10:16am 312
234 10:20am 201
456 10:23am 180
123 10:25am 393
456 10:27am 112
999 12:21pm 888

Thought

This is a typical example of stream processing. The key is to build a fixed-length window to slide through all data, count data within and return the possible malicious IDs.

Single machine solution

Two data structures are used: a queue and a hash table. The queue is scanning the data and only keeps the data within a 10-minute window. Once a new data entry is filled, the old ones out of the window are popped out. The hash table counts the data in the queue and will be updated with the changing queue. Any ID with more than 500 clicks will be added to a set.
from datetime import datetime
import time
from collections import deque

def get_minute(s, fmt = '%I:%M%p'):
return time.mktime(datetime.strptime(s, fmt).timetuple())

def get_diff(s1, s2):
return int(get_minute(s2) - get_minute(s1)) / 60

def find_ids(infile, duration, maxcnt):
queue, htable, ans = deque(), {}, set()
with open(infile, 'rt') as _infile:
for l in _infile:
line = l.split()
line[2] = int(line[2])
current_id, current_time, current_clk = line
if current_id not in htable:
htable[current_id] = current_clk
else:
htable[current_id] += current_clk
queue.append(line)
while queue and get_diff(queue[0][1], current_time) > duration:
past_id, _, past_clk = queue.popleft()
htable[past_id] -= past_clk
if htable[current_id] > maxcnt:
ans.add(current_id)
return ans

if __name__ == "__main__":
print find_ids('sample.txt', 10, 500)

Cluster solution

The newest Spark (version 1.2.0) starts to support Python streaming. However, the document is still scarce — wait to see if this problem can be done by the new API.
To be continued

Example 2015.1: Time to refinance?

In the US, it’s typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases–you take a new loan at the lower rate and pay off your old loan with it– then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily– if you don’t mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you’re interested in the remaining principal or the proportion of each payment that’s principal. Here, we just want to check the monthly payment.

R
We’ll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.


mortgage J N M monthPay return(monthPay)
}

To compare the monthly costs for a series of loans offered by a local bank, we’ll input the bank’s loans as a data frame. To save typing, we’ll use the rep() function to generate the term of the loan and the points.


offers = data.frame(
principal = rep(275000, times=9),
term = rep(c(30,20,15), each=3),
points = rep(c(0,1,2), times=3),
rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

principal term points rate
1 275000 30 0 3.875
2 275000 30 1 3.750
3 275000 30 2 3.500
4 275000 20 0 3.625
5 275000 20 1 3.500
6 275000 20 2 3.375
7 275000 15 0 3.000
8 275000 15 1 2.875
9 275000 15 2 2.750

(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it’s easy to add the monthly cost to the data frame:


offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

principal term points rate monthly
1 275000 30 0 3.875 1293.152
2 275000 30 1 3.750 1273.568
3 275000 30 2 3.500 1234.873
4 275000 20 0 3.625 1612.610
5 275000 20 1 3.500 1594.889
6 275000 20 2 3.375 1577.282
7 275000 15 0 3.000 1899.100
8 275000 15 1 2.875 1882.611
9 275000 15 2 2.750 1866.210

In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters– principal, payment, rate, and term. To enter the data in a manner similar to R, we’ll use array statements and do loops.


data t;
principal = 275000;
array te [3] (30,20,15);
array po [3] (0,1,2);
array ra [9] (.03875, .0375, .035, .03625, .035,
.03375, .03, .02875, .0275);
do i = 1 to 3;
do j = 1 to 3;
term = te[i];
points = po[j];
rate = ra[ 3 * (i-1) +j];
monthly = mort(principal,.,rate/12, term*12);
output;
end;
end;
run;

proc print noobs data = t;
var principal term points rate monthly; run;

principal term points rate monthly

275000 30 0 0.03875 1293.15
275000 30 1 0.03750 1273.57
275000 30 2 0.03500 1234.87
275000 20 0 0.03625 1612.61
275000 20 1 0.03500 1594.89
275000 20 2 0.03375 1577.28
275000 15 0 0.03000 1899.10
275000 15 1 0.02875 1882.61
275000 15 2 0.02750 1866.21

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.->

Spark practice (3): clean and sort Social Security numbers

Sample.txt
Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN
402-94-7709 
283-90-3049
124-01-2425
1231232
088-57-9593
905-60-3585
44-82-8341
257581087
327-84-0220
402-94-7709

Thoughts

SSN indexed data is commonly seen and stored in many file systems. The trick to accelerate the speed on Spark is to build a numerical key and use the sortByKey operator. Besides, the accumulator provides a global variable existing across machines in a cluster, which is especially useful for counting data.

Single machine solution

#!/usr/bin/env python
# coding=utf-8
htable = {}
valid_cnt = 0
with open('sample.txt', 'rb') as infile, open('sample_bad.txt', 'wb') as outfile:
for l in infile:
l = l.strip()
nums = l.split('-')
key = -1
if l.isdigit() and len(l) == 9:
key = int(l)
if len(nums) == 3 and map(len, nums) == [3, 2, 4]:
key = 1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2])
if key == -1:
outfile.write(l + 'n')
else:
if key not in htable:
htable[key] = l
valid_cnt += 1

with open('sample_sorted.txt', 'wb') as outfile:
for x in sorted(htable):
outfile.write(htable[x] + 'n')

print valid_cnt

Cluster solution

#!/usr/bin/env python
# coding=utf-8
import pyspark
sc = pyspark.SparkContext()
valid_cnt = sc.accumulator(0)

def is_validSSN(l):
l = l.strip()
nums = l.split('-')
cdn1 = (l.isdigit() and len(l) == 9)
cdn2 = (len(nums) == 3 and map(len, nums) == [3, 2, 4])
if cdn1 or cdn2:
return True
return False

def set_key(l):
global valid_cnt
valid_cnt += 1
l = l.strip()
if len(l) == 9:
return (int(l), l)
nums = l.split('-')
return (1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2]), l)

rdd = sc.textFile('sample.txt')
rdd1 = rdd.filter(lambda x: not is_validSSN(x))

rdd2 = rdd.filter(is_validSSN).distinct()
.map(lambda x: set_key(x))
.sortByKey().map(lambda x: x[1])

for x in rdd1.collect():
print 'Invalid SSNt', x

for x in rdd2.collect():
print 'valid SSNt', x

print 'nNumber of valid SSN is {}'.format(valid_cnt)

# Save RDD to file system
rdd1.saveAsTextFile('sample_bad')
rdd2.saveAsTextFile('sample_sorted')
sc.stop()