PROC-X: an online (unofficial) journal about the SAS(R) software - written by bloggers

SAS employees are now easier to spot on Communities

by Chris hemedinger • January 23, 2015

Since the launch of Communities on SAS, hundreds of SAS employees have been among you. Some SAS employees made themselves known by selecting a telling user name (such as Cynthia@SAS), but others remained camouflaged or incognito, keeping their secret identities like the SAS superheroes they are. That’s about to change. […]

The post SAS employees are now easier to spot on Communities appeared first on The SAS Dummy.

SAS

Spark practice(4): malicious web attack

by Charlie H • January 8, 2015

Suppose there is a website tracking user activities to prevent robotic attack on the Internet. Please design an algorithm to identify user IDs that have more than 500 clicks within any given 10 minutes.

Sample.txt: anonymousUserID timeStamp clickCount
123    9:45am    10
234    9:46am    12
234    9:50am    20
456    9:53am    100
123    9:55am    33
456    9:56am    312
123    10:03am    110
123    10:16am    312
234    10:20am    201
456    10:23am    180
123    10:25am    393
456    10:27am    112
999    12:21pm    888

Thought

This is a typical example of stream processing. The key is to build a fixed-length window to slide through all data, count data within and return the possible malicious IDs.

Single machine solution

Two data structures are used: a queue and a hash table. The queue is scanning the data and only keeps the data within a 10-minute window. Once a new data entry is filled, the old ones out of the window are popped out. The hash table counts the data in the queue and will be updated with the changing queue. Any ID with more than 500 clicks will be added to a set.

from datetime import datetime
import time
from collections import deque

def get_minute(s, fmt = '%I:%M%p'):
return time.mktime(datetime.strptime(s, fmt).timetuple())

def get_diff(s1, s2):
return int(get_minute(s2) - get_minute(s1)) / 60

def find_ids(infile, duration, maxcnt):
    queue, htable, ans = deque(), {}, set()
with open(infile, 'rt') as _infile:
for l in _infile:
            line = l.split()
            line[2] = int(line[2])
            current_id, current_time, current_clk = line
if current_id not in htable:
                htable[current_id] = current_clk
else:
                htable[current_id] += current_clk
            queue.append(line)
while queue and get_diff(queue[0][1], current_time) > duration:
                past_id, _, past_clk = queue.popleft()
                htable[past_id] -= past_clk
if htable[current_id] > maxcnt:
                ans.add(current_id)
return ans

if __name__ == "__main__":
print find_ids('sample.txt', 10, 500)

Cluster solution

The newest Spark (version 1.2.0) starts to support Python streaming. However, the document is still scarce — wait to see if this problem can be done by the new API.

To be continued

SAS

Example 2015.1: Time to refinance?

by ken kleinman • January 5, 2015

In the US, it’s typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases–you take a new loan at the lower rate and pay off your old loan with it– then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily– if you don’t mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you’re interested in the remaining principal or the proportion of each payment that’s principal. Here, we just want to check the monthly payment.

R
We’ll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.


mortgage   J   N   M   monthPay   return(monthPay)
}

To compare the monthly costs for a series of loans offered by a local bank, we’ll input the bank’s loans as a data frame. To save typing, we’ll use the rep() function to generate the term of the loan and the points.


offers = data.frame(
  principal = rep(275000, times=9),
  term = rep(c(30,20,15), each=3), 
  points = rep(c(0,1,2), times=3),
  rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

  principal term points  rate
1    275000   30      0 3.875
2    275000   30      1 3.750
3    275000   30      2 3.500
4    275000   20      0 3.625
5    275000   20      1 3.500
6    275000   20      2 3.375
7    275000   15      0 3.000
8    275000   15      1 2.875
9    275000   15      2 2.750

(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it’s easy to add the monthly cost to the data frame:


offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

  principal term points  rate  monthly
1    275000   30      0 3.875 1293.152
2    275000   30      1 3.750 1273.568
3    275000   30      2 3.500 1234.873
4    275000   20      0 3.625 1612.610
5    275000   20      1 3.500 1594.889
6    275000   20      2 3.375 1577.282
7    275000   15      0 3.000 1899.100
8    275000   15      1 2.875 1882.611
9    275000   15      2 2.750 1866.210

In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters– principal, payment, rate, and term. To enter the data in a manner similar to R, we’ll use array statements and do loops.


data t;
principal = 275000; 
array te [3] (30,20,15);
array po [3] (0,1,2); 
array ra [9] (.03875, .0375, .035, .03625, .035, 
              .03375, .03, .02875, .0275);
do i = 1 to 3;
  do j = 1 to 3;
    term = te[i];
 points = po[j];
 rate = ra[ 3 * (i-1) +j];
 monthly = mort(principal,.,rate/12, term*12);
    output;
  end;
end;
run;

proc print noobs data = t; 
var principal term points rate monthly; run;

principal    term    points      rate     monthly

  275000      30        0      0.03875    1293.15
  275000      30        1      0.03750    1273.57
  275000      30        2      0.03500    1234.87
  275000      20        0      0.03625    1612.61
  275000      20        1      0.03500    1594.89
  275000      20        2      0.03375    1577.28
  275000      15        0      0.03000    1899.10
  275000      15        1      0.02875    1882.61
  275000      15        2      0.02750    1866.21

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.->

SAS

SAS is #1…In Plans to Discontinue Use

by Bob Muenchen • December 30, 2014

I’ve been tracking The Popularity of Data Analysis Software for many years now, and a clear trend is the decline of the market share of the bigger analytics firms, notably SAS and SPSS. Many people have interpreted my comments as implying … Continue reading →

SAS

Spark practice (3): clean and sort Social Security numbers

by Charlie H • December 24, 2014

Sample.txt

Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN

402-94-7709 
283-90-3049 
124-01-2425 
1231232
088-57-9593 
905-60-3585 
44-82-8341
257581087
327-84-0220
402-94-7709

Thoughts

SSN indexed data is commonly seen and stored in many file systems. The trick to accelerate the speed on Spark is to build a numerical key and use the sortByKey operator. Besides, the accumulator provides a global variable existing across machines in a cluster, which is especially useful for counting data.

Single machine solution

#!/usr/bin/env python
# coding=utf-8
htable = {}
valid_cnt = 0
with open('sample.txt', 'rb') as infile, open('sample_bad.txt', 'wb') as outfile:
for l in infile:
        l = l.strip()
        nums = l.split('-')
        key = -1
if l.isdigit() and len(l) == 9:
            key = int(l)
if len(nums) == 3 and map(len, nums) == [3, 2, 4]:
            key = 1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2])
if key == -1:
            outfile.write(l + 'n')
else:
if key not in htable:
                htable[key] = l
                valid_cnt += 1

with open('sample_sorted.txt', 'wb') as outfile:
for x in sorted(htable):
        outfile.write(htable[x] + 'n')

print valid_cnt

Cluster solution

#!/usr/bin/env python
# coding=utf-8
import pyspark
sc = pyspark.SparkContext()
valid_cnt = sc.accumulator(0)

def is_validSSN(l):
    l = l.strip()
    nums = l.split('-')
    cdn1 = (l.isdigit() and len(l) == 9)
    cdn2 = (len(nums) == 3 and map(len, nums) == [3, 2, 4])
if cdn1 or cdn2:
return True
return False

def set_key(l):
global valid_cnt
    valid_cnt += 1
    l = l.strip()
if len(l) == 9:
return (int(l), l)
    nums = l.split('-')
return (1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2]), l)

rdd = sc.textFile('sample.txt')
rdd1 = rdd.filter(lambda x: not is_validSSN(x))

rdd2 = rdd.filter(is_validSSN).distinct() 
    .map(lambda x: set_key(x)) 
    .sortByKey().map(lambda x: x[1])

for x in rdd1.collect():
print 'Invalid SSNt', x

for x in rdd2.collect():
print 'valid SSNt', x

print 'nNumber of valid SSN is {}'.format(valid_cnt)

# Save RDD to file system
rdd1.saveAsTextFile('sample_bad')
rdd2.saveAsTextFile('sample_sorted')
sc.stop()

SAS

SAS BI training session

by CN Sharad • December 22, 2014

My SAS BI training session recordingDirect Source Link

Page 267 of 751

« 1 … 265 266 267 268 269 … 751 »