Category: SAS

Who paid $500k for a US visa? Over 10,000 people!

Having spent many years in graduate school, and living in the Research Triangle Park (RTP) in North Carolina, I have a lot of friends from other countries. Therefore when I recently saw some stories & graphs about EB-5 visas (where you invest a cool half-million US $ to bypass the […]

The post Who paid $500k for a US visa? Over 10,000 people! appeared first on The SAS Training Post.

6 Tips for Data Visualization from a Floral Designer

You never know where you will find inspiration. This past weekend I attended the NC Museum of Art Art in Bloom Festival.  The idea is that local floral designers use a museum masterpiece to draw inspiration for a floral design. [More: WRAL video about event] It was incredible to see how someone could paint with flowers. Turns out there are many Art in Bloom events like …

Post 6 Tips for Data Visualization from a Floral Designer appeared first on BI Notes for SAS® Users. Go to BI Notes for SAS® Users to subscribe.

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]

When art and analytics collide

The best graphs are both beautiful and informative – a smooth blend of art and analytics. But more often than not, the two collide rather than blending smoothly… Here is a link to a artistic infographic I recently saw posted by Vendavo on twitter. Their message (80% of your profit is generated […]

The post When art and analytics collide appeared first on The SAS Training Post.

Deploy a minimal Spark cluster

Requirements

Since Spark is rapidly evolving, I need to deploy and maintain a minimal Spark cluster for the purpose of testing and prototyping. A public cloud is the best fit for my current demand.
  1. Intranet speed
    The cluster should easily copy the data from one server to another. MapReduce always shuffles a large chunk of data throughout the HDFS. It’s best that the hard disk is SSD.
  2. Elasticity and scalability
    Before scaling the cluster out to more machines, the cloud should have some elasticity to size up or size down.
  3. Locality of Hadoop
    Most importantly, the Hadoop cluster and the Spark cluster should have one-to-one mapping relationship like below. The computation and the storage should always be on the same machines.
Hadoop Cluster Manager Spark MapReduce
Name Node Master Driver Job Tracker
Data Node Slave Executor Task Tracker

Choice of public cloud:

I simply compare two cloud service provider: AWS and DigitalOcean. Both have nice Python-based monitoring tools(Boto for AWS and python-digitalocean for DigitalOcean).
  1. From storage to computation
    Hadoop’s S3 is a great storage to keep data and load it into the Spark/EC2 cluster. Or the Spark cluster on EC2 can directly read S3 bucket such as s3n://file (the speed is still acceptable). On DigitalOcean, I have to upload data from local to the cluster’s HDFS.
  2. DevOps tools:
      • With default setting after running it, you will get
        • 2 HDFSs: one persistent and one ephemeral
        • Spark 1.3 or any earlier version
        • Spark’s stand-alone cluster manager
      • A minimal cluster with 1 master and 3 slaves will be consist of 4 m1.xlarge EC2 instances
        • Pros: large memory with each node having 15 GB memory
        • Cons: not SSD; expensive (cost $0.35 * 6 = $2.1 per hour)
      • With default setting after running it, you will get
        • HDFS
        • no Spark
        • Mesos
        • OpenVPN
      • A minimal cluster with 1 master and 3 slaves will be consist of 4 2GB/2CPUs droplets
        • Pros: as low as $0.12 per hour; Mesos provide fine-grained control over the cluster(down to 0.1 CPU and 16MB memory); nice to have VPN to guarantee the security
        • Cons: small memory(each has 2GB memory); have to install Spark manually

Add Spark to DigitalOcean cluster

Tom Faulhaber has a quick bash script for deployment. To install Spark 1.3.0, I write it into a fabfile for Python’s Fabric.
Then all the deployment onto the DigitOcean is just one command line.
# 10.1.2.3 is the internal IP address of the master
fab -H 10.1.2.3 deploy_spark
The source codes above are available at my Github

Euro vs Dollar exchange rate: An historic event?

I recently read a Washington Post article about the euro versus the dollar, and I wanted to analyze the data myself to see whether the article was simply stating the facts, or “sensationalizing” things. The washingtonpost.com article started with the headline, “This is historic: The dollar will soon be worth […]

The post Euro vs Dollar exchange rate: An historic event? appeared first on The SAS Training Post.