Data and Analytics Innovation using SAS & Spark – part 1

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

This article is not a tutorial on Hadoop, Spark, or big data. At the same time, no prerequisite knowledge of these technologies is required for understanding. We’ll give you enough background prior to diving into the details. In simplest terms, the Hadoop framework maintains the data and Spark controls and directs data processing. As an analogy, think of Hadoop as a train, big data as the payload, and Spark as the crew driving the train and organizing and distributing the goods.

Big data

I recently read that data volumes are doubling each year. Not that long ago we talked in terms of gigabytes. This quickly turned into terabytes and we’re now in the age of petabytes. The type of data is also changing. Data used to fit neatly into rows and columns. Now, nearly eighty percent of data is unstructured. All these trends and facts have led us to deal with massive amounts of data, aka big data. Maintaining and processing big data required creating technical frameworks. Next, we’ll investigate a couple of these tools.

Hadoop

Hadoop is a technology stack utilizing parallel processing on a distributed filesystem. Hadoop is useful to companies when data sets become so large or complex that their current solutions cannot effectively process the information in a reasonable amount of time. As the data science field has matured over the past few years, so has the need for a different approach to processing data.

Apache Spark

Apache Spark is a cluster-computing framework utilizing both iterative algorithms and interactive/exploratory data analysis. The goal of Spark is to keep the benefits of Hadoop’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce. As a result, this brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

Now let’s explore how SAS integrates with these technologies to maximize capturing, managing, and analyzing big data.

SAS capabilities to leverage Spark

SAS provides Hadoop data processing and data scoring capabilities using SAS/ACCESS Interface to Hadoop and In-Database Technologies to Hadoop with MapReduce or Spark as the processing framework. This addresses some of the traditional data management batch processing, huge volumes of extract, transform, load (ETL) data as well as faster, interactive and in-memory processing for quicker response with Spark.

In SAS Viya, SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. All users with SAS/ACCESS Interface to Hadoop can use the serial. Likewise, SAS Data Connect Accelerator to Hadoop can load or save data in parallel between Hadoop and SAS using SAS Embedded Process, as a Hive/MapReduce or Spark job.

Connecting to Spark in a Hadoop Cluster

There are two ways to connect to a Hadoop cluster using SAS/ACCESS Interface to Hadoop, based on the SAS environment: LIBNAME and CASLIB statements.

LIBNAME statement to connect to Spark from MVA SAS

options set=SAS_HADOOP_JAR_PATH="/third_party/Hadoop/jars/lib:/third_party/Hadoop/jars/lib/spark"; 
options set=SAS_HADOOP_CONFIG_PATH="/third_party/Hadoop/conf"; 
 
libname hdplib hadoop server="hadoop.server.com" port=10000 user="hive"
schema='default' properties="hive.execution.engine=SPARK";

Parameters

SAS_HADOOP_JAR_PATH	Directory path for the Hadoop and Spark JAR files
SAS_HADOOP_CONFIG_PATH	Directory path for the Hadoop cluster configuration files
Libref	The hdplib libref specifies the location where SAS will find the data
SAS/ACCESS Engine Name	HADOOP option to connect Hadoop engine
SERVER	Hadoop Hive server to connect
PORT	Listening Hive server Port. 10000 is the default, so it is not required. It is included just in case
USER and PASSWORD	Are not always required
SCHEMA	Hive schema to access. It is optional; by default, it connects to the “default” schema
PROPERTIES	Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

CASLIB statement to connect from CAS

caslib splib sessref=mysession datasource=(srctype="hadoop", dataTransferMode="auto",username="hive", server="hadoop.server.com", 
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/conf ig/data/hadoop/lib/spark", 
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", schema="default"
platform="spark"
dfdebug="EPALL" 
properties="hive.execution.engine=SPARK");

Parameters

CASLIB	Space holder for the specified data access. The splib CAS library specifies the Hadoop data source
sessref	Holds the CAS library in a specific CAS session. mysession is the current active CAS session
SRCTYPE	Type of data source
DATATRANSFERMODE	Type of data movement between CAS and Hadoop. Accepts one of three values – serial, parallel, auto. When AUTO is specified, CAS choose the type of data transfer based on available license in the system. If Data Connect Accelerator to Hadoop has been licensed, parallel data transfer will be used, otherwise serial mode of transfer is used
HADOOPJARPATH	Hadoop and Spark JAR files location path on the CAS cluster
HADOOPCONFIGDIR	Hadoop configuration files location path on the CAS cluster
PLATFORM	Type of Hadoop platform to execute the job or transfer data using SAS Embedded Process. Default value is “mapred” for Hive MapReduce. When using “Spark”, data transfer and job executes as a Spark job
DFDEBUG	Used to receive additional information back from SAS Embedded Process transfers data in the SAS log
PROPERTIES	Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

Data Access using Spark

SAS Data Connect Accelerator for Hadoop with the Spark platform option uses Hive as the query engine to access Spark data. Data movement happens between Spark and CAS through SAS generated Scala code. This approach is useful when data already exists in Spark and either needs to be used for SAS analytics processing or moved to CAS for massively parallel data and analytics processing.

Loading Data from Hadoop to CAS using Spark

Processing data in CAS offers advanced data preparation, visualization, modeling and model pipelines, and finally model deployment. Model deployment can be performed using available CAS modules or pushed back to Spark if the data is already in Hadoop.

Load data from Hadoop to CAS using Spark

proc casutil 
      incaslib=splib
      outcaslib=casuser;
      load casdata="gas"
      casout="gas"
      replace;
run;

Parameters

PROC CASUTIL	Used to process CAS action routines to process data
INCASLIB	Input CAS library to read data
OUTCASLIB	Output CAS library to write data
CASDATA	Table to load to the CAS in-memory server
CASOUT	Output CAS table name

We can look at the status of the data load job using Hadoop’ resource management and job scheduling application, YARN. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

In the figure above, the YARN application executed the data load as a Spark job. This was possible because the CASLIB statement had Platform= Spark option specified. The data movement direction, in this case Hadoop to CAS uses the Spark job name, “SAS CAS/DC Input,” where “Input” is data loaded into CAS.

Saving Data from CAS to Hadoop using Spark

You can save data back to Hadoop from CAS at many stages of the analytic life cycle. For example, use data in CAS to prepare, blend, visualize, and model. Once the data meets the business use case, data can be saved in parallel to Hadoop using Spark jobs to share with other parts of the organization.

Using the SAVE CAS action to move data to Hadoop using Spark

proc cas;
session mysession; 
      table.save /
      caslib="splib"
      table={caslib="casuser", name="gas"},
      name="gas.sashdat"
      replace=True;
quit;

Parameters

PROC CAS	Used to execute CAS actionsets and actions to process data. “table” is the actionset and “save” is the action
TABLE	Location and name of the source table
NAME	Name of the target table saved to the Hadoop library using Spark

We can verify the status of saving data from CAS to Hadoop using YARN application. Data from CAS saves as a Hadoop table using, Spark as the execution platform. Furthermore, as SAS Data Connect Accelerator for Hadoop transfers data in parallel, individual Spark executors in each of the Spark executor nodes handles data execution for that specific Hadoop cluster node.

Finally, the SAVE data executed as a Spark job. As we can see from YARN, the Spark job named “SAS CAS/DC Output” specifies that the data moves from CAS to Hadoop.

Where we are; where we’re going

We have so far traveled across the Spark pond to setup SAS libraries for Spark, Load and Save data from and to Hadoop using Spark. In the next section we’ll look at ways to Score data and execute SAS code inside Hadoop using Spark.

Data and Analytics Innovation using SAS & Spark – part 1 was published on SAS Users.