Getting Started with Python Integration to SAS® Viya® – Part 9 – Summarize Columns

This post was kindly contributed by SAS Users - go there to comment and to read the full post.

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS server, how to execute CAS actions, and how to filter CAS tables. Now it’s time to focus on how to summarize columns.

Load and explore data

Let’s first load some data. In SAS Viya we can use the Samples caslib for demonstration data. I’ll load the data into the distributed CAS server using my CAS connection object conn, followed by the loadTable action. The loadTable action loads a server-side file into memory. Then I’ll reference the CAS table in the tbl object and preview 5 rows using the SWAT head method. Remember, the data in this example is small for training purposes. Processing data in the CAS server’s massively parallel processing environment is typically reserved for larger data.

conn.loadTable(path = 'WATER_CLUSTER.sashdat', caslib = 'samples',
                            casOut = dict(caslib = 'casuser'))
 
tbl = conn.CASTable('water_cluster', caslib='casuser')
 
tbl.head()

The water_cluster CAS table contains daily water usage for a variety properties. The Daily_W_C_M3 column displays the water used per location in cubic meters (m3) .

Simple summarizations

Next, let’s execute methods for simple summarizations on CAS table columns. If you’ve used Pandas before these techniques will be very familiar.

I’ll start by getting the total water consumption for all the data using the sum method in the SWAT package on the CAS table column Daily_W_C_M3.

(tbl
 .Daily_W_C_M3
 .sum()
)
# and the results
401407.88800000015

Next, I’ll view the maximum and minimum water consumption value in the data using the max and min methods in the SWAT package.

(tbl
 .Daily_W_C_M3
 .max()
)
# and the results
11910.0
 
(tbl
 .Daily_W_C_M3
 .min()
)
# and the results
0.0

Lastly, I’ll find the average water consumption value in the CAS table using the mean method.

(tbl
 .Daily_W_C_M3
 .mean()
)
# and the results
8.591778424657537

Instead of manually specifying each method to view summary information, I could of used the summary CAS action or the describe method from the SWAT package. For more information check out my previous post Descriptive Statistics.

Find the property with the max water consumption

What if I want to find the property or properties with the max water consumption? One technique is to find the max water consumption value, then filter the CAS table for the row (or rows) with that value. Here I will store the max water consumption value in the variable maxWaterConsumption. Then I’ll use the variable in the query method to filter the CAS table. For more information on filtering CAS tables, check out my previous blog post Filtering CAS Tables.

## Store the max water consumption value
maxWaterConsumption = (tbl
                      .Daily_W_C_M3
                      .max()
)
 
## Filter the CAS table for the property with the max water usage
(tbl
 .query(f"Daily_W_C_M3 = {maxWaterConsumption }")
 .head()
)

The results show that the property address 1660 S DAIRY ASHFORD ST had the largest daily water consumption rate.

Find the top 10 daily water consumption values and properties

Lastly, to find the top 10 daily water consumption properties and values I can use the nlargest method in the SWAT package. This method is similar to the Pandas nlargest method. One difference is the SWAT nlargest method does not implement the keep parameter.

I’ll specify the SWAT nlargest method and store the results in the variable df_top10. Then I’ll display the results.

df_top10 = tbl.nlargest(10, 'Daily_W_C_M3')
display(df_top10)

The CAS server will execute the request and return a SASDataFrame to the client. The SASDataFrame shows that a single property contains the top 10 largest water consumption values.

Remember, a SASDataFrame is a subclass of a Pandas DataFrame. As a result, you can work with them as you normally would a Pandas DataFrame. To learn more about working with CAS objects check out my previous post Working with CAS Actions and CASResults Objects.

Summary

The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas techniques to work with your distributed CAS tables. This enables you to utilize the massively parallel processing power of the CAS server in SAS Viya using Python. Here we learned about using familiar Pandas methods in the SWAT package like sum, min, mean, max and nlargest.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® – Part 9 – Summarize Columns was published on SAS Users.