This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
In the aftermath of a natural disaster, most people want to help by donating supplies, money, etc. And then it becomes a matter of logistics – getting all those donations to the people who need them. We recently had several days of rain and flooding in North Carolina, and I […]
The post Feeding the hungry, after a flood appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Visualize a design matrix appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.
It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures.
In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or encodings that result in different design matrices.
Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX)
require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.
/* add fake response variable; for convenience, shorten variable names */ data Temp / view=Temp; set Sashelp.heart(obs=150 keep=BP_Status Chol_Status Smoking_Status Weight_Status); rename BP_Status=BP Chol_Status=Chol Smoking_Status=Smoking Weight_Status=Weight; FakeY = 0; run; ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=GLM; model FakeY = BP Chol Smoking Weight; run; ods exclude none; ods graphics / width=500px height=800px; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="GLM Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"White" "Black"}; QUIT; |
Click on the heat map to enlarge it.
Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In
the design matrix for the GLM parameterization, a categorical variable with k levels is represented by k columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:
The GLM parameterization is called a “singular parameterization” because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.
There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:
ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=REF; model FakeY = BP Chol Smoking Weight; run; ods exclude none; |
Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have k levels are replaced by k–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.
The previous design matrices were binary 0/1 matrices.
The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.
If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.
ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP(ref='Normal') Chol(ref='Desirable') Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT; model FakeY = BP Chol Smoking Weight; run; ods exclude none; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="Effect Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"}; QUIT; |
In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to “read.” For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.
In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.
If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:
/* convert from wide (matrix) to long (row, col, value)*/ data Long; set Design; array dummy[*] _NUMERIC_; do varNum = 1 to dim(dummy); rowNum = _N_; value = dummy[varNum]; output; end; keep varNum rowNum value; run; proc sgplot data=Long; /* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */ styleattrs DATACOLORS=(Black White LightBlue); heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex; xaxis type=discrete; /* values=(1 to 11) valuesdisplay=("A" "B" ... "J" "K"); */ yaxis reverse; run; |
The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.
If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.
In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps.
The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.
The post Visualize a design matrix appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
If you’re a SAS programmer, you have likely used loops in your SAS code to make life easier from time to time. In this blog post, I demonstrate a few ways you can use loops to do clever things in your graph code. Perhaps even the old dogs can learn […]
The post Using loops in your SAS graph code appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Determining the size of a SAS data set appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
When developing SAS® data sets, program code and/or applications, efficiency is not always given the attention it deserves, particularly in the early phases of development. Since data sizes and system performance can affect a program and/or an application’s behavior, SAS users may want to access information about a data set’s […]
The post Determining the size of a SAS data set appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Building cars is towards the top of the manufacturing hierarchy – some countries are even known for the cars they build. If you want a good quality car, you probably think of Japan. If you want a stylish sports car, you probably think of Italy. If you want a diesel […]
The post Will your next car be made in China? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
In a previous post I’ve described a method for configuring Active Directory Authentication for SAS® on Linux (with realmd). One of the packages that’s installed is oddjob-mkhomedir. This package normally handles any requirement for auto-creating home directories for those AD users on Linux. Unfortunately it doesn’t seem to get used by the SAS Object Spawner. … Continue reading “Auto Creation of Linux Home Directories for SAS Users”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
There are several ways to visualize data in a two-way ANOVA model. Most visualizations show a statistical summary of the response variable for each category. However, for small data sets, it can be useful to overlay the raw data.
This article shows a simple trick that you can use to combine two categorical variables and plot the raw data for the joint levels of the two categorical variables.
Recall that an ANOVA (ANalysis Of VAriance) model is used to understand
differences among group means and the variation among and between groups.
The documentation for the ROBUSTREG procedure in SAS/STAT contains an example that compares the traditional ANOVA using PROC GLM with a robust ANOVA that uses PROC ROBUSTREG.
The response variable is the survival time (Time) for 16 mice who were randomly assigned to different combinations of two successive treatments (T1, T2). (Higher times are better.) The data are shown below:
data recover; input T1 $ T2 $ Time @@; datalines; 0 0 20.2 0 0 23.9 0 0 21.9 0 0 42.4 1 0 27.2 1 0 34.0 1 0 27.4 1 0 28.5 0 1 25.9 0 1 34.5 0 1 25.1 0 1 34.2 1 1 35.0 1 1 33.9 1 1 38.3 1 1 39.9 ; |
The response variable depends on the joint levels of the binary variables T1 and T2. A first attempt to visualize the data in SAS might be to create a box plot of the four combinations of T1 and T2. You can do this by assigning T1 to be the “category” variable and T2 to be a “group” variable in a clustered box plot, as follows:
title "Response for Two Groups"; title2 "Use VBOX Statement with Categories and Groups"; proc sgplot data=recover; vbox Time / category=T1 group=T2; run; |
The graph shows the distribution of response for the four joint combinations of T1 and T2.
The graph is a little hard to interpret because the category levels are 0/1.
The two box plots on the left are for T1=0, which means “Did not receive the T1 treatment.” The two box plots on the right are for mice who received the T1 treatment.
Within those clusters, the blue boxes indicate the distribution of responses for the mice who did not receive the T2 treatment, whereas the red boxes indicate the response distribution for mice that did receive T2.
Both treatments seem to increase the mean survival time for mice, and receiving both treatments seems to give the highest survival times.
Interpreting the graph took a little thought. Also, the colors seem somewhat arbitrary. I think the graph could be improved if the category labels indicate the joint levels. In other words, I’d prefer to see a box plot of the levels of interaction variable T1*T2. If possible, I’d also like to optionally plot the raw response values.
The LOGISTIC and GENMOD procedures
in SAS/STAT support the EFFECTPLOT statement. Many other SAS regression procedures support the STORE statement, which enables you to save a regression model and then use the PLM procedure (which supports the EFFECTPLOT statement).
The EFFECTPLOT statement can create a variety of plots for visualizing regression models, including a box plot of the joint levels for two categorical variables, as shown by the following statements:
/* Use the EFFECTPLOT statement in PROC GENMOD, or use the STORE statement and PROC PLM */ proc genmod data=recover; class T1 T2; model Time = T1 T2 T1*T2; effectplot box / cluster; effectplot interaction / obs(jitter); /* or use interaction plot to see raw data */ run; |
The resulting graph uses box plots to show the schematic distribution of each of the joint levels of the two categorical variables. (The second EFFECTPLOT statement creates
an “interaction plot” that shows the raw values and mean responses.) The means of each group are connected, which makes it easier to compare adjacent means. The labels indicate the levels of the T1*T2 interaction variable. I think this graph is an improvement over the previous multi-colored box plot, and I find it easier to read and interpret.
Although the EFFECTPLOT statement makes it easy to create this plot, the EFFECTPLOT statement does not support overlaying raw values on the box plots. (You can, however, see the raw values on the “interaction plot”.) The next section shows an alternative way to create the box plots.
You can explicitly form the interaction variable (T1*T2) by using the CATX function to concatenate the T1 and T2 variables, as shown in the following DATA step view. Because the levels are binary-encoded, the resulting levels are ‘0 0’, ‘0 1’, ‘1 0’, and ‘1 1’. You can
define a SAS format to make the joint levels more readable. You can then display the box plots for the interaction variable and, optionally, overlay the raw values:
data recover2 / view=recover2; length Treatment $3; /* specify length of concatenated variable */ set recover; Treatment = catx(' ',T1,T2); /* combine into one group */ run; proc format; /* make the joint levels more readable */ value $ TreatFmt '0 0' = 'Control' '1 0' = 'T1 Only' '0 1' = 'T2 Only' '1 1' = 'T1 and T2'; run; proc sgplot data=recover2 noautolegend; format Treatment $TreatFmt.; vbox Time / category=Treatment; scatter x=Treatment y=Time / jitter markerattrs=(symbol=CircleFilled size=10); xaxis discreteorder=data; run; |
By manually concatenating the two categorical variables to form a new interaction variable, you have complete control over the plot. You can also overlay the raw data, as shown. The raw data indicates that the “Control” group seems to contain an outlier: a mouse who lived longer than would be expected for his treatment. Using PROC ROBUSTREG to compute a robust ANOVA is one way to deal with extreme outliers in the ANOVA setting.
In summary, the EFFECTPLOT statement enables you to quickly create box plots that show the response distribution for joint levels of two categorical variables. However, sometimes you might want more control, such as the ability to format the labels or overlay the raw data. This article shows how to use the CATX function to manually create a new variable that contains the joint categories.
The post Visualize an ANOVA with two-way interactions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
I had been puzzling over why some SAS® Viya™ services were not starting on a machine reboot. Initially I thought the answer appeared in the SAS Viya 3.2 Administration documentation set: see the General Servers and Services: Troubleshooting section. I found that all the expected services started after: [root@hostname ~]# /etc/init.d/sas-viya-all-services stop [root@hostname ~]# rm … Continue reading “Nudging SAS Viya Services Timeout”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
The post Your mapping toolkit tip #6 - Geocoding your addresses appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
You’ve got a database containing the addresses of all your customers … but how can you plot them on a map or analyze them spatially? First, you’ll need to convert the address into a numeric coordinate (latitude & longitude). SAS can do that … with Proc Geocode! But before we […]
The post Your mapping toolkit tip #6 – Geocoding your addresses appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post People using smartphones while driving - the numbers are in! appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
You’re sitting in a line of cars at the intersection, waiting for the light to change – when it finally turns green, the 2nd car just sits there for several seconds until someone honks at them, and then they scoot through the light … but everyone behind them has to […]
The post People using smartphones while driving – the numbers are in! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |