The post Truncate response surfaces appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so that only positive responses are displayed.
This article shows how you can truncate a surface or a contour plot so that negative values are not displayed. You could do something similar to truncate unreasonably high values in a surface plot.
Before showing how to truncate the surface plot, let’s figure out why the model predicts negative values when all the observed responses are positive. The following DATA step is a simplified version of the real data. The RSREG procedure uses least squares regression to fit a quadratic response surface. If you use the PLOTS=SURFACE option, the procedure automatically displays a contour plot and surface plot for the predicted response:
data Sample; input X Y Z @@; datalines; 10 90 22 22 76 13 22 75 7 24 78 14 24 76 10 25 63 5 26 62 10 26 94 20 26 63 15 27 94 16 27 95 14 29 66 7 30 69 8 30 74 8 ; ods graphics / width=400px height=400px ANTIALIASMAX=10000; proc rsreg data=Sample plots=surface(fill=pred overlaypairs); model Z = Y X; run; proc rsreg data=Sample plots=surface(3d fill=Pred gridsize=80); model Z = Y X; ods select Surface; ods output Surface=Surface; /* use ODS OUTPUT to save surface data to a data set */ run; |
The contour plot overlays a scatter plot of the data. You can see that the data are observed only in the upper-right portion of the plot (the red regions) and that no data are in the lower-left portion of the plot.
The RSREG procedure fits a quadratic model to the data. The predicted values near the observed data are all positive. Some of the predicted values that are far from the observed data are negative.
I previously wrote about this phenomenon and showed how to compute the convex hull for these bivariate data. When you evaluate the model inside the convex hull, you are interpolating. When you evaluate the model outside the convex hull, you are extrapolating.
It is well known that polynomial regression models can give nonsensical results if you extrapolate far from the data.
The RSREG procedure is not aware that the response variable should be positive. A quadratic surface will eventually get arbitrarily big in the positive and/or negative directions. You can see this on the contour and surface plots, which show the predictions of the model on a regular grid of (X, Y) values.
If you want to display only the positive portion of the prediction surface, you can replace each negative predicted value with a missing value. The first step is to obtain the predicted values on a regular grid. You can use the “missing value trick” to score the quadratic model on a grid, or you can use the ODS OUTPUT statement to obtain the gridded values that are used in the surface plot. I chose the latter option. In the previous section, I used the ODS OUTPUT statement to write the gridded predicted values for the surface plot to a SAS data set named Surface.
As Warren Kuhfeld points out in his article about processing ODS OUTPUT data set, the names in an ODS data object can be “long and hard to type.” Therefore, I rename the variables. I also combine the gridded values with the original data so that I can optionally overlay the data and the predicted values.
/* rename vars and set negative responses to missing */ data Surf2; set Surface(rename=( Predicted0_1_0_0 = Pred /* rename the long ODS names */ Factor1_0_1_0_0 = GY /* 'G' for 'gridded' */ Factor2_0_1_0_0 = GX)) Sample(in=theData); /* combine with original data */ if theData then Type = "Data "; else Type = "Gridded"; if Pred < 0 then Pred = .; /* replace negative predictions with missing values */ label GX = 'X' GY = 'Y'; run; |
You can use the Graph Template Language (GTL) to generate graphs that are similar to those produced by PROC RSREG.
You can then use PROC SGRENDER to create the graph. Because the negative response values were set to missing, the contour plot displays a missing value color (black, for this ODS style) in the lower-left and upper-right portions of the plot.
Similarly, the missing values cause the surface plot to be truncated. By using the GRIDSIZE= option, you can make the jagged edges small.
Notice that the colors in the graphs are now based on the range [0, 50], whereas previously the colors were based on the range [-60, 50]. I’ve added a continuous legend to the plots so that the range of the response variable is obvious.
I’d like to stress that sometimes “nonsensical values” indicate an inappropriate model. If you notice nonsensical values, you should always ask yourself why the model is predicting those values. You shouldn’t modify the prediction surface without a good reason. But if you do have a good reason, the techniques in this article should help you.
You can download the complete SAS program that analyzes the data and generates the truncated graphs.
The post Truncate response surfaces appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I show Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.
Forrester Research, Inc. is another company that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.
Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports. Figure 3c shows the results of The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018. By “multimodal” they mean controllable by various means such as menus, workflows, wizards, or code. Figure 3d shows the results from The Forrester Wave: Notebook-Based Solutions, Q3, 2018 (notebooks blend programming code and output in the same window). Those are the two most recent Forrester reports on the topic. Forrester plans to cover tools for automated modeling in a separate report. Given that automation is now a widely adopted feature of the several companies shown in Figure 3c, that seems like an odd approach.
Both plots use the x-axis to display the strength of each company’s strategy, while the y-axis measures the strength of each’s current offering. Blue shading is used to divide the vendors into Leaders, Strong Performers, Contenders, and Challengers. The size of the circle around each data point indicates the “presence” of each vendor in the marketplace, weighted by 70% by vendor size and 30% by ISV and service partners.
In Figure 3c, we see a perspective that is radically different from the latest Gartner plot, 3a (see previous post). Here IBM is considered a leader, instead of a middle-of-the-pack Visionary. SAS and RapidMiner are both considered leaders by Gartner and Forrester.
In the Strong Performers segment, we see KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. KNIME and SAP are next to each other in this segment, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating from Gartner.
The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.
Forrester’s Challengers segment World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems. Neither were considered by Gartner.
The notebook-based vendors shown in Figure 3d is also extremely different from Gartner’s perspective. Here Domino Data Labs is a leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a leader, though its strength is this market is minimal.
In the Strong Performers segment are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this segment; neither were reviewed by Gartner. Cloudera is in this segment as well; it was left out by Gartner.
The Condenders segment contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.
The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.
As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical prediction, such as trillions of people or a negative number of people! The lesson is that you should be careful when you evaluate a model far beyond the range of the training data.
The same dangers exist for multivariate regression models, but they are emphasized less often. Perhaps the reason is that it is much harder to know when you are extrapolating a multivariate model.
Interpolation occurs when you evaluate the model inside the convex hull of the training data. Anything else is an extrapolation. In particular, you might be extrapolating even if you score the model at a point inside the bounding box of the training data. This differs from the univariate case in which the convex hull equals the bounding box (range) of the data. In general, the convex hull of a set of points is smaller than the bounding box.
You can use a bivariate example to illustrate the difference between the convex hull of the data and the bounding box for the data, which is the rectangle
[X_{min}, X_{max}] x [Y_{min}, Y_{max}].
The following SAS DATA step defines two explanatory variables (X and Y) and one response variable (Z). The SGPLOT procedure shows the distribution of the (X, Y) variables and colors each marker according to the response value:
data Sample; input X Y Z @@; datalines; 10 90 22 22 76 13 22 75 7 24 78 14 24 76 10 25 63 5 26 62 10 26 94 20 26 63 15 27 94 16 27 95 14 29 66 7 30 69 8 30 74 8 ; title "Response Values for Bivariate Data"; proc sgplot data=Sample; scatter x=x y=y / markerattrs=(size=12 symbol=CircleFilled) colorresponse=Z colormodel=AltThreeColorRamp; xaxis grid; yaxis grid; run; |
The data are observed in a region that is approximately triangular. No observations are near the lower-left corner of the plot. If you fit a response surface to this data, it is likely that you would visualize the model by using a contour plot or a surface plot on the rectangular domain [10, 30] x [62, 95]. For such a model, predicted values near the lower-left corner are not very reliable because the corner is far from the data.
In general, you should expect less accuracy when you predict the model “outside” the data (for example, (10, 60)) as opposed to points that are “inside” the data (for example, (25, 70)).
This concept is sometimes discussed in courses about the design of experiments. For a nice exposition, see the course notes of Professor Rafi Hafka (2012, p. 49–59) at the University of Florida.
You can use SAS to visualize the convex hull of the bivariate observations. The convex hull is the smallest convex set that contains the observations.
The SAS/IML language supports the CVEXHULL function, which computes the convex hull for a set of planar points.
You can represent the points by using an N x 2 matrix, where each row is a 2-D point.
When you call the CVEXHULL function, you obtain a vector of N integers. The first few integers are positive and represent the rows of the matrix that comprise the convex hull. The (absolute value of) the negative integers represents the rows that are interior to the convex hull. This is illustrated for the sample data:
proc iml; use Sample; read all var {x y} into points; close; /* get indices of points in the convex hull in counter-clockwise order */ indices = cvexhull( points ); print (indices`)[L="indices"]; /* positive indices are boundary; negative indices are inside */ |
The output shows that the observation numbers (indices) that form the convex hull are {1, 6, 7, 12, 13, 14, 11}.
The other observations are in the interior. You can visualize the interior and boundary points by forming a binary indicator vector that has the value 1 for points on the boundary and 0 for points in the interior.
To get the indicator vector in the order of the data, you need to use the SORTNDX subroutine to compute the anti-rank of the indices, as follows:
b = (indices > 0); /* binary indicator variable for sorted vertices */ call sortndx(ndx, abs(indices)); /* get anti-rank, which is sort index that "reverses" the order */ onBoundary = b[ndx]; /* binary indicator data in original order */ title "Convex Hull of Bivariate Data"; call scatter(points[,1], points[,2]) group=onBoundary option="markerattrs=(size=12 symbol=CircleFilled)"; |
The blue points are the boundary of the convex hull whereas the red points are in the interior.
You can visualize the convex hull by forming the polygon that connects the first, sixth, seventh, …, eleventh observations.
You can do this manually by using the POLYGON statement in PROC SGPLOT, which I show in the Appendix section. However, there is an easier way to visualize the convex hull. I previously wrote about SAS/IML packages and showed how to install the polygon package. The polygon package contains a module called PolyDraw, which enables you to draw polygons and overlay a scatter plot.
The following SAS/IML statements extract the positive indices and use them to get the points on the boundary of the convex hull. If the polygon package is installed, you can load the polygon package and visualize the convex hull and data:
hullNdx = indices[loc(b)]; /* get positive indices */ convexHull = points[hullNdx, ]; /* extract the convex hull, in CC order */ /* In SAS/IML 14.1, you can use the polygon package to visualize the convex hull: https://blogs.sas.com/content/iml/2016/04/27/packages-share-sas-iml-programs-html */ package load polygon; /* assumes package is installed */ run PolyDraw(convexHull, points||onBoundary) grid={x y} markerattrs="size=12 symbol=CircleFilled"; |
The graph shows the convex hull of the data. You can see that it primarily occupies the upper-right portion of the rectangle. The convex hull shows the interpolation region for regression models. If you evaluate a model outside the convex hull, you are extrapolating. In particular, even though points in the lower left corner of the plot are within the bounding box of the data, they are far from the data.
Of course, if you have 5, 10 or 100 explanatory variables, you will not be able to visualize the convex hull of the data. Nevertheless, the same lesson applies. Namely, when you evaluate the model inside the bounding box of the data, you might be extrapolating rather than interpolating. Just as in the univariate case, the model might predict nonsensical data when you extrapolate far from the data.
Packages are supported in SAS/IML 14.1. If you are running an earlier version of SAS, you create the same graph by writing the polygon data and the binary indicator variable to a SAS data set, as follows:
hullNdx = indices[loc(b)]; /* get positive indices */ convexHull = points[hullNdx, ]; /* extract the convex hull, in CC order */ /* Write the data and polygon to SAS data sets. Use the POLYGON statement in PROC SGPLOT. */ p = points || onBoundary; poly = j(nrow(convexHull), 1, 1) || convexHull; create TheData from p[colname={x y "onBoundary"}]; append from p; close; create Hull from poly[colname={ID cX cY}]; append from poly; close; quit; data All; set TheData Hull; run; /* combine the data and convex hull polygon */ proc sgplot data=All noautolegend; polygon x=cX y=cY ID=id / fill; scatter x=x y=y / group=onBoundary markerattrs=(size=12 symbol=CircleFilled); xaxis grid; yaxis grid; run; |
The resulting graph is similar to the one produced by the PolyDraw modules and is not shown.
The post Interpolation vs extrapolation: the convex hull of multivariate data appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How to view or create ODS output without causing SAS® to stop responding or run slowly was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
SAS makes it easy for you to create a large amount of procedure output with very few statements. However, when you create a large amount of procedure output with the Output Delivery System (ODS), your SAS session might stop responding or run slowly. In some cases, SAS generates a “Not Responding” message. Beginning with SAS® 9.3, the SAS windowing environment creates HTML output by default and enables ODS Graphics by default. If your code creates a large amount of either HTML output or ODS Graphics output, you can experience performance issues in SAS. This blog article discusses how to work around this issue.
By default, the SAS windowing environment with SAS 9.3 and SAS® 9.4 creates procedure output in HTML format and displays that HTML output in the Results Viewer window. However, when a large amount of HTML output is displayed in the Results Viewer window, performance might suffer. To display HTML output in the Results Viewer window, SAS uses an embedded version of Internet Explorer within the SAS environment. And because Internet Explorer does not process large amounts of HTML output well, it can slow down your results.
If you do not need to create HTML output, you can display procedure output in the Output window instead. To do so, add the following statements to the top of your code before the procedure step:
ods _all_ close;
ods listing;
The Output window can show results faster than HTML output that is displayed in the Results Viewer window.
If you want to enable the Output window via the SAS windowing environment, take these steps:
A large amount of output in the Output window, which typically does not cause a performance issue, might still generate an “Output window is full” message. In that case, you can route your LISTING output to a disk file. Use either the PRINTTO procedure or the ODS LISTING statement with the FILE= option. Here is an example:
ods _all_ close;
ods listing file="sasoutput.lst";
Beginning with SAS 9.3, the SAS windowing environment enables ODS Graphics by default. Therefore, most SAS/STAT® procedures now create graphics output automatically. Naturally, graphics output can take longer to create than regular text output. If you are running a SAS/STAT procedure but you do not need to create graphics output, add the following statement to the code before the procedure step:
ods graphics off;
If you want to set this option via the SAS windowing environment, take these steps:
For maximum efficiency, you can combine the ODS GRAPHICS OFF statement with the statements listed in the previous section, as shown here:
ods _all_ close;
ods listing;
ods graphics off;
You can ask SAS to write ODS output to disk but not to create output in the Results Viewer window. To do so, add the following statement to your code before your procedure step:
ods results off;
Later in your SAS session, if you decide that you want to see output in the Results Viewer window, submit this statement:
ods results on;
If you want to disable the Results Viewer window via the SAS windowing environment, take these steps:
The ODS RESULTS OFF statement is a valuable debugging tool because it enables you to write ODS output to disk without viewing it in the Results Viewer window. You can then inspect the ODS output file on disk to check the size of it (before you open it).
In certain situations, you might use multiple procedure steps to send output to ODS. However, if you want to exclude certain procedure output from being written to ODS, use the following statement:
ods exclude all;
Ensure that you place the statement right before the procedure step that contains the output that you want to suppress.
If necessary, use the following statement when you want to resume sending subsequent procedure output to ODS:
ods exclude none;
Five reasons to use ODS EXCLUDE to suppress SAS output discusses the ODS EXCLUDE statement in more detail.
Certain web browsers display large HTML files better than others. When you use SAS to create large HTML files, you might try using a web browser such as Chrome, Firefox, or Edge instead of Internet Explorer. However, even browsers such as Chrome, Firefox, and Edge might run slowly when processing a very large HTML file.
Instead, as a substitute for HTML, you might consider creating PDF output (with the ODS PDF destination) or RTF output (with the ODS RTF destination). However, if you end up creating a very large PDF or RTF file, then Adobe (for PDF output) and Microsoft Word (for RTF output) might also experience performance issues.
The information in this blog mainly pertains to the SAS windowing environment. For information about how to resolve ODS issues in SAS® Enterprise Guide®, refer to Take control of ODS results in SAS Enterprise Guide.
How to view or create ODS output without causing SAS® to stop responding or run slowly was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
data km;
seed=12345;
do loc=1 to 2;
do time=2 to 22 by 1+int(4*ranuni(seed));
status=int(2*ranuni(seed));
output;
end;
end;
run;
proc lifetest data=km plots=s outsurv=os;
ods select survivalplot;
time time*status(1);
strata loc;
run;
data os;
retain survhold 1;
set os;
if _censor_ = 1 then SURVIVAL= survhold;
else survhold= SURVIVAL;
run;
proc sgplot data=os;
step x=time y=Survival / name=”survival” legendlabel=”Survival” group=stratum;
band x=time lower=0 upper=survival / modelname=”survival” transparency=.5;
run;
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
The post The value of pi depends on how you measure distance appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
It’s time to celebrate Pi Day! Every year on March 14th (written 3/14 in the US), math-loving folks celebrate “all things pi-related” because 3.14 is the three-decimal approximation to the mathematical constant, π.
Although children learn that pi is approximately 3.14159…, the actual definition of π is the ratio of a circle’s circumference to its diameter. Equivalently, it is distance around half of the unit circle. (The unit circle has a unit radius, so its diameter is 2.)
The value for pi, therefore, depends on the definition of a circle.
But we all know what a circle looks like, don’t we? How can there be more than one circle?
A circle is defined as the locus of points in the plane that are a given distance from a given point. This definition depends on the definition of a “distance,” and it turns out that there are infinitely many ways to measure the distance between two points in the plane. The Euclidean distance between two points is the most familiar distances, but there are other definitions. For two points a = (x1, y1) and b = (x2, y2), you can define the “L^{p} distance” between a and b by the formula
D_{p} = ( |x1 – x2|^{p} +
|y1 – y2|^{p} )^{1/p}
This definition defines a distance metric for every value of p ≥ 1.
If you set p=2 in the formula, you get the usual L^{2} (Euclidean) distance. If you set p=1, you get the L^{1} metric, which is known as the “taxicab” or “city block” distance.
You might think that the Euclidean distance is the only relevant distance, but it turns out that some of these other distances have practical applications in statistics, machine learning, linear algebra, and many fields of applied mathematics. For example, the 2-norm (L^{2}) distance is used in least-squares regression whereas the 1-norm (L^{1}) distance is used in robust regression and quantile regression. A combination of the two distances is used for ridge regression, LASSO regression, and “elastic net” regression.
Here’s the connection to pi: If you can define infinitely many distance formulas, then there are infinitely many unit circles, one for each value of p ≥ 1. And if there are infinitely many circles, there might be infinitely many values of pi. (Spoiler alert: There are!)
You can easily solve for y as a function of x and draw the unit circle for a representative set of values for p. The following graph was generated by the SAS step and PROC SGPLOT.
You can download the SAS program that generates the graphs in this article.
The L^{1} unit circle is a diamond (the top half is shown), the L^{2} unit circle is the familiar round shape, and as p gets large the unit circle for the L^{p} distance approaches the boundary of the square defined by the four points (±1, ±1).
For more information about L^{p} circles and metrics, see the Wikipedia article “Lp Space: The p-norm in finite dimensions.”
Here comes the surprise:
Just as each L^{p} metric has its own unit circle, each metric has its own numerical value for pi, which is the length of the unit semicircle as measured by that metric.
So far, we’ve only used geometry, but it’s time to use a little calculus. This presentation is based on
Keller and Vakil (2009, p. 931-935), who give more details about the formulas in this section.
For a curve that is represented as a graph (y as a function of x), you can obtain the length of the curve by integrating the arclength. In Calculus 2, the arclength formula is derived for Euclidean distance, but it is straightforward to give the formula for the L^{p} distance:
s(p) = ∫ (1 + |dy/dx|^{p})^{1/p} dx
To obtain a value for pi in the L^{p} metric, you can integrate the arclength for the upper half of the L^{p} unit circle. Equivalently, by symmetry, you can integrate one-eighth of the unit circle and multiply by 4. A convenient choice for the limits of integration is [0, 2^{-1/p}] because 2^{-1/p} is the x value where the 45-degree line intersects the unit circle for the L^{p} metric.
Substituting for the derivative gives the following formula (Keller and Vakil, 2009, p. 932):
π(p) = 4 ∫ (1 + u(x))^{1/p} dx, where u(x) = |x^{-p} – 1|^{1-p} and the interval of integration is [0, 2^{-1/p}].
For each value of p, you get a different value for pi.
You can use your favorite numerical integration routine to approximate π(p) by integrating the formula for various values of p ≥ 1. I used SAS/IML, which supports the QUAD function for numerical integration. The arclength computation for a variety of values for p is summarized by the following graph. The graph shows the computation of π(p), which is the length of the semicircle in the L^{p} metric, versus values of p for p in [1, 11].
The graph shows that the L^{1} value for pi is 4. The value decreases rapidly as p approaches 2 and reaches a minimum value when p=2 and the value of pi is 3.14159…. For p > 2, the graph of π(p) increases slowly. You can show that π(p) asymptotically approaches the value 4 as p approaches infinity.
On Pi Day, some places have contests to see who can recite the most digits of pi. I encourage you to enter the contest and say “Pi, in the L^{1} metric, is FOUR point zero, zero, zero, zero, ….” If they refuse to give you the prize, tell them to read this article! 😉
One the one hand, this article shows that there is nothing special about the value 3.14159….
For an L^{p} metric, the ratio of the circumference of a circle to its diameter can be any value between π and 4.
On the other hand, the graph shows that π is the unique minimizer of the graph. Among an infinitude of circles and metrics, the well-known Euclidean distance is the only L^{p} metric for which pi is 3.14159….
If you ask me, our value of π is special, without a doubt!
Download the SAS program that creates the graphs in this article.
The post The value of pi depends on how you measure distance appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post How to detect SAS data sets that contain (or do not contain) character variables appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A SAS programmer posted an interesting question on a SAS discussion forum. The programmer wanted to iterate over hundreds of SAS data sets, read in all the character variables, and then do some analysis. However, not every data set contains character variables, and SAS complains when you ask it to read the character variables in a data set that contains only numeric variables.
The programmer wanted to use PROC IML to solve the problem, but the issue also occurs in the SAS DATA step. The following program creates three data sets. Two of them (AllChar and Mixed) contain at least one character variable. The third data set (AllNum) does not contain any character variables. For the third data set, an error occurs if you try to use the KEEP=_CHARACTER_ data set option, as shown in the following example:
data AllNum; x=1; y=2; z=3; run; data AllChar; A='ABC'; B='XYZW'; run; data Mixed; name='Joe'; sex='M'; Height=1.8; Weight=81; treatment='Placebo'; run; /* try to use DROP=_CHARACTER_ to exclude numeric variables */ data KeepTheChar; set AllNum(keep=_CHARACTER_); /* ERROR when no character variables in the data set */ run; |
ERROR: The variable _CHARACTER_ in the DROP, KEEP, or RENAME list has never been referenced. |
The same problem occurs in PROC IML if you try to read character variables when none exist:
proc iml; use AllNum; read all var _CHAR_ into X; /* ERROR when no character variables in the data set */ close; |
ERROR: No character variables in the data set. |
There are at least two ways to handle this situation:
Of course, the same ideas apply if you want to read only numeric variables and you encounter a data set that does not contain any numeric variables.
If you have ever been to a SAS conference, you know that DICTIONARY tables are a favorite topic for SAS programmers. DICTIONARY tables are read-only tables that provide information about the state of the SAS session, including libraries, data sets, variables, and system options. You can access them directly by using PROC SQL.
If you want to access the information in the DATA steps or other procedures (like PROC IML), you can use special data views in SASHELP. In particular, the Sashelp.VColumn view provides information about variables in SAS data set and is often used to find data sets that contain certain variable names.
(See the references at the end of this article for more information about DICTIONARY tables.)
The following SAS/IML program uses the Sashelp.VColumn to find out which data sets contain at least one character variable:
proc iml; /* Solution 1: Use dictionary table sashelp.vcolumn */ /* Find data sets in WORK that have AT LEAST ONE character variable */ use sashelp.vcolumn(where=(libname="WORK" & memtype='DATA' & type='char'); /* read only CHAR variables */ read all var {memname name}; /* memname=data set name; name=name of character variable */ close; /* loop over data sets. If a set contains at least one character variable, process it */ dsName = {'AllNum' 'AllChar' 'Mixed'}; /* names of potential data sets */ do i = 1 to ncol(dsName); idx = loc(memname = upcase(dsName[i])); /* is data set on the has-character-variable list? */ /* for demo, print whether data set has character variables */ msg = "The data set " + (dsName[i]) + " contains " + char(ncol(idx)) + " character variables."; print msg; if ncol(idx)>0 then do; /* the data set exists and has character vars */ charVars = name[idx]; /* get the names of the character vars */ use (dsName[i]); /* open the data set for reading */ read all var charVars into X; /* read character variables (always succeeds) */ close; /* process the data */ end; end; |
The output shows that you can use the DICTIONARY tables to determine which data sets have at least one character variable. You can then use the USE/READ statements in PROC IML to read the character variables and process the data however you wish.
As mentioned previously, this technique can also be used in PROC SQL and the DATA step.
The previous section is very efficient because only character variables are ever read into SAS/IML matrices. However, there might be situations when you want to process character variables (if they exist) and then later process numerical variables (if they exist). Although a SAS/IML matrix contains only one data type (either all numeric or all character), you can read mixed-type data into a SAS/IML table, which supports both numeric and character variables. You can then use the TableIsVarNumeric function to generate a binary indicator variable that tells you which variables in the data are numeric and which are character, as follows:
/* Solution 2: Read all data into a table. Use the TableIsVarNumeric function to determine which variables are numeric and which are character. */ dsName = {'AllNum' 'AllChar' 'Mixed'}; /* names of potential data sets */ do i = 1 to ncol(dsName); /* for each data set... */ T = TableCreateFromDataset("WORK", dsName[i]); /* read all variables into a table */ numerInd = TableIsVarNumeric(T); /* binary indicator vector for numeric vars */ charInd = ^numerInd; /* binary indicator vector for character vars */ numCharVars = sum(charInd); /* count of character variables in this data set */ msg = "The data set " + (dsName[i]) + " contains " + char(numCharVars) + " character variables."; print msg; if numCharVars > 0 then do; X = TableGetVarData(T, loc(charInd)); /* extract the character variables into X */ /* process the data */ end; /* optionally process the numeric data */ numNumerVars = sum(numerInd); /* count of numeric variables in this data set */ /* etc */ end; |
The output is identical to the output in the previous section.
In summary, this article discusses a programmer who wants to iterate over many SAS data sets and process only character variables. However, some of the data sets do not have any character variables! This article shows two methods for dealing with this situation: DICTIONARY tables (available through Sashelp views) or SAS/IML tables. The first method is also available in Base SAS.
Of course, you can also use this trick to read all numeric variables when some of the data sets might not have any numeric variable. I’ve previously written about how to read all numeric variables into a SAS/IML matrix by using the _ALL_ keyword. If the data set contains both numeric and character variables, then only the numeric variables are read.
The following resources provide more information about DICTIONARY tables in SAS:
The post How to detect SAS data sets that contain (or do not contain) character variables appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Getting Started with SAS Containers was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
As of December 2018, any customer with a valid SAS Viya order is able to package and deploy their SAS Viya software in Docker containers. SAS has provided a fully documented and supported project (or “recipe”) for easily building these containers. So how can you start? You can simply stop reading this article and go directly to the GitHub repository and follow the instructions there. Otherwise, in this article, Jeff Owens, a solutions architect at SAS, provides a little color commentary around the process in case it is helpful…
Well, at its core, remember SAS and it’s massively parallel, in-memory counterpart, Cloud Analytic Services (CAS) is a powerful runtime for data processing and analytics. A runtime simply being an engine responsible for processing and executing a particular type of code (i.e. SAS code). Traditionally, the SAS runtime would live on a centralized server somewhere and users would submit their “jobs” to that SAS runtime (server) in a variety of ways. The SAS server supports a number of different products, tasks, etc. – but for this discussion let’s just focus on the scenario where a job here is a “.sas” file, perhaps developed in an IDE-like Enterprise Guide or SAS Studio, and submitted to the SAS runtime engine via the IDE itself, a bash shell, or maybe even SAS’ enterprise grade scheduler and job management solution – SAS Grid. In these cases, the SAS and CAS servers are on dedicated, always-on physical servers.
The brave new containerized world in which we live provides us a new deployment model: submit the job and create the runtime server at the same time. Plus, only consume the exact resources from the host machine or the Kubernetes cluster the specific job requires. And when the job finishes, release those resources for others to use. Kubernetes and PaaS clusters are quite likely shared environments, and one of the major themes in the rise of the containers is the further abstraction between hardware and software. Some of that may be easier said than done, particularly for customers with very large volumes of jobs to manage, but it is indeed possible today with SAS Viya on Docker/Kubernetes.
Another effective (and more immediate) usage of this containerized version of SAS Viya is simply an adhoc, on-demand, temporary development environment. The container package includes SAS Studio, so one can quickly spin up a full SAS Viya programming sandbox – SAS Studio as well as the SAS & CAS runtimes. Here they can develop and test SAS code, and just as quickly tear the environment down when no longer needed. This is useful for users that: (a) don’t have access to an “always-on” environment for whatever reason, (b) want to try out experimental code that could potentially consume resources from a shared “always-on” sas environment, and/or (c) maybe their Kubernetes cluster has many more resources available than their always-on and they want to try a BIG job.
Yes, it is possible to deploy the entire SAS Viya stack (microservices and all) via Kubernetes but that discussion is for another day. This post focuses strictly on the SAS Viya programming components and running on a single machine Docker host rather than a Kubernetes cluster.
I will begin here with a fresh single machine RHEL 7.5 server running on Openstack. But this machine could have been running on any cloud or VM platform, and I could use any (modern enough) flavor of Linux thanks to how Docker works. My machine here has 8cpu, 16GB RAM, and a 50GB root volume. Less or more is fine. A couple of notes to help understand how to configure an instance:
The first step is to install Docker.
Following along with sas-container-recipes now, the first thing I should do is mirror the repo for my order. Note, this is not a required step – you could build this container directly from SAS repos if you wanted, but we’ll mirror as a best practice. We could simply mirror and serve it over the local filesystem of our build host, but since I promised color I’ll serve it over the web instead. So, these commands run on a separate RHEL server. If you choose to mirror on your build host, make sure you have the disk space (~30GB should be plenty). You will also need your SAS_Viya_deployment_data.zip file available on the SAS Customer Support site. Run the following code to execute the setup.
$ wget https://support.sas.com/installation/viya/34/sas-mirror-manager/lax/mirrormgr-linux.tgz $ tar xf mirrormgr-linux.tgz $ rm -f mirrormgr-linux.tgz $ mkdir -p /repos/viyactr $ mirrormgr mirror –deployment-data SAS_Viya_deployment_data.zip –path /repos/viyactr –platform x64-redhat-linux-6 –latest $ yum install httpd -y $ system start httpd $ systemctl enable httpd $ ln -s /repos/viyactr /var/www/html/sas_repo |
Next, I go ahead and clone the sas-containers-recipes repo locally and upload my SAS-Viya-deployment-data.zip file and I am ready to run the build command. As a bonus, I am also going to use my site’s (SAS’) sssd.conf file so my container will use our corporate Active Directory for authentication. If you do not need or want that integration you can skip the “vi addons/sssd.conf” line and change the “–addons” option to “addons/auth-demo” so your container seeds with a single “sasdemo:sasdemo” user:password instead.
$ # upload SAS_Viya_deployment_data.zip to this machine somehow $ git clone https://github.com/sassoftware/sas-container-recipes.git $ cd sas-container-recipes/ $ vi addons/sssd.conf # <- paste in your site’s sssd.conf file $ build.sh \ --type single \ --zip ~/SAS_Viya_deployment_data.zip \ --mirror-url http://jo.openstack.sas.com/sas_repo \ --addons “addons/auth-sssd” |
The build should take about 45 minutes and produce a single container image for you (there might be a few images, but it is just one with a thin layer or two on top). You might want to give this image a new name (docker tag) and push it into your own private registry (docker push). Aside from that, we are ready to run it.
If you are curious, look in the addons directory for the other optional layers you can add to your container. Several tools are available for easily configuring connections to external databases.
Here is the run command we can use to launch the container. Note the image name I use here is “sas-viya-programming:xxxxxx” – this is the image that has my sssd layer built on top of it.
$ docker run \ --detach \ --rm \ --env CASENV_CAS_HOST=$(hostname -f) \ --env CASENV_CAS_VIRTUAL_PORT=8081 \ --publish 5570:5570 \ --publish 8081:80 \ --name sas-viya-programming \ --hostname sas-viya-programming \ sas-viya-programming:xxxxxx |
And now, in a web browser, I can go to <hostname>:8081/SASStudio and I will end up in SAS Studio, where I can sign in with my internal SAS credentials. To stop the container, use the name you gave it: “docker stop sas-viya-programming”. Because we used the “–rm” flag the container will be removed (completely destroyed) when we stop it.
Note we are explicitly mapping in the HTTP port (8081:80) so we easily know how to get to SAS Studio. If you want to start up another container here on the same host, you will need to use a different port or else you’ll get an address already in use error. Also note we might be interested in connecting directly to this CAS server from something other than SAS Studio (localhost). A remote python client for example. We can use the other port we mapped in (5570:5570) to connect to the CAS server.
Running this container with the above command means anything and everything done inside the container (configuration changes, code, data) will not persist if the container stops and a new one started later. Luckily this is a very standard and easy to solve scenario with Docker and Kubernetes. Here are a couple of targets inside the container you might be interested in mounting a volume to:
Here is what an updated docker run command might look like with these volumes included:
$docker run \ --detach \ -rm \ --env CASNV_CAS_VIRTUAL_HOST=$(hostname -f) \ --env CASNV_CAS_VIRTUAL_PORT=8081 \ --volume mydata:/data \ --volume /nfsdata:/nfsdata \ # example syntax for bind mount instead of docker volume mount --volume mycode:/code \ --volume sastmp:/tmp \ --publish 5570:5570 \ --publish 8081:80 \ --name sas-viya-programming \ --hostname sas-viya-programming \ sas-viya-programming:xxxxxx |
Yes. You would just need to install Docker on your laptop (go to docker.com for that). You can certainly follow the instructions from the top to build and run locally. You can even push this container image out to an internal registry so other users could skip the build and just run.
So far, we have only talked about the “ad-hoc” or “sandbox” dev type of use case for this container. A later article may cover how to run in batch mode or maybe we will move straight to multi-containers & Kubernetes. In the meantime though, here is how to submit a .sas program as a batch job to this single container we have built.
Try creating your own image and deploying a container. Feel free to comment on your experience.
SAS Communities Article- Running SAS Analytics in a Docker container
SAS Global Forum Paper- Docker Toolkit for Data Scientists – How to Start Doing Data Science in Minutes!
SAS Global Forum Tech Talk Video- Deploying and running SAS in Containers
Getting Started with SAS Containers was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many schools, the markers might overlap, which makes it difficult to visualize the distribution of scores. This is a general problem with using dot plots. An alternative visualization is to plot a box plot for each school district, which is described in today’s article.
Box plots (also called box-and-whisker plots) are used by statisticians to provide a schematic visualization of the distribution of some quantity. The previous article was written for non-statisticians, so I did not include any box plots. To understand a box plot, the reader needs to know how to interpret the box and whiskers:
Box plots require a certain level of comfort with statistical ideas. Nevertheless, for a statistical audience,
box plots provide a compact way to compare dozens or hundreds of distributions.
I almost always use the SGPLOT procedure to create box plots, but today I’m going to demonstrate the BOXPLOT procedure. The BOXPLOT procedure is from the days before ODS graphics, but it has several nice features, including the following:
The second and third features are both useful for visualizing the SAT data for public high schools in NC.
You can use the INSETGROUP statement in PROC BOXPLOT to specify statistics that you want to display under each box plot. For example, the following syntax displays the number of high schools in each district and the median of the schools’ SAT scores. The WHERE clause filters the data so that the graph shows only the largest school districts (those with seven or more high schools).
ods graphics / width=700px height=480px; title "Average SAT Scores for Large NC School Districts"; proc boxplot data=SATSortMerge; where _FREQ_ >= 7; /* restrict to large school districts */ plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50; insetgroup Q2 N; run; |
The graph shows schematic box plots for 18 large school districts. The districts are sorted according to the median value of the schools’ SAT scores. The INSETGROUP statement creates a table inside the graph. The table shows the number of schools in each district and gives the median score for the district. The INSETGROUP statement can display many other statistics such as the mean, standard deviation, minimum value, and maximum value for each district.
One of the coolest features of PROC BOXPLOT is that it will automatically create a panel of box plots. It is difficult to visualize all 115 NC school districts in a single graph. The graph would be very wide (or tall) and the labels for the school districts would potentially collide. However, PROC BOXPLOT will split the display into a panel, which is extremely convenient if you plan to print the graphs on a piece of paper.
For example, the following call to PROC BOXPLOT results in box plots for 115 school districts. The procedure splits these box plots across a panel that contains five graphs and plots 23 box plots in each graph. Notice that I do not have to specify the number of graphs: the procedure uses the data to make an intelligent decision. To save space in this blog post, I omit three of the graphs and only show the first and last graphs:
ods graphics / width=640px height=400px; title "Average SAT Scores for NC School Districts"; proc boxplot data=SATSortMerge; plot Total*DistrictAbbr / grid odstitle=title nohlabel boxstyle=schematicID vaxis=800 to 1450 by 50; run; |
Because the districts are ordered by the median SAT score, the first plot shows the school districts with high SAT scores and the last plot shows districts with lower SAT scores. Districts that have only one school are shown as a diamond (the mean value) with a line through it (the median value). Districts that have two or three schools are shown as a box without whiskers. For larger school districts, the box plots show a schematic representation of the distribution of the schools’ SAT scores.
In summary, PROC BOXPLOT has several useful features for plotting many box plots. This article shows that you can use the INSETGROUP statement to easily add a table of descriptive statistics to the graph. The procedure also automatically creates a panel of graphs so that you can more easily look at dozens or hundreds of box plots.
You can download the SAS program (NCSATBoxplots.sas) that creates the data and the graphs.
The post Use PROC BOXPLOT to display hundreds of box plots appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How to conditionally terminate a SAS batch flow process in UNIX/Linux was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
In automated production (or business operations) environments, we often run SAS job flows in batch mode and on schedule. SAS job flow is a collection of several inter-dependent SAS programs executed as a single process.
In my earlier posts, Running SAS programs in batch under Unix/Linux and Let SAS write batch scripts for you, I described how you can run SAS programs in batch mode by creating UNIX/Linux scripts that in turn incorporate other scripts invocations.
In this scenario you can run multiple SAS programs sequentially or in parallel, all while having a single root script kicked off on schedule. The whole SAS processing flow runs like a chain reaction.
However, sometimes we need to automatically stop and terminate that chain job flow execution if certain criteria are met (or not met) in a program of that process flow.
Let’s say our first job in a batch flow is a data preparation step (ETL) where we extract data tables from a database and prepare them for further processing. The rest of the batch process is dependent on successful completion of that critical first job. The process is kicked off at 3:00 a.m. daily, however, sometimes we run into a situation when the database connection is unavailable, or the database itself is not finished refreshing, or something else happens resulting in the ETL program completing with ERRORs.
This failure means that our data has not updated properly and there is no reason to continue running the remainder of the job flow process as it might lead to undesired or even disastrous consequences. In this situation we want to automatically terminate the flow execution and send an e-mail notification to the process owners and/or SAS administrators informing them about the mishap.
Suppose, we run the following main.sh script on UNIX/Linux:
#!/bin/sh #1 extract data from a database /sas/code/etl/etl.sh #2 run the rest of processing flow /sas/code/processing/tail.sh |
The etl.sh script runs the SAS ETL process as follows:
#!/usr/bin/sh dtstamp=$(date +%Y.%m.%d_%H.%M.%S) pgmname="/sas/code/etl/etl.sas" logname="/sas/code/etl/etl_$dtstamp.log" /sas/SASHome/SASFoundation/9.4/sas $pgmname -log $logname |
We want to run tail.sh shell script (which itself runs multiple other scripts) only if etl.sas program completes successfully, that is if SAS ETL process etl.sas that is run by etl.sh completes with no ERRORs or WARNINGs. Otherwise, we want to terminate the main.sh script and do not run the rest of the processing flow.
To do this, we re-write our main.sh script as:
#!/bin/sh #1 extract data from a database /sas/code/etl/etl.sh exitcode=$? echo "Status=$exitcode (0=SUCCESS,1=WARNING,2=ERROR)" if [ $exitcode -eq 0 ] then #2 run the rest of processing flow /sas/code/processing/tail.sh fi |
In this code, we use a special shell script variable ($? for the Bourne and Korn shells, $STATUS for the C shell) to capture the exit status code of the previously executed OS command, /sas/code/etl/etl.sh:
exitcode=$?
Then the optional echo command just prints the captured value of that status for our information.
Every UNIX/Linux command executed by the shell script or user has an exit status represented by an integer number in the range of 0-255. The exit code of 0 means the command executed successfully without any errors; a non-zero value means the command was a failure.
SAS System plays nicely with the UNIX/Linux Operating System. According to the SAS documentation Determining the Completion Status of a SAS Job in UNIX Environments, a SAS job returns the exit status code for its completion the same way the shell code does it – in the special shell script variable ($? for the Bourne and Korn shells, and $STATUS for the C shell.) A value of 0 indicates successful termination. For additional flexibility, SAS’ ABORT statement with an optional integer argument allows you to specify a custom exit status code.
The following table summarizes the values of the SAS exit status code:
Condition | Exit Status Code |
---|---|
All steps terminated normally | 0 |
SAS issued WARNINGs | 1 |
SAS issued ERRORs | 2 |
User issued ABORT statement | 3 |
User issued ABORT RETURN statement | 4 |
User issued ABORT ABEND statement | 5 |
SAS could not initialize because of a severe error | 6 |
User issued ABORT RETURN – n statement | n |
User issued ABORT ABEND – n statement | n |
Since our etl.sh script executes SAS code etl.sas, the exit status code is passed by the SAS System to etl.sh and consequently to our main.sh shell script.
Then, in the main.sh script we check if that exit code equals to 0 and then and only then run the remaining flow by executing the tail.sh shell script. Otherwise, we skip tail.sh and exit from the main.sh script reaching its end.
Alternatively, the main.sh script can be implemented with an explicit exit as follows:
#!/bin/sh #1 extract data from a database /sas/code/etl/etl.sh exitcode=$? echo "Status=$exitcode (0=SUCCESS,1=WARNING,2=ERROR)" if [ $exitcode -ne 0 ] then exit fi #2 run the rest of processing flow /sas/code/processing/tail.sh |
In this shell script code example, we check the exit return code value, and if it is NOT equal to 0, then we explicitly terminate the main.sh shell script using exit command which gets us out of the script immediately without executing the subsequent commands. In this case, our #2 command invoking tail.sh script never gets executed that effectively stops the batch flow process.
If you also need to automatically send an e-mail notification to the designated people about the failed batch flow process, you can do it in a separate SAS job that runs right before exit command. Then the if-statement will look something like this:
if [ $exitcode -ne 0 ] then # send an email and exit /sas/code/etl/email_etl_failure.sh exit fi |
That is immediately after the email is sent, the shell script and the whole batch flow process gets terminated by the exit command; no shell script commands beyond that if-statement will be executed.
Be extra careful if you use the special script variable $? directly in a script’s logical expression, without assigning it to an interim variable. For example, you could use the following script command sequence:
/sas/code/etl/etl.sh if [ $? -ne 0 ] . . . |
However, let’s say you insert another script command between them, for example:
/sas/code/etl/etl.sh echo "Status=$? (0=SUCCESS,1=WARNING,2=ERROR)" if [ $? -ne 0 ] . . . |
Then the $? variable in the if [ $? -ne 0 ] statement will have the value of the previous echo command, not the /stas/code/etl/etl.sh command as you might imply.
Hence, I suggest capturing the $? value in an interim variable (e.g. exitcode=$?) right after the command, exit code of which you are going to inspect, and then reference that interim variable (as $exitcode) in your subsequent script statements. That will save you from trouble of inadvertently referring to a wrong exit code when you insert some additional commands during your script development.
What do you think about this approach? Did you find this blog post useful? Did you ever need to terminate your batch job flow? How did you go about it? Please share with us.
How to conditionally terminate a SAS batch flow process in UNIX/Linux was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |