This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Of course you know how to create graphs … But do you often find that preparing the data to plot is often the hardest part? Well then, this blog post is for you! I’ll be demonstrating how to import Excel data into SAS, transpose the data, use what were formerly column […]
The post Import Excel data, transpose, and plot it! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post SAS code golf: find the max digit in a string of digits appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
“Code golf” is a fun programming pastime that challenges you to solve a problem with the least amount of code possible. Like regular golf, the goal is to use fewest code “strokes” to hit the mark. Here’s a recent challenge that was posted to me via Twitter.
@cjdinger @SASJedi got a fun puzzle for you guys, we've been discussing at my office.
You have a character var with the string "000112010302". What's the least about of code that can be written to determine what is the highest number (3) in the string?— Wes (@SigurWes) July 17, 2018
While I feel that I can solve nearly any problem (that I can understand) using SAS, my knowledge of the SAS language is quite limited when compared to that of many experts. And so, I reached out to the SAS Support Communities for help on this one.
The answers were quick, creative, and diverse. I’ll share a few of them here.
The winner, in terms of concision, came from FreelanceReinhard. He supplied a macro-function one-liner:
%sysfunc(findc(123456789,00112010302,b));
With this entry, FreelanceReinhard defied a natural algorithmic instinct to treat this as a numerical digit comparison problem, and instead approached it as simple pattern matching problem. The highest digit comes from a finite set (0..9). The FINDC function can tell you which of those digits is the first to be found in the target string. The b directive tells FINDC to work backwards through the pattern, from ‘9’ down to ‘0’.
In a similar vein, novinosrin’s approach uses the COMPRESS function to keep only the highest digits from the pattern, in descending order, and then applies the FIRST function to return the top value.
a=first(compress('9876543210','00112010302','k'));
The COMPRESS function is often used to eliminate matching characters from a string, but the k directive inverts the action to keep only the matching characters instead.
If you wanted to use the more traditional approach of looping through values, comparing, and keeping just the maximum value, then you can hardly do better than the code offered by hashman.
do j = 1 to length (str) ; d = d <> input (char (str, j), 1.) ; end ;
Experienced SAS programmers will remember that the <> operator is shorthand for MAX (as opposed to “not equal” as some of us learned in Pascal or SQL). “MAX” might be clearer to read, but it requires an additional character. (Remember the “><” is shorthand for the MIN operator in SAS.)
AhmedAl_Attar offered the most dangerous approach, using memory manipulation techniques to populate members of an array:
array ct [20] $1 _temporary_; call pokelong (str,addrlong(ct[1]),length(str)); c=max(of ct{*});
CALL POKELONG and ADDRLONG are documented along with several cautions due to the risk of overwriting something important in your process or system memory. But, they are fast-acting.
And finally, I knew that there would be an elegant matrix-based approach in SAS/IML. ChanceTGardener offered the first variant, and then Rick Wicklin echoed it shortly after.
proc iml; str='000112010302'; maximum=max((substr(str,1:length(str),1))); print maximum; quit;
Code golf does not always produce the most readable, maintainable code. But puzzles like these encourage us to explore new features and nuanced behaviors of our favorite programming language, and thus broaden our understanding of how SAS really works.
Want to experiment with these different approaches? Here’s a SAS program that combines all of them. Think you can do better (or different)? Visit the communities topic and chime in.
data max; str = '00112010302'; /* novinosrin's approach */ a=first(compress('9876543210',str,'k')); /* FreelanceReinhard's approach */ b=findc('123456789',str,-9); /* AhmedAl_Attar's approach using POKELONG */ array ct [20] $1 _temporary_; call pokelong (str,addrlong(ct[1]),length(str)); c=max(of ct{*}); /* loop approach from hashman */ /* remember that <> is MAX */ do j = 1 to length (str) ; d = d <> input (char (str, j), 1.) ; end ; drop j; run; /* FreelanceReinhard's approach in a one-liner macro function */ %let str=00112010302; %put max=%sysfunc(findc(123456789,&str.,b)); /* IML approach from ChanceTGardener */ /* Requires SAS/IML to run */ proc iml; str='000112010302'; maximum=max((substr(str,1:length(str),1))); print maximum; quit; |
The post SAS code golf: find the max digit in a string of digits appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article shows how to implement balanced bootstrap sampling in SAS.
The basic bootstrap samples with replacement from the original data (N observations) to obtain B new samples. This is called “uniform” resampling because each observation has a uniform probability of 1/N of being selected at each step of the resampling process.
Within the union of the B bootstrap samples,
each observation has an expected value of appearing B times.
Balanced bootstrap resampling (Davison, Hinkley, and Schechtman, 1986) is an alternative process in which each observation appears exactly B times in the union of the B bootstrap samples of size N. This has some practical benefits for estimating certain inferential statistics such as the bias and quantiles of the sampling distribution (Hall, 1990).
It is easy to implement a balanced bootstrap resampling scheme: Concatenate B copies of the data, randomly permute the B*N observations, and then use the first N observations for the first bootstrap sample, the next B for the second sample, and so forth. (Other algorithms are also possible, as discussed by Gleason, 1988).
This article shows how to implement balanced bootstrap sampling in SAS.
To illustrate the idea, consider the following data set that has N=6 observations. Five observations are clustered near x=0 and the sixth is a large outlier (x=10). The sample skewness for these data is skew=2.316 because of the influence of the outlier.
data Sample(keep=x); input x @@; datalines; -1 -0.2 0 0.2 1 10 ; proc means data=Sample skewness; run; %let ObsStat = 2.3163714; |
You can use the bootstrap to approximate the sampling distribution for the skewness statistic for these data. I have previously shown how to use SAS to bootstrap the skewness statistic: Use PROC SURVEYSELECT to form bootstrap samples, use PROC MEANS with a BY statement to analyze the samples, and use PROC UNIVARIATE to analyze the bootstrap distribution of skewness values. In that previous article, PROC SURVEYSELECT is used to perform uniform sampling (sampling with replacement).
It is straightforward to modify the previous program to perform balanced bootstrap sampling. The following program is based on a SAS paper by Nils Penard at PhUSE 2012. It does the following:
/* balanced bootstrap computation */ proc surveyselect data=Sample out=DupData noprint reps=5000 /* duplicate data B times */ method=SRS samprate=1; /* sample w/o replacement */ run; data Permute; set DupData; call streaminit(12345); u = rand("uniform"); /* generate a uniform random number for each obs */ run; proc sort data=Permute; by u; run; /* sort in random order */ data BalancedBoot; merge DupData(drop=x) Permute(keep=x); /* reuse REPLICATE variable */ run; |
You can use the BalancedBoot data set to perform subsequent bootstrap analyses.
If you perform a bootstrap analysis, you obtain the following approximate bootstrap distribution for the skewness statistic. The observed statistic is indicated by a red vertical line. For reference, the mean of the bootstrap distribution is indicated by a gray vertical line. You can see that the sampling distribution for this tiny data set is highly nonnormal. Many bootstrap samples that contain the outlier (exactly one-sixth of the samples in a balanced bootstrap) will have a large skewness value.
To assure yourself that each of the original six observations appears exactly B times in the union of the bootstrap sample, you can run PROC FREQ, as follows:
proc freq data=BalancedBoot; /* OPTIONAL: Show that each obs appears B times */ tables x / nocum; run; |
As shown in the article “Bootstrap estimates in SAS/IML,” you can perform bootstrap computations in the SAS/IML language.
For uniform sampling, the SAMPLE function samples with replacement from the original data. However, you can modify the sampling scheme to support balanced bootstrap resampling:
The following SAS/IML program modifies the program in the previous post to perform balanced bootstrap sampling:
/* balanced bootstrap computation in SAS/IML */ proc iml; use Sample; read all var "x"; close; call randseed(12345); /* Return a row vector of statistics, one for each column. */ start EvalStat(M); return skewness(M); /* <== put your computation here */ finish; Est = EvalStat(x); /* 1. observed statistic for data */ /* balanced bootstrap resampling */ B = 5000; /* B = number of bootstrap samples */ allX = repeat(x, B); /* replicate the data B times */ s = sample(allX, nrow(allX), "WOR"); /* 2. sample without replacement (=permute) */ s = shape(s, nrow(x), B); /* reshape to (N x B) */ /* use the balanced bootstrap samples in subsequent computations */ bStat = T( EvalStat(s) ); /* 3. compute the statistic for each bootstrap sample */ bootEst = mean(bStat); /* 4. summarize bootstrap distrib such as mean */ bias = Est - bootEst; /* Estimate of bias */ RBal = Est || BootEst || Bias; /* combine results for printing */ print RBal[format=8.4 c={"Obs" "BootEst" "Bias"}]; |
As shown in the previous histogram, the bias estimate (the difference between the observed statistic and the mean of the bootstrap distribution) is sizeable.
It is worth mentioning that the SAS-supplied
%BOOT macro performs balanced bootstrap sampling by default. To generate balanced bootstrap samples with the %BOOT macro, set the BALANCED=1 option, as follows:
%boot(data=Sample, samples=5000, balanced=1) /* or omit BALANCED= option */
If you want uniform (unbalanced) samples, call the macro as follows:
%boot(data=Sample, samples=5000, balanced=0).
In conclusion, it is easy to generate balanced bootstrap samples. Balanced sampling can improve the efficiency of certain bootstrap estimates and inferences. For details, see the previous references of Appendix II of Hall (1992).
The post Balanced bootstrap resampling in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The Base SAS DATA step has been a powerful tool for many years for SAS programmers. But as data sets grow and programmers work with massively parallel processing (MPP) computing environments such as Teradata, Hadoop or the SAS High-Performance Analytics grid, the data step remains stubbornly single-threaded. Welcome DS2 – […]
The post What DS2 can do for the DATA step appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.
Robert’s implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However,
the idea of “thickening” or “inflating” a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.
Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:
The following graphic shows the offset region (r = 0.5) for a convex “house-shaped” polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):
The image on the right shows why the process is sometimes called an “inflating” a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.
You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However,
for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two “prongs.” Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.
The shape of the Texas mainland is nonconvex. I used
PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon.
The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.
The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison’s post. Remember, though, that the distances in Robert’s article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane.
For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.
I don’t want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements!
The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees.
You can
download the SAS program that computes all the images in this article.
In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon.
The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.
The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The United States declared independence in 1776, and we celebrate it on July 4th every year. But the land areas that make up the United States today weren’t necessarily the same as they were back then. So I thought it would be interesting to create a map showing when each […]
The post How old is your county? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The probability that two random chords of a circle intersect appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is “What is the probability that two randomly chosen chords of a circle intersect?”
This article shows how to create a simulation in SAS to estimate the probability.
For this problem, a “random chord” is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle.
The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.
You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.
proc iml; /* Find the intersection between 2D line segments [p1,p2] and [q1,q2]. This function assumes that the line segments have different slopes (A is nonsingular) */ start IntersectSegsSimple(p1, p2, q1, q2); b = colvec(q1 - p1); A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */ x = solve(A, b); /* x = (s,t) */ if all(0<=x && x<=1) then /* if x is in [0,1] x [0,1] */ return (1-x[1])*p1 + x[1]*p2; /* return intersection */ else /* otherwise, segments do not intersect */ return ({. .}); /* return missing values */ finish; /* Generate two random chords on the unit circle. Simulate the probability that they intersect */ N = 1e5; theta = j(N, 4); call randseed(123456); call randgen(theta, "uniform", 0, 2*constant('pi')); intersect = j(N,1,0); do i = 1 to N; t = theta[i,]`; /* 4 random U(0, 2*pi) */ pts = cos(t) || sin(t); /* 4 pts on unit circle */ p1 = pts[1,]; p2 = pts[2,]; q1 = pts[3,]; q2 = pts[4,]; intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .); end; prob = mean(intersect); print prob; |
This simulation produces an estimate that is close to the exact probability of 1/3.
This problem has an interesting connection to
Bertrand’s Paradox. Bertrand’s paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define “random chords” in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.
For the definition of “random chords” in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample.
If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand’s Paradox.
The post The probability that two random chords of a circle intersect appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post The intersection of two line segments appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Back in high school, you probably learned to find the intersection of two lines in the plane. The intersection requires solving a system of two linear equations. There are three cases: (1) the lines intersect in a unique point, (2) the lines are parallel and do not intersect, or (3) the lines are coincident. Thus, for two lines, the intersection problem has either 1, 0, or infinitely many solutions. Most students quickly learn that the lines always intersect when their slopes are different, whereas the special cases (parallel or coincident) occur when the lines have the same slope.
Recently I had to find the intersection between two line segments in the plane. Line segments have finite extent, so segments with different slopes may or may not intersect.
For example, the following panel of graphs shows three pairs of line segments in the plane. In the first panel, the segments intersect. In the second panel, the segments have the same slopes as in the first panel, but these segments do not intersect. In the third panel, the segments intersect in an interval.
This article shows how to construct a linear system of equations that distinguishes between the three cases and compute an intersection point, if it exists.
Let p1 and p2 be the endpoints of one segment and let q1 and q2 be the endpoints of the other. Recall that a parametrization of the first segment is (1-s)*p1 + s*p2, where s ∈ [0,1] and the endpoints are treated as 2-D vectors. Similarly, a
parametrization of the second segment is (1-t)*q1 + t*q2, where t ∈ [0,1]. Consequently, the segments intersect if and only if there exists values for (s,t) in the unit square such that
(1-s)*p1 + s*p2 = (1-t)*q1 + t*q2
You can rearrange the terms to rewrite the equation as
(p2-p1)*s + (q1-q2)*t = q1 – p1
This is a vector equation which can be rewritten in terms of matrices and vectors. Define the 2 x 2 matrix A whose first column contains the elements of (p2-p1) and whose second column contains the elements of (q1-q2). Define b = q1 – p1 and x = (s,t). If the solution of the linear system A*x = b is in the unit square, then the segments intersect. If the solution is not in the unit square, the segments do not intersect. If the segments have the same slope, then the matrix A is singular and you need to perform additional tests to determine whether the segments intersect.
As shown above, the intersection of two planar line segments is neatly expressed in terms of a matrix-vector system. In SAS, the SAS/IML language provides a natural syntax for expressing and solving matrix-vector equations. The following SAS/IML function constructs and solves a linear system. For simplicity, this version does not handle the degenerate case of two segments that have the same slope. That case is handled in the next section.
start IntersectSegsSimple(p1, p2, q1, q2); b = colvec(q1 - p1); A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */ x = solve(A, b); /* x = (s,t) */ if all(0<=x && x<=1) then /* if x is in [0,1] x [0,1] */ return (1-x[1])*p1 + x[1]*p2; /* return intersection */ else /* otherwise, segments do not intersect */ return ({. .}); /* return missing values */ finish; /* Test 1: intersection at (0.95, 1.25) */ p1 = {1.8 2.1}; p2 = {0.8 1.1}; q1 = {1 1.25}; q2 = {0 1.25}; z = IntersectSegsSimple(p1,p2,q1,q2); print z; /* Test 2: no intersection */ p1 = {-1 0.5}; p2 = {1 0.5}; q1 = {0 1}; q2 = {0 2}; v = IntersectSegsSimple(p1, p2, q1, q2); print v; |
The function contains only a few statements. The function is called to solve the examples in the first two panels of the previous graph. The SOLVE function solves the linear system (assuming that a solution exists), and the IF-THEN statement tests whether the solution is in the unit square [0,1] x [0,1]. If so, the function returns the point of intersection. If not, the function returns a pair of missing values.
For many applications, the function in the previous section is sufficient because it handles the generic cases. For completeness the following module also handles segments that have identical slopes. The DET function determines whether the segments have the same slope. If so, the segments could be parallel or collinear. To determine whether collinear segments intersect, you can test for three conditions:
Notice that the condition “p2 is inside [q1,q2]” does not need to be checked separately because it is already handled by the existing checks. If any of the three conditions are true, there are infinitely many solutions (or the segments share an endpoint). If none of the conditions hold, the segments do not intersect.
For overlapping segments, the following function returns an endpoint of the intersection interval.
/* handle all cases: determine intersection of two planar line segments [p1, p2] and [q1, q2] */ start Intersect2DSegs(p1, p2, q1, q2); b = colvec(q1 - p1); A = colvec(p2-p1) || colvec(q1-q2); if det(A)^=0 then do; /* nonsingular system: 0 or 1 intersection */ x = solve(A, b); /* x = (s,t) */ if all(0<=x && x<=1) then /* if x is in [0,1] x [0,1] */ return (1-x[1])*p1 + x[1]*p2; /* return intersection */ else /* segments do not intersect */ return ({. .}); /* return missing values */ end; /* segments are collinear: 0 or infinitely many intersections */ denom = choose(p2-p1=0, ., p2-p1); /* protect against division by 0 */ s = (q1 - p1) / denom; /* Is q1 in [p1, p2]? */ if any(0<=s && s<=1) then return q1; s = (q2 - p1) / denom; /* Is q2 in [p1, p2]? */ if any(0<=s && s<=1) then return q2; denom = choose(q2-q1=0, ., q2-q1); /* protect against division by 0 */ s = (p1 - q1) / denom; /* Is p1 in [q1, q2]? */ if any(0<=s && s<=1) then return p1; return ({. .}); /* segments are disjoint */ finish; /* test overlapping segments; return endpoint of one segment */ p1 = {-1 1}; p2 = {1 1}; q1 = {0 1}; q2 = {2 1}; w = Intersect2DSegs(p1, p2, q1, q2); print w; |
In summary, by using matrices, vectors, and linear algebra, you can easily solve for the intersection of two line segments or determine that the segments do not intersect. The general case needs some special logic to handle degenerate configurations, but the code that solves the generic cases is straightforward when expressed in a vectorized language such as SAS/IML.
The post The intersection of two line segments appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Using %IF-%THEN-%ELSE in SAS programs appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
SAS programmers have long wanted the ability to control the flow of their SAS programs without having to resort to complex SAS macro programming. With SAS 9.4 Maintenance 5, it’s now supported! You can now use %IF-%THEN-%ELSE constructs in open code. This is big news — even if it only recently came to light on SAS Support Communities. (Thanks to Super User Tom for asking about it.)
Prior to this change, if you wanted to check a condition — say, whether a data set exists — before running a PROC, you had to code it within a macro routine. It would look something like this:
/* capture conditional logic in macro */ %macro PrintIfExists(); %if %sysfunc(exist(work.result)) %then %do; proc means data=work.result; run; %end; %else %do; %PUT WARNING: Missing WORK.RESULT - report process skipped.; %end; %mend; /* call the macro */ %PrintIfExists(); |
Now you can simplify this code to remove the %MACRO/%MEND wrapper and the macro call:
/* If a file exists, take an action */ /* else fail gracefully */ %if %sysfunc(exist(work.result)) %then %do; proc means data=work.result; run; %end; %else %do; %PUT WARNING: Missing WORK.RESULT - report process skipped.; %end; |
Here are some additional ideas for how to use this feature. I’m sure you’ll be able to think of many more!
When developing your code, it’s now easier to leave debugging statements in and turn them on with a simple flag.
/* Conditionally produce debugging information */ %let _DEBUG = 0; /* set to 1 for debugging */ %if &_DEBUG. %then %do; proc print data=sashelp.class(obs=10); run; %end; |
If you have code that’s under construction and should never be run while you work on other parts of your program, you can now “IF 0” out the entire block. As a longtime C and C++ programmer, this reminds me of the “#if 0 / #endif” preprocessor directives as an alternative for commenting out blocks of code. Glad to see this in SAS!
/* skip processing of blocks of code */ /* like #if 0 / #endif in C/C++ */ %if 0 %then %do; proc ToBeDetermined; READMYMIND = Yes; run; %end; |
I have batch jobs that run daily, but that send e-mail to people only one day per week. Now this is easier to express inline with conditional logic.
/*If it's Monday, send a weekly report by email */ %if %sysfunc(today(),weekday1.)=2 %then %do; options emailsys=smtp emailhost=myhost.company.com; filename output email subject = "Weekly report for &SYSDATE." from = "SAS Dummy <sasdummy@sas.com>" to = "knowledgethirster@curious.net" ct ='text/html'; ods tagsets.msoffice2k(id=email) file=OUTPUT(title="Important Report!") style=seaside; title "The Weekly Buzz"; proc print data=amazing.data; run; ods tagsets.msoffice2k(id=email) close; %end; |
For batch jobs especially, system environment variables can be a rich source of information about the conditions under which your code is running. You can glean user ID information, path settings, network settings, and so much more. If your SAS program needs to pick up cues from the running environment, this is a useful method to accomplish that.
/* Check for system environment vars before running code */ %if %sysfunc(sysexist(ORACLE_HOME)) %then %do; %put NOTE: ORACLE client is installed.; /* assign an Oracle library */ libname ora oracle path=corp schema=alldata authdomain=oracle; %end; |
As awesome as this feature is, there are a few rules that apply to the use of the construct in open code. These are different from what’s allowed within a %MACRO wrapper.
First rule: your %IF/%THEN must be followed by a %DO/%END block for the statements that you want to conditionally execute. The same is true for any statements that follow the optional %ELSE branch of the condition.
And second: no nesting of multiple %IF/%THEN constructs in open code. If you need that flexibility, you can do that within a %MACRO wrapper instead.
And remember, this works only in SAS 9.4 Maintenance 5 and later. That includes the most recent release of SAS University Edition, so if you don’t have the latest SAS release in your workplace, this gives you a way to kick the tires on this feature if you can’t wait to try it.
The post Using %IF-%THEN-%ELSE in SAS programs appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
The post Compute derivatives for nonparametric regression models appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
SAS enables you to evaluate a regression model at any location within the range of the data. However, sometimes you might be interested in how the predicted response is increasing or decreasing at specified locations. You can use finite differences to compute the slope (first derivative) of a regression model. This numerical approximation technique is most useful for nonparametric regression models that cannot be simply written in terms of an analytical formula.
The following data are the hypothetical concentrations of a drug in a patient’s bloodstream at times (measured in hours) during a 72-hour period after the drug is administered. For a real drug, a pharmacokinetic researcher might construct a parametric model and fit the model by using PROC NLMIXED. The following example uses the EFFECT statement to fit a regression model that uses cubic splines. Other nonparametric models in SAS include loess curves, generalized additive models, adaptive regression, and thin-plate splines.
data Drug; input Time Concentration @@; datalines; 1 3 3 7 6 19 12 73 18 81 24 71 36 38 42 28 48 20 72 12 ; proc glmselect data=Drug; effect spl = spline(Time/ naturalcubic basis=tpf(noint) knotmethod=percentiles(5)); model Concentration = spl / selection=none; /* fit model by using spline effects */ store out=SplineModel; /* store model for future scoring */ quit; |
Because the data are not evenly distributed in time, a graph of the spline fit evaluated at the data points does not adequately show the response curve. Notice that the call to PROC GLMSELECT used a STORE statement to store the model to an item store. You can use PROC PLM to score the model on a uniform grid of values to visualize the regression model:
/* use uniform grid to visualize curve */ data ScoreData; do Time = 0 to 72; output; end; run; /* score the model on the uniform grid */ proc plm restore=SplineModel noprint; score data=ScoreData out=ScoreResults; run; /* merge fitted curve with original data and plot the fitted curve */ data All; set Drug ScoreResults; run; title "Observed and Predicted Blood Content of Drug"; proc sgplot data=All noautolegend; scatter x=Time y=Concentration; series x=Time y=Predicted / name="fit" legendlabel="Predicted Response"; keylegend "fit" / position=NE location=inside opaque; xaxis grid values=(0 to 72 by 12) label="Hours"; yaxis grid; run; |
A researcher might be interested in knowing the slope of the regression curve at certain time points. The slope indicates the rate of change of the response variable (the blood-level concentration). Because nonparametric regression curves do not have explicit formulas, you cannot use calculus to compute a derivative. However, you can use finite difference formulas to compute a numerical derivative at any point.
There are several finite difference formulas for the first derivative. The forward and backward formulas are less accurate than the central difference formula. Let h be a small value. Then the approximate derivative of a function f at a point t is given by
f′(t) ≈ [ f(t + h) – f(t – h) ] / 2h
The formula says that you can approximate the slope at t by evaluating the model at the points t ± h. You can use the DIF function to compute the difference between the response function at adjacent time points and divide that difference by 2h. The code that scores the model is similar to the more-familiar case of scoring on a uniform grid. The following statements evaluate the derivative of the model at six-hour intervals.
/* compute derivatives at specified points */ data ScoreData; h = 0.5e-5; do t = 6 to 48 by 6; /* siz-hour intervals */ Time = t - h; output; Time = t + h; output; end; keep t Time h; run; /* score the model at each time point */ proc plm restore=SplineModel noprint; score data=ScoreData out=ScoreOut; /* Predicted column contains f(x+h) and f(x-h) */ run; /* compute first derivative by using central difference formula */ data Deriv; set ScoreOut; Slope = dif(Predicted) / (2*h); /* [f(x+h) - f(x-h)] / 2h */ Time = t; /* estimate slope at this time */ if mod(_N_,2)=0; /* process observations in pairs; drop even obs */ drop h t; run; proc print data=Deriv; run; |
The output shows that after six hours the drug is entering the bloodstream at 6.8 units per hour. By 18 hours, the rate of absorption has slowed to 0.6 units per hour.
After 24 hours, the rate of absorption is negative, which means that the blood-level concentration is decreasing. At approximately 30 hours, the drug is leaving the bloodstream at the rate of -3.5 units per hour.
This technique generalizes to other nonparametric models. If you can score the model, you can use the central difference formula to approximate the first derivative in the interior of the data range.
For more about numerical derivatives, including a finite-difference approximation of the second derivative, see Warren Kuhfeld’s article on derivatives for penalized B-splines. Warren’s article is focused on how to obtain the predicted values that are generated by the built-in regression models in PROC SGPLOT (LOESS and PBSPLINE), but it contains derivative formulas that apply to any regression curve.
The post Compute derivatives for nonparametric regression models appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |