Expanding lengths of all character variables in SAS data sets was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
In my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.
Before reading any further, let’s take a quick quiz:
Q: A character variable length attribute represents a number of:
If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable’s values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.
It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 2^{8}=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.
For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.
In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.
Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.
Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.
When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:
ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:
WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
Either ERROR or WARNING is unacceptable and must be properly addressed.
Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.
The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:
libname inlib cvp 'c:\source_folder'; |
Then use PROC COPY to create expanded versions of our original data sets in a target library:
libname outlib 'c:\target_folder'; proc copy in=inlib out=outlib noclone; select dataset1 dataset2; run; |
Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:
proc copy in=inlib out=outlib noclone; run; |
It’s that easy. And the icing on the cake is that CVP engine by default automatically adjusts the variables format widths to match the expanded byte lengths for all converted character variables.
CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.
To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:
libname inlib cvp 'c:\source_folder'; libname outlib 'c:\utf8_target_folder' outencoding='UTF-8'; proc copy in=inlib out=outlib noclone; select dataset1 dataset2; run; |
Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:
proc copy in=inlib out=outlib noclone; run; |
After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.
There are several options available with the CVP Engine. Here are the most widely used:
CVPBYTES=bytes – specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.
Example: libname inlib 'SAS data-library' cvpbytes=5;
The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.
CVPMULTIPLIER=multiplier – specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.
Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;
The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.
Notes:
Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.
Expanding lengths of all character variables in SAS data sets was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Use simulation to estimate the power of a statistical test appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A previous article about standardizing data in groups shows how to simulate data from two groups. One sample (with n1=20 observations) is simulated from an N(15, 5) distribution whereas a second (with n2=30 observations) is simulated from an N(16, 5) distribution. Although I didn’t mention, due to random sampling variation the sample means of the two groups are close to each other: 14.9 and 15.15. In fact,
if you run a t test on these data, the t test concludes that the samples probably came from populations that have the same mean parameter. More formally, the t test
fails to reject the null hypothesis that the two group means are the same.
The probability of making an incorrect inference like this is called the power of the hypothesis test. This article uses PROC POWER to compute the exact power of a t test and then uses a simulation to estimate the power.
Along the way, I show a surprising feature of SAS/IML functions.
Let’s run the t test on the simulated data. The following SAS DATA step is the same as in the previous article. It simulates the data and calls PROC TTEST to analyze the data. (In this article, all tests are performed at the α=0.05 significance level.) The null hypothesis is that the group means are equal; the alternative hypothesis is that they are not equal. The data are simulated from distributions that have the same variance, so you can use the pooled t test to analyze the data:
/* Samples from N(15,5) and N(16, 5). */ %let mu1 = 15; %let mu2 = 16; %let sigma = 5; data TwoSample; call streaminit(54321); n1 = 20; n2 = 30; Group = 1; do i = 1 to n1; x = round(rand('Normal', &mu1, &sigma), 0.01); output; end; Group = 2; do i = 1 to n2; x = round(rand('Normal', &mu2, &sigma), 0.01); output; end; keep Group x; run; proc ttest data=TwoSample plots=none; class Group; var x; run; |
I blurred the rows of the tables that correspond to the Satterthwaite test so that you can focus on the pooled t test.
The pooled test uses the difference between the sample means (-0.25) and the pooled standard deviation (4.86)
to construct a t test for the difference of means. The p-value for the two-sided test is 0.8613, which means that the data do not provide enough evidence to reject the null hypothesis.
Because we simulated the data, we know that, in fact, the difference between the population means is not zero. In other words, the t test failed to reject the null hypothesis even though the alternative hypothesis (unequal means) is true.
This is called a “Type 2” error. The probability of making a Type 2 error is called the power of a statistical hypothesis test. Typically, the power depends on the sample size and the magnitude of the effect you are trying to detect.
A two-sample t test is one of the tests that is supported by PROC POWER in SAS. You can use PROC POWER to find the probability of a Type 2 error for random samples of size n1=20 and n2=30 when the true difference between the means is 1.
/* compute exact power */ proc power; twosamplemeans power = . /* missing ==> "compute this" */ meandiff= 1 /* difference of means */ stddev=5 /* N(mu1-mu2, 5) */ groupns=(20 30); /* num obs in the samples */ run; |
The tables tell you that (with these parameter values) the two-sided t test will commit a Type 2 error about 10% of the time (power=0.104) at the α=0.05 significance level. So we got “unlucky” when the simulated sample failed to reject the null hypothesis. Apparently, this only happens for 10% of random samples. In 9 out of 10 random samples, the t test will conclude that the sample comes from populations that have different means.
I previously wrote about how to use simulation to estimate the power of a t test.
That article shows how to use the DATA step to simulate many independent samples and then uses BY-group processing in PROC TTEST to analyze each sample. You use the ODS OUTPUT statement to capture
the statistics for each analysis.
For some samples, the test rejects the null hypothesis; for others, it does not.
The proportion of samples that reject the null hypothesis is an estimate of the power of the test.
If you can write a SAS/IML function that implements a t test,
you can perform the same power computation in PROC IML. The following SAS/IML program (based on Chapter 5 of Wicklin (2013) Simulating Data with SAS) defines a function that computes the t statistic for a difference of means. The function returns a binary value that indicates whether the test rejects the null hypothesis.
To test the function, I call it on the same simulated data:
proc iml; /* SAS/IML function for a pooled t test. X : column vector with n1 elements from a population with mean mu_x. Y : column vector with n2 elements from a population with mean mu_y. Compute the pooled t statistic for the difference of means and return whether the statistic rejects the null hypothesis H0: mu_x = mu_y at the alpha signif level. Formulas from the SAS documentation: https://bit.ly/30rr2GV */ start TTestH0(x, y, alpha=0.05); n1 = nrow(X); n2 = nrow(Y); /* sample sizes */ meanX = mean(x); varX = var(x); /* mean & var of each sample */ meanY = mean(y); varY = var(y); DF = n1 + n2 - 2; /* pooled variance: https://blogs.sas.com/content/iml/2020/06/29/pooled-variance.html */ poolVar = ((n1-1)*varX + (n2-1)*varY) / DF; /* Compute test statistic, critical value, and whether we reject H0 */ t = (meanX - meanY) / sqrt(poolVar*(1/n1 + 1/n2)); *print (meanX - meanY)[L="Diff (1-2)"] (sqrt(poolVar))[L="PoolSD"] t; t_crit = quantile('t', 1-alpha/2, DF); RejectH0 = (abs(t) > t_crit); /* binary 0|1 for 2-sided test */ return RejectH0; finish; use TwoSample; read all var 'x' into x where(group=1); read all var 'x' into y where(group=2); close; reject = TTestH0(x, y); print reject; |
To confirm that the computations are the same as in PROC TTEST, I uncommented the PRINT statement in the module.
The computations are the same as for PROC TTEST. The test fails to reject the null hypothesis for these data.
Notice that the SAS/IML function is both concise (less than 10 lines) and mimics the formulas that you see in textbooks or documentation. This “natural syntax” is one of the advantages of the SAS/IML language.
The SAS/IML function looks like a typical function for a univariate analysis, but it contains a surprise. Without making any modifications. you can call the function to run thousands of t tests. The module is written in a vectorized manner. If you send in matrices that have B columns, the module will run t tests for each column and return a binary row vector that contains B values.
Take a close look at the SAS/IML function. The MEAN and VAR functions operate on columns of a matrix and return row vectors. Consequently, the meanX, meanY, varX, and varY variables are row vectors.
Therefore, the poolVar and t variables are also row vectors, and
the RejectH0 variable is a binary row vector.
To demonstrate that the function can analyze many samples in a single call, the following statements simulate 100,000 random samples from the N(15,5) and N(16,5) distributions. Each column of the X and Y matrices is a sample. One call to the TTestH0 function computes 100,000 t tests. The “colon” subscript reduction operator computes the mean of the vector, which is the proportion of tests that reject the null hypothesis. As mentioned earlier, this is an estimate of the power of the test.
/* simulation of power */ B = 1e5; /* number of independent samples (number of cols) */ n1 = 20; n2 = 30; /* sample sizes (number of rows) */ call randseed(321); x = j(n1, B); /* allocate array for samples from Group 1 */ y = j(n2, B); /* allocate array for samples from Group 2 */ call randgen(x, 'Normal', &mu1, &sigma); /* X ~ N(mu1,sigma) */ call randgen(y, 'Normal', &mu2, &sigma); /* Y ~ N(mu2,sigma) */ reject = TTestH0(x, y); /* result is row vector with B elements */ power = reject[:]; print power[L="Power (B=1E5)"]; |
The result is very close to the exact power as computed by PROC POWER.
Notice also that the SAS/IML function, when properly written, can test thousands of samples as easily as it tests one sample. The program does not require an explicit loop over the samples.
This article starts with a simulated sample that contains two groups. Although the groups are simulated from distributions that have different means,
due to random sampling variation a t test fails to reject the null hypothesis that the means are equal. This is a Type 2 error. For random samples from the specified distributions, PROC POWER shows that the probability that the t test will commit a Type 2 error is 0.1. You can estimate this value by simulating many samples and computing the proportion that fails to reject the null hypothesis. The simulation agrees with the exact power.
I previously wrote a simulation of power by using the DATA step and BY-group processing in PROC TTEST. But in this article, I implement a t test by writing a SAS/IML function. You might be surprised to discover that you can pass many samples to the function, which runs a test on each column of the input matrices.
It is worth mentioning that
you can repeat this computation for many parameter values to obtain a power curve.
For example, you can vary the difference between the means to determine how the power depends on the magnitude of the difference between the means.
The post Use simulation to estimate the power of a statistical test appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Learning to think like SAS was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The most fundamental concept that students learning introductory SAS programming must master is how SAS handles data. This might seem like an obvious statement, but it is often overlooked by students in their rush to produce code that works. I often tell my class to step back for a moment and “try to think like SAS” before they even touch the keyboard. There are many key topics that students must understand in order to be successful SAS programmers. How does SAS compile and execute a program? What is the built-in loop that SAS uses to process data observation by observation? What are the coding differences when working with numeric and character data? How does SAS handle missing observations?
One concept that is a common source of confusion for students is how to tell SAS to treat rows versus columns. An example that we use in class is how to write a program to calculate a basic descriptive statistic, such as the mean. The approach that we discuss is to identify our goal, rows or columns, and then decide what SAS programming statements are appropriate by thinking like SAS. First, we decide if we want to calculate the mean of an observation (a row) or the mean of a variable (a column). We also pause to consider other issues such as the type of variable, in this case numeric, and how SAS evaluates missing data. Once these concepts are understood we can proceed with an appropriate method: using DATA step programming, a procedure such as MEANS, TABULATE, REPORT or SQL, and so on. For more detailed information about this example there is an excellent user group paper on this topic called “Many Means to a Mean” written by Shannon Pileggi for the Western Users of SAS Software conference in 2017. In addition, The Little SAS^{®} Book and its companion book, Exercises and Projects for the Little SAS^{®} Book, Sixth Edition address these types of topics in easy-to-understand examples followed up with thought-provoking exercises.
Here is an example of the type of question that our book of exercises and projects uses to address this type of concept.
Avg1 = MEAN(X1,X2,X3); Avg2 = (X1 + X2 + X3) / 3; PROC MEANS; VAR X1 X2 X3; RUN;
In the book, we provide solutions for odd-numbered multiple choice and short answer questions, and hints for the programming exercises. Here is the solution for this question:
For more information about The Little SAS Book and its companion book of exercises and projects, check out these blogs:
Learning to think like SAS was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post 4 ways to standardize data in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable’s standard deviation.
Recently, I wanted to perform a slight variation of the usual standardization:
Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.
Let’s start by creating some example data.
The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30.
Notice that the data are ordered by the Group variable.
%let mu1 = 15; %let mu2 = 16; %let sigma = 5; /* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */ data TwoSample; call streaminit(54321); n1 = 20; n2 = 30; Group = 1; do i = 1 to n1; x = round(rand('Normal', &mu1, &sigma), 0.01); output; end; Group = 2; do i = 1 to n2; x = round(rand('Normal', &mu2, &sigma), 0.01); output; end; keep Group x; run; ods graphics / width=500px height=360px; title "Normal Data"; title2 "N(&mu1, &sigma) and N(&mu2, &sigma)"; proc sgpanel data=TwoSample noautolegend; panelby Group / columns=1; histogram x; density x / type=normal; colaxis grid; run; |
The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are
mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.
Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:
/* use PROC STDIZE to standardize by the sample means and sample std dev */ proc stdize data=TwoSample out=ScaleSample method=std; by Group; var x; run; proc means data=ScaleSample Mean StdDev Min Max ndec=2; class Group; var x; run; |
As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.
A second way to standardize the data is to use the DATA step to center and scale each variable and each group.
You can overwrite the contents of each column, or (as I’ve done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:
/* because we know the population parameters, we can use them to center and scale the data */ /* DATA step method */ data ScaleData; array mu[2] (&mu1 &mu2); /* put parameter values into an array */ set TwoSample; y = (x-mu[Group]) / σ run; proc means data=ScaleData Mean StdDev Min Max ndec=2; class Group; var y; run; |
Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.
You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.
You can specify the parameters in “wide form” or in “long form”. In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters.
The following DATA step creates a data set that has three variables: Group, Mu, and Sigma.
Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.
data ScaleWide; Group=1; Mu=&mu1; Sigma=σ output; Group=2; Mu=&mu2; Sigma=σ output; run; proc stdize data=TwoSample out=StdWide method=in(ScaleWide); by Group; var x; location mu; /* column name METHOD=IN data set */ scale sigma; /* column name METHOD=IN data set */ run; proc means data=StdWide Mean StdDev Min Max ndec=2; class Group; var x; run; |
The output is identical to the output from the previous section and is not shown.
You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_.
The parameters for centering and scaling are specified in rows where _TYPE_=”LOCATION” and _TYPE_=”SCALE”, respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.
For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_=’LOCATION’) and scale (_TYPE_=’SCALE’) the X variable for each group:
data ScaleParam; length _TYPE_ $14; input Group _TYPE_ x; datalines; 1 LOCATION 15 1 SCALE 5 2 LOCATION 16 2 SCALE 5 ; proc stdize data=TwoSample out=Scale2 method=in(ScaleParam); by Group; var x; run; proc means data=Scale2; class Group; var x; run; |
Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.
This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data.
Most analysts want to
standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:
The post 4 ways to standardize data in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Submatrices of matrices appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Have you ever seen the “brain teaser” for children that shows a 4 x 4 grid and asks “how many squares of any size are in this grid?” To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square. Hence there are a total of (16+9+4+1) = 30 squares of any size.
Recently a SAS statistical programmer asked a similar question about submatrices of a matrix. Specifically, the programmer wanted to form square submatrices of consecutive rows and columns.
This article shows a few tricks in the SAS/IML language that enable you to generate and compute with all
k x k submatrices of a matrix, where the submatrices are formed by using consecutive rows and columns of the original matrix. Although it is possible to consider more general submatrices, in this article, a “submatrix” always refers to one that is formed by using consecutive rows and columns.
I’ve previously written about several ways to extract submatrices from a matrix.
Recall the “colon operator” (formally called the index creation operator) generates a vector of consecutive integers. You can use the vectors in the row and column position of the subscripts to extract a submatrix. For example, the following SAS/IML statements define a 4 x 4 matrix and extract the four 3 x 3 submatrices:
A = {4 3 1 6, 2 4 3 1, 0 2 4 3, 5 0 2 4}; /* extract the four 3 x 3 submatrices */ A11 = A[1:3, 1:3]; A12 = A[1:3, 2:4]; A21 = A[2:4, 1:3]; A22 = A[2:4, 2:4]; |
Manually specifying the row and column numbers is tedious.
Let’s write a SAS/IML function that will return a k x k submatrix of A by starting at the (i,j)th position of A. Call k the order of the submatrix.
/* Given A, return the square submatrix of order k starting at index (i,j) */ start submat(A, i, j, k); return A[ i:(i+k-1), j:(j+k-1) ]; finish; /* example : 3 x 3 submatrices of A */ A11 = submat(A, 1, 1, 3); A12 = submat(A, 1, 2, 3); A21 = submat(A, 2, 1, 3); A22 = submat(A, 2, 2, 3); /* https://blogs.sas.com/content/iml/2015/12/02/matrices-graphs-gridded-layout.html */ ods layout gridded columns=4 advance=table; print A11, A12, A21, A22; ods layout end; |
If A is an n x m matrix, we can ask how many submatrices of order k are in A. Recall that we are only considering submatrices formed by consecutive rows and columns. There are
(n – k + 1) sequences of consecutive rows of length k, such as 1:k, 2:(k+1), and so forth.
Similarly, there are
(m – k + 1) sequences of consecutive columns of length k. So the following SAS/IML function counts the number of submatrices of order k. You can call the function for different k=1, 2, 3, and 4 in order to solve the “Counting Squares” brain teaser:
/* count all submatrices of order k for the matrix A */ start CountAllSubmat(A, k); numRows = nrow(A)-k+1; numCols = ncol(A)-k+1; return numRows * numCols; finish; n = nrow(A); numSubmat = j(n, 1, .); do k = 1 to n; numSubmat[k] = CountAllSubmat(A, k); end; order = T(1:n); print order numSubmat; |
If you add up the number of squares each size, you get a total of 30 squares.
You can iterate over all submatrices of a specified order. You can perform a matrix computation on each submatrix (for example, compute its determinant), or you can pack it into a list for future processing.
The following SAS/IML function iterates over submatrices and puts each submatrix into a list. The STRUCT subroutine from the ListUtil package can be used to indicate the contents of the list in a concise form.
start GetListSubmat(A, k); numSub = CountAllSubmat(A, k); L = ListCreate( numSub ); /* create list with numSub items */ cnt = 1; do i = 1 to nrow(A)-k+1; /* for each row */ do j = 1 to ncol(A)-k+1; /* and for each column */ B = submat(A, i, j, k); /* compute submatrix of order k from (i,j)th cell */ call ListSetItem(L, cnt, B); /* put in list (or compute with B here) */ cnt = cnt + 1; end; end; return L; finish; /* Example: get all matrices of order k=2 */ L = GetListSubmat(A, 2); package load ListUtil; call Struct(L); |
The output from the STRUCT call shwos that there are nine 2 x 2 matrices. The Value1–Valuie4 columns show the values (in row-major order). For example, the first 2 x 2 submatrix (in the upper left corner of A) is {4 3, 2 4}. The second submatrix (beginning with A[1,2]) is {3 1, 4 3}.
It is straightforward to modify the function to compute the determinant or some other matrix computation.
This article shows some tips and techniques for dealing with submatrices of a matrix.
The submatrices in this article
are formed by using consecutive rows and columns.
Thus, they are similar to the “Counting Squares” brain teaser, which requires that you consider sub-squares of order 1, 2, 3, and so forth.
The post Submatrices of matrices appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Cast your vote! SAS Support Communities are on the 2020 Khoros Kudos Awards Ballot! was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Help us celebrate our collective success! Our SAS Support Communities are nominated for Best-in-Class: Community and Bottom Line Savings Rock Star. Voting closes August 14.
Voting has opened for the first annual Khoros Kudos Awards, and this year the SAS Communities has been nominated for two categories:
Join us in celebrating the growth and success of our online communities by casting your vote by 5 pm PT on Friday, August 14! To vote, click the category name above, scroll to the end of the entry and click ‘Kudo’. (Note: Voting requires registering, but it’s quick and Khoros won’t spam you).
Not familiar with the SAS Support Communities? Our Communities are a wonderful resource for SAS users of all levels – and a great way to connect, share code and explore with fellow SAS aficionados, academics, learners, professionals, experts – and employees, too!
SAS Support Communities began more than 15 years ago, and have grown to nearly 340k members (22,800 new members this year alone!) and are home to over 600k discussions around topics like SAS products, solutions, administration and training. The SAS Support Communities are designed to help you succeed, and they’re a great place to network, collaborate and uncover insights with fellow SAS users. Members can take advantage of specially curated content, peer and expert recommendations, critical product updates and more.
Need help? Ask the community any time of day! Start a conversation to get tips and share what you know with others. Our team of SAS experts monitor the forums 24/7 to provide assistance when you need it most – and maintains a community library of user-friendly resources like how-to tutorials and handy tips. If you’re new to the SAS Communities, check out our New SAS User community and our welcome resources to get started!
Thank you for being a part of our SAS Support Communities and Happy Posting!
Cast your vote! SAS Support Communities are on the 2020 Khoros Kudos Awards Ballot! was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Run tasks in parallel in SAS Viya appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The iml action was introduced in Viya 3.5. As shown in a previous article, the iml action supports
ways to implement the map-reduce paradigm, which is a way to distribute a computation by using multiple threads.
The map-reduce paradigm is ideal for “embarrassingly parallel” computations, which are composed of many independent and essentially identical computations. I wrote an example program that shows how to run a Monte Carlo simulation in parallel by using the iml action in SAS Viya.
The map-reduce paradigm is often used when there is ONE task to perform and you want to assign a portion of the work to each thread. A different scenario is when you have SEVERAL independent tasks to perform. You can assign each task to a separate thread and ask each thread to compute an entire task.
The iml action supports the PARTASKS function (short for “parallel tasks”) for running tasks in separate threads.
This article shows a simple example.
Suppose you have three tasks. The first takes 2 seconds to run, the next takes 0.5 seconds, and the third takes 1 second.
If you run a set of tasks sequentially, the total time is the sum of the times for the individual tasks:
2 + 0.5 + 1 = 3.5 seconds. In contrast, if you run the tasks in parallel, the total time is the maximum of the individual times: 2 seconds. Running the tasks in parallel saves time, and the speed-up depends on the ratio of the largest time to the total time, which is 2/3.5 in this example. Thus, the time for the parallel computation is 57% of the time for the sequential computation.
In general, the largest speedup occurs when the tasks each run in about the same time. In that situation (ignoring overhead costs), the time reduction can be as much 1/k when you
run k tasks in k threads.
Let’s run three tasks sequentially, then run the same tasks in parallel.
Suppose you have a large symmetric N x N matrix and you need to compute three things: the determinant, the matrix inverse, and the eigenvalues.
The following call to PROC IML runs the three tasks sequentially.
The TOEPLITZ function is used to create a large symmetric matrix.
proc iml; N = 2000; A = toeplitz( (N:1)/100 ); /* N x N symmetric matrix */ det = det(A); /* get det(A) */ inv = inv(A); /* get inv(A) */ eigval = eigval(A); /* get eigval(A) */ |
The previous computations are performed by directly calling three built-in SAS/IML functions. Of course,
in practice, the tasks will be more complex than these.
Almost surely, each task will be encapsulated into a SAS/IML module. To emulate that process, I am going to define three trivial “task modules.” The following statements perform The same computations, but now each “task” is performed by calling a user-written module:
start detTask(A); return det(A); finish; start invTask(A); return inv(A); finish; start eigvalTask(A); return eigval(A); finish; det = detTask(A); /* get det(A) */ inv = invTask(A); /* get inv(A) */ eigval = eigvalTask(A); /* get eigval(A) */ |
You can use the PARTASKS function to run the three tasks concurrently.
To use the PARTASKS function, each task
must be a user-defined SAS/IML function that has exactly one input argument and returns one output argument.
For our example, the detTask, invTask, and eigvalTask modules satisfy these requirements.
(If a task requires more
than one input argument, you can pack the arguments into a list and pass the list as a single argument.)
The syntax of the PARTASKS function is
result = PARTASKS( Tasks, TaskArgs, Options );
where
Suppose you are running SAS Viya and you use the have access to a machine that has at least four threads. You can use the CAS statement to define a session that will use only one machine (no worker nodes). The CAS statement might look something like this:
cas sess0 host='your_host_name' casuser='your_username' port=&YourPort; /* SMP, 0 workers */
Then you can run the PARTASKS function by making the following call to the iml action. If you are not familiar with the iml action, see the getting started example for the iml action.
proc cas; session sess0; /* SMP session: controller node only */ loadactionset 'iml'; /* load the action set */ source partasks; /* put program in SOURCE/ENDSOURCE block */ start detTask(A); /* 1st task */ return det(A); finish; start invTask(A); /* 2nd task */ return inv(A); finish; start eigvalTask(A); /* 3rd task */ return eigval(A); finish; /* ----- Main Program ----- */ N = 2000; A = toeplitz( (N:1)/100 ); /* N x N symmetric matrix */ Tasks = {'detTask' 'invTask' 'eigvalTask'}; Args = [A, A, A]; /* each task gets same arg in this case */ opt = {1, /* distribute to threads on controller */ 1}; /* display information about the tasks */ Results = ParTasks(Tasks, Args, opt); /* results are returned in a list */ /* the i_th list item is the result of the i_th task */ det = Results$1; /* get det(A) */ inv = Results$2; /* get inv(A) */ eigval = Results$3; /* get eigval(A) */ endsource; iml / code=partasks nthreads=4; run; |
The results are identical to the sequential computations in PROC IML. However, this program executes the three tasks in parallel, so the total time for the computations is the time required by the longest task.
You can use the following diagram to help visualize the program. (Click to enlarge.)
The diagram shows the following:
This article demonstrates a simple call to the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute tasks among k threads. Each thread runs concurrently and independently. The total time for the computation depends on the longest time that any one thread spends on its computation.
This example is simple but demonstrates the main ideas of distributing tasks in parallel.
In a future article, I will present a more compelling example.
For more information and examples, see Wicklin and Banadaki (2020), “Write Custom Parallel Programs by Using the iml Action,”
which is the basis for these blog posts.
Another source is the SAS IML Programming Guide, which includes
documentation and examples for the iml action.
The post Run tasks in parallel in SAS Viya appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How to prompt for a date range in a SAS Visual Analytics Report - Four Part Series was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
In this SAS demo, I demonstrate how to prompt for a date range in a SAS Visual Analytics report. We walk through how to configure prompts using four types of control objects:
Then we cover how to define parameters and configure filters for report objects using the date prompted boundaries. These techniques are demonstrated using SAS Viya 3.4 SAS Visual Analytics 8.4 but can be applied to earlier releases.
I’ve also written accompanying articles in a four-part series. Below the video see the links to the articles and their corresponding timestamps in the video.
01:33 – 1st Example: Slider (Range) – How to prompt for a date range in a SAS VA report – Example 1 Slider (Range)
03:01 – 2nd Example: Slider (Single Value) – How to prompt for a date range in a SAS VA report – Example 2 Slider (Single Value)
14:04 – 3rd Example: Drop-down list – How to prompt for a date range in a SAS VA report – Example 3 Drop-down list
19:33 – 4th Example: Text input – How to prompt for a date range in a SAS VA report – Example 4 Text input
How to prompt for a date range in a SAS Visual Analytics Report – Four Part Series was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
How to translate your cURL command into SAS code was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
How often have you needed Google Translate, but for SAS code?
SAS Technical Support often gets requests like the following: “I have this API named <insert name of really cool API here> and I want to process data I get back from the API with SAS. How do I do it?”
In this article, I’ll show you how to translate the examples from your API documentation (offered in a select few languages) to the equivalent in SAS.
My first recommendation is to know your API. Most APIs come with documentation and will give specific guidance regarding how to POST data (send) to the API and how to GET data (receive) from the API. Often, examples are provided using cURL commands.
With this information, you are welcome to examine the SAS documentation for the HTTP procedure and build your code. Before you call or email SAS Technical Support asking for PROC HTTP code, I encourage you to verify that you can communicate with your API (or URL) from outside of SAS. One way to do so is with cURL. cURL (Client URL) is a command-line tool that is shipped with many UNIX flavors and installed easily on Windows. With a cURL command, you can interact with your URLs from outside of SAS from a command-line prompt like this:
curl -o "c:\temp\file.txt" -request "https://httpbin.org/get"
From SAS, you can use a cURL command with the DATA step, like this:
data _null_; %sysexec curl -o "c:\temp\file.txt" -request "https://httpbin.org/get"; run; |
File.txt contains the response from the URL:
{ "args": {}, "headers": { "Accept": "*/*", "Host": "httpbin.org", "Range": "bytes=Request", "User-Agent": "curl/7.46.0", "X-Amzn-Trace-Id": "Root=1-5f028cd3-2ec7e1e05da1f616e9106ee8" }, "origin": "149.173.8.1", "url": "https://httpbin.org/get" }
However, if you use SAS® Enterprise Guide® or SAS® Studio, you might not have permissions to run operating system commands from SAS, so you need a way to translate your cURL commands to SAS. The previous cURL command is easily translated to the following PROC HTTP code:
filename out "c:\temp\file.txt"; proc http url="https://httpbin.org/get" out=out; run; |
The cURL command supports many options and features. Check out the cURL reference page. SAS can’t guarantee that all are directly translatable to PROC HTTP, but I do want to cover some of the most popular that SAS customers have asked about.
If your cURL command uses the -d (DATA) option, you’ll use the IN= argument in your PROC HTTP statement. Here I am posting to the URL httpbin.org/post a file called test.csv, which resides in my c:\temp directory:
curl -d "c:\temp\test.csv" -X post "https://httpbin.org/post";
This command translates to the following PROC HTTP code:
filename test "c:\temp\test.csv"; proc http url="https://httpbin.org/post" /* If the IN= argument is used then method="post" is the default */ /* and therefore unnecessary in this step */ method="post" in=test; run; |
None of the URLs above require authentication, but you’ll likely find authentication is part of most APIs. Many APIs have moved to OAuth for authentication. This method of authentication requires the use of an access token, which you obtain with a POST command. With the correct credentials, this cURL command posts to the SAS® Viya® SASLogon REST API in order to obtain an access token:
curl -X POST "https://server.example.com/SASLogon/oauth/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -d "grant_type=password&username=userid&password=mypassword" \ -u "app:mysecret" -o "c:\temp\token.txt"
The following PROC HTTP code does the same task:
filename out temp; proc http url="http://servername/SASLogon/oauth/token" in="grant_type=password&username=userid&password=mypassword" webusername="clientid" webpassword="clientsecret" method="post" out=out; headers "Content-Type"="application/x-www-form-urlencoded"; run; |
*Read about ways to hide your credentials in Chris Hemedinger’s post here: How to secure your REST API credentials in SAS programs.
The output file contains an access token, necessary to make requests on behalf of a client to the REST API. In this example, a cURL command like the following requests a list of folders from the Folders microservice:
curl -X GET "https://server.example.com/folders/folders/@myFolder" \ -H "Accept: application/json" \ -H "Authorization: Bearer TOKEN-STRING"
The PROC HTTP code will look like this if you “directly translate”:
filename new temp; proc http url=" https://server.example.com/folders/folders/@myFolder" method="get" out=new; headers "Accept"="application/json" "Authorization"="Bearer TOKEN-STRING"; run; |
But starting in SAS® 9.4M5, there’s a shortcut with the OAUTH_BEARER option:
filename new temp; proc http OAUTH_BEARER="TOKEN-STRING" url="https://server.example.com/folders/folders/@myFolder" method="get" out=new; run; |
I can’t tell you about PROC HTTP without a mention of the JSON engine. Starting in SAS® 9.4m4, the JSON engine enables us to easily read JSON files. I can use the previous cURL command to pipe my access token to a file with the -o argument, but using my PROC HTTP code I can easily move that value into a macro variable. I’ll add a LIBNAME statement that points to the fileref in the previous step:
libname test json fileref=new; |
I can then examine the contents of the JSON output with this step:
proc contents data=test._all_; run; |
Here I spy the access token I will need for a later PROC HTTP step:
Here’s how I can place it in a macro variable:
data _null_; set test.root; call symputx("access_token",access_token); run; %put &access_token; |
So everywhere I used TOKEN-STRING in the previous code, I can now use the macro variable instead, like this:
proc http OAUTH_BEARER="&access_token" url="https://server.example.com/folders/folders/@myFolder" method="get" out=new; run; |
With the cURL command, you can use the -v (verbose) argument to get connection and header information. It’s helpful for debugging and diagnosing trouble spots.
In SAS® 9.4M5, the DEBUG statement was added to PROC HTTP. DEBUG supports several options. I’ll highlight the LEVEL= option here:
Level= 0 | 1 | 2 | 3
Selecting 0 provides no debugging information while 3 provides the highest amount of data and messages. Base SAS 9.4 Procedures Guide includes full descriptions of each debug level.
See the HTTP Procedure documentation for additional syntax.
If you want someone to “write your code for you,” I recommend using a product like Postman to test your POST and GET commands with your API from outside of SAS. Postman is an open-source product with a super cool feature: it will produce CURL commands from successful communication with a URL.
How to translate your cURL command into SAS code was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
With COVID-19 spreading worldwide, accurate data have become more important than ever. In this blog, I share some of my favorite sources:
The Economist Reported deaths often underestimate actual deaths. One way to get at the real numbers is to compare total deaths from all causes versus the typical death rate. This “mortality tracker” plots excess deaths which is a more reliable measure than reported deaths.
Johns Hopkins University This interactive dashboard by the Coronavirus Resource Center at the Bloomberg School of Public Health shows detailed data about the pandemic worldwide.
National Public Radio These interactive graphics by NPR focus on the pandemic in the US.
SAS Institute This interactive dashboard gives a different look at global COVID-19 data.
Avi Schiffmann This webpage may be the most impressive effort by an individual person, and shows that tabular data can be profoundly thought-provoking too.
These articles are also highly recommended:
The Risks–Know Them–Avoid Them This article explains in plain language how COVID-19 spreads and how to keep yourself safe. Share this with your family.
COVID-19 Superspreader Events in 28 Countries: Critical Patterns and Lessons This fascinating article compiles data about superspreader events (SSEs) and reveals a lot about how this virus is spread.
Temporary reduction in daily global CO2 emissions during the COVID-19 forced confinement Finally, something positive: an article about the reduction in CO2 emissions due to the pandemic.
Knowledge is power. Working together we can all stay healthy.
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |