Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
As earth completes its routine annual circle around the sun and a new (and hopefully better) year kicks in, it is a perfect occasion to reflect on the idiosyncrasy of time.
While it is customary to think that 3+2=5, it is only true in sequential world. In parallel world, however, 3+2=3. Think about it: if you have two SAS programs one of which runs 3 hours, and the second one runs 2 hours, their total duration will be 5 hours if you run them one after another sequentially, but it will take only 3 hours if you run them simultaneously, in parallel.
I am sure you remember those “filling up a swimming pool” math problems from elementary school. They clearly and convincingly demonstrate that two pipes will fill up a swimming pool faster than one. That’s the power of running water in parallel.
The same principle of parallel processing (or parallel computing) is applicable to SAS programs (or non-SAS programs) by running their different independent pieces in separate SAS sessions at the same time (in parallel). Divide and conquer.
You might be surprised at how easily this can be done, and at the same time how powerful it is. Let’s take a look.
SAS/CONNECT® is one of the oldest SAS products that was developed to enable SAS programs to run in multi-machine client/server environments. In its original incarnation SAS/CONNECT allowed only synchronous execution of the SAS remote sessions. That is when a remote session was started, the client session was suspended until processing by the server session had completed. That was client/server, but not parallel processing.
Starting with SAS 8 released in 1999, Multi-Process Connect (MP CONNECT) parallel processing functionality was added to SAS/CONNECT enabling you to execute multiple SAS sessions asynchronously. When a remote SAS session kicks off asynchronously, a portion of your SAS program is sent to the server session for execution and control is immediately returned to the client session. The client session can continue with its own processing or spawn one or more additional asynchronous remote server sessions.
Sometimes, what comes across as new is just well forgotten old. They used to be Central Processing Units (CPU), but now they are called just processors. Nowadays, practically every single computer is a “multi-machine” (or to be precise “multi-processor”) device. Even your laptop. Just open Task Manager (Ctrl-Alt-Delete), click on the Performance tab and you will see how many physical processors (or cores) and logical processors your laptop has:
That means that this laptop can run eight independent SAS processes (sessions) at the same time. All you need to do is to say nicely “Dear Mr. & Mrs. SAS/CONNECT, my SAS program consists of several independent pieces. Would you please run each piece in its own SAS session, and run them all at the same time?” And believe me, SAS/CONNECT does not care how many logical processors you have, whether your logical processors are far away from each other “remote machines” or they are situated in a single laptop or even in a single chip.
Here is how you communicate your request to SAS/CONNECT in SAS language.
Suppose you have a SAS code that consists of several pieces – DATA or PROC steps that are independent of each other, i.e. they do not require to be run in a specific sequence. For example, each of the two pieces generates its own data set.
Then we can create these two data sets in two separate “remote” SAS sessions that run in parallel. Here is how you do this. (For illustration purposes, I just create two dummy data sets.)
options sascmd="sas"; /* Current datetime */ %let _start_dt = %sysfunc(datetime()); /* Prosess 1 */ signon task1; rsubmit task1 wait=no; libname SASDL 'C:\temp'; data SASDL.DATA_A (keep=str); length str $1000; do i=1 to 1150000; str = ''; do j=1 to 1000; str = cats(str,'A'); end; output; end; run; endrsubmit; /* Process 2 */ signon task2; rsubmit task2 wait=no; libname SASDL 'C:\temp'; data SASDL.DATA_B (keep=str); length str $1000; do i=1 to 750000; str = ''; do j=1 to 1000; str = cats(str,'B'); end; output; end; run; endrsubmit; waitfor _all_; signoff _all_; /* Print total duration */ data _null_; dur = datetime() - &_start_dt; put 30*'-' / ' TOTAL DURATION:' dur time13.2 / 30*'-'; run; |
In this code, the key elements are:
SASCMD= System Option – specifies the command that starts a server session on a multiprocessor computer.
SIGNON Statement – initiates a connection between a client session and a server session.
RSUBMIT Statement – marks the beginning of a block of statements that a client session submits to a server session for execution.
ENDRSUBMIT statement – marks the end of a block of statements that a client session submits to a server session for execution.
WAITFOR Statement – causes the client session to wait for the completion of one or more tasks (asynchronous RSUBMIT statements) that are in progress.
SIGNOFF Statement – ends the connection between a client session and a server session.
There is a distinction between parallel processing described above and threaded processing (aka multithreading). Parallel processing is achieved by running several independent SAS sessions, each processing its own unit of SAS code.
Threaded processing, on the other hand, is achieved by developing special algorithms and implementing executable codes that run on multiple processors (threads) within the same SAS session. Many SAS PROCs are multi-threaded by design (e.g. SORT, SQL, MEANS/SUMMARY, TABULATE, REG, GLM, and others) and every single one can run multi-threaded.
Simplistically, total duration of several independent processes running in parallel is equal to the duration of the longest of these processes.
In the code example above, we have two single-threaded SAS DATA steps and we can take full advantage of the SAS MP CONNECT. This code spawns off two “remote” SAS sessions, each running its own DATA step. On my PC, SAS log showed that DATA_A step took 3 minutes to complete, while DATA_B step took 2 minutes to complete. However, total duration of these two tasks was 3 minutes, which is equal to the duration of the longest of the two processes. That is how we get 3 + 2 = 3.
It might not look too remarkable when we cut run time from 5 minutes to 3 minutes, but it becomes more significant for longer processes. For example, cutting run time from 5 hours to 3 hours saves 2 whole hours. That time saving can be made even more impressive if we can split our SAS code into more than two parallel processes.
Interestingly, when running in parallel, each step DATA_A and DATA_B takes slightly longer than when they run in a single session. If we run these two data steps in a single session sequentially, DATA_A step takes 2:45 minutes, and DATA_B step takes 1:45 minutes. That is because even though parallel SAS processes run on separate processors, they still share (and compete for) some other common computer resources such as RAM and hard drive.
If our parallel SAS processes each run multithreaded PROC, we may not yield meaningful time saving as each such PROC will employ multiple processors at the same time.
On the other hand, you can still accelerate your program performance by running it in parallel even on a single processor. That is because your spawned “remote” sessions might require different resources at different times: while one session using the processor, the other one might be doing input/output (I/O) operations thus eliminating the processor idle time.
For deeper discussion and understanding, you may consider delving into Amdahl’s law, which provides theoretical background and estimation of potential time saving achievable by parallel computing on multiple processors.
Besides passing pieces of SAS code from client sessions to server sessions, MP CONNECT allows you to pass some other SAS objects.
For example, if you have a data library defined in your client session, you may pass that library definition on to multiple server sessions without re-defining them in each server session.
Let’s say you have two data libraries defined in your client session:
libname SRCLIB oracle user=myusr1 password=mypwd1 path=mysrv1; libname TGTLIB '/sas/data/datastore1'; |
In order to make these data libraries available in the remote session all you need is to add inheritlib= option to the rsubmit statement:
rsubmit task1 wait=no inheritlib=(SRCLIB TGTLIB); |
This will allow libraries that are defined in the client session to be inherited by and available in the server session. As an option, each client libref can be associated with a libref that is named differently in the server session:
rsubmit task1 wait=no inheritlib=(SRCLIB=NEWSRC TGTLIB=NEWTGT); |
%SYSLPUT Statement allows a client session to create a single macro variable in the server session or to copy a specified group of macro variables to the server session. Here is a general syntax of the %syslput statement:
%SYSLPUT _ALL_ | _AUTOMATIC_ | _GLOBAL_ | _LOCAL_ | _USER_
</LIKE=‘character-string’><REMOTE=server-ID>;
And here is an example of how to pass the value of a client-session-defined macro variable _start_dt to a remote session as macro variable rem_start_dt:
options sascmd="sas"; %let run_dt = %sysfunc(datetime()); signon task1; %syslput rem_run_dt=&run_dt / remote=task1; rsubmit task1 wait=no; %put &=rem_run_dt; endrsubmit; waitfor task1; signoff task1; |
Similarly, %SYSRPUT Statement assigns a value from the server session to a macro variable in the client session. The general syntax of the %sysrput statement is one of the following:
(macro-variable specifies the name of a macro variable in the client session.)
(/LIKE=<‘character-string’ >specifies a subset of macro variables whose names match a user-specified character sequence, or pattern.)
Here is a code example that passes two macro variables, rem_start and rem_year from the remote session and outputs them to the SAS log in the client session:
options sascmd="sas"; signon task1; rsubmit task1 wait=no; %let start_dt = %sysfunc(datetime()); %sysrput rem_start=&start_dt; %sysrput rem_year=2021; endrsubmit; waitfor task1; signoff task1; %put &=rem_start &=rem_year; |
SAS’ Multi-Process Connect is a simple and efficient tool enabling parallel execution of independent programming units. Compared to sequential processing of time-intensive programs, it allows to substantially reduce overall duration of your program execution.
Running SAS programs in parallel using SAS/CONNECT® was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post The moving block bootstrap for time series appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
As I discussed in a previous article, the simple block bootstrap is a way to perform a bootstrap analysis on a time series. The first step is to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L) that divides the total length of the series (n). Each bootstrap resample is generated by randomly choosing from among the non-overlapping n/L blocks of residuals, which are added to the predicted model.
The simple block bootstrap is not often used in practice. One reason is that the total number of blocks (k=n/L) is often small. If so, the bootstrap resamples do not capture enough variation for the bootstrap method to make correct inferences. This article describes a better alternative: the moving block bootstrap. In the moving block bootstrap, every block has the same block length but the blocks overlap. The following figure illustrates the overlapping blocks when L=3. The indices 1:L define the first block of residuals, the indices 2:L+1 define the second block, and so forth until the last block, which contains the residuals n-L+1:n.
To form a bootstrap resample, you randomly choose k=n/L blocks (with replacement) and concatenate them. You then add these residuals to the predicted values to create a “new” time series. Repeat the process many times and you have constructed a batch of bootstrap resamples. The process of forming one bootstrap sample is illustrated in the following figure. In the figure, the time series has been reshaped into a k x L matrix, where each row is a block.
To demonstrate the moving block bootstrap in SAS, let’s use the same data that I analyzed in the previous article about the simple block bootstrap. The previous article extracted 132 observations from the
Sashelp.Air data set and used PROC AUTOREG to form an additive model Predicted + Residuals. The OutReg data set contains three variables of interest: Time, Pred, and Resid.
As before, I will choose the block size to be L=12. The following SAS/IML program reads the data and defines a matrix (R) such that the i_th row contains the residuals with indices i:i+L-1.
In total, the matrix R has n-L+1 rows.
/* MOVING BLOCK BOOTSTRAP */ %let L = 12; proc iml; call randseed(12345); use OutReg; read all var {'Time' 'Pred' 'Resid'}; close; /* Restriction for Simple Block Bootstrap: The length of the series (n) must be divisible by the number of blocks (k) so that all blocks have the same length (L) */ n = nrow(Pred); /* length of series */ L = &L; /* length of each block */ k = n / L; /* number of random blocks to use */ if k ^= int(k) then ABORT "The series length is not divisible by the block length"; /* Trick: Reshape data into k x L matrix. Each row is block of length L */ P = shape(Pred, k, L); /* there are k rows for Pred */ J = n - L + 1; /* total number of overlapping blocks to choose from */ R = j(J, L, .); /* there are n-L+1 blocks of residuals */ Resid = rowvec(Resid); /* make Resid into row vector so we don't need to transpose each row */ do i = 1 to J; R[i,] = Resid[ , i:i+L-1]; /* fill each row with a block of residuals */ end; |
With this setup, the formation of bootstrap resamples is almost identical to the program in the previous article. The only difference is that the matrix R for the moving block bootstrap has more rows. Nevertheless, each resample is formed by randomly choosing k rows from R and adding them to a block of predicted values. The following statements generate B=1000 bootstrap resamples, which are
written to a SAS data set (BootOut). The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.
/* The moving block bootstrap repeats this process B times and usually writes the resamples to a SAS data set. */ B = 1000; SampleID = j(n,1,.); create BootOut var {'SampleID' 'Time' 'YBoot'}; /* create outside of loop */ do i = 1 to B; SampleId[,] = i; idx = sample(1:J, k); /* sample of size k from the set 1:J */ YBoot = P + R[idx,]; append; end; close BootOut; QUIT; |
The BootOut data set contains B=1000 bootstrap samples.
The rest of the bootstrap analysis is exactly the same as in the previous article.
This article shows how to perform a moving block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which must divide the total length of the series (n), and form the n-L+1 overlapping blocks of residuals. Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.
The post The moving block bootstrap for time series appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations.
In a previous article, I presented some of my most popular blog posts from 2020.
The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.
However, among last year’s 100+ articles are many that discuss advanced topics.
Did you make a New Year’s resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.
I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:
SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:
Did I omit one of your favorite blog posts from The DO Loop in 2020?
If so, leave a comment and tell me what topic you found interesting or useful.
And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.
The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How to schedule and manage your SAS hot fixes was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
This is the last of three posts on our hot-fix process, aimed at helping you better manage your SAS®9 environment through tips and best practices. The first two installments are linked here:
Having a good understanding of the hot-fix process can help you keep your SAS environment running smoothly. This last installment aims to help you get on a schedule with your hot-fix installations and provides an example spreadsheet (available for download on GitHub) to manage hot fixes.
As an administrator, sometimes applying outstanding hot fixes can be a daunting task. However, the longer you wait, the worse your situation becomes—with a potentially unstable environment and a growing backlog of hot fixes to apply. With a little careful planning, the task can become routine and everyone involved will be much happier. The next sections outline a strategy for getting on a quarterly schedule.
The first step of getting on a quarterly schedule to apply hot fixes is to run the SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool. For information about running this tool and analyzing the report it generates, see the first two installments in this blog series, The SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool and Understanding the SAS Hot Fix Analysis, Download and Deployment Tool Report.
Once you review the SASHFADD report, you will have a better understanding of what resources will be needed to apply the outstanding hot fixes. You also need to decide which philosophy of installing hot fixes you want to follow. For more information, see Which hot fixes should I apply?
The second step is to coordinate the process with your IT department. Before you take the system offline to apply hot fixes, IT typically wants to do the following:
After the first session of applying your hot-fix backlog, all these tasks can run on a regular (preferably at least quarterly) schedule that won’t require as much analysis time from IT.
Before you implement the plan that you and the IT department devised, you need to communicate with your end users. Let them know ahead of time (maybe by a week) when the outage will occur, what they need to do to prepare for it, and how long it will take. It’s a best practice to perform the update outside of regular business hours.
When you follow a quarterly schedule of applying hot fixes, there are many benefits:
Applying hot fixes can often be a complicated process with multiple steps before and after you install them. So, a key aspect of successfully applying hot fixes is ensuring that you follow all the steps that are included in the SASHFADD report. A great tool for managing this complexity is a spreadsheet!
Download one I created and customize it:
SANDY’S SPREADSHEET | DOWNLOAD IT NOW
This tool allows you to see and then check off (through highlighting, color coding, or notes) each of the steps to get the best results.
Administrators will have different approaches to their spreadsheets. Mine, linked above, is the result of much trial and error. Here are the items that I keep track of in my spreadsheet:
Another benefit of the spreadsheet is that you can group steps together so that you can do them all at once. Here are some examples of when you can group steps to save time:
See the following links for the detailed and thorough documentation:
I hope that this blog series has been helpful to you! Have a terrific day!
READ PART ONE | The SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool
READ PART TWO | Understanding the SAS Hot Fix Analysis, Download and Deployment Tool Report
How to schedule and manage your SAS hot fixes was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
For ordinary least squares (OLS) regression, you can use a basic bootstrap of the residuals (called residual resampling) to perform a bootstrap analysis of the parameter estimates. This is possible because an assumption of OLS regression is that the residuals are independent. Therefore, you can reshuffle the residuals to get each bootstrap sample.
For a time series, the residuals are not independent. Rather, if you fit a model to the data, the residuals at time t+i are often close to the residual at time t for small values of i. This is known as autocorrelation in the error component.
Accordingly, if you want to bootstrap the residuals of a time series, it is not correct to randomly shuffle the residuals, which would destroy the autocorrelation. Instead, you need to randomly choose a block of residuals (for example, at times t, t+1, …, and t+L) and use those blocks of residuals to create bootstrap resamples. You repeatedly choose random blocks until you have enough residuals to create a bootstrap resample.
There are several ways to choose blocks:
There are many ways to fit a model to a time series and to obtain the model residuals. Trovero and Leonard (2018) discuss several modern methods to fit trends, cycles, and seasonality by using SAS 9.4 or SAS Viya. To get the residuals, you will want to fit an additive model. In this article, I will use the Sashelp.Air data and will fit a simple additive model (trend plus noise) by using the AUTOREG procedure in SAS/ETS software.
The Sashelp.Air data set has 144 months of data. The following SAS DATA step drops the first year of data, which leaves 11 years of 12 months. I am doing this because I am going to use blocks of size L=12, and I think the example will be clearer if there are 11 blocks of size 12 (rather than 12 blocks).
data Air; set Sashelp.Air; if Date >= '01JAN1950'd; /* exclude first year of data */ Time = _N_; /* the observation number */ run; title "Original Series: Air Travel"; proc sgplot data=Air; series x=Time y=Air; xaxis grid; yaxis grid; run; |
The graph suggests that the time series has a linear trend. The following call to PROC AUTOREG fits a linear model to the data. The predicted mean and residuals are output to the OutReg data set as the PRED and RESID variables, respectively. The call to PROC SGPLOT overlays a graph of the trend and a graph of the residuals.
/* Similar to Getting Started example in PROC AUTOREG */ proc autoreg data=Air plots=none outest=RegEst; AR12: model Air = Time / nlag=12; output out=OutReg pm=Pred rm=Resid; /* mean prediction and residuals */ ods select FinalModel.ParameterEstimates ARParameterEstimates; run; title "Mean Prediction and Residuals from AR Model"; proc sgplot data=OutReg; series x=Time y=Pred; series x=Time y=Resid; refline 0 / axis=y; xaxis values=(24 to 144 by 12) grid valueshint; run; |
The parameter estimates are shown for the linear model. On average, airlines carried an additional 2.8 thousand passengers per month during this time period.
The graph shows the decomposition of the series into a linear trend and residuals. I added vertical lines to indicate the blocks of residuals that are used in the next section.
The first block contains the residuals for times 13-24. The second block contains the residuals for times 25-36, and so forth until the 11th block, which contains the residuals for times 133-144.
For the simple bootstrap, the length of the blocks (L) must evenly divide the length of the series (n), which means that k = n / L is an integer.
Because I dropped the first year of observations from Sashelp.Air, there are n=132 observations. I will choose the block size to be L=12, which means that there are k=11 non-overlapping blocks.
Each bootstrap resample is formed by randomly choosing k blocks (with replacement) and add those residuals to the predicted values.
Think about putting the n predicted values and residuals into a matrix in row-wise order. The first L observations are in the first row, the next L are in the second row, and so forth. Thus, the matrix has k rows and L columns.
The original series is of the form Predicted + Residuals, where the plus sign represents matrix addition.
For the simple block bootstrap, each bootstrap resample is obtained by resampling the rows of the residual array and adding the rows together to obtain a new series of the form Predicted + (Random Residuals). This process is shown schematically in the following figure.
You can use the SAS/IML language to implement the simple block bootstrap. The following call to PROC IML reads in the original predicted and residual values and reshapes then vectors into k x L matrices (P and R, respectively). The SAMPLE function generates a sample (with replacement) of the vector 1:k, which is used to randomly select rows of the R matrix. To make sure that the process is working as expected, you can create one bootstrap resample and graph it. It should resemble the original series:
/* SIMPLE BLOCK BOOTSTRAP */ %let L = 12; proc iml; call randseed(12345); /* the original series is Y = Pred + Resid */ use OutReg; read all var {'Time' 'Pred' 'Resid'}; close; /* For the Simple Block Bootstrap, the length of the series (n) must be divisible by the block length (L). */ n = nrow(Pred); /* length of series */ L = &L; /* length of each block */ k = n / L; /* number of non-overlapping blocks */ if k ^= int(k) then ABORT "The series length is not divisible by the block length"; /* Trick: reshape data into k x L matrix. Each row is block of length L */ P = shape(Pred, k, L); R = shape(Resid, k, L); /* non-overlapping residuals (also k x L) */ /* Example: Generate one bootstrap resample by randomly selecting from the residual blocks */ idx = sample(1:nrow(R), k); /* sample (w/ replacement) of size k from the set 1:k */ YBoot = P + R[idx,]; title "One Bootstrap Resample"; title2 "Simple Block Bootstrap"; refs = "refline " + char(do(12,nrow(Pred),12)) + " / axis=x;"; call series(Time, YBoot) other=refs; |
The graph shows one bootstrap resample. The residuals from arbitrary blocks are concatenated until there are n residuals. These are added to the predicted value to create a “new” series, which is a bootstrap resample. You can generate a large number of bootstrap resamples and use them to perform inferences for time series statistics.
You can repeat the process in a loop to generate more resamples. The following statements generate B=1000 bootstrap resamples.
These are written to a SAS data set (BootOut). The program uses a technique in which the results of each computation are immediately written to a SAS data set, which is very efficient. The program writes the Time variable, the resampled series (YBoot), and an ID variable that identifies each bootstrap sample.
/* The simple block bootstrap repeats this process B times and usually writes the resamples to a SAS data set. */ B = 1000; J = nrow(R); /* J=k for non-overlapping blocks, but prepare for moving blocks */ SampleID = j(n,1,.); create BootOut var {'SampleID' 'Time' 'YBoot'}; /* open data set outside of loop */ do i = 1 to B; SampleId[,] = i; /* fill array: https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */ idx = sample(1:J, k); /* sample of size k from the set 1:k */ YBoot = P + R[idx,]; append; /* append each bootstrap sample */ end; close BootOut; QUIT; |
The BootOut data set contains B=1000 bootstrap samples. You can efficiently analyze the samples by using a BY statement. For example, suppose that you want to investigate how the parameter estimates for the trend line vary among the bootstrap samples. You can run PROC AUTOREG on each bootstrap sample by using BY-group processing. Be sure to suppress ODS output during the BY-group analysis, and write the desired statistics to an output data set (BootEst), as follows:
/* Analyze the bootstrap samples by using a BY statement. See https://blogs.sas.com/content/iml/2012/07/18/simulation-in-sas-the-slow-way-or-the-by-way.html */ proc autoreg data=BootOut plots=none outest=BootEst noprint; by SampleID; AR12: model YBoot = Time / nlag=12; run; /* OPTIONAL: Use PROC MEANS or PROC UNIVARIATE to estimate standard errors and CIs */ proc means data=BootEst mean stddev P5 P95; var Intercept Time _A:; run; title "Distribution of Parameter Estimates"; proc sgplot data=BootEst; scatter x=Intercept y=Time; xaxis grid; yaxis grid; refline 77.5402 / axis=x; refline 2.7956 / axis=y; run; |
The scatter plot shows the bootstrap distribution of the parameter estimates of the linear trend. The reference lines indicate the parameter estimates for the original data. You can use the bootstrap distribution for inferential statistics such as estimation of standard errors, confidence intervals, the covariance of estimates, and more.
You can perform a similar bootstrap analysis for any other statistic that is generated by any time series analysis. The important thing is that the block bootstrap is performed on some sort of residual or “noise” component, so be sure to remove the trend, seasonality, cycles, and so forth and then bootstrap the remainder.
This article shows how to perform a simple block bootstrap on a time series in SAS. First, you need to decompose the series into additive components: Y = Predicted + Residuals. You then choose a block length (L), which (for the simple block bootstrap) must divide the total length of the series (n). Each bootstrap resample is generated by randomly choosing blocks of residuals and adding them to the predicted model. This article uses the SAS/IML language to perform the simple block bootstrap in SAS.
In practice, the simple block bootstrap is rarely used. However, it illustrates the basic ideas for bootstrapping a time series, and it provides a foundation for more sophisticated bootstrap methods.
The post The simple block bootstrap for time series in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Last year, I wrote more than 100 posts for The DO Loop blog. In previous years, the most popular articles were about SAS programming tips, statistical analysis, and data visualization. But not in 2020.
In 2020, when the world was ravaged by the coronavirus pandemic, the most-read articles were related to analyzing and visualizing the tragic loss and suffering of the pandemic.
Here are some of the most popular articles from 2019 in several categories.
Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:
Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you
made a resolution to learn something new this year, consider subscribing to The DO Loop so you don’t miss a single article!
The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Create a response variable that has a specified R-square value appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
When you perform a linear regression, you can examine the R-square value, which is a goodness-of-fit statistic that indicates how well the response variable can be represented as a linear combination of the explanatory variables. But did you know that you can also go the other direction? Given a set of explanatory variables and an R-square statistic, you can create a response variable, Y, such that a linear regression of Y on the explanatory variables produces exactly that R-square value.
In a previous article, I showed how to compute a vector that has a specified correlation with another vector.
You can generalize that situation to obtain a vector that has a specified relationship with a linear subspace that is spanned by multiple vectors.
Recall that the correlation is related to the angle between two vectors by the formula cos(θ) = ρ, where θ is the angle between the vectors and ρ is the correlation coefficient.
Therefore, correlation and “angle between” measure similar quantities.
It makes sense to define the angle between a vector and a linear subspace as the smallest angle the vector makes with any vector in the subspace. Equivalently, it is the angle between the vector and its (orthogonal) projection onto the subspace.
This is shown graphically in the following figure. The vector z is not in the span of the explanatory variables. The vector w is the projection of z onto the linear subspace. As explained in the previous article, you can find a vector y such that the angle between y and w is θ,
where cos(θ) = ρ. Equivalently, the correlation between y and w is ρ.
There is a connection between this geometry and the geometry of least-squares regression. In least-square regression, the predicted response is the projection of an observed response vector onto the span of the explanatory variables. Consequently, the previous
article shows how to simulate an “observed” response vector that has a specified correlation with the predicted response.
For simple linear regression (one explanatory variable), textbooks often point out that the R-square statistic is the square of the correlation between the independent variable, X, and the response variable, Y.
So, the previous article enables you to create a response variable that has a specified R-square value with one explanatory variable.
The generalization to multivariate linear regression is that the R-square statistic is the square of the correlation between the predicted response and the observed response.
Therefore, you can use the technique in this article to create a response variable that has a specified R-square value in a linear regression model.
To be explicit, suppose you are given explanatory variables X_{1}, X_{2}, …, X_{k}, and a correlation coefficient, ρ. The following steps generate a response variable, Y, such that the R-square statistic for the regression of Y onto the explanatory variables is ρ^{2}:
The following program shows how to carry out this algorithm in the SAS/IML language:
proc iml; /* Define or load the modules from https://blogs.sas.com/content/iml/2020/12/17/generate-correlated-vector.html */ load module=_all_; /* read some data X1, X2, ... into columns of a matrix, X */ use sashelp.class; read all var {"Height" "Weight" "Age"} into X; /* read data into (X1,X2,X3) */ close; /* Least-squares fit = Project Y onto span(1,X1,X2,...,Xk) */ start OLSPred(y, _x); X = j(nrow(_x), 1, 1) || _x; b = solve(X`*X, X`*y); yhat = X*b; return yhat; finish; /* specify the desired correlation between Y and \hat{Y}. Equiv: R-square = rho^2 */ rho = 0.543; call randseed(123); guess = randfun(nrow(X), "Normal"); /* 1. make random guess */ w = OLSPred(guess, X); /* 2. w is in Span(1,X1,X2,...) */ Y = CorrVec1(w, rho, guess); /* 3. Find Y such that corr(Y,w) = rho */ /* optional: you can scale Y anyway you want ... */ /* in regression, R-square is squared correlation between Y and YHat */ corr = corr(Y||w)[2]; R2 = rho**2; PRINT rho corr R2; |
The program uses a random guess to generate a vector Y such that the correlation between Y and the least-squares prediction for Y is exactly 0.543. In other words, if you run a regression model where Y is the response and (X1, X2, X3) are the explanatory variables, the R-square statistic for the model will be ρ^{2} = 0.2948.
Let’s write the Y variable to a SAS data set and run PROC REG to verify this fact:
/* Write to a data set, then call PROC REG */ Z = Y || X; create SimCorr from Z[c={Y X1 X2 X3}]; append from Z; close; QUIT; proc reg data=SimCorr plots=none; model Y = X1 X2 X3; ods select FitStatistics ParameterEstimates; quit; |
The “FitStatistics” table that is created by using PROC REG verifies that the R-square statistic is 0.2948, which is the square of the ρ value that was specified in the SAS/IML program. The ParameterEstimates table from PROC REG shows the vector in the subspace that has correlation ρ with Y. It is
-1.26382 + 0.04910*X1 – 0.00197*X2 – 0.12016 *X3.
Many textbooks point out that the R-square statistic in multivariable regression has a geometric interpretation: It is the squared correlation between the response vector and the projection of that vector onto the linear subspace of the explanatory variables (which is the predicted response vector).
You can use the program in this article to solve the inverse problem: Given a set of explanatory variables and correlation, you can find a response variable for which the R-square statistic is exactly the squared correlation.
You can download the SAS program that computes the results in this article.
The post Create a response variable that has a specified R-square value appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Do you know that you can create a vector that has a specific correlation with another vector? That is, given a vector, x, and a correlation coefficient, ρ, you can find a vector, y, such that corr(x, y) = ρ. The vectors x and y can have an arbitrary number of elements, n > 2. One application of this technique is to create a scatter plot that shows correlated data for any correlation in the interval (-1, 1). For example, you can create a scatter plot with n points for which the correlation is exactly a specified value, as shown at the end of this article.
The algorithm combines a mixture of statistics and basic linear algebra. The following facts are useful:
Given a centered vector, u, there are infinitely-many vectors that have correlation ρ with u. Geometrically, you can choose any vector on a positive cone in the same direction as u, where the cone has angle θ and cos(θ)=ρ. This is shown graphically in the figure below. The plane marked \(\mathbf{u}^{\perp}\) is the orthogonal complement to the vector u. If you extend the cone through the plane, you obtain the cone of vectors that are negatively correlated with x
One way to obtain a correlated vector is to start with a guess, z. The vector z can be uniquely represented as the sum \(\mathbf{y} = \mathbf{w} + \mathbf{w}^{\perp}\), where
w is the projection of z onto the span of u,
and \(\mathbf{w}^{\perp}\) is the projection of z onto the orthogonal complement.
The following figure shows the geometry of the right triangle with angle θ such that cos(θ) = ρ.
If you want the vector y to be unit length, you can read off the formula for y from the figure. The formula is
\(\mathbf{y} = \rho \mathbf{w} / \lVert\mathbf{w}\rVert + \sqrt{1 – \rho^2} \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert \)
In the figure, \(\mathbf{v}_1 = \mathbf{w} / \lVert\mathbf{w}\rVert\) and
\(\mathbf{v}_2 = \mathbf{w}^\perp / \lVert\mathbf{w}^\perp\rVert\).
It is straightforward to implement this projection in a matrix-vector language such as SAS/IML. The following program defines two helper functions (Center and UnitVec) and uses them to implement the projection algorithm. The function CorrVec1 takes three arguments: the vector x, a correlation coefficient ρ, and an initial guess. The function centers and scales the vectors into the vectors u and z. The vector z is projected onto the span of u. Finally, the function uses trigonometry and the fact that cos(θ) = ρ to return a unit vector that has the required correlation with x.
/* Given a vector, x, and a correlation, rho, find y such that corr(x,y) = rho */ proc iml; /* center a column vector by subtracting its mean */ start Center(v); return ( v - mean(v) ); finish; /* create a unit vector in the direction of a column vector */ start UnitVec(v); return ( v / norm(v) ); finish; /* Find a vector, y, such that corr(x,y) = rho. The initial guess can be almost any vector that is not in span(x), orthog to span(x), and not in span(1) */ start CorrVec1(x, rho, guess); /* 1. Center the x and z vectors. Scale them to unit length. */ u = UnitVec( Center(x) ); z = UnitVec( Center(guess) ); /* 2. Project z onto the span(u) and the orthog complement of span(u) */ w = (z`*u) * u; wPerp = z - w; /* 3. The requirement that cos(theta)=rho results in a right triangle where y (the hypotenuse) has unit length and the legs have lengths rho and sqrt(1-rho^2), respectively */ v1 = rho * UnitVec(w); v2 = sqrt(1 - rho**2) * UnitVec(wPerp); y = v1 + v2; /* 4. Check the sign of y`*u. Flip the sign of y, if necessary */ if sign(y`*u) ^= sign(rho) then y = -y; return ( y ); finish; |
The purpose of the function is to project the guess onto the green cone in the figure. However, if the guess is in the opposite direction from x, the algorithm will compute a vector, y, that has the opposite correlation.
The function detects this case and flips y, if necessary.
The following statements call the function for a vector, x, and requests a unit vector that has correlation ρ = 0.543 with x:
/* Example: Call the CorrVec1 function */ x = {1,2,3}; rho = 0.543; guess = {0, 1, -1}; y = CorrVec1(x, rho, guess); corr = corr(x||y); print x y, corr; |
As requested, the correlation coefficient between x and y is 0.543. This process will work provided that the guess satisfies a few mild assumptions. Specifically, the guess cannot be in the span of x or in the orthogonal complement of x. The guess also cannot be a multiple of the 1 vector. Otherwise, the process will work for positive and negative correlations.
The function returns a vector that has unit length and 0 mean. However, you can translate the vector and scale it by any positive quantity without changing its correlation with x, as shown by the following example:
/* because correlation is a relationship between standardized vectors, you can translate and scale Y any way you want */ y2 = 100 + 23*y; /* rescale and translate */ corr = corr(x||y2); /* the correlation will not change */ print corr; |
When y is a centered unit vector, the vector β*y has L_{2} norm β.
If you want to create a vector whose standard deviation is β, use β*sqrt(n-1)*y, where n is the number of elements in y.
One application of this technique is to create a random vector that has a specified correlation with a given vector, x. For example, in the following program, the x vector contains the heights of 19 students in the Sashelp.Class data set. The program generates a random guess from the standard normal distribution and passes that guess to the CorrVec1 function and requests a vector that has the correlation 0.678 with x. The result is a centered unit vector.
use sashelp.class; read all var {"Height"} into X; close; rho = 0.678; call randseed(123); guess = randfun(nrow(x), "Normal"); y = CorrVec1(x, rho, guess); mean = 100; std = 23*sqrt(nrow(x)-1); v = mean + std*y; title "Correlation = 0.678"; title2 "Random Normal Vector"; call scatter(X, v) grid={x y}; |
The graph shows a scatter plot between x and the random vector, v. The correlation in the scatter plot is 0.678. The sample mean of the vector v is 100. The sample standard deviation is 23.
If you make a second call to the RANDFUN function, you can get another random vector that has the same properties. Or you can repeat the process for a range of ρ values to visualize data that have a range of correlations. For example, the following graph shows a panel of scatter plots for ρ = -0.75, -0.25, 0.25, and 0.75. The X variable is the same for each plot. The Y variable is a random vector that was rescaled to have mean 100 and standard deviation 23, as above.
The random guess does not need to be from the normal distribution. You can use any distribution.
This article shows how to create a vector that has a specified correlation with a given vector. That is,
given a vector, x, and a correlation coefficient, ρ, find a vector, y, such that corr(x, y) = ρ.
The algorithm in this article produces a centered vector that has unit length. You can multiply the vector by β > 0 to obtain a vector whose norm is β. You can multiply the vector by β*sqrt(n-1) to obtain a vector whose standard deviation is β.
There are infinitely-many vectors that have correlation ρ with x. The algorithm uses a guess to produce a particular vector for y. You can use a random guess to obtain a random vector that has a specified correlation with x.
The post Find a vector that has a specified correlation with another vector appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
2020 roundup: SAS Users YouTube channel how to tutorials was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
There’s nothing worse than being in the middle of a task and getting stuck. Being able to find quick tips and tricks to help you solve the task at hand, or simply entertain your curiosity, is key to maintaining your efficiency and building everyday skills. But how do you get quick information that’s ALSO engaging? By adding some personality to traditionally routine tutorials, you can learn and may even have fun at the same time. Cue the SAS Users YouTube channel.
With more than 50 videos that show personality published to-date and over 10,000 hours watched, there’s no shortage of learning going on. Our team of experts love to share their knowledge and passion (with personal flavor!) to give you solutions to those everyday tasks.
What better way to round out the year than provide a roundup of our most popular videos from 2020? Check out these crowd favorites:
We’ve got you covered! SAS will continue to publish videos throughout 2021. Subscribe now to the SAS Users YouTube channel, so you can be notified when we’re publishing new videos. Be on the lookout for some of the following topics:
2020 roundup: SAS Users YouTube channel how to tutorials was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
How to create a Napoleon plot with Graph Template Language (GTL) was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Do you need to see how long patients have been treated for? Would you like to know if a patient’s dose has changed, or if the patient experienced any dose interruptions? If so, you can use a Napoleon plot, also known as a swimmer plot, in conjunction with your exposure data set to find your answers. We demonstrate how to find the answer in our recent book SAS® Graphics for Clinical Trials by Example.
You may be wondering what a Napoleon plot is? Have you ever heard of the map of Napoleon’s Russian campaign? It was a map that displayed six types of data, such as troop movement, temperature, latitude, and longitude on one graph (Wikipedia). In the clinical setting, we try to mimic this approach by displaying several different types of safety data on one graph: hence, the name “Napoleon plot.” The plot is also known as a swimmer plot because each patient has a row in which their data is displayed, which looks like swimming lanes.
Now that you know what a Napoleon plot is, how do you produce it? In essence, you are merely writing GTL code to produce the graph you need. In order to generate a Napoleon plot, some key GTL statements that are used are DISCRETEATTRMAP, HIGHLOWPLOT, SCATTERPLOT and DISCRETELEGEND. Other plot statements are used, but the statements that were just mentioned are typically used for all Napoleon plot. In our recent book, one of the chapters carefully walks you through each step to show you how to produce the Napoleon plot. Program 1, below, gives a small teaser of some of the code used to produce the Napoleon Plot.
Program 1: Code for Napoleon Plot That Highlights Dose Interruptions
discreteattrmap name = "Dose_Group"; value "54" / fillattrs = (color = orange) lineattrs = (color = orange pattern = solid); value "81" / fillattrs = (color = red) lineattrs = (color = red pattern = solid); enddiscreteattrmap; discreteattrvar attrvar = id_dose_group var = exdose attrmap = "Dose_Group"; legenditem type = marker name = "54_marker" / markerattrs = (symbol = squarefilled color = orange) label = "Xan 54mg"; < Other legenditem statements > layout overlay / yaxisopts = (type = discrete display = (line label) label = "Patient") highlowplot y = number high = eval(aendy/30.4375) low = eval(astdy/30.4375) / group = id_dose_group type = bar lineattrs = graphoutlines barwidth = 0.2; scatterplot y = number x = eval((max_aendy + 10)/30.4375) / markerattrs = (symbol = completed size = 12px); discretelegend "54_marker" "81_marker" "completed_marker" / type = marker autoalign = (bottomright) across = 1 location = inside title = "Dose"; endlayout; |
Without further ado, Output 1 shows you an example of a Napoleon plot. You can see that there are many patients, and so the patient labels have been suppressed. You also see that the patient who has been on the study the longest has a dose delay indicated by the white space between the red and orange bars. While this example illustrates a simple Napoleon plot with only two types, dose exposure and treatment, the book has more complex examples of swimmer plots.
Output 1: Napoleon Plot that Highlights Dose Interruptions
How to create a Napoleon plot with Graph Template Language (GTL) was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |