This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
How many planets are there in our solar system? The answer hasn’t always been 9 … er, I mean 8 (sorry Pluto!). The count has changed throughout history as we got a better understanding of astronomy, discovered new planets, and redefined what a ‘planet’ is. Wouldn’t it be helpful to […]
The post How many planets in our solar system? (… tricky question!) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post An easier way to run thousands of regressions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
SAS programmers on SAS discussion forums sometimes ask how to run thousands of regressions of the form Y = B0 + B1*X_i, where i=1,2,…. A similar question asks how to solve thousands of regressions of the form Y_i = B0 + B1*X for thousands of response variables. I have previously written about how to solve the first problem by converting the data from wide form to a long form and using the BY statement in PROC REG to solve the thousands of regression problems in a single call to the procedure. My recent description of the SWEEP operator (which is implemented in SAS/IML software), provides an alternative technique that enables you to analyze both regression problems when the data are in the wide format.
Why anyone wants to solve thousands of linear regression problems? I can think of a few reasons:
In a previous article, I showed how to simulate data that satisfies a regression model. I created a data set that contains explanatory variables X1-X1000 and a single response variable, Y.
You can download the SAS program that computes the data and performs the regression.
The following SAS/IML program reads the simulated data into a large matrix, M.
For each regression, it forms the three-column matrix A from the intercept column, the k_th explanatory variable, and the variable Y. It then forms the sum of squares and cross products (SSCP) matrix (A`*A) and uses the SWEEP function to solve the least squares regression problem. The two parameter estimates (intercept and slope) for each explanatory variable are saved in a 2 x 1000 array. A few parameter estimates are displayed.
/* program to compute parameter estimates for Y = b0 + b1*X_k, k=1,2,... */ proc iml; XVarNames = "x1":"x&nCont"; /* names of explanatory variables */ p = ncol(XVarNames ); /* number of X variables, excluding intercept */ varNames = XVarNames || "Y"; /* name of all data variables */ use Wide; read all var varNames into M; close; /* read data into M */ A = j(nrow(M), 3, 1); /* columns for intercept, X_k, Y */ A[ ,3] = M[ , ncol(M)]; /* put Y in 3rd col */ ParamEst = j(2, p, .); /* allocate 2 x p matrix for estimates */ do k = 1 to ncol(XVarNames); A[ ,2] = M[ ,k]; /* X_k in 2nd col */ S1 = sweep(A`*A, {1 2}); /* sweep in intercept and X_k */ ParamEst[ ,k] = S1[{1 2}, 3]; /* estimates for k_th model Y = b0 + b1*X_k */ end; print (paramEst[,1:6])[r={"Intercept" "Slope"} c=XVarNames]; |
You could perform a similar loop for models that contain multiple variables, such as all two-variable main-effect models of the form Y = b0 + b1*X_k + b2*X_j, where k ≠ j. You can use the ALLCOMB function in SAS/IML to choose the combinations of columns to sweep.
You can also use the SWEEP operator to perform the regression of many responses onto a single explanatory variable. This case is easy in PROC REG because the procedure supports multiple response variables. The analysis is also easy if you use the SWEEP function in SAS/IML. It only requires a single function call!
The following DATA step simulates 1,000 response variables according to the model Y_i = b0 + b1*X + ε, where b0 = 39.07, b1 = 0.902, and ε ~ N(0, 5.71) is a normally distributed random error term. These values are the parameter estimates for the regression of SepalLength onto SepalWidth for the virginica species of iris flowers in the Sashelp.Iris data set. For more about simulating regression models, see Chapter 11 of Wicklin (2013).
/* based on Wicklin (2013, p. 203) */ %let NumSim = 1000; /* number of Y variables */ data RegSim(drop=RMSE b0 b1 k Species); set Sashelp.Iris(where =(Species="Virginica") keep=Species SepalWidth rename=(SepalWidth=X)); RMSE = 5.71; b0 = 39.07; b1 = 0.902; /* param estimates for model SepalLength = SepalWidth */ array Y[&NumSim]; call streaminit(321); do k = 1 to &NumSim; Y[k] = b0 + b1*X + rand("Normal", 0, RMSE); /* simulate responses from model */ end; output; run; |
After generating the data, you can compute all 1,000 parameter estimates by using a single call to the SWEEP function in SAS/IML. The following program forms a matrix M for which the first column is an intercept term, the second column is the X variable, and the next 1,000 columns are the simulated responses. Sweeping the first two rows of the M`*M matrix computes the 1,000 parameter estimates, which are then graphed in a scatter plot.
/* program to compute parameter estimates for Y_k = b0 + b1*X, k=1,2,... */ proc iml; YVarNames = "Y1":"Y&numSim"; /* names of explanatory variables */ varNames = "X" || YVarNames; /* name of all data variables */ use RegSim; read all var varNames into M; close; M = j(nrow(M), 1, 1) || M; /* add intercept column */ S1 = sweep(M`*M, {1 2}); /* sweep in intercept and X */ ParamEst = S1[{1 2}, 3:ncol(M)]; /* estimates for models Y_i = X */ title "Parameter Estimates for 1,000 Simulated Responses"; call scatter(ParamEst[1,], ParamEst[2,]) label={"Intercept" "X"} other="refline 39.07/axis=x; refline 0.902/axis=y;"; |
The graph shows the parameter estimates for each of the 1,000 linear regressions. The distribution of the estimates provides more information about the sampling distribution than the estimate of standard error that PROC REG produces when you run a regression model. The graph visually demonstrates the “covariance of the estimates,” which PROC REG estimates if you use the COVB option on the MODEL statement.
In summary, you can use the SWEEP operator (implemented in the SWEEP function in SAS/IML) to efficiently compute thousands of regression estimates for wide data. This article demonstrates how to compute models that have many explanatory variables (Y = b0 + b1* X_i) and models that have many response variables (Y_i = b0 + b1 * X).
Are there drawbacks to this approach? Sure. You don’t automatically get the dozens of ancillary regression statistics like R squared, adjusted R squared, p-values, and so forth.
Also, PROC REG automatically handles missing values, whereas in SAS/IML you must first extract the complete cases before you try to form the SSCP matrix. Nevertheless, this computation can be useful for simulation studies in which the data are simulated and analyzed within SAS/IML.
Download the SAS programs for this article.
The post An easier way to run thousands of regressions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The ODS destination for PowerPoint uses table templates and style templates to display the tables, graphs, and other output produced by SAS procedures. You can customize the look of your presentation in a number of ways, including using custom style templates and images. Here we’ll learn about using background images.
The post Background images and the ODS destination for PowerPoint appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
A SAS practice exam can help you prepare for SAS certification. Practice exams are similar in difficulty, objectives, length, and design of the actual exam. While there’s no guarantee that passing the practice exam will result in passing the actual exam, they can help you determine how prepared you are for an exam.
The post Are SAS practice exams worth the cost? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
SAS variables are variables in the statistics sense, not the computer programming sense. SAS has what many computer languages call “variables,” it just calls them “macro variables.” Knowing the difference between SAS variables and SAS macro variables will help you write more flexible and effective code.
The post When a variable is not a variable appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Can you guess which 10 rivers produce 95% of the ocean's plastic pollution? appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Plastic pollution in the oceans is becoming a huge problem. And, as with any problem, finding the solution starts with identifying the source of the problem. A recent study estimated that 95% of the plastic pollution in our oceans comes from 10 rivers – let’s put some visual analytics to […]
The post Can you guess which 10 rivers produce 95% of the ocean's plastic pollution? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The 80-20 rule for blogs appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
You’ve probably heard about the “80-20 Rule,” which describes many natural and manmade phenomena. This rule is sometimes called the “Pareto Principle” because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held by 20% of the population.
The unequal distribution of effort, resources, and other quantities can often be described in terms of the Pareto distribution, which has a parameter that controls the inequity. Whereas some data seem to obey an 80-20 rule, other data are better described as following a “70-30 rule” or a “60-40 rule,” and so on. (Although the two numbers do not need to add to 100, it makes the statement pleasingly symmetric.)
I thought about the Pareto distribution recently when I was looking at some statistics for the SAS blogs.
I found myself wondering whether 80% of the traffic to a blog is generated by 20% of its posts.
(Definition: A blog is a collection of articles. Each individual article is a post.)
As you can imagine, some posts are more popular than others.
Some posts are timeless. They appear in internet searches when someone searches for a particular statistical or programming technique, even though they were published long ago. Other posts (such as Christmas-themed posts) do not get much traffic from internet searches. They generate traffic for a week or two and then fade into oblivion.
To better understand how various blogs at SAS follow the Pareto Principle, I downloaded data for seven blogs during a particular time period. I then kept only the top 100 posts for each blog.
The following line plot shows the pageviews for one blog. The horizontal axis indicates the posts for this blog, ranked by popularity. (The most popular post is 1, the next most popular post is 2, and so forth.) This blog has four very popular posts that each generated more than 7,000 pageviews during the time period. Another group of four posts were slightly less popular (between 4,000 and 5,000 pageviews). After those eight “blockbuster” posts are the rank-and-file posts that were viewed less than 3,000 times during the time period.
The pageviews for the other blogs look similar. This distribution is also common in book-sales data: most books sell only a few thousand copies whereas the best-sellers (think John Grisham) sell hundreds of millions of copies. Movie revenue is another example that follows this distribution.
The distribution is a long-tailed distribution, a fact that becomes evident if you graph a histogram of the underlying quantity. For the blog, the following histogram shows the distribution of the pageviews for the top 100 posts:
Notice the “power law” nature of the distribution for the first few bars of the histogram. The height of each bar is about 0.4 of the previous height. About 57% of the blog posts had less than 1,000 pageviews. Another 22% had between 1,000 and 2,000 pageviews. The number of rank-and-file posts in each category decreases like a power law, but then the blockbuster posts start to appear. These popular posts give the distribution a very long tail.
Because some blogs (like the one pictured) attract thousands of readers whereas other blogs have fewer readers,
you need to standardize the data if you want to compare the distributions for several blogs.
Recall that the Pareto Principle is a statement about cumulative percentages. The following graph shows the cumulative percentages of pageviews for seven blogs at SAS (based on the Top 100 posts):
The graph shows a dashed line with slope –1 that is overlaid on the cumulative percentage curves. The places where the dashed line intersects a curve are the “Pareto locations” for which Y% of the pageviews are generated by the first (1-Y)% most popular blog posts. In general, these blogs appear to satisfy a “70-30 rule” because about 70% of the pageviews are generated by the 30 / 100 most popular posts. There is some variation between blogs, with the upper curve following a “72-28 rule” and the lower curve satisfying a “63-37 rule.”
All this is approximate and is based on only the top 100 posts for each blog. For a more rigorous analysis, you could use PROC UNIVARIATE or PROC NLMIXED to fit the parameters of the Pareto distribution to the data for each blog. However, I am happy to stop here. In general, blogs at SAS satisfy an approximate “70-30 rule,” where 70% of the traffic is generated by the top 30% of the posts.
The post The 80-20 rule for blogs appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.
The post What makes a SAS user? SAS thinks like me: Dede Schreiber-Gregory appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes them a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.
The post What makes a SAS user? Introverts find their tribe: Richann Watson appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
There were 97 e-posters in The Quad demo room at SAS Global Forum this year. And the one that caught my eye was Ted Conway’s “Periodic Table of Introductory SAS ODS Graphics Examples.” Here’s a picture of Ted fielding some questions from an interested user… He created a nice/fun graphic, […]
The post A periodic table to help you with your SAS ODS graphics! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |