The post Compute bivariate ranks appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Ranking is a fundamental concept in statistics. Ranks of univariate data are used by statisticians to estimate statistics such as percentiles (quantiles) and empirical distributions. A more advanced use is to compute various rank-based measures of correlation or association between pairs of variables. For example, ranks are used to compute the Spearman rank correlation.
The Spearman correlation uses univariate ranks. That is, the Spearman correlation between variables X and Y is determined by computing the tied ranks for X and Y separately.
Other bivariate measures of association use a different type of ranking, which is known as the bivariate rank. In a bivariate ranking, the pairs of (X,Y) values are ranked. A bivariate ranking assigns a rank to the pairs by using the X values, the Y values, and the joint values. This article provides an example of a bivariate ranking scheme, which is used in computing a statistic known as Hoeffding’s dependence coefficient.
The SAS/IML language supports the BRANKS function, which computes bivariate (tied) ranks according to the following formula. If the data are pairs of values {(X_{i}, Y_{i}) | i=1,2,…,n}, then the bivariate rank of the i_th point is
\(Q_i = 3/4 + \sum\nolimits_j u(X_i – X_j) u(Y_i – Y_j)\)
where u is a function that counts how many values are less than or equal to a given value. Tied values are counted as 0.5.
Specifically, u(t)=1 if t>0,
u(t)=1/2 if t=0, and
u(t)=0 otherwise.
You can think of the formula as a (scaled) estimate of the bivariate cumulative distribution function (CDF).
If you assume that all data values are distinct (no tied values), then the argument to the u function is never 0, and the formula counts how many data points have an X coordinate less than X_{i} and (simultaneously) a Y coordinate less than Y_{i}.
If the data have tied values in one or both coordinates,
then the formula for the rank of
P = (X_{i}, Y_{i})
says:
The SAS/IML language supports the BRANKS function, which computes bivariate (tied) ranks according to the formula in the previous section.
Let’s start with a sample that has nine observations. The ninth observation is a repeat of the eighth observation. The BRANKS function returns a matrix that has three columns:
/* BIVARIATE RANKS */ proc iml; w = {10 20, 10 21, 10 22, 11 20, 11 21, 11 22, 12 20, 12 22, 12 22 }; /* last row is a repeat */ Ranks = branks(w); print w[c={x y}], Ranks[c={'RankX' 'RankY' 'BRank'}]; |
The first two columns are univariate tied ranks of the X and Y coordinates. This third column is the bivariate ranking of the pairs of points. If you think about plotting the points on a scatter plot, points that are in the lower-left corner of the plot have the lowest ranks, and points in the upper-right corner have the higher ranks.
To ensure that the BRANKS function does, in fact, use the formula in the documentation, I wrote the following function, which evaluates the formula “manually.” The following statements verify that the formula gives the same values as the third column of the BRANKS function:
/* compute bivariate ranks manually */ start BivarRank(xy); x = xy[,1]; y = xy[,2]; n = nrow(x); Q = j(n, 1, .); do i = 1 to n; ux = (x[i] > x) + 0.5*(x[i] = x); /* count X values >= x[i] */ uy = (y[i] > y) + 0.5*(y[i] = y); /* count Y values >= y[i] */ Q[i] = 0.75 + sum( ux#uy ); /* bivariate rank of (x[i], y[i]) */ end; return Q; finish; Q = BivarRank(w); bivarRank = Ranks[,3]; print Q bivarRank (Q-bivarRank)[L="Diff"]; |
It’s not clear (to me) how this formula assigns ranks to a cloud of points in a scatter plot.
We know that the points in the lower-left corner of the graph have low bivariate ranks and that points in the upper-right corner have high bivariate ranks. However, it is not clear what happens in the middle.
Let’s generate some data and find out! The following program statements generate 1000 random uniform points in the unit square. The points and the bivariate ranks are written to a SAS data set. PROC SGPLOT displays the points and colors the markers according to the bivariate rank, as follows:
/* compute bivariate ranks for random data */ call randseed(1234); xy = randfun({1000 2}, "Uniform" ); /* 1000 random uniform points in [0,1]x[0,1] */ bivarRank = branks(xy); /* third column contains bivariate ranks */ m = bivarRank || xy; create BivarRanks from m[c={'rx' 'ry' 'brank' 'x' 'y'}]; append from m; close; QUIT; /* palette("spectral",9) */ %let colorRamp = CXD53E4F CXF46D43 CXFDAE61 CXFEE08B CXFFFFBF CXE6F598 CXABDDA4 CX66C2A5 CX3288BD; title "Bivariate Ranks of 1000 Points"; proc sgplot data=BivarRanks aspect=1; scatter x=x y=y / colorresponse=brank markerattrs=(symbol=CircleFilled) colormodel=(&colorRamp); run; |
The colors of the markers indicate the bivariate ranks of the observations.
The graph indicates that low ranks are assigned to points whose X or Y coordinates are small. High ranks are assigned only when both coordinates are large.
Notice that the ranks are not uniformly distributed among the 1000 points. That is because there are many tied ranks for the lower ranks. For example, about 50% of the points have ranks less than 200. Only 10% have ranks greater than 600.
As I indicated earlier, the formula for the bivariate ranks is reminiscent of the definition of a bivariate CDF. In retrospect, I should not have been surprised to see that coloring the observations by their bivariate rank looks a lot like a two-dimensional CDF. For example, the following graph shows the CDF for the bivariate normal distribution:
This article discusses the concept of a bivariate rank for ordered pairs. In a bivariate rank, both the X and Y coordinate are used to assign a rank. The formula that computes the bivariate rank is not complicated, but I did not initially understand how it assigns ranks to points in a scatter plot. As usual, a visualization helps. The visualization shows that bivariate ranks are conceptually similar to the computation of a two-dimensional CDF.
The post Compute bivariate ranks appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Creating a report that displays only the column headings for a data set containing 0 records was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
A customer recently contacted SAS Technical Support and wanted to know how he could generate a report that displays just the column headings for a data set (or table) that does not contain any records. Rather than just omitting the missing data, he wanted to provide his customers with a visual way to see where data was missing.
This blog demonstrates how to create a report that provides only the column headings for data that is missing. The blog also explains how to create, select, and exclude output objects as well as how to generate reports with the SAS® Output Delivery System (ODS). These concepts are relevant to the task of generating a report with the column headings for a data set that contains no (0) observations.
The first section below provides some basic information that you need to understand about ODS. Specifically, it discusses ODS destinations along with the concept of output objects and how they work in ODS.
The second section explains tools that enable you to achieve the desired output for the report:
The last section provides a code example that unites all of these concepts. The end result is a report that contains only column headings from the WORK.CLASS data set and information from the Moments output object in the SASHELP.CLASS data set.
The SAS Output Delivery System has many destinations that you can use to generate files in various formats. Some of these destinations generate files in third-party formats such as XLSX (Excel), DOCX (Word), PPTX (PowerPoint), and HTML (HTML, HTML5). Other types of destinations are available, too. Examples include ODS Package, which generates Archive (or ZIP) files, and ODS Document, which generates binary objects from SAS DATA steps or procedures.
The foundation for the output delivery system is an output object, which is generated when a SAS® procedure or DATA step is executed. The output object is generated when you combine text and numbers with a template definition.
DATA steps generate only one output object, whereas procedures can generate one or more output objects.
To see the contents of an output object, you can use the ODS TRACE statement to generate trace records. A trace record displays the object name, the template location, and the label.
The following example generates trace records for the SASHELP.CLASS data set:
ods trace on; univariate data=sashelp.class; run; |
The output from this code is shown below:
Once you discover an object’s name, you can choose to select or exclude it from the output by using either the ODS SELECT statement or the ODS EXCLUDE statement.
For example, using the previous code example, you can use the ODS SELECT statement to choose a specific output object.
ods trace on; ods select moments; proc univariate data=sashelp.class; run; |
In this example, The ODS SELECT statement selects just the Moments object and sends it to any open ODS destinations so that the object’s data can be printed in a report. No other objects are sent to the destination.
The ODS TRACE statement generates the following output in the trace log since only the Moments object is specified.
The information in the previous section is helpful for data sets that contain records. But it is not helpful if you want to generate a report that shows column headings from a data set that does not have any records.
When a table has no records, it does not generate an output object. Because no object is generated, ODS cannot display headings.
However, you can use another strategy with ODS to display headings from a data set with no records. You can use dictionary tables to obtain the name of the column headings for a table that has no records. Dictionary tables are created automatically by SAS® to store information related to SAS libraries, SAS system options, SAS catalog, and so on. These tables enable you to query information about a data set (column names, titles, and so on).
To accomplish the task at hand, you can use a dictionary table and a SAS macro with ODS. The macro is used to verify whether you are processing a zero-observation data set:
Note: The column headings are arranged vertically. You need to transpose them so that they are displayed horizontally in the report that is generated by the ODS destination.
The following example illustrates the strategy that is described in the last section:
/* Sample table with 0 observations */ data work.class; set sashelp.class; stop; run; ods excel file="sample.xlsx" options(embedded_titles="yes"); %macro test(libref=,dsn=); %let rc=%sysfunc(open(&libref..&dsn,i)); %let nobs=%sysfunc(attrn(&rc,NOBS)); %let close=%sysfunc(CLOSE(&rc)); %if &nobs ne 0 and %sysfunc(Exist(&libref..&dsn)) %then %do; title "Report for Company XYZ"; ods select moments; proc univariate data=&libref..&dsn; run; %end; %else %do; proc sql noprint; create table temp as select name from dictionary.columns where libname=%upcase("&libref") and memname=%upcase("&dsn"); run; quit; proc transpose data=temp out=temp1(drop=_label_ _name_); id name; var name; run; proc report noheader style(column)=header[just=center] nowd; title "No data for data set &libref..&dsn"; run; %end; %mend; %test(libref=work,dsn=class) %test(libref=sashelp,dsn=class) ods excel close; |
As you can see below, the report that is generated shows the column headings for the empty WORK.CLASS data set as well as the data from the Moments object from the SASHELP.CLASS data set.
Creating a report that displays only the column headings for a data set containing 0 records was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
I can’t believe it’s true, but SAS Global Forum is just over a month away. I have some exciting news to share with you, so let’s start with the theme for this year:
New Day. New Answers. Inspired by Curiosity.
What a fitting theme for this year! Technology continues to evolve, so each new day is a chance to seek new answers to what can sometimes feel like impossible challenges. Our curiosity as humans drives us to seek out better ways to do things. And I hope your curiosity will drive you to register for this year’s SAS Global Forum.
We are excited to offer a global event across three regions. If you’re in the Americas, the conference is May 18-20. In Asia Pacific? Then we’ll see you May 19-20. And we didn’t forget about Europe. Your dates are May 25-26. We hope these region-specific dates and the virtual nature of the conference means more SAS users than ever will join us for an inspiring event. Curious about the exciting agenda? It’s all on the website, so check it out.
Want to be inspired to chase your “impossible” dreams? Or hear more about the future of AI? How about learning about work-life balance and your mental health? We have you covered. SAS executives are gearing up to host an exciting lineup of extremely smart, engaging and thought-provoking keynote speakers like Adam Grant, Ayesha Khanna and Hakeem Oluseyi.
And who knows, we might have a few more surprises up our sleeve. You’ll just have to register and attend to find out.
Have you joined the SAS Global Forum online community? You should, because that’s where you’ll find all the discussion around the conference…before, during and after. It’s also where you’ll find a link to the 2021 proceedings, when they become available. Authors are busy preparing their presentations now and they are hard at work staging their proceedings in the community. Join the community so you can connect with other attendees and know when the proceedings become available.
SAS Global Forum is the place where creativity meets curiosity, and amazing analytics happens! I encourage you to regularly check the conference website, as we’re continually adding new sessions and events. You don’t want to miss this year’s conference, so don’t forget to register for SAS Global Forum. See you soon!
Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Compute tied ranks appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The ranks of a set of data values are used in many nonparametric statistics and statistical tests. When you request a statistic or nonparametric test in SAS, the procedure will automatically compute the ranks that are needed. However, sometimes it is useful to know how to compute the ranks yourself.
This article shows how to compute ranks in SAS when the data contains repeated values, which result in tied ranks. The article shows how to use PROC RANK as well as the RANKTIE function in SAS/IML software.
Ranks are easy to compute if there are no tied values in the data. You simply sort the data values and assign the ordinal position of the sorted data as the rank. For example, in the data {18, 13, 19, 16}, the corresponding ranks are {3, 1, 4, 2} because 13 is the first (sorted) value, 16 is the second (sorted) value, and so forth. For a sample of size n, the ranks are integers in the range [1,n].
When the data contains duplicate values, you must decide how to assign ranks to the tied values. All tied values should have the same rank, but what rank should you assign? There are several ways to handle ties, but the most common way is to assign the average rank of the tied values. This is advantageous in statistical tests because it preserves the sum of the ranks. For example,
in the data {18, 13, 18, 16}, the “average rank” method would assign the ranks {3.5, 1, 3.5, 2} because the third and fourth sorted values are the same. Therefore, the average rank is (3 + 4)/2 = 3.5.
Notice that the sum of the ranks is 10, which equals the sum of the integers 1:n, where n=4.
There are four common methods for handling ties. Suppose there are k tied values. If you sort the data, the tied values will appear in the ordinal positions R, R+1, …, R+k-1.
In SAS, you can compute ranks by using the RANKTIE function in SAS/IML, or you can use PROC RANK in Base SAS. Both methods support the four methods for handling ties. The default method is the mean method.
You can use the RANKTIE function in PROC IML to compute tied ranks, as follows:
proc iml; x = {10, 10, 10, 11, 11, 12, 12, 12, 13}; rankMean = ranktie(x); /* default="MEAN" */ rankLow = ranktie(x, "Low"); rankDense = ranktie(x, "Dense"); print x rankMean rankLow rankDense; |
The result shows three of the four methods:
For simplicity, the previous example lists the data in sorted order. However, the tied ranks will be the same regardless of the order of the values. For example, the following statements use the same values but change the order of the observations. The ranks are the same:
/* Change the order of the data. The result is similar except for the ordering. */ x = {12, 10, 13, 11, 10, 10, 12, 11, 12}; rankMean = ranktie(x); /* default="MEAN" */ rankLow = ranktie(x, "Low"); rankDense = ranktie(x, "Dense"); print x rankMean rankLow rankDense; |
Although PROC RANK supports the TIES= option to specify the MEAN, LOW, HIGH, or DENSE methods, you can only use one method at a time. Therefore, the following program calls PROC RANK three times and concatenates the outputs into a single data set:
data X; input x @@; datalines; 10 10 10 11 11 12 12 12 13 ; proc rank data=X out=rankMean ties=MEAN; var X; ranks rankMean; run; proc rank data=X out=rankLow ties=LOW; var X; ranks rankLow; run; proc rank data=X out=rankDense ties=DENSE; var X; ranks rankDense; run; data Ranks; merge rankMean rankLow rankDense; run; proc print data=Ranks; run; |
The result is the same as for the RANKTIE function in SAS/IML. In practice, you typically apply only one method for handling ties, so you only need to call PROC RANK once.
Many statistics and statistical tests use ranks. When the data contain duplicate values, the ranks are not unique. You must choose a way to assign ranks to the tied values. This article discusses four ways to compute tied ranks in SAS. Computations are shown by using PROC IML and PROC RANK.
The post Compute tied ranks appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Using shell scripts for massively parallel processing was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Until recently, I used UNIX/Linux shell scripts in a very limited capacity, mostly as vehicle of submitting SAS batch jobs. All heavy lifting (conditional processing logic, looping, macro processing, etc.) was done in SAS and by SAS. If there was a need for parallel processing and synchronization, it was also implemented in SAS. I even wrote a blog post Running SAS programs in parallel using SAS/CONNECT®, which I proudly shared with my customers.
The post caught their attention and I was asked if I could implement the same approach to speed up processes that were taking too long to run.
However, it turned out that SAS/CONNECT was not licensed at their site and procuring the license wasn’t going to happen any time soon. Bummer!
Or boon? You should never be discouraged by obstacles. In fact, encountering an obstacle might be a stroke of luck. Just add a mixture of curiosity, creativity, and tenacity – and you get a recipe for new opportunity and success. That’s exactly what happened when I turned to exploring shell scripting as an alternative way of implementing parallel processing.
UNIX/Linux OS allows running several scripts in parallel. Let’s say we have three SAS batch jobs controlled by their own scripts script1.sh, script2.sh, and script3.sh. We can run them concurrently (in parallel) by submitting these shell scripts one after another in background mode using & at the end. Just put them in a wrapper “parent” script allthree.sh and run it in background mode as:
$ nohup allthree.sh &
Here what is inside the allthree.sh:
#!/bin/sh script1.sh & script2.sh & script3.sh & wait |
With such an arrangement, allthree.sh “parent” script starts all three background tasks (and corresponding SAS programs) that will run by the server concurrently (as far as resources would allow.) Depending on the server capacity (mainly, the number of CPU’s) these jobs will run in parallel, or quasi parallel competing for the server shared resources with the Operating System taking charge for orchestrating their co-existence and load balancing.
The wait command at the end is responsible for the “parent” script’s synchronization. Since no process id or job id is specified with wait command, it will wait for all current “child” processes to complete. Once all three tasks completed, the parent script allthree.sh will continue past the wait command.
To evaluate server capabilities as it relates to the parallel processing, we would like to know the number of CPU’s.
To get this information we can ran the the lscpu command as it provides an overview of the CPU architectural characteristics such as number of CPU’s, number of CPU cores, vendor ID, model, model name, speed of each core, and lots more. Here is what I got:
Ha! 56 CPUs! This is not bad, not bad at all! I don’t even have to usurp the whole server after all. I can just grab about 50% of its capacity and be a nice guy leaving another 50% to all other users.
Here is a simplified description of the problem I was facing.
Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. The existing process sequentially looped through all the CSV files, which (given the data volume) took about an hour to run.
This task was a perfect candidate for parallel processing since data ingestions of individual days were fully independent of each other.
The solution is comprised of the two parts:
The first thing I did was re-writing the SAS program from looping through all of the days to ingesting just a single day of a month-year. Here is a bare-bones version of the SAS program:
/* capture parameter &sysparm passed from OS command */ %let YYYYMMDD = &sysparm; /* create varlist macro variable to list all input variable names */ proc sql noprint; select name into :varlist separated by ' ' from SASHELP.VCOLUMN where libname='PARMSDL' and memname='DATA_TEMPLATE'; quit; /* create fileref inf for the source file */ filename inf "/cvspath/rawdata&YYYYMMDD..cvs"; /* create daily output data set */ data SASDL.DATA&YYYYMMDD; if 0 then set PARMSDL.DATA_TEMPLATE; infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max; input &varlist; run; |
This SAS program (let’s call it oneday.sas) can be run in batch using the following OS command:
sas oneday.sas -log oneday.log -sysparm 202103
Note, that we pass a parameter (e.g. 202103 means year 2021, month 03) defining the requested year and month YYYYMM as -sysparm value.
That value becomes available in the SAS program as a macro variable reference &sysparm.
We also use a pre-created data template PARMSDL.DATA_TEMPLATE – a zero-observations data set that contains descriptions of all the variables and their attributes (see Simplify data preparation using SAS data templates).
Below shell script month_parallel_driver.sh puts everything together. It spawns and runs concurrently as many daily processes as there are days in a specified month-of-year and synchronizes all single day processes (threads) at the end by waiting them all to complete. It logs all its treads and calculates (and prints) the total processing duration. As you can see, shell script as a programming language is a quite versatile and powerful. Here it is:
#!/bin/sh # HOW TO RUN: # cd /projpath/scripts # nohup sh month_parallel_driver.sh & # Project path proj=/projpath # Program file name prgm=oneday pgmname=$proj/programs/$prgm.sas # Current date/time stamp now=$(date +%Y.%m.%d_%H.%M.%S) echo 'Start time:'$now # Reset timer SECONDS=0 # Get YYYYMM as the script parameter par=$1 # Extract year and month from $par y=${par:0:4} m=${par:4:2} # Get number of days in month $m of year $y days=$(cal $m $y | awk 'NF {DAYS = $NF}; END {print DAYS}') # Create log directory logdir=$proj/saslogs/${prgm}_${y}${m}_${now}_logs mkdir $logdir # Loop through all days of month $m of year $y for i in $(seq -f "%02g" 1 $days) do # Assign log name for a single day thread logname=$logdir/${prgm}_${y}${m}_thread${i}_$now.log # Run single day thread /SASHome/SASFoundation/9.4/sas $pgmname -log $logname -sysparm $par$i & done # Wait until all threads are finished wait # Calculate and print duration end=$(date +%Y.%m.%d_%H.%M.%S) echo 'End time:'$end hh=$(($SECONDS/3600)) mm=$(( $(($SECONDS - $hh * 3600)) / 60 )) ss=$(($SECONDS - $hh * 3600 - $mm * 60)) printf " Total Duration: %02d:%02d:%02d\n" $hh $mm $ss echo '------- End of job -------' |
This script is self-described by detail comments and can be run as:
cd /projpath/scripts
nohup sh month_parallel_driver.sh &
Note that it will create a separate date-time stamped SAS log file for each thread, i.e. there will be as many log files created as there are days in the month-year for which data is ingested.
The results were as expected as they were stunning. The overall duration was cut roughly by a factor of 25, so now this whole task completes in about two minutes vs. one hour before. Actually, now it is even fun to watch how SAS logs and output data sets are being updated in real time.
What is more, this script-centric approach can be used for running not just SAS processes, but non-SAS, open source and/or hybrid processes as well. This makes it a powerful amplifier and integrator for heterogeneous software applications development.
The solution presented in this post is a stripped-down version of the original production quality solution. This better serves our educational objective of communicating the key concepts and coding techniques. If you believe your organization’s computational powers are underutilized and may benefit from a SAS Consulting Services engagement, please reach out to us through your SAS representative, and we will be happy to help.
Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.
Using shell scripts for massively parallel processing was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Overlay other graphs on a bar chart with PROC SGPLOT appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
It can be frustrating to receive an error message from statistical software. In the early days of the SAS statistical graphics (SG) procedures, an error message that I dreaded was
ERROR: Attempting to overlay incompatible plot or chart types.
This error message appears when you attempt to use PROC SGPLOT to overlay two plots that have different properties. For example, you might be trying to overlay a bar chart, which requires a categorical variable, with a scatter plot or series plot, which often displays values of a continuous variable.
In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can combine a bar chart with other plot types by using the VBARBASIC or HBARBASIC statements, which create a bar chart that is compatible with other “basic” plots.
The SAS documentation includes an explanation of chart types, and a table that shows which plots you can overlay when you use PROC SGPLOT or PROC SGPANEL. (The doc erroneously puts HBARBASIC and VBARBASIC in the “categorization” group, but they should be in the “basic group.” I have contacted the doc writer to correct the mistake.)
If you try to overlay plots from different chart types, you will get the dreaded
ERROR: Attempting to overlay incompatible plot or chart types.
First, let me emphasize that this error message only appears when you use PROC SGPLOT or SGPANEL to overlay the plots. If you use the Graph Template Language (GTL) and PROC SGRENDER, you do not have this restriction.
I have previously written about two important cases for which it is necessary to overlay an empirical distribution (bar chart or histogram) and a theoretical distribution, which is visualized by using a scatter plot or series plot:
My previous article shows how to overlay a bar chart and a series plot, but the example is a little complicated. The examples in the next sections are much simpler.
There are two ways to combine a bar chart and a line plot: You can use the HBAR and HLINE statements, or you can use the HBARBASIC and SERIES statements. By using HBARBASIC, you can overlay a bar chart with many other plots.
Suppose you want to use a bar chart to display the average height (by age) of a sample of school children. You also want to add a line that shows the average heights in the population. Because you must use “compatible” plot types, the traditional approach is to combine the HBAR and HLINE statements, as follows. (For simplicity, I omit the legend on this graph.) The DATA step creates fake data, which are supposed to represent the national average and a range of values for the average heights.
data NationalAverage; /* fake data for demonstration purposes */ label Average = "National Average"; input Age Average Low High; datalines; 11 60 53 63 12 62 54 65 13 65 55 67 14 66 56 68 15 67 57 70 16 68 58 72 ; data All; set Sashelp.Class NationalAverage; run; title "Average Heights of Students in Class, by Age"; proc sgplot data=All noautolegend; hbar Age / response=height stat=mean; hline Age / response=Average markers datalabel; run; |
This is the “classic” bar chart and line plot. This syntax has been available in SAS since at least SAS 9.2. It enables you to combine multiple statements for discrete variables, such as
HBAR/VBAR, HLINE/VLINE, and DOT.
However, in some situations, you might need to overlay a bar chart and more complicated plots. In those situations, use the HBARBASIC or VBARBASIC graphs, as shown in the next section.
The VBARBASIC and HBARBASIC statements (introduced in SAS 9.4M3)
enable you to combine bar charts with one or more other “basic” plots such as scatter plots, series plots, and box plots. Like the VBAR and HBAR statements, these statements can summarize raw data. They have almost the same syntax as the VBAR and HBAR statements.
Suppose you want to combine a bar chart, a series plot, and a high-low plot. You can’t use the VBAR or HBAR statements because that leads to “incompatible plot or chart types.” However, you can use
the VBARBASIC and HBARBASIC statements, as follows:
proc sgplot data=All noautolegend; hbarbasic Age / response=height stat=mean name="S" legendlabel="Class"; series y=Age x=Average / markers datalabel=Average name="Avg" legendlabel="National Average"; highlow y=Age low=Low high=High; keylegend "S" "Avg"; run; |
Notice that the SERIES and HIGHLOW statements create “basic” graphs. To overlay these on a bar chart, use the HBARBASIC statement. In a similar way, you can overlay many other graph types on a bar chart.
Sometimes you need to overlay a bar chart and another type of graph. If you aren’t careful, you might get the error message: ERROR: Attempting to overlay incompatible plot or chart types.
In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can use the VBARBASIC or HBARBASIC statements to create a “basic” bar chart that is compatible with other “basic” plots.
The post Overlay other graphs on a bar chart with PROC SGPLOT appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post 3 reasons to prefer a horizontal bar chart appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Most introductory statistics courses introduce the bar chart as a way to visualize the frequency (counts) for a categorical variable. A vertical bar chart places the categories along the horizontal (X) axis and shows the counts (or percentages) on the vertical (Y) axis. The vertical bar chart is a precursor to the histogram, which visualizes the distribution of counts for a continuous variable that has been binned.
Although bar charts are often displayed by using vertical bars, it is often advantageous to use a horizontal bar chart instead.
This article discusses three situations in which a horizontal bar chart is preferable to a vertical bar chart.
In SAS, it is easy to create a vertical or a horizontal bar chart:
For example, the following calls to SAS procedures create vertical and horizontal bar charts. The charts show the number of patients in a study who smoke, where the smoking category has five different levels, from non-smoker to heavy smoker. This categorical variable is ordinal, so I chose to sort the data and use the DISCRETEORDER=DATA option (for PROC SGPLOT) or the ORDER=DATA option (for PROC FREQ) so that the bars appear in a logical order.
/* Sort the data by smoking status: See https://blogs.sas.com/content/iml/2016/06/20/select-when-sas-data-step.html */ data Heart; set sashelp.heart; select (Smoking_Status); when ('Non-smoker') Smoking_Cat=1; when ('Light (1-5)') Smoking_Cat=2; when ('Moderate (6-15)') Smoking_Cat=3; when ('Heavy (16-25)') Smoking_Cat=4; when ('Very Heavy (> 25)') Smoking_Cat=5; otherwise Smoking_Cat=.; end; run; proc sort data=Heart; by Smoking_Cat; run; ods graphics/ width=400px height=300px; /* make the graphs small */ /* Standard vertical bar charts */ title "Vertical Bar Chart"; proc sgplot data=Heart; vbar Smoking_Status; xaxis discreteorder=data; /* use data order instead of alphabetical */ yaxis grid; run; proc freq data=Heart order=data; tables Smoking_Status / plot=FreqPlot; run; title "Horizontal Bar Chart"; proc sgplot data=Heart; hbar Smoking_Status; xaxis grid; yaxis discreteorder=data; /* use data order instead of alphabetical */ run; proc freq data=Heart order=data; /* Y axis is reversed from PROC SGPLOT */ tables Smoking_Status / plot=FreqPlot(orient=horizontal); run; |
The plots from PROC SGPLOT are displayed; the ones from PROC FREQ are similar.
I intentionally made the graphs somewhat small so that the category labels for the vertical bar chart cannot be displayed without rotating or splitting the text labels.
There are a least three advantages to using horizontal bar charts:
Many categories are easier to display. The previous example has five categories but imagine having 20 or 50 categories. A horizontal bar chart can display the category names in a straightforward manner just by making the chart taller. This is an advantage for graphs on the printed page (in portrait mode) and for an HTML page because it is easy to scroll a web page vertically.
The graph to the right shows a portion of a horizontal bar chart that has 45 categories. Each category is the name of a pair of variables and the bar chart shows the Pearson correlation between the two variables.
A horizontal layout can also be helpful for labeling each segment in a stacked bar chart. An example is shown below.
In practice,
it is not always possible to get the labels to fit fully inside the bars, especially for categories that have few counts. However, if you have 10 or more categories, a horizontal bar chart offers a better chance of displaying segment labels inside the bars. You should experiment with both the vertical and horizontal charts to determine which is the better choice.
SAS offers both vertical and horizontal bar charts. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart and gives some situations in which the horizontal chart is preferable.
The post 3 reasons to prefer a horizontal bar chart appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Double integrals by using Monte Carlo methods appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
As mentioned in my article about Monte Carlo estimate of (one-dimensional) integrals, one of the advantages of Monte Carlo integration is that you can perform multivariate integrals on complicated regions. This article demonstrates how to use SAS to obtain a Monte Carlo estimate of a double integral over rectangular and non-rectangular regions. Be aware that a Monte Carlo estimate is often less precise than a numerical integration of the iterated integral.
Multivariate integrals are notoriously difficult to solve. But if you use Monte Carlo methods, higher-dimensional integrals are only marginally more difficult than one-dimensional integrals. For simplicity, suppose we are interested in the double integral \(\int_D f(x,y) \,dx\,dy\), where D is a region in the plane.
The basic steps for estimating a multivariate integral are as follows:
The quantity Area(D)*mean(W) is a Monte Carlo estimate of \(\int_D f(x,y) \,dx\,dy\). The estimate depends on the size of random sample (larger samples tend to give better estimates) and the particular random variates in the sample.
Let’s start with the simplest case, which is when the domain of integration, D, is a rectangular region. This section estimates the double integral of f(x,y) = cos(x)*exp(y) over the region D = [0,π/2] x [0,1]. That is, we want to estimate the integral
\(\int\nolimits_0^{\pi/2}\int\nolimits_0^1 \cos(x)\exp(y)\,dy\,dx = e – 1 \approx 1.71828\)
where e is the base of the natural logarithm. For this problem, the area of D is (b-a)*(d-c) = π/2.
The graph at the right shows a heat map of the function on the rectangular domain.
The following SAS/IML program defines the integrand and the domain of integration.
The program generates 5E6 uniform random variates on the interval [a,b]=[0,π/2] and on the interval [c,d] = [0,1]. The (x,y) pairs are evaluated, and the vector W holds the result. The Monte Carlo estimate is the area of the rectangular domain times the mean of W:
/* Monte Carlo approximation of a double integral over a rectangular region */ proc iml; /* define integrand: f(x,y) = cos(x)*exp(y) */ start func(x); return cos(x[,1]) # exp(x[,2]); finish; /* Domain of integration: D = [a,b] x [c,d] */ pi = constant('pi'); a = 0; b = pi/2; /* 0 < x < pi/2 */ c = 0; d = 1; /* 0 < y < 1 */ call randseed(1234); N = 5E6; X = j(N,2); /* X ~ U(D) */ z = j(N,1); call randgen(z, "uniform", a, b); /* z ~ U(a,b) */ X[,1] = z; call randgen(z, "uniform", c, d); /* z ~ U(c,d) */ X[,2] = z; W = func(X); /* W = f(X1,X2) */ Area = (b-a)*(d-c); /* area of rectangular region */ MCEst = Area * mean(W); /* MC estimate of double integral */ /* the double integral is separable; solve exactly */ Exact = (sin(b)-sin(a))*(exp(d)-exp(c)); Diff = Exact - MCEst; print Exact MCEst Diff; |
The output shows that the Monte Carlo estimate is a good approximation to the exact value of the double integral. For this random sample, the estimate is within 0.0002 units of the true value.
It is only slightly more difficult to estimate a double integral on a non-rectangular domain, D.
It is helpful to split the problem into two subproblems: (1) generate point uniformly in D, and (2) estimate the integral on D.
The next two sections show how to estimate the integral of f(x,y) = exp(-(x^{2} + y^{2})) over the circle of unit radius centered at the origin:
D = {(x,y) | x^{2} + y^{2} ≤ 1}.
The graph at the right shows a heat map of the function on a square. The domain of integration is the unit disk.
By using polar coordinates, you can solve this integral exactly. The double integral has the value
2π*(e – 1)/(2 e) ≈ 1.9858653, where e is the base of the natural logarithm.
For certain shapes, you can directly generate a random sample of uniformly distributed points:
In general, for any planar region, you can use the acceptance-rejection technique. I previously showed an example of using Monte Carlo integration to estimate π by estimating the area of a circle.
Because the acceptance-rejection technique is the most general, let’s use it to generate a random set of points in the unit disk, D. The steps are as follows:
If you know the area of D, you can use a useful trick to choose the sample size. When you use an acceptance-rejection technique, you do not know in advance how many points will end up in D. However, you can estimate the number as N_{D} ≈ N_{R}*Area(D)/Area(R), where N_{R} is the sample size in R.
If you know the area of D, you can invert this formula. If you want approximately N_{D} points to be in D, generate N_{D}*Area(R)/Area(D) points in R. For example, suppose you want N_{D}=5E6 points in the unit disk, D. We can choose N_{R} ≥ N_{D}*4/π, as in the following SAS/IML program, which generates random points in the unit disk:
proc iml; /* (1) Generate approx N_D points in U(D), where D is the unit disk D = { (x,y) | x**2 + y**2 <= 1 } */ N_D = 5E6; /* we want this many points in D */ a = -1; b = 1; /* Bounding rectangle, R: */ c = -1; d = 1; /* R = [a,b] x [c,d] */ /* generate points inside R. Generate enough points (N_R) in R so that approximately N_D are actually in D */ pi = constant('pi'); area_Rect = (b-a)*(d-c); area_D = pi; N_R = ceil(N_D * area_Rect / area_D); /* estimate how many points in R we'll need */ call randseed(1234); X = j(N_R,2); z = j(N_R,1); call randgen(z, "uniform", a, b); X[,1] = z; call randgen(z, "uniform", c, d); X[,2] = z; /* which points in the bounding rectangle are in D? */ b = (X[,##] <= 1); /* x^2+y^2 <= 1 */ X = X[ loc(b),]; /* these points are in D */ print N_D[L="Target N_D" F=comma10.] (nrow(X))[L="Actual N_D" F=comma10.]; |
The table shows the result. The program generated 6.3 million points in the rectangle, R. Of these, 4.998 million were in the unit disk, D. This is very close to the desired number of points in D, which was 5 million.
Each row of the matrix, X, contains a point in the disk, D. The points are a random sample from the uniform distribution on D. Therefore, you can estimate the integral by calculating the mean value of the function on these points:
/* (2) Monte Carlo approximation of a double integral over a non-rectangular domain. Estimate integral of f(x,y) = exp(-(x**2 + y**2)) over the disk D = { (x,y) | x**2 + y**2 <= 1 } */ /* integrand: f(x,y) = exp(-(x**2 + y**2)) */ start func(x); r2 = x[,1]##2 + x[,2]##2; return exp(-r2); finish; W = func(X); MCEst = Area_D * mean(W); /* compare the estimate to the exact value of the integral */ e = constant('e'); Exact = 2*pi*(e-1)/(2*e); Diff = Exact - MCEst; print Exact MCEst Diff; |
The computation shows that a Monte Carlo estimate of the integral over the unit disk is very close to the exact value. For this random sample, the estimate is within 0.00006 units of the true value.
This article shows how to use SAS to compute Monte Carlo estimates of double integrals on a planar region. The first example shows a Monte Carlo integration over a rectangular domain. The second example shows a Monte Carlo integration over a non-rectangular domain. For a non-rectangular domain, the integration requires two steps: first, use an acceptance-rejection technique to generate points uniformly at random in the domain; then use Monte Carlo to estimate the integral.
The technique in this article generalizes to integration on higher-dimensional domains.
The post Double integrals by using Monte Carlo methods appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Sample size for the Monte Carlo estimate of an integral appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A previous article shows how to use Monte Carlo simulation to estimate a one-dimensional integral on a finite interval.
A larger random sample will (on average) result in an estimate that is closer to the true value of the integral than a smaller sample. This article shows how you can determine a sample size so that the Monte Carlo estimate is within a specified distance from the true value, with high probability.
This article is inspired by and uses some of the notation from Neal, D. (1983) “Determining Sample Sizes for Monte Carlo Integration,” The College Mathematics Journal.
As shown in the previous article, a Monte Carlo estimate of an integral of g(x) on the interval (a,b) requires three steps:
The goal of this article is to choose n large enough so that, with probability at least β, the Monte Carlo estimate of the integral
is within δ of the true value.
The mathematical derivation is at the end of this article. The result (Neal, 1983, p. 257) is that you should choose
\(n > \left( \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)s_{Y}}{\delta} \right)^2\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(s_{Y}\) is an estimate of the standard deviation of Y.
Let’s apply the formula to see how large a sample size we need to estimate the integral
\(\int\nolimits_{a}^{b} g(x) \, dx\)
to three decimal places (δ=5E-4) for the function
g(x) = x^{α-1} exp(-x), where α=4 and the interval of integration is (1, 3.5). The function is shown at the top of this article.
The estimate requires knowing an estimate for the standard deviation of Y=g(X), where X ~ U(a,b).
You can use a small “pilot study” to obtain the standard deviation, as shown in the following SAS/IML program. You could also use the DATA step and PROC MEANS for this computation.
%let alpha = 4; /* shape parameter */ %let a = 1; /* lower limit of integration */ %let b = 3.5; /* upper limit of integration */ proc iml; /* define the integrand */ start Func(x) global(shape); return x##(shape - 1) # exp(-x); finish; call randseed(1234); shape = α a = &a; b = &b; /* Small "pilot study" to estimate s_Y = stddev(Y) */ N = 1e4; /* small sample */ X = j(N,1); call randgen(x, "Uniform", a, b); /* X ~ U(a,b) */ Y = func(X); /* Y = f(X) */ s = std(Y); /* estimate of std dev(Y) */ |
For this problem, the estimate of the standard deviation of Y is about 0.3. For β=0.95, the value of \(\Phi^{-1}((\beta+1)/2) \approx 1.96\), but the following program implements the general formula for any probability, β:
/* find n so that MC est is within delta of true value with 95% prob */ beta = 0.95; delta = 5e-4; k = quantile("Normal", (beta+1)/2) * (b-a) * s; sqrtN = k / delta; roundN = round(sqrtN**2, 1000); /* round to nearest 1000 */ print beta delta roundN[F=comma10.]; |
With 95% probability, if you use a sample size of n=8,765,000, the Monte Carlo estimate will be within δ=0.0005 units of the true value of the integral. Wow! That’s a larger value than I expected!
I used n=1E6 in the previous article and reported that the difference between the Monte Carlo approximation and the true value of the integral was less than 0.0002. So it is possible to be close to the true value by using a smaller value of n. In fact, the graph to the right (from the previous article) shows that for n=400,000 and n=750,000, the Monte Carlo estimate is very close for the specific random number seed that I used. But the formula in this section provides the sample size that you need to (probably!) be within ±0.0005 REGARDLESS of the random number seed.
I like to use SAS to check my math. The Monte Carlo estimate of the integral of g will vary according to the random sample. The math in the previous section states that if you generate k random samples of size n=8,765,000, that (on average) about 0.95*k of the sample will be within δ=0.0005 units of the true value of the integral, which is 2.666275. The following SAS/IML program generates k=200 random samples of size n and k estimates of the integral. We expect about 190 estimates to be within δ units of the true value and only about 10 to be farther away:
N = roundN; X = j(N,1); k = 200; Est = j(k,1); do i = 1 to k; call randgen(X, "Uniform", a, b); /* x[i] ~ U(a,b) */ Y = func(X); /* Y = f(X) */ f_avg = mean(Y); /* estimate E(Y) */ Est[i] = (b-a)*f_avg; /* estimate integral */ end; call quad(Exact, "Func", a||b); /* find the "exact" value of the integral */ Diff = Est - Exact; title "Difference Between Monte Carlo Estimate and True Value"; title2 "k=200 Estimates; n=8,765,000"; call Histogram(Diff) other="refline -0.0005 0.0005/axis=x;"; |
The histogram shows the difference between the exact integral and the estimates. The vertical reference lines are at ±0.0005. As predicted, most of the estimates are less than 0.0005 units from the exact value. What percentage? Let’s find out:
/* how many estimates are within and NOT within delta? */ Close = ncol(loc(abs(Diff)<=delta)); /* number that are closer than delta to true value */ NotClose = ncol(loc(abs(Diff)> delta)); /* number that are farther than delta */ PctClose = Close / k; /* percent close to true value */ PctNotClose = NotClose / k; /* percent not close */ print k Close NotClose PctClose PctNotClose; |
For this set of 200 random samples, exactly 95% of the estimates were accurate to within 0.0005. Usually, a simulation of 200 estimates would show that between 93% and 97% of the estimates are close. In this case, the answer was exactly 95%, but don’t expect that to happen always!
This article shows how you can use elementary statistics to find a sample size that is large enough so that a Monte Carlo estimate is (probably!) within a certain distance of the exact value. The article was inspired by an article by Neal, D. (1983). The mathematical derivation of the result is provided in the next section.
This section derives the formula for choosing the sample size, n.
Because the Monte Carlo estimate is a mean, you can use elementary probability theory to find n.
Let {x_{1}, x_{2}, …, x_{n}} be random variates on (a, b). Let Y be the vector {g(x_{1}), g(x_{2}), …, g(x_{n})}.
Let \(m_{n} = \frac{1}{n} \sum\nolimits_{i = 1}^{n} g(x_{i})\) be the mean of Y and let \(\mu = \frac{1}{b-a} \int\nolimits_{a}^{b} g(x) \, dx\) be the average value of g on the interval (a,b).
We know that \(m_n \to \mu\) as \(n \to \infty\). For any probability 0 < β < 1,
we want to find n large enough so that
\(P\left( | m_n – \mu | < \frac{\delta}{(b-a)} \right) \geq \beta\)
From the central limit theorem, we can substitute the standard normal random variable, Z, inside the parentheses by dividing by the standard error of Y, which is σ_{Y}/sqrt(n):
\(P\left( | Z | < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) \geq \beta\)
Equivalently,
\(2 P\left( Z < \frac{\delta \sqrt{n}}{(b-a)\sigma_{Y}} \right) - 1 \geq \beta\)
Solving the equation for sqrt(n) gives
\(\sqrt{n} > \Phi^{-1}\left( \frac{\beta+1}{2} \right) \frac{(b-a)\sigma_{Y}}{\delta}\)
Squaring both sides leads to the desired bound on n. Because we do not know the true standard deviation of Y, we substitute a sample statistic, s_{Y}.
The post Sample size for the Monte Carlo estimate of an integral appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
To be and not to be – the uncertainty principle in SAS was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
If I were to say that we live in uncertain times, that would probably be an understatement. Therefore, I won’t say that. Oops, I already did. Or did I?
For centuries, people around the world have been busy scratching their heads in search of a meaningful answer to Shakespeare’s profoundly elementary question: “To be or not to be?”
Have we succeeded? Sure. And in pursuit of even further greatness, we have progressed beyond the simple binary choice. Thanks to human ingenuity, it is now possible to have it all: to be and not to be.
But doesn’t this contradict human logic? Not at all, according to the Heisenberg uncertainty principle – a cornerstone of quantum mechanics asserting a fundamental limit to the certainty of knowledge.
According to the uncertainty principle, it is not possible to determine both the momentum and position of particles (bosons, electrons, quarks, etc.) simultaneously. Here is the famous formula:
where
Δx = uncertainty in position.
Δp = uncertainty in momentum.
h = Planck’s constant (a rare and precious number equal to 6.62607015×10^{−34} representing how much the energy of a photon increases, when the frequency of its electromagnetic wave increases by 1).
4π = π π π π (4 pi’s; no mathematical formula of any scientific significance can do without at least one of them!)
In addition, every particle or quantum entity may be defined as either a particle or a wave depending on how you feel about it according to the wave-particle duality principle. But let’s not let the dual meaning inconvenience us. Let’s just call them matters, or things for simplicity.
Then we can formulate the uncertainty principle in plain and clear terms:
Since it is impossible to know whether the position of a thing is X or not X, then that thing can be in position X and not be in position X simultaneously. Thus “to be and not to be”.
Capeesh?
There is an abundance of examples of the uncertainty principle in SAS software. Let’s consider several of them.
Some of you may remember SAS version 7.0. It’s remarkable in a way that it was the shortest-lived SAS version that lasted roughly one year. It was released in October 1998 and was replaced by SAS 8.0 in November 1999. There were no 7.1 or 7.2 sub-versions, only 7.0.
But (and this is a big BUT), have you noticed that even today the latest SAS products (9.4 and Viya) use the following version 7 file extensions?
. . . and this is just a partial list.
When you define a SAS library with v9 engine
libname AAA v9 'c:\temp';
SAS log will indicate:
NOTE: Libref AAA was successfully assigned as follows:
Engine: V9
Physical Name: c:\temp
Notice how it’s SAS Engine V9, but SAS datasets created with it have .sas7bdat extensions.
Where do you think that digit “7” came from? Obviously, even almost two decades after version 7.0’s demise it is still alive and kicking. How can you explain that other than by the uncertainty principle: “it is while it is not”!
Let’s take another example. How long have you known the fact that in order to create a permanent SAS data set you need to specify its name as a two-level name, e.g. LIBREF.DATASETNAME, while for temporary data sets you can specify a one-level name, e.g. DATASETNAME, or you can use a two-level name where the first level is WORK to explicitly signify the temporary library. Now, equipped with that “settled science” knowledge, what do you think the following code will create, a temporary or a permanent data set?
options user='c:\temp'; data MYDATA; x = 22371; run; |
Just run this code and check your c:\temp folder to make sure that data set MYDATA is permanent. Credit for this shortcut goes to the option user= . Now we can say that to create a permanent data set we can use a two-level name or one-level name, which makes it indistinguishable from temporary data sets.
To bring this uncertainty to an even higher level, you can drop MYDATA name altogether and still create a permanent data set:
options user='c:\temp'; data; x = 22371; run; |
SAS Log will show:
NOTE: The data set USER.DATA1 has 1 observations and 1 variables.
Isn’t an ultimate proof of the “to be and not to be” principle (sponsored by DATAn naming convention)!
In addition, you can create a data set by defining its physical pathname without even relying on SAS data set names, whether one or two-level:
data "c:\temp\aaa"; x = 22371; format x date9.; run; |
This code runs perfectly fine, creating a SAS data set as a file named aaa.sas7bdat in the c:\temp folder.
And I am not even talking about the NOWORKTERM option (well, I am now) which preserves all the SAS files and directory of the temporary WORK library at the termination of a SAS session, which essentially makes temporary SAS files permanent.
As you can see even “well settled science” crumbles right in front of your eyes under the certainty of the uncertainty principle.
And now, ladies and gentlemen, you will have to pass your final exam to receive an official April Fools diploma from SAS University.
You know that every SAS data step creates automatic variables, _N_ and _ERROR_, which are available during the data step execution. Is it possible to save those automatic variables on the output data set?
In other words, will the following code create 3 variables on the output data set ABC?
data ABC (keep=MODEL _N_ _ERROR_); set SASHELP.CARS(keep=MODEL); run; |
If you answered “No” you get 1 credit. If you answered “Yes” you get 0 credit. But that’s only if you answered the second question (I assume you noticed that I asked two questions in a row). If your “Yes”/ ”No” answer relates to the first question your credits are in reverse.
However, if you not only answered “Yes” to the first question, but also provided a “how-to” code example, you get a bonus in the amount of 10 credits. Here is your bonus for creativity:
data BBC; set SASHELP.CARS(keep=MODEL); x = _n_; e = _error_; rename x=_n_ e=_error_; run; |
You still have to run this code to make sure it creates data set BBC with 3 variables: MODEL, _N_, and _ERROR_ in order to get your 10 credits vested.
And lastly, the final curiosity test and exercise where you find out about SAS’ no-nonsense solution in the face of uncertainty. What happens in the following data step when the SAS-created automatic data step variables, _N_ and _ERROR_, collide with the same-name variables brought in by the previously created BBC data set?
data CBC; set BBC; run; |
After you complete this test/exercise and find out the answer, you can grab your diploma below and proudly brag about it and display it anywhere.
WAIT! Before you leave, please do not forget to provide your answers, questions, code examples, and comments below.
April 1, 2020: Theory of relativity in SAS programming
April 1, 2019: Dividing by zero with SAS
April 1, 2018: SAS discovers a new planet in the Solar System
April 1, 2017: SAS code to prove Fermat’s Last Theorem
To be and not to be – the uncertainty principle in SAS was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |