This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Do a quick search on “data scientist” on any of the popular job boards and there’s no denying the global shortage of data scientists is a real one. And, whether you’re looking for the salary commensurate with the prestigious title, a fast pass to the C-suite, or simply want to […]
The post SAS Academy for Data Science creates top-rate analytical professionals appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
You might have seen a SAS Global Forum infographic floating around the web. And maybe you wondered how you might create something similar using SAS software? If so, then this blog’s for you – I have created my own version of the infographic using SAS/Graph, and I’ll show you how […]
The post Building a SAS Global Forum infographic appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements.
A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS procedures.
The BY statement and the CLASS statement in SAS both enable you to specify one or more categorical variables whose levels define subgroups of the data. (For simplicity, we consider only a single categorical variable.) The primary difference is that the BY statement computes many analyses, each on a subset of the data, whereas the CLASS statement computes a single analysis of all the data. Specifically,
To illustrate the differences between an analysis that uses a BY statement and one that uses a CLASS statement, let’s create a subset (called Cars) of the Sashelp.Cars data. The levels of the Origin variable indicate whether a vehicle is manufactured in “Asia”, “Europe”, or the “USA”.
For efficiency reasons, most classical SAS procedures require that you sort the data when you use a BY statement.
Therefore, a call to PROC SORT creates a sorted version of the data called CarsSorted, which will be used for the BY-group analyses.
data Cars; set Sashelp.Cars; where cylinders in (4,6,8) and type ^= 'Hybrid'; run; proc sort data=Cars out=CarsSorted; by Origin; run; |
When you generate descriptive statistics for groups of data, the univariate statistics are identical whether you use a CLASS statement or a BY statement.
What changes is the way that the statistics are displayed.
When you use the CLASS statement, you get one table that contains all statistics or one graph that shows the distribution of each subgroup.
However, when you use the BY statement you get multiple tables and graphs.
The following statements use the CLASS statement to produce descriptive statistics. PROC UNIVARIATE displays one (paneled) graph that shows a comparative histogram for the vehicles that are made in Asia, Europe, and USA. PROC MEANS displays one table that contains descriptive statistics:
proc univariate data=Cars; class Origin; var Horsepower; histogram Horsepower / nrows=3; /* must use NROWS= to get panel */ ods select histogram; run; proc means data=Cars N Mean Std; class Origin; var Horsepower Weight Mpg_Highway; run; |
In contrast, if you run a BY-group analysis on the levels of the Origin variable,
you will see three times as many tables and graphs. Each analysis is preceded by a label that identifies each BY group. Notice that the BY-group analysis uses the sorted data.
proc means data=CarsSorted N Mean Std; by Origin; var Horsepower Weight Mpg_Highway; run; |
Always remember that the output from a BY statement is equivalent to the output from running the procedure multiple times on subsets of the data. For example, the previous statistics could also be generated by calling PROC MEANS three times, each call with a different WHERE clause, as follows:
proc means N Mean Std data=CarsSorted( where=(origin='Asia') ); var Horsepower Weight Mpg_Highway; run; proc means N Mean Std data=CarsSorted( where=(origin='Europe') ); var Horsepower Weight Mpg_Highway; run; proc means N Mean Std data=CarsSorted( where=(origin='USA') ); var Horsepower Weight Mpg_Highway; run; |
In fact, if you ever find yourself repeating an analysis many times (perhaps by using a macro loop), you should consider whether you can rewrite your program to be more efficient by using a BY statement.
As a general rule, you should use a CLASS statement when you want to compare or contrast groups. For example, the following call to PROC GLM performs an ANOVA analysis on the horsepower (response variable) for the three groups defined by the Origin variable. The procedure automatically creates a graph that displays three boxplots, one for each group. The procedure also computes parameter estimates for the levels of the CLASS variable (not shown).
proc glm data=Cars; /* by default, create graph with side-by-side boxplots */ class Origin; model Horsepower = Origin / solution; run; |
You can specify multiple variables on the CLASS statement to include multiple categorical variables in a model. Any variables that are not listed on the CLASS statement are assumed to be continuous. Thus the following call to PROC GLM analyzes a model that has one continuous and one classification variable. The procedure automatically produces a graph that overlays the three regression curves on the data:
ods graphics /antialias=on; title "CLASS Variable Regression: One Model with Multiple Parameters"; proc GLM data=Cars plots=FitPlot; class Origin; model Horsepower = Origin | Weight / solution; ods select ParameterEstimates ANCOVAPlot; quit; |
In contrast, if you use a BY statement, the Origin variable cannot be part of the model but is used only to subset the data. If you use a BY statement, you obtain three different models of the form
Horsepower = Weight. You get three parameter estimates tables and three graphs, each showing one regression line overlaid on a subset of the data.
When you use a BY statement and fit three models of the form Horsepower = Weight, the procedure fits a total of six parameters. Notice that when you use the CLASS statement and fit the model
Horsepower = Origin | Weight, you also fit six free parameters. It turns out that these two methods produce the same predicted values. In fact, you can combine the parameter estimates (for the GLM parameterization) for the CLASS model to obtain the parameter estimates from the BY-variable analysis, as shown below. Each parameter estimate for the BY-variable models are obtained as the sum of two estimates for the CLASS-variable analysis:
For many regression models, the predicted values for the BY-variable analyses are the same as for a particular model that uses a CLASS variable. As shown above, you can even see how the parameters are related when you use a GLM or reference parameterization.
However, the CLASS variable formulation can fit models (such as the equal-slope model
Horsepower = Origin Weight) that are not available when you use a BY variable to fit three separate models. Furthermore, the CLASS statement provides parameter estimates so that you can see the effect of the groups on the response variable. It is more difficult to compare the models that are produced by using the BY statement.
Some SAS procedures use other syntax to analyze groups. In particular, the SGPLOT procedure calls classification variables “group variables.” If you want to overlay graphs for multiple groups, you can use the GROUP= option on many SGPLOT statements. (Some statements support the CATEGORY= option, which is similar.)
For example, to replicate the two-variable regression analysis from PROC GLM, you can use the following statements in PROC SGPLOT:
proc sgplot data=Cars; reg y=Horsepower x=Weight / group=Origin; /* Horsepower = Origin | Weight */ run; |
In summary, use the BY statement in SAS procedures when you want to repeat an analysis for every level of one or more categorical variables. The variables define the subsets but are not otherwise part of the analysis. In classical SAS procedures, the data must be sorted by the BY variables. A BY-group analysis can produce many tables and graphs, so you might want to suppress the ODS output and write the results to a SAS data set.
Use the CLASS statement when you want to include a categorical variable in a model. A CLASS statement often enables you to compare or contrast subgroups. For example, in regression models you can evaluate the relative effect of each level on the response variable.
In some cases, the BY statement and the CLASS statement produce identical statistics. However, the CLASS statement enables you to fit a wider variety of models.
The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
Sometimes I forget whether I’ve added our internal site root and intermediate CA certificates to the Trusted CA Bundle that SAS® Software applications use. Sometimes I also forget the command I can use to find out whether I did! As is often the case with my blog posts, by jotting things down here, I … Continue reading “Did I add that CA Certificate to the SAS Trusted CA Bundle?”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.
The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally, since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.
Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).
jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.
Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.
To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.
The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:
You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:
jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; ISPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.
At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.
The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.
jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).
While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.
As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.
Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.
Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.
If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:
> t.test(pretest ~ gender, data = mydata100) Welch Two Sample t-test data: pretest by gender t = -0.66251, df = 97.725, p-value = 0.5092 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.810931 1.403879 sample estimates: mean in group Female mean in group Male 74.60417 75.30769 > wilcox.test(pretest ~ gender, data = mydata100) Wilcoxon rank sum test with continuity correction data: pretest by gender W = 1133, p-value = 0.4283 alternative hypothesis: true location shift is not equal to 0 > t.test(posttest ~ gender, data = mydata100) Welch Two Sample t-test data: posttest by gender t = -0.57528, df = 97.312, p-value = 0.5664 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.365939 1.853119 sample estimates: mean in group Female mean in group Male 81.66667 82.42308 > wilcox.test(posttest ~ gender, data = mydata100) Wilcoxon rank sum test with continuity correction data: posttest by gender W = 1151, p-value = 0.5049 alternative hypothesis: true location shift is not equal to 0
While the same comparison using the jamovi GUI, or its jmv package, would look like this:
Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:
library("jmv") ttestIS( data = mydata100, vars = c("pretest", "posttest"), group = "gender", mann = TRUE, meanDiff = TRUE)
In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.
The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.
jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.
jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.
As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.
Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.
A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.
Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.
Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).
Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.
In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!
Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
The post Merged legends: Overlay a symbol and line in a legend item appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Did you know that SAS can combine or “merge” a symbol and a line pattern into a single legend item, as shown below? This kind of legend is useful when you are overlaying a group of curves onto a scatter plot. It enables the reader to quickly associate values of a categorical variable with colors, line patterns, and marker symbols in a plot.
When you use PROC SGPLOT and the GROUP= option to create a graph, the SGPLOT procedure automatically displays the group attributes (such a symbol, color, and line pattern) in a legend. If you overlay multiple plot types (such as a series plot on a scatter plot) the default behavior is to create a legend for the first plot statement. You can use the KEYLEGEND statement to control which plots contribute to the legend. In the following example, the KEYLEGEND statement creates a legend that shows the attributes for the scatter plot (the marker shapes) and also the series plot (line patterns):
data ScatCurve; /* example data: scatter plot and curves */ call streaminit(1); do Group = 1 to 2; do x = 0 to 5 by 0.2; Pred = Group + (1/Group)*x - (0.2*x)**2; y = Pred + rand("Normal",0,0.5); output; end; end; run; ods graphics / antialias=on subpixel=on; title "Legend Not Merged"; proc sgplot data=ScatCurve; scatter x=x y=y / group=Group name="obs" markerattrs=(symbol=CircleFilled); series x=x y=Pred / group=Group name="curves"; keylegend "obs" "curves" / title="Group"; /* separate items for markers and lines */ run; |
The legend contains all the relevant information about symbols, colors, and line patterns, but it is longer than it needs to be.
When you display curves and markers for the same groups, you can obtain a more compact representation by merging the symbols and line patterns into a single legend, as shown in the next sections.
If you are comfortable with the Graph Template Language (GTL) in SAS, then you can use the MERGEDLEGEND statement to create a merged legend. The following statements create a graph template that overlays a scatter plot and a series plot and creates a merged legend in which each item contains both a symbol and a line pattern:
proc template; define statgraph ScatCurveLegend; dynamic _X _Y _Curve _Group _Title; /* dynamic variables */ begingraph; entrytitle _Title; layout overlay; scatterplot x=_X y=_Y / group=_Group name="obs" markerattrs=(symbol=CircleFilled); seriesplot x=_X y=_Curve / group=_Group name="curves"; /* specify exactly two names for the MERGEDLEGEND statement */ mergedlegend "obs" "curves" / border=true title=_Group; endlayout; endgraph; end; run; proc sgrender data=ScatCurve template=ScatCurveLegend; dynamic _Title="Overlay Curves, Merged Legend" _X="x" _Y="y" _Curve="Pred" _Group="Group"; run; |
The graph is the same as before, but the legend is different.
The MERGEDLEGEND statement is used in many graphs that are created by SAS regression procedures. As you can see, the legend shows the symbol and line pattern for each group.
Unfortunately, the MERGEDLEGEND statement is not supported by PROC SGPLOT. However, SAS 9.4M5 supports the LEGENDITEM statement in PROC SGPLOT. The LEGENDITEM statement enables you to construct a custom legend and gives you complete control over every item in a legend. The following example constructs a legend that uses the symbols, line patterns, and group values that are present in the graph. Notice that you have to specify these attributes manually, as shown in the following example:
title "Merged Legend by Using LEGENDITEM in SGPLOT"; proc sgplot data=ScatCurve; scatter x=x y=y / group=Group markeratters=(symbol=CircleFilled); series x=x y=Pred / group=Group; legenditem type=markerline name="I1" / label="1" lineattrs=GraphData1 markerattrs=GraphData1(symbol=CircleFilled); legenditem type=markerline name="I2" / label="2" lineattrs=GraphData2 markerattrs=GraphData2(symbol=CircleFilled); keylegend "I1" "I2" / title="Group"; run; |
The graph and legend are identical to the previous graph and are not shown.
The advantage of the LEGENDITEM statement is that you can layout the legend however you choose; the legend is not tied to the attributes in any previous graph component.
However, this is also a disadvantage. If you change the marker attributes in the SCATTER statement, the legend will not reflect that change until you manually modify each LEGENDITEM statement. Although there is no denying the power of the LEGENDITEM statement, the MERGEDLEGEND statement in the GTL always faithfully and automatically reflects the attributes in the SCATTERPLOT and SERIESPLOT statements.
In summary, the SG procedures in SAS automatically create a legend. When you overlay multiple plots, you can use the KEYLEGEND statement to control which plots contribute to the legend. However, it is also possible to merge the symbols and line patterns into a single compact legend. In the GTL, you can use the MERGEDLEGEND statement. In SAS 9.4M5, PROC SGPLOT supports the LEGENDITEM statement to customize the items that appear in a legend.
The post Merged legends: Overlay a symbol and line in a legend item appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Want to see my newly minted certified professional badge? Scroll down to take a peek. Yes, I managed to successfully complete the Base SAS Programmer certification exam… with, ahem, flying colors I might add. Here are my tips to tackle the Base SAS certification exam: 1. Get clear on the […]
The post Demystifying certification (Part 3): To the finish appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
If N random people are in a room, the classical birthday problem provides the probability that at least
two people share a birthday. The birthday problem does not consider how many birthdays are in common.
However, a generalization (sometimes called the Multiple-Birthday Problem) examines the distribution of the number of shared birthdays. Specifically, among N people, what is the probability that exactly k birthdays are shared (k = 1, 2, 3, …, floor(N/2))?
The bar chart at the right shows the distribution for N=23. The heights of the bars indicate the probability of 0 shared birthdays, 1 shared birthday, and so on.
This article uses simulation in SAS to examine the multiple-birthday problem.
If you are interested in a theoretical treatment, see
Fisher, Funk, and Sams (2013, p. 5-14).
You can explore the classical birthday problem by using probability theory or you can estimate the probabilities by using Monte Carlo simulation. The simulation in this article assumes 365 equally distributed birthdays, but see my previous article to extend the simulation to include leap days or a nonuniform distribution of birthdays.
The simulation-based approach enables you to investigate the multiple birthday problem.
To begin, consider a room that contains N=23 people.
The following SAS/IML statements are taken from my previous article. The Monte Carlo simulation generates one million rooms of size 23 and estimates the proportion of rooms in which the number of shared birthdays is 0, 1, 2, and so forth.
proc iml; /* Function that simulates B rooms, each with N people, and counts the number of shared (matching) birthdays in each room. The return value is a row vector that has B counts. */ start BirthdaySim(N, B); bday = Sample(1:365, B||N); /* each column is a room */ match = N - countunique(bday, "col"); /* number of unique birthdays in each col */ return ( match ); finish; call randseed(12345); /* set seed for random number stream */ NumPeople = 23; NumRooms = 1e6; match = BirthdaySim(NumPeople, NumRooms); /* 1e6 counts */ call tabulate(k, freq, match); /* summarize: k=number of shared birthdays, freq=counts */ prob = freq / NumRooms; /* proportion of rooms that have 0, 1, 2,... shared birthdays */ print prob[F=BEST6. C=(char(k,2)) L="Estimate of Probability (N=23)"]; |
The output summarizes the results.
In less than half the rooms, no person shared a birthday with anyone else.
In about 36% of the rooms, one birthday is shared by two or more people. In about 12% of the room, there were two birthdays that were shared by four or more people. About 2% of the rooms had three birthdays shared among six or more individuals, and so forth. In theory, there is a small probability of a room in which 8, 9, 10, or 11 birthdays are shared, but the probability of these events is very small. The probability estimates are plotted in the bar chart at the top of this article.
There is a second way to represent this probability distribution: you can create a stacked bar chart that shows the cumulative probability of k or fewer shared birthdays
for k=1, 2, 3,….
You can see that the probability of 0 matching birthdays is less than 50%, the probability of 1 or fewer matches is 87%, the probability of 2 or fewer matches is
97%, and so forth. The probability for 5, 6, or 7 matches is not easily determined from either chart.
Mathematically speaking, the information in the stacked bar chart is equivalent to the information in the regular bar chart. The regular bar chart shows the PMF (probability mass function) for the distribution whereas the stacked chart shows the CDF (cumulative distribution function).
The previous section showed the distribution of shared birthdays for N=23.
You can download the SAS program that simulates the multiple birthday problem for rooms that contain N=2, 3, …, and 60 people. For each simulation, the program computes estimates for the probabilities of k matching birthdays, where k ranges from 0 to 30.
There are a few ways to visualize those 59 distributions. One way is to plot 59 bar charts, either in a panel or by using an animation.
A panel of three bar charts is shown to the right for rooms of size N=25, 40, and 60.
You can see that the bar chart for N=25 is similar to the one shown earlier. For N=40, most rooms have between one and three shared birthdays. For rooms with N=60 people, most rooms have between three and six shared birthdays. The horizontal graphs are truncated at k=12, even though a few simulations generated rooms that contain more than 12 matches.
Obviously plotting all 59 bar charts would require a lot of space. A different way to visualize these 59 distributions is to place 59 stacked bar charts side by side. This visualization is more compact.
In fact, instead of 59 stacked bar charts, you might want to create a stacked band plot, which is what I show below.
The following stacked band chart is an attempt to visualize the distribution of shared birthdays for ALL room sizes N=2, 3, …, 60. (Click to enlarge.)
How do you interpret this graph? A room that contains N=23 people corresponds to drawing a vertical line at N=23. That vertical line intersects the red, green, and brown bands near 0.5, 0.87, and 0.97. Therefore, these are the probabilities that a room with 23 people contains 0 shared birthdays, 1 or fewer shared birthdays, and 2 or fewer shared birthdays.
The vertical heights of the bands qualitatively indicate the individual probabilities.
Next consider a room that contains N=40 people. A vertical line at N=40 intersects the colored bands at
0.11,
0.37,
0.66,
0.86, and
0.95.
These are the cumulative probabilities P(k ≤ 0),
P(k ≤ 1),
P(k ≤ 2), and so forth.
If you need the exact values, you can use PROC PRINT to display the actual probability estimates.
For example:
proc print data=BDayProb noobs; where N=40 and k < 5; run; |
In summary, this article examines the multiple-birthday problem, which is a generalization of the classical birthday problem. If a room contains N random people, the multiple-birthday problem examines the probability that k birthdays are shared by two or more people for k = 0, 1, 2, …, floor(N/2). This article shows how to use Monte Carlo simulation in SAS to estimate the probabilities. You can visualize the results by using a panel of bar charts or by using a stacked band plot to visualize the cumulative distribution. You can
download the SAS program that simulates the multiple birthday problem, estimates the probabilities, and visualizes the results.
What do you think? Do you like the stacked band plot that summarizes the results in one graph, or do you prefer a series of traditional bar charts? Leave a comment.
The post The distribution of shared birthdays in the Birthday Problem appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
In an earlier blog, I asked you to participate in CertMag/GoCertify’s Annual IT Salary survey. The response was fantastic and I’m happy to report that we made the list of the Top 75 IT certifications out of more than 900. This marks the first time SAS has been part of […]
The post You know the value of your SAS Certification; does the rest of the world? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Coding in Python with SAS University Edition appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
Good news learners! SAS University Edition has gone back to school and learned some new tricks.
With the December 2017 update, SAS University Edition now includes the SASPy package, available in its Jupyter Notebook interface. If you’re keeping track, you know that SAS University Edition has long had support for Jupyter Notebook. With that, you can write and run SAS programs in a notebook-style environment. But until now, you could not use that Jupyter Notebook to run Python programs. With the latest update, you can — and you can use the SASPy library to drive SAS features like a Python coder.
Oh, and there’s another new trick that you’ll find in this version: you can now use SAS (and Python) to access data from HTTPS websites — that is, sites that use SSL encryption. Previous releases of SAS University Edition did not include the components that are needed to support these encrypted connections. That’s going to make downloading web data much easier, not to mention using REST APIs. I’ll show one HTTPS-enabled example in this post.
When you first access SAS University Edition in your web browser, you’ll see a colorful “Welcome” window. From here, you can (A) start SAS Studio or (B) start Jupyter Notebook. For this article, I’ll assume that you select choice (B). However, if you want to learn to use SAS and all of its capabilities, SAS Studio remains the best method for doing that in SAS University Edition.
When you start the notebook interface, you’re brought into the Jupyter Home page. To get started with Python, select New->Python 3 from the menu on the right. You’ll get a new empty Untitled notebook. I’m going to assume that you know how to work with the notebook interface and that you want to use those skills in a new way…with SAS. That is why you’re reading this, right?
pandas is the standard for Python programmers who work with data. The pandas module is included in SAS University Edition — you can use it to read and manipulate data frames (which you can think of like a table). Here’s an example of retrieving a data file from GitHub and loading it into a data frame. (Read more about this particular file in this article. Note that GitHub uses HTTPS — now possible to access in SAS University Edition!)
import saspy import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv') df.describe() |
Here’s the result. This is all straight Python stuff; we haven’t started using any SAS yet.
Before we can use SAS features with this data, we need to move the data into a SAS data set. SASPy provides a dataframe2sasdata() method (shorter alias: df2sd) that can import your Python pandas data frame into a SAS library and data set. The method returns a SASdata object. This example copies the data into WORK.PROBLY in the SAS session:
sas = saspy.SASsession() probly = sas.df2sd(df,'PROBLY') probly.describe() |
The SASdata object also includes a describe() method that yields a result that’s similar to what you get from pandas:
SASPy includes a collection of built-in objects and methods that provide APIs to the most commonly used SAS procedures. The APIs present a simple “Python-ic” style approach to the work you’re trying to accomplish. For example, to create a SAS-based histogram for a variable in a data set, simply use the hist() method.
SASPy offers dozens of simple API methods that represent statistics, machine learning, time series, and more. You can find them documented on the GitHub project page. Note that since SAS University Edition does not include all SAS products, some of these API methods might not work for you. For example, the SASml.forest() method (representing the HPFOREST procedure) works only when you have SAS Enterprise Miner. (And no, that’s not included in SAS University Edition.)
In SASPy, all methods generate SAS program code behind the scenes. If you like the results you see and want to learn the SAS code that was used, you can flip on the “teach me SAS” mode in SASPy.
sas.teach_me_sas('true') |
Here’s what SASPy reveals about the describe() and hist() methods we’ve already seen:
Interesting code, right? Does it make you want to learn more about the STACKEDODSOUTPUT option on PROC MEANS? Or the SCALE= option on PROC SGPLOT?
If you want to experiment with SAS statements that you’ve learned, you don’t need to leave the current notebook and start over. There’s also a built-in %%SAS “magic command” that you can use to try out a few of these SAS statements.
%%SAS proc means data=sashelp.cars stackodsoutput n nmiss median mean std min p25 p50 p75 max;run; |
SAS University Edition includes over 300 Python modules to support your work in Jupyter Notebook. To see a complete list, run the help(‘modules’) command from within a Python notebook. This list includes the common Python packages required to work with data, such as pandas and NumPy. However, it does not include any of the popular Python-based machine learning modules, nor any modules to support data visualization. Of course, SASPy has support for most of this within its APIs, so why would you need anything else…right?
Because SAS University Edition is packaged in a virtual machine that you cannot alter, you don’t have the option of installing additional Python modules. You also don’t have access to the Jupyter terminal, which would allow you to control the system from a shell-like interface. All of this is possible (and encouraged) when you have your own SAS installation with your own instance of SASPy. It’s all waiting for you when you’ve outgrown the learning environment of SAS University Edition and you’re ready to apply your SAS skills and tech to your official work!
The post Coding in Python with SAS University Edition appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |