This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Dictionary tables are one of the things I love most about SQL! What a useful thing it is to be able to programmatically determine what your data looks like so you can write self-modifying and data-driven programs. While PROC SQL has a great set of dictionary tables, they all rely […]
The post Jedi SAS Tricks – FedSQL Dictionary Tables appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Here in the US, we typically use top level domains such as .com, .gov, and .org. I guess we were one of the first countries to start using web domains in a big way, and therefore we kind of got squatter’s rights. As other countries started using the web, they […]
The post A map of country code top-level domains (ccTLD) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
%macro readpass(xlsfile1,xlsfile2,passwd,outfile,sheetname,getnames);
options macrogen symbolgen mprint nocaps; options noxwait noxsync;
%* we start excel here using this routine here *;
filename cmds dde 'excel|system';
data _null_;
length fid rc start stop time 8;
fid=fopen('cmds','s');
if (fid le 0) then do;
rc=system('start excel');
start=datetime();
stop=start+20;
do while (fid le 0);
fid=fopen('sas2xl','s');
time=datetime();
if (time ge stop) then fid=1;
end;
end;
rc=fclose(fid);
run; quit;
%* then we open the excel sheet here with its password *;
filename cmds dde 'excel|system';
data _null_;
file cmds;
put '[open("'"&xlsfile1"'",,,,"'"&passwd"'")]';
run;
%* then we save it without the password *;
data _null_;
file cmds;
put '[error("false")]';
put '[save.as("'"&xlsfile2"'",51,"")]';
put '[quit]';
run;
%* Then we import the file here *;
proc import datafile="&xlsfile2" out=&outfile dbms=xlsx replace;
%* sheet="%superq(datafilm&i)";
sheet="&sheetname";
getnames=&getnames;
run; quit;
%* then we destroy the non password excel file here *;
systask command "del ""&xlsfile2"" ";
proc contents data=&outfile varnum;
run;
%mend readpass;
%readpass(j:\access\accpcff\excelfiles\passpro.xlsx, /* name of the xlsx 2007 file */
c:\sastest\nopass.xlsx, /* temporary xls file for translation for import */
mypass, /* password of the excel spreadsheet */
work.temp1, /* name of the sas dataset you want to write */
sheet1, /* name of the sheet */
yes) ; /* getnames */
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
One way to assess the precision of a statistic (a point estimate) is to compute the standard error, which is the standard deviation of the statistic’s sampling distribution. A relatively large standard error indicates that the point estimate should be viewed with skepticism, either because the sample size is small or because the data themselves have a large variance.
The jackknife method is one way to estimate the standard error of a statistic.
Some simple statistics have explicit formulas for the standard error, but the formulas often assume normality of the data or a very large sample size. When your data do not satisfy the assumptions or when no formula exists, you can use resampling techniques to estimate the standard error. Bootstrap resampling is one choice, and the jackknife method is another. Unlike the bootstrap, which uses random samples, the jackknife is a deterministic method.
This article explains the jackknife method and describes how to compute jackknife estimates in SAS/IML software.
This is best when the statistic that you need is also implemented in SAS/IML. If the statistic is computed by a SAS procedure, you might prefer to download and use the %JACK macro, which does not require SAS/IML.
The jackknife method estimates the standard error (and bias) of statistics without making any parametric assumptions about the population that generated the data. It uses only the sample data.
The jackknife method manufactures jackknife samples from the data.
A jackknife sample is a “leave-one-out” resample of the data. If there are n observations, then there are n jackknife samples, each of size n-1. If the original data are
{x_{1}, x_{2},…, x_{n}},
then the i_th jackknife sample is
{x_{1},…, x_{i-1},x_{i+1},…, x_{n}}
You then compute n jackknife replicates. A jackknife replicate is the statistic of interest computed on a jackknife sample. You can obtain an estimate of the standard error from the variance of the jackknife replicates. The jackknife method is summarized by the following:
Resampling methods are not hard, but the notation in some books can be confusing. To clarify the method, let’s choose a particular statistic and look at example data. The following example is from Martinez and Martinez (2001, 1st Ed, p. 241), which is also the source for this article. The data are the LSAT scores and grade-point averages (GPAs) for 15 randomly chosen students who applied to law school.
data law; input lsat gpa @@; datalines; 653 3.12 576 3.39 635 3.30 661 3.43 605 3.13 578 3.03 572 2.88 545 2.76 651 3.36 555 3.00 580 3.07 594 2.96 666 3.44 558 2.81 575 2.74 ; |
The statistic of interest (T) will be the correlation coefficient between the LSAT and the GPA variables for the n=15 observations. The observed correlation is T_{Data} = 0.776. The standard error of T helps us understand how much T would change if we took a different random sample of 15 students.
The next sections show how to implement the jackknife analysis in the SAS/IML language.
The SAS/IML matrix language is the simplest way to perform a general jackknife estimates. If X is an n x p data matrix, you can obtain the i_th jackknife sample by excluding the i_th row of X. The following two helper functions encapsulate some of the computations. The SeqExclude function returns the index vector {1, 2, …, i-1, i+1, …, n}. The JackSample function returns the data matrix without the i_th row:
proc iml; /* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ start SeqExclude(n,i); if i=1 then return 2:n; if i=n then return 1:n-1; return (1:i-1) || (i+1:n); finish; /* return the i_th jackknife sample for (n x p) matrix X */ start JackSamp(X,i); n = nrow(X); return X[ SeqExclude(n, i), ]; /* return data without i_th row */ finish; |
By using the helper functions, you can carry out each step of the jackknife method. To make the method easy to modify for other statistics, I’ve written a function called EvalStat which computes the correlation coefficient. This function is called on the original data and on each jackknife sample.
/* compute the statistic in this function */ start EvalStat(X); return corr(X)[2,1]; /* <== Example: return correlation between two variables */ finish; /* read the data into a (n x 2) data matrix */ use law; read all var {"gpa" "lsat"} into X; close; /* 1. compute statistic on observed data */ T = EvalStat(X); /* 2. compute same statistic on each jackknife sample */ n = nrow(X); T_LOO = j(n,1,.); /* LOO = "Leave One Out" */ do i = 1 to n; Y = JackSamp(X,i); T_LOO[i] = EvalStat(Y); end; /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO ); /* 4 & 5. compute jackknife estimates of bias and standard error */ biasJack = (n-1)*(T_Avg - T); stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) ); result = T || T_Avg || biasJack || stdErrJack; print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}]; |
The output shows that the estimate of bias for the correlation coefficient is very small. The standard error of the correlation coefficient is estimated as 0.14, which is about 18% of the estimate.
To use this code yourself, simply modify the EvalStat function. The remainder of the program does not need to change.
When the data are univariate, you can sometimes eliminate the loop that computes jackknife samples and evaluates the jackknife replicates.
If X is column vector, you can computing the (n-1) x n matrix whose i_th column represents the i_th jackknife sample. (To prevent huge matrices, this method is best for n < 20000.)
Because many statistical functions in SAS/IML operate on the columns of a matrix, you can often compute the jackknife replicates in a vectorized manner.
In the following program, the JackSampMat function returns the matrix of jackknife samples for univariate data. The function calls the REMOVE function in SAS/IML, which deletes specified elements of a matrix and returns the results in a row vector. The EvalStatMat function takes the matrix of jackknife samples and returns a row vector of statistics, one for each column. In this example, the function returns the sample standard deviation.
/* If x is univariate, you can construct a matrix where each column contains a jackknife sample. Use for univariate column vector x when n < 20000 */ start JackSampMat(x); n = nrow(x); B = j(n-1, n,0); do i = 1 to n; B[,i] = remove(x, i)`; /* transpose to column vevtor */ end; return B; finish; /* Input: matrix where each column of X is a bootstrap sample. Return a row vector of statistics, one for each column. */ start EvalStatMat(x); return std(x); /* <== Example: return std dev of each sample */ finish; |
Let’s use these functions to get a jackknife estimate of the standard error for the statistic (the standard deviation). The data (from Martinez and Martinez, p. 246) have been studied by many researchers and represent the weight gain in grams for 10 rats who were fed a low-protein diet of cereal:
x = {58,67,74,74,80,89,95,97,98,107}; /* Weight gain (g) for 10 rats */ /* optional: visualize the matrix of jackknife samples */ *M = JackSampMat(x); *print M[c=("S1":"S10") r=("1":"9")]; /* Jackknife method for univariate data */ /* 1. compute observed statistic */ T = EvalStatMat(x); /* 2. compute same statistic on each jackknife sample */ T_LOO = EvalStatMat( JackSampMat(x) ); /* LOO = "Leave One Out" */ /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO` ); /* transpose T_LOO */ /* 4 & 5. compute jackknife estimates of bias and standard error */ biasJack = (n-1)*(T_Avg - T); stdErrJack = sqrt( (n-1)/n * ssq(T_LOO - T_Avg) ); result = T || T_Avg || biasJack || stdErrJack; print result[c={"Estimate" "Mean Jackknife Estimate" "Bias" "Std Error"}]; |
The output shows that the standard deviation of these data is about 15.7 grams. The jackknife method computes that the standard error for this statistic about 2.9 grams, which is about 18% of the estimate.
In summary, jackknife estimates are straightforward to implement in SAS/IML. This article shows a general implementation that works for all data and a specialized implementation that works for univariate data. In both cases, you can adapt the code for your use by modifying the function that computes the statistic on a data set. This approach is useful and efficient when the statistic is implemented in SAS/IML.
The post The jackknife method to estimate standard errors in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Tabs -vs- Spaces: Which coders make more money? appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
There have been several polarizing topics throughout history, such as religion & political affiliation. And for software developers there’s one more biggie … tabs -vs- spaces! Which group is right? Perhaps the opinion of the better programmers should have more weight(?) Is there a metric we can use to determine whether […]
The post Tabs -vs- Spaces: Which coders make more money? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.
New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, IBM Watson, Pentaho, and Google’s TensorFlow.
Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.
Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.
Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.
Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.
SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.
Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.
The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.
Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”
From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.
Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.
The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.
Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.
While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.
Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.
Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.
IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!
While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.
In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.
From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.
In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.
The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.
As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.
Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.
Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.
Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.
The rapid growth in Stata use seems to be finally slowing down. Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.
The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
This post was kindly contributed by BI Notes for SAS Software Users - go there to comment and to read the full post. |
Recently I was asked to create a Web Analytics dashboard on spec. I decided to use the latest version SAS Visual Analytics 8.1 so I could review the new features. When we wrote the Introduction to SAS Visual Analytics book, we were using beta versions of the application. Here’s some of the process I used while working on creating the Web Analytics report.
Getting the Right …
This post appeared first on BI Notes for SAS Software Users. Go to the site to subscribe or view more content.
[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
This post was kindly contributed by BI Notes for SAS Software Users - go there to comment and to read the full post. |
The post How to find a feasible point for a constrained optimization in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Most numerical optimization routines require that the user provide an initial guess for the solution. I have previously described a method for choosing an initial guess for an optimization, which works well for low-dimensional optimization problems. Recently a SAS programmer asked how to find an initial guess when there are linear constraints and bounds on the parameters. There are two simple approaches for finding an initial guess that is in the feasible region. One is the “shotgun” approach in which you generate many initial guesses and evaluate each one until you find one that is feasible. The other is to use the NLPFEA subroutine in SAS/IML, which takes any guess and transforms it into a feasible point in the linearly constrained region.
The NLPFEA routine returns a point in the feasible region from an arbitrary starting guess. Suppose the problem has p (possibly bounded) parameters, and the feasible region is formed by k > 0 additional linear constraints. Then you can represent the feasible regions by a (k+2) x (p+2) matrix, which is the representation that is used for linear programming and constrained nonlinear optimizations. The first row specifies the lower bounds for the parameters and the second row specifies the upper bounds. (Missing values in the first p columns indicate that a parameter is not bounded.) The remaining k rows indicate linear constraints. For example, the following matrix defines a pentagonal region for two parameters. You can call the NLPFEA subroutine to generate a point that satisfies all constraints:
proc iml; con = { 0 0 . ., /* param min */ 10 10 . ., /* param max */ 3 -2 -1 10, /* 3*x1 + -2*x2 LE 10 */ 5 10 -1 56, /* 5*x1 + 10*x2 LE 56 */ 4 2 1 7 }; /* 4*x1 + 2*x2 GE 7 */ guess = {0 0}; /* arbitrary p-dimensional point */ call nlpfea(z, guess, con); /* x0 is feasible point */ print guess[c={"x" "y"}], z[c={"Tx" "Ty"}]; |
The output shows that the guess (x,y) = (0,0) was not feasible, but the NLPFEA routine generated the transformed point T(x,y) = (1.2, 1.1), which is feasible.
It is interesting to visualize the NLPFEA subroutine. The following SAS/IML statements create 36 initial guesses that are distributed uniformly on a circle around the feasible region. For each guess, the program transforms the guess into the feasible region by calling the NLPFEA subroutine. The initial and transformed points are saved to a SAS data set and visualized by using PROC SGPLOT:
NumPts = 36; twopi = 2*constant('pi'); x=.; y=.; Tx=.; Ty=.; create feasible var {"x" "y" "Tx" "Ty"}; do i = 1 to NumPts; x = 2.5 + 5*cos((i-1)/NumPts * twopi); /* guess on circle */ y = 2.5 + 5*sin((i-1)/NumPts * twopi); call nlpfea(feasPt, x||y, con); /* transform into feasible */ Tx = feasPt[1]; Ty = feasPt[2]; append; end; close; |
The graph visualizes the result. The graph shows how each point on the circle is transformed into the feasible region. Some points are transformed into the interior, but many are transformed onto the boundary. You can see that the transformed point always satisfies the linear constraints.
Before I finish, I want to point out that the nonlinear programming (NLP) subroutines in SAS/IML software rarely require you to call the NLPFEA subroutine explicitly. When you call an NLP routine for a linearly constrained optimization and provide a nonfeasible initial guess, the NLP routine internally calls the NLPFEA routine. Consequently, you might see the following NOTE displayed in the SAS log: NOTE: Initial point was changed to be feasible for boundary and linear constraints. For example, run the following program, which provides a nonfeasible initial guess to the NLPNRA (Newton-Raphson) subroutine.
start func(x); x1 = x[,1]; x2 = x[,2]; return ( -(x1-3)##4 + -(x2-2)##2 + 0.1*sin(x1#x2)); finish; opt = {1, /* find maximum of function */ 3}; /* print a little bit of output */ x0 = {0 0}; call nlpnra(rc, x_Opt, "func", x0, opt) blc=con; |
In this program, the NLPNRA subroutine detects that the guess not feasible. It internally calls NLPFEA to obtain a feasible guess and then computes an optimal solution. This is very convenient for programmers. The only drawback is that you don’t know the initial guess that produced the optimal solution. However, you can call the NLPFEA subroutine directly if you want to obtain that information.
In optimization problems that have linear and boundary constraints, most optimization routines require an initial guess that is in the feasible region. The NLPFEA subroutine enables you to obtain a feasible point from an arbitrary initial guess. You can then use that feasible point as an initial guess in a built-in or user-defined optimization routine. However, for the built-in NLP subroutines, you can actually skip the NLPFEA call because the NLP subroutines internally call NLPFEA when you supply a nonfeasible initial guess.
The post How to find a feasible point for a constrained optimization in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
If countries have a similar median age, does that mean they are also similar in other ways? My best guess at an answer is – probably. Perhaps if we plot the data on a map, we’ll be able to see the answer more clearly. I first started thinking about this […]
The post Map: Median age by country appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |