The post Generalized inverses for matrices appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A data analyst asked how to compute parameter estimates in a linear regression model when the underlying data matrix is rank deficient. This situation can occur if one of the variables in the regression is a linear combination of other variables. It also occurs when you use the GLM parameterization of a classification variable. In the GLM parameterization, the columns of the design matrix are linearly dependent. As a result, the matrix of crossproducts (the X`X matrix) is singular.
In either case, you can understand the computation of the parameter estimates learning about generalized inverses in linear systems. This article presents an overview of generalized inverses. A subsequent article will specifically apply generalized inverses to the problem of estimating parameters for regression problems with collinearities.
Recall that the inverse matrix of a square matrix A is a matrix G such as A*G = G*A = I, where I is the identity matrix. When such a matrix exists, it is unique and A is said to be nonsingular (or invertible). If there are linear dependencies in the columns of A, then an inverse does not exist. However, you can define a series of weaker conditions that are known as the Penrose conditions:
Any matrix, G, that satisfies the first condition is called a generalized inverse (or sometimes a “G1” inverse) for A.
A matrix that satisfies the first and second condition is called a “G2” inverse for A.
The G2 inverse is used in statistics to compute parameter estimates for regression problems (see Goodnight (1979), p. 155).
A matrix that satisfies all four conditions is called the Moore-Penrose inverse or the pseudoinverse.
When A is square but singular, there are infinitely many matrices that satisfy the first two conditions, but the Moore-Penrose inverse is unique.
In regression problems, the parameter estimates are obtained by solving the “normal equations.” The normal equations are the linear system (X`*X)*b = (X`*Y), where X is the design matrix, Y is the vector of observed responses, and b is the parameter estimates to be solved. The matrix A = X`*X is symmetric. If the columns of the design matrix are linearly dependent, then A is singular. The following SAS/IML program defines a symmetric singular matrix A and a right-hand-side vector c, which you can think of as X`*Y in the regression context. The call to the DET function computes the determinant of the matrix. A zero determinant indicates that A is singular and that there are infinitely many vectors b that solve the linear system:
proc iml; A = {100 50 20 10, 50 106 46 23, 20 46 56 28, 10 23 28 14 }; c = {130, 776, 486, 243}; det = det(A); /* demonstrate that A is singular */ print det; |
For nonsingular matrices, you can use either the INV or the SOLVE functions in SAS/IML to solve for the unique solution of the linear system. However, both functions give errors when called with a singular matrix. SAS/IML provides several ways to compute a generalized inverse, including the SWEEP function and the GINV function. The SWEEP function is an efficient way to use Gaussian elimination to solve the symmetric linear systems that arise in regression. The GINV function is a function that computes the Moore-Penrose pseudoinverse. The following SAS/IML statements compute two different solutions for the singular system A*b = c:
b1 = ginv(A)*c; /* solution even if A is not full rank */ b2 = sweep(A)*c; print b1 b2; |
The SAS/IML language also provides a way to obtain any of the other infinitely many solutions to the singular system A*b = c. Because A is a rank-1 matrix, it has a one-dimensional kernel (null space). The HOMOGEN function in SAS/IML computes a basis for the null space. That is, it computes a vector that is mapped to the zero vector by A. The following statements compute the unit basis vector for the kernel. The output shows that the vector is mapped to the zero vector:
xNull = homogen(A); /* basis for nullspace of A */ print xNull (A*xNull)[L="A*xNull"]; |
All solutions to A*b = c are of the form b + α*xNull, where b is any particular solution.
You can verify that the Moore-Penrose matrix GINV(A) satisfies the four Penrose conditions, whereas the G2 inverse (SWEEP(A)) only satisfies the first two conditions. I mentioned that the singular system has infinitely many solutions, but the Moore-Penrose solution (b1) is unique.
It turns out that the Moore-Penrose solution is the solution that has the smallest Euclidean norm.
Here is a computation of the norm for three solutions to the system A*b = c:
/* GINV gives the estimate that has the smallest L2 norm: */ GINVnorm = norm(b1); sweepNorm = norm(b2); /* you can add alpha*xNull to any solution to get another solution */ b3 = b1 + 2*xNull; /* here's another solution (alpha=2) */ otherNorm = norm(b3); print ginvNorm sweepNorm otherNorm; |
Because
all solutions are of the form b1 + α*xNull, where xNull is the basis for the nullspace of A, you can graph the norm of the solutions as a function of α. The graph is shown below and indicates that the Moore-Penrose solution is the minimal-norm solution:
alpha = do(-2.5, 2.5, 0.05); norm = j(1, ncol(alpha), .); do i = 1 to ncol(alpha); norm[i] = norm(b1 + alpha[i] * xNull); end; title "Euclidean Norm of Solutions b + alpha*xNull"; title2 "b = Solution from Moore-Penrose Inverse"; title3 "xNull = Basis for Nullspace of A"; call series(alpha, norm) other="refline 0 / axis=x label='b=GINV';refline 1.78885 / axis=x label='SWEEP';"; |
In summary, a singular linear system has infinitely many solutions.
You can obtain a particular solution by using the sweep operator or by finding the Moore-Penrose solution.
You can use the HOMOGEN function to obtain the full family of solutions.
The Moore-Penrose solution is expensive to compute but has an interesting property: it is the solution that has the smallest Euclidean norm. The sweep solution is more efficient to compute and is used in SAS regression procedures such as PROC GLM to estimate parameters in models that include classification variables and use a GLM parameterization. The next blog post explores regression estimates in more detail.
The post Generalized inverses for matrices appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
I recently read an article that said a school in Asheville, North Carolina had the worst chickenpox outbreak in the state in 2 decades. The article was interesting, and it also let me know I had a hole in my knowledge … “What?!? – There’s a chickenpox vaccine?!?” When I […]
The post Immunization rates in North Carolina schools appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
You might know that you can use the ODS SELECT statement to display only some of the tables and graphs that are created by a SAS procedure.
But did you know that you can use a WHERE clause on the ODS SELECT statement to display tables that match a pattern? This article shows how to use wildcards, regular expressions, and pattern matching to select ODS tables in SAS.
A SAS procedure might produce a dozen or more tables. You might be interested in displaying a subset of those tables. Recall that you can use the ODS TRACE ON statement to obtain a list of all the tables and graphs that a procedure creates. You can then use the ODS SELECT or the ODS EXCLUDE statement to control which tables and graphs are displayed.
Here’s an example from the SAS/STAT documentation. The following PROC LOGISTIC call creates 27 tables and graphs, most of which are related to ROC curves. The ODS TRACE ON statement displays the names of each output object in the SAS log:
data roc; input alb tp totscore popind @@; totscore = 10 - totscore; datalines; 3.0 5.8 10 0 3.2 6.3 5 1 3.9 6.8 3 1 2.8 4.8 6 0 3.2 5.8 3 1 0.9 4.0 5 0 2.5 5.7 8 0 1.6 5.6 5 1 3.8 5.7 5 1 3.7 6.7 6 1 3.2 5.4 4 1 3.8 6.6 6 1 4.1 6.6 5 1 3.6 5.7 5 1 4.3 7.0 4 1 3.6 6.7 4 0 2.3 4.4 6 1 4.2 7.6 4 0 4.0 6.6 6 0 3.5 5.8 6 1 3.8 6.8 7 1 3.0 4.7 8 0 4.5 7.4 5 1 3.7 7.4 5 1 3.1 6.6 6 1 4.1 8.2 6 1 4.3 7.0 5 1 4.3 6.5 4 1 3.2 5.1 5 1 2.6 4.7 6 1 3.3 6.8 6 0 1.7 4.0 7 0 3.7 6.1 5 1 3.3 6.3 7 1 4.2 7.7 6 1 3.5 6.2 5 1 2.9 5.7 9 0 2.1 4.8 7 1 2.8 6.2 8 0 4.0 7.0 7 1 3.3 5.7 6 1 3.7 6.9 5 1 3.6 6.6 5 1 ; ods graphics on; ods trace on; proc logistic data=roc; model popind(event='0') = alb tp totscore / nofit; roc 'Albumin' alb; roc 'K-G Score' totscore; roc 'Total Protein' tp; roccontrast reference('K-G Score') / estimate e; run; |
The SAS log displays the names of the tables and graphs. A portion of log is shown below:
Output Added:
-------------
Name: OddsRatios
Label: Odds Ratios
Template: Stat.Logistic.OddsRatios
Path: Logistic.ROC3.OddsRatios
-------------
Output Added:
-------------
Name: ROCCurve
Label: ROC Curve
Template: Stat.Logistic.Graphics.ROC
Path: Logistic.ROC3.ROCCurve
-------------
Output Added:
-------------
Name: ROCOverlay
Label: ROC Curves
Template: Stat.Logistic.Graphics.ROCOverlay
Path: Logistic.ROCComparisons.ROCOverlay
-------------
Output Added:
-------------
Name: ROCAssociation
Label: ROC Association Statistics
Template: Stat.Logistic.ROCAssociation
Path: Logistic.ROCComparisons.ROCAssociation
-------------
Output Added:
-------------
Name: ROCContrastCoeff
Label: ROC Contrast Coefficients
Template: Stat.Logistic.ROCContrastCoeff
Path: Logistic.ROCComparisons.ROCContrastCoeff
-------------
Only a few of the 27 ODS objects are shown here.
Notice that each ODS object has four properties: a name, a label, a template, and a path. Most of the time, the name is used on the ODS SELECT statement to filter the output. For example, if you want to display only the ROC curves and the overlay of the ROC curves, you can put the following statement prior to the RUN statement in the procedure:
ods select ROCCurve ROCOverlay; /* specify the names literally */ |
Often the ODS objects that you want to display are related to each other. In the LOGISTIC example, you might want to display all the information about ROC curves. Fortunately, the SAS developers often use a common prefix or suffix, such as ‘ROC’, in the names of the ODS objects. That means that you can display all ROC-related tables and graphs be selecting the ODS objects whose name (or path) contains ‘ROC’ as a substring.
You can use the WHERE clause to select ODS objects whose name (or label or path) matches a particular pattern.
The object’s name is available in a special variable named _NAME_.
Similarly, the object’s label and path are available in variables named _LABEL_ and _PATH_, respectively. You cannot match patterns in the template string; there is no _TEMPLATE_ variable.
In SAS, the following operators and functions are useful for matching strings:
For example, the following statements select ODS tables and graphs from the previous PROC LOGISTIC call. You can put one of these statements before the RUN statement in the procedure:
/* use any one of the following statements inside the PROC LOGISTIC call */ ods select where=(_name_ =: 'ROC'); /* name starts with 'ROC' */ ods select where=(_name_ like 'ROC%'); /* name starts with 'ROC' */ ods select where=(_path_ ? 'ROC'); /* path contains 'ROC' */ ods select where=(_label_ ? 'ROC'); /* label contains 'ROC' */ ods select where=(_name_ in: ('Odds', 'ROC')); /* name starts with 'Odds' or 'ROC' */ ods select where=(substr(_name_,4,8)='Contrast'); /* name has subtring 'Contrast' at position 4 */ |
For additional examples of using pattern matching to select ODS objects, see Warren Kuhfeld’s graphics-focused blog post and the section of the SAS/STAT User’s Guide that discusses selecting ODS graphics.
Although the CONTAIN and LIKE operators are often sufficient for selecting a table, SAS provides the powerful PRXMATCH function for more complex pattern-matching tasks. The PRXMATCH function uses Perl regular expressions to match strings. SAS provides a
Perl Regular Expression “cheat sheet” that summarizes the syntax and commons search queries for the PRXMATCH function.
You can put any of the following statements inside the PROC LOGISTIC call:
/* use any one of the following PRXMATCH expressions inside the PROC LOGISTIC call */ ods select where=(prxmatch('/ROC/', _name_)); /* name contains 'ROC' anywhere */ ods select where=(prxmatch('/^ROC/', _name_)); /* name starts with 'ROC' */ ods select where=(prxmatch('/Odds|^ROC/', _name_)); /* name contains 'Odds' anywhere or 'ROC' at the beginning */ ods select where=(prxmatch('/ROC/', _name_)=0); /* name does NOT contain 'ROC' anywhere */ ods select where=(prxmatch('/Logistic\.ROC2/', _path_)); /* escape special wildcard character '.' */ |
In summary, the WHERE= option on the ODS SELECT (and ODS EXCLUDE) statement is quite powerful.
Many SAS programmers know how to list the names of tables and graphs on the ODS SELECT statement to display only a subset of the output. However, the WHERE= option enables you to use wildcards and regular expressions to select objects whose names or paths match a certain pattern. This can be a quick and efficient way to select tables that are related to each other and share a common prefix or suffix in their name.
The post Select ODS tables by using wildcards and regular expressions in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
retain seed 111618;
call ranuni(seed, ranno);
This post was kindly contributed by SAS & Statistics - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Learn how to become an author with SAS Press.
The post SAS Press Is Recruiting Authors! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
There has been a lot of controversy surrounding this year’s midterm election, when it comes to counting the ballots … and I kept hearing the term provisional ballots in the news. But I’m embarrassed to say that I didn’t really know much about provisional ballots. I decided to do a […]
The post Where do provisional ballots come from? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The US ‘midterm’ elections have finally started to wind down, and we finally have some (mostly) finalized results to study. But what’s the best way to visualize who won the US congressional seats in each of the 435 districts? Let’s dive into this topic!… Preparation For starters, I couldn’t find […]
The post Building a better election map appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Create newline-delimited JSON (or JSONL) with SAS appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
JSON is a popular format for data exchange between APIs and some modern databases. It’s also used as a way to archive audit logs in many systems. Because JSON can represent hierarchical relationships of data fields, many people consider it to be superior to the CSV format — although it’s certainly not yet universal.
I learned recently that newline-delimited JSON, also called JSONL or JSON Lines, is growing in popularity. In a JSONL file, each line of text represents a valid JSON object — building up to a series of records. But there is no hierarchical relationship among these lines, so when taken as a whole the JSONL file is not valid JSON. That is, a JSON parser can process each line individually, but it cannot process the file all at once.
In SAS, you can use PROC JSON to create valid JSON files. And you can use the JSON libname engine to parse valid JSON files. But neither of these can create or parse JSONL files directly. Here’s a simple example of a JSONL file. Each line, enclosed in braces, represents valid JSON. But if you paste the entire body into a validation tool like JSONLint, the parsing fails.
If we needed these records to be true JSON, we need a hierarchy. This requires us to set off the rows with more braces and brackets and separate them with commas, like this:
In a recent SAS Support Communities thread, a SAS user was struggling to use PROC JSON and a SAS data set to create a JSONL file for use with the Amazon Redshift database. PROC JSON can’t create the finished file directly, but we can use PROC JSON to create the individual JSON object records. Our solution looks like this:
Here’s the code we used. You need to change only the output file name and the source SAS data set.
/* Build a JSONL (newline-delimited JSON) file */ /* from the records in a SAS data set */ filename final "c:\temp\final.jsonl" ; %let datasource = sashelp.class; /* Create a new subfolder in WORK to hold */ /* temp JSON files, avoiding conflicts */ options dlcreatedir; %let workpath = %sysfunc(getoption(WORK))/json; libname json "&workpath."; libname json clear; /* Will create a run a separate PROC JSON step */ /* for each record. This might take a while */ /* for very large data. */ /* Each iteration will create a new JSON file */ data _null_; set &datasource.; call execute(catt('filename out "',"&workpath./out",_n_,'.json";')); call execute('proc json out=out nosastags ;'); call execute("export &datasource.(obs="||_n_||" firstobs="||_n_||");"); call execute('run;'); run; /* This will concatenate the collection of JSON files */ /* into a single JSONL file */ data _null_; file final encoding='utf-8' termstr=cr; infile "&workpath./out*.json"; input; /* trim the start and end [ ] characters */ final = substr(_infile_,2,length(_infile_)-2); put final; run; |
From what I’ve read, it’s a common practice to compress JSONL files with gzip for storage or faster transfers. That’s a simple step to apply in our example, because SAS supports a GZIP method in SAS 9.4 Maintenance 5. To create a gzipped final result, change the first FILENAME statement to something like:
filename final ZIP "c:\temp\final.jsonl.gz" GZIP;
The JSONL format is new to me and I haven’t needed to use it in any of my applications. If you use JSONL in your work, I’d love to hear your feedback about whether this approach would create the types of files you need.
The post Create newline-delimited JSON (or JSONL) with SAS appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Joseph Woodside shares examples of applied experimental learning in healthcare from his book, “Applied Health Analytics and Informatics Using SAS”.
The post Big Data in Healthcare appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
An ROC curve graphically summarizes the tradeoff between
true positives and true negatives for a rule or model that predicts a binary response variable.
An ROC curve is a parametric curve that is constructed by varying the cutpoint value at which estimated probabilities are considered to predict the binary event.
Most SAS data analysts know that you can fit a logistic model in PROC LOGISTIC and create an ROC curve for that model,
but did you know that PROC LOGISTIC enables you to create and compare ROC curves for ANY vector of predicted probabilities regardless of where the predictions came from?
This article shows how!
If you want to review the basic constructions of an ROC curve, you can see
a previous article that constructs an empirical ROC curve from first principles. The PROC LOGISTIC documentation provides formulas used for constructing an ROC curve.
Before discussing how to create an ROC plot from an arbitrary vector of predicted probabilities, let’s review how to create an ROC curve from a model that is fit by using PROC LOGISTIC.
The following data and model are taken from the the PROC LOGISTIC documentation. The data are for 43 cancer patients who also had an intestinal obstruction. The response variable popInd is a postoperative indicator variable: popInd = 1 for patients who died within two months after surgery.
The explanatory variables are three pre-operative screening tests. The goal of the study is to determine patients who might benefit from surgery, where “benefit” is measured by postoperative survival of at least two months.
data roc; input alb tp totscore popind @@; totscore = 10 - totscore; datalines; 3.0 5.8 10 0 3.2 6.3 5 1 3.9 6.8 3 1 2.8 4.8 6 0 3.2 5.8 3 1 0.9 4.0 5 0 2.5 5.7 8 0 1.6 5.6 5 1 3.8 5.7 5 1 3.7 6.7 6 1 3.2 5.4 4 1 3.8 6.6 6 1 4.1 6.6 5 1 3.6 5.7 5 1 4.3 7.0 4 1 3.6 6.7 4 0 2.3 4.4 6 1 4.2 7.6 4 0 4.0 6.6 6 0 3.5 5.8 6 1 3.8 6.8 7 1 3.0 4.7 8 0 4.5 7.4 5 1 3.7 7.4 5 1 3.1 6.6 6 1 4.1 8.2 6 1 4.3 7.0 5 1 4.3 6.5 4 1 3.2 5.1 5 1 2.6 4.7 6 1 3.3 6.8 6 0 1.7 4.0 7 0 3.7 6.1 5 1 3.3 6.3 7 1 4.2 7.7 6 1 3.5 6.2 5 1 2.9 5.7 9 0 2.1 4.8 7 1 2.8 6.2 8 0 4.0 7.0 7 1 3.3 5.7 6 1 3.7 6.9 5 1 3.6 6.6 5 1 ; ods graphics on; proc logistic data=roc plots(only)=roc; LogisticModel: model popind(event='0') = alb tp totscore; output out=LogiOut predicted=LogiPred; /* output predicted value, to be used later */ run; |
You can see the documentation for details about how to interpret the output from PROC LOGISTIC, but the example shows that you can use the PLOTS=ROC option (or the ROC statement) to create an ROC curve for a model that is fit by PROC LOGISTIC. For this model, the area under the ROC curve is 0.77. Because a random “coin flip” prediction has an expected area of 0.5, this model predicts the survival of surgery patients better than random chance.
A logistic model is not the only way to predict a binary response. You could also use a decision tree, a generalized mixed model, a nonparametric regression model, or even ask a human expert for her opinion.
An ROC curve only requires two quantities: for each observation, you need the observed binary response and a predicted probability. In fact, if you carefully read the PROC LOGISTIC documentation, you will find these sentences:
In other words, you can use PROC LOGISTIC to create an ROC curve regardless of how the predicted probabilities are obtained! For argument’s sake,
let’s suppose that you ask a human expert to predict the probability of each patient surviving for at least two months after surgery. (Notice that there is no statistical model here, only a probability for each patient.) The following SAS DATA step defines the predicted probabilities, which are then merged with the output from the earlier PROC LOGISTIC call:
data ExpertPred; input ExpertPred @@; datalines; 0.95 0.2 0.05 0.3 0.1 0.6 0.8 0.5 0.1 0.25 0.1 0.2 0.05 0.1 0.05 0.1 0.4 0.1 0.2 0.25 0.4 0.7 0.1 0.1 0.3 0.2 0.1 0.05 0.1 0.4 0.4 0.7 0.2 0.4 0.1 0.1 0.9 0.7 0.8 0.25 0.3 0.1 0.1 ; data Survival; merge LogiOut ExpertPred; run; /* create ROC curve from a variable that contains predicted values */ proc logistic data=Survival; model popind(event='0') = ExpertPred / nofit; roc 'Expert Predictions' pred=ExpertPred; ods select ROCcurve; run; |
Notice that you only need to supply two variables on the MODEL statements: the observed responses and the variable that contains the predicted values. On the ROC statement, I’ve used the PRED= option to indicate that the ExpertPred variable is not being fitted by the procedure. Although PROC LOGISTIC creates many tables, I’ve used the ODS SELECT statement to suppress all output except for the ROC curve.
You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows:
/* overlay two or more ROC curves by using variables of predicted values */ proc logistic data=Survival; model popind(event='0') = LogiPred ExpertPred / nofit; roc 'Logistic' pred=LogiPred; roc 'Expert' pred=ExpertPred; ods select ROCOverlay; /* optional: for a statistical comparison, use ROCCONTRAST stmt and remove the ODS SELECT stmt */ *roccontrast reference('Expert Model') / estimate e; run; |
This ROC overlay shows that the “expert” prediction is almost always superior or equivalent to the logistic model in terms of true and false classification rates.
As noted in the comments of the previous call to PROC LOGISTIC, you can use the ROCCONTRAST statement to obtain a statistical analysis of the difference between the areas under the curves (AUC).
In summary,
you can use the ROC statement in PROC LOGISTIC to generate ROC curves for models that were computed outside of PROC LOGISTIC. All you need are the predicted probabilities and observed response for each observation.
You can also overlay and compare two or more ROC curves and use the ROCCONTRAST statement to analyze the difference between areas under the curves.
The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |