The post The sample skewness is a biased statistic appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The skewness of a distribution
indicates whether a distribution is symmetric or not.
The Wikipedia article about skewness discusses two common definitions for the sample skewness, including the definition used by SAS. In the middle of the article, you will discover the following sentence:
In general, the [estimators]are both biased estimators of the population skewness.
The article goes on to say that the estimators are not biased for symmetric distributions.
Similar statements are true for the sample kurtosis.
This statement might initially surprise you. After all, the statistics that we use to estimate the mean and standard deviation are unbiased. Although biased estimates are not inherently “bad,” it is useful to get an intuitive feel for how biased an estimator might be.
Let’s demonstrate the bias in the skewness statistic by running a Monte Carlo simulation.
Choose an unsymmetric univariate distribution for which the population skewness is known. For example, the exponential distribution has skewness equal to 2. Then do the following:
In this article, I will generate B=10,000 random samples of size N=100 from the exponential distribution. The simulation shows that the expected value of the skewness is NOT close to the population parameter. Hence, the skewness statistic is biased.
The following DATA step simulates B random samples of size N from the exponential distribution. The call to PROC MEANS computes the sample skewness for each sample. The call to PROC SGPLOT displays the approximate sampling distribution of the skewness. The graph overlays a vertical reference line at 2, which is the skewness parameter for the exponential distribution, and also overlays a reference line at the Monte Carlo estimate of the expected value.
%let NumSamples = 10000; %let N = 100; /* 1. Simulate B random samples from exponential distribution */ data Exp; call streaminit(1); do SampleID = 1 to &NumSamples; do i = 1 to &N; x = rand("Expo"); output; end; end; run; /* 2. Estimate skewness (and other stats) for each sample */ proc means data=Exp noprint; by SampleID; var x; output out=MCEst mean=Mean stddev=StdDev skew=Skewness kurt=Kurtosis; run; /* 3. Graph the sampling distribution and overlay parameter value */ title "Monte Carlo Distribution of Sample Skewness"; title2 "N = &N; B = &NumSamples"; proc sgplot data=MCEst; histogram Skewness; refline 2 / axis=x lineattrs=(thickness=3 color=DarkRed) labelattrs=(color=DarkRed) label="Parameter"; refline 1.818 / axis=x lineattrs=(thickness=3 color=DarkBlue) label="Monte Carlo Estimate" labelattrs=(color=DarkBlue) labelloc=inside ; run; /* 4. Display the Monte Carlo estimate of the statistics */ proc means data=MCEst ndec=3 mean stddev; var Mean StdDev Skewness Kurtosis; run; |
For the exponential distribution, the skewness parameter has the value 2. However, according to the Monte Carlo simulation, the expected value of the sample skewness is about 1.82 for these samples of size 100.
Thus, the bias is approximately 0.18, which is about 9% of the true value.
The kurtosis statistic is also biased.
The output from PROC MEANS includes the Monte Carlo estimates for the expected value of the sample mean, standard deviation, skewness, and (excess) kurtosis. For the exponential distribution, the parameter values are 1, 1, 2, and 6, respectively. The Monte Carlo estimates for the sample mean and standard deviation are close to the parameter values because these are unbiased estimators. However, the estimates for the skewness and kurtosis are biased towards zero.
This article uses Monte Carlo simulation to demonstrate bias in the commonly used definitions of skewness and kurtosis.
For skewed distributions, the expected value of the sample skewness is biased towards zero. The bias is greater for highly skewed distributions. The skewness statistic for a symmetric distribution is unbiased.
The post The sample skewness is a biased statistic appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Confidence intervals for eigenvalues of a correlation matrix appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A fundamental principle of data analysis is that a statistic is an estimate of a parameter for the population. A statistic is calculated from a random sample. This leads to uncertainty in the estimate: a different random sample would have produced a different statistic.
To quantify the uncertainty, SAS procedures often support options that estimate standard errors for statistics and confidence intervals for parameters.
Of course, if statistics have uncertainty, so, too, do functions of the statistics. For complicated functions of the statistics,
the bootstrap method might be the only viable technique for quantifying the uncertainty.
This article shows how to obtain confidence intervals for the eigenvalues of a correlation matrix.
The eigenvalues are complicated functions of the correlation estimates.
The eigenvalues are used in a principal component analysis (PCA) to decide how many components to keep in a dimensionality reduction.
There are two main methods to estimate confidence intervals for eigenvalues: an asymptotic (large sample) method, which assumes that the eigenvalues are multivariate normal, and a bootstrap method, which makes minimal distributional assumptions.
The following sections show how to compute each method.
A graph of the results is shown to the right. For the data in this article, the bootstrap method generates confidence intervals that are more accurate than the asymptotic method.
This article was inspired by
Larsen and Warne (2010), “Estimating confidence intervals for
eigenvalues in exploratory factor analysis.” Larsen and Warne discuss why confidence intervals can be useful when deciding how many principal components to keep.
To demonstrate the techniques, let’s perform a principal component analysis (PCA) on the four continuous variables in the Fisher Iris data. In SAS, you can use PROC PRINCOMP to perform a PCA, as follows:
%let DSName = Sashelp.Iris; %let VarList = SepalLength SepalWidth PetalLength PetalWidth; /* 1. compute value of the statistic on original data */ proc princomp data=&DSName STD plots=none; /* stdize PC scores to unit variance */ var &VarList; ods select ScreePlot Eigenvalues NObsNVar; ods output Eigenvalues=EV0 NObsNVar=NObs(where=(Description="Observations")); run; proc sql noprint; select nValue1 into :N from NObs; /* put the number of obs into macro variable, N */ quit; %put &=N; |
The first table shows that there are 150 observations. The second table displays the eigenvalues for the sample, which are 2.9, 0.9, 0.15, and 0.02.
If you want a graph of the eigenvalues, you can use the PLOTS(ONLY)=SCREE option on the PROC PRINCOMP statement.
The ODS output statement creates SAS data set from the tables.
The PROC SQL call creates a macro variable, N, that contains the number of observations.
If a sample size, n, is large enough, the sampling distribution of the eigenvalues is approximately multivariate normal (Larsen and Ware (2010, p. 873)). If g is an eigenvalue for a correlation matrix, then an asymptotic confidence interval is
g ± z^{*} sqrt( 2 g^{2} / n )
where z^{*} is the standard normal quantile, as computed in the following program:
/* Asymptotic CIs for eigenvalues (Larsen and Warne (2010, p. 873) */ data AsympCI; set EV0; alpha = 0.05; z = quantile("Normal", 1 - alpha/2); /* = 1.96 */ SE = sqrt(2*Eigenvalue**2 / &N); Normal_LCL = Eigenvalue - SE; /* g +/- z* sqrt(2 g^2 / n) */ Normal_UCL = Eigenvalue + SE; drop alpha z SE; run; proc print data=AsympCI noobs; var Number Eigenvalue Normal_LCL Normal_UCL; run; |
The lower and upper confidence limits are shown for each eigenvalue.
The advantage of this method is its simplicity.
The intervals assume that the distribution of the eigenvalues is multivariate normal, which will occur when the sample size is very large. Since N=150 does not seem “very large,” it is not clear whether these confidence intervals are valid. Therefore, let’s estimate the confidence intervals by using the bootstrap method and compare the bootstrap intervals to the asymptotic intervals.
The bootstrap computations in this section follow the strategy outlined in the article “Compute a bootstrap confidence interval in SAS.” (For additional bootstrap tips, see “The essential guide to bootstrapping in SAS.”) The main steps are:
The steps are implemented in the following SAS program:
/* 2. Generate many bootstrap samples */ %let NumSamples = 5000; /* number of bootstrap resamples */ proc surveyselect data=&DSName NOPRINT seed=12345 out=BootSamp(rename=(Replicate=SampleID)) method=urs /* resample with replacement */ samprate=1 /* each bootstrap sample has N observations */ /* OUTHITS option to suppress the frequency var */ reps=&NumSamples; /* generate NumSamples bootstrap resamples */ run; /* 3. Compute the statistic for each bootstrap sample */ /* Suppress output during this step: https://blogs.sas.com/content/iml/2013/05/24/turn-off-ods-for-simulations.html */ %macro ODSOff(); ods graphics off; ods exclude all; ods noresults; %mend; %macro ODSOn(); ods graphics on; ods exclude none; ods results; %mend; %ODSOff; proc princomp data=BootSamp STD plots=none; by SampleID; freq NumberHits; var &VarList; ods output Eigenvalues=EV(keep=SampleID Number Eigenvalue); run; %ODSOn; /* 4. Estimate 95% confidence interval as the 2.5th through 97.5th percentiles of boostrap distribution */ proc univariate data=EV noprint; class Number; var EigenValue; output out=BootCI pctlpre=Boot_ pctlpts=2.5 97.5 pctlname=LCL UCL; run; /* merge the bootstrap CIs with the normal CIs for comparison */ data AllCI; merge AsympCI(keep=Number Eigenvalue Normal:) BootCI(keep=Number Boot:); by Number; run; proc print data=AllCI noobs; format Eigenvalue Normal: Boot: 5.3; run; |
The table displays the bootstrap confidence intervals (columns 5 and 6) next to the asymptotic confidence intervals (columns 3 and 4).
It is easier to compare the intervals if you visualize them graphically, as follows:
/* convert from wide to long */ data CIPlot; set AllCI; Method = "Normal "; LCL = Normal_LCL; UCL = Normal_UCL; output; Method = "Bootstrap"; LCL = Boot_LCL; UCL = Boot_UCL; output; keep Method Eigenvalue Number LCL UCL; run; title "Comparison of Normal and Bootstrap Confidence Intervals"; title2 "Eigenvalues of the Correlation Matrix for the Iris Data"; ods graphics / width=480px height=360px; proc sgplot data=CIPlot; scatter x=Eigenvalue y=Number / group=Method clusterwidth=0.4 xerrorlower=LCL xerrorupper=UCL groupdisplay=cluster; yaxis grid type=discrete colorbands=even colorbandsattrs=(color=gray transparency=0.9); xaxis grid; run; |
The graph is shown at the top of this article. The graph nicely summarizes the comparison. For the first (largest) eigenvalue, the bootstrap confidence interval is about half as wide as the normal confidence interval. Thus, the asymptotic result seems too wide for these data. For the other eigenvalues, the normal confidence intervals appear to be too narrow.
If you graph the bootstrap distribution, you can see that the bootstrap distribution does not appear to be multivariate normal. This presumably explains why the asymptotic intervals are so different from the bootstrap intervals. For completeness, the following graph shows a matrix of scatter plots and marginal histograms for the bootstrap distribution. The histograms indicate skewness in the bootstrap distribution.
This article shows how to compute confidence intervals for the eigenvalues of an estimated correlation matrix.
The first method uses a formula that is valid when the sampling distribution of the eigenvalues is multivariate normal. The second method uses bootstrapping to approximate the distribution of the eigenvalues, then uses percentiles of the distribution to estimate the confidence intervals. For the Iris data, the bootstrap confidence intervals are substantially different from the asymptotic formula.
The post Confidence intervals for eigenvalues of a correlation matrix appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Generate random points in a polygon appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The triangulation theorem for polygons says that every simple polygon can be triangulated.
In fact, if the polygon has V vertices, you can decompose it into V-2 non-overlapping triangles.
In this article, a “polygon” always means a simple polygon. Also, a “random point” means one that is drawn at random from the uniform distribution.
The triangularization of a polygon is useful in many ways, but one application is to generate uniform random points in a polygon or a collection of polygons. Because polygons can be decomposed into triangles, the problem reduces to a simpler one: Given a list of k triangles, generate uniform random points in the union of the triangles. I have already shown
how to generate random points in a triangle, so you can apply this method to generate random points in a polygon or collection of polygons.
Suppose that a polygon or any other planar region is decomposed into k triangles T_{1}, T_{2}, …, T_{k}. If you want to generate N random points uniformly in the region, the number of points in any triangle should be proportional to the area of the triangle divided by the total area of the polygon.
One way to accomplish this is to use a two-step process. First, choose a triangle by using a probability proportional to the relative areas. Next, generate a random point in that triangle. This two-step approach is suitable for
the SAS DATA step. At the end of this process, you have generated N_{i} observations in triangle T_{i}.
An equivalent formulation is to realize that the vector
{N_{1}, N_{2}, …, N_{k}} is a random draw from the multinomial distribution with parameters p = {p_{1}, p_{2}, …, p_{k}}, where
p_{i} = Area(T_{i}) / (Σ_{j} Area(T_{j})).
This second formulation is better for a vector languages such as the SAS/IML language.
Therefore, the following algorithm generates random points in a polygon:
Notice that Steps 2-4 of this algorithm apply to ANY collection of triangles. To make the algorithm flexible, I will implement the first step (the decomposition) in one function and the remaining steps in a second function.
There are various general methods for triangulating a polygon, but for convex polygons, there is a simple method. From among the V vertices, choose any vertex and call it P_{1}. Enumerate the remaining vertices consecutively in a counter-clockwise direction: P_{2}, P_{3}, …, P_{k}, where k = V-2. Because the polygon is convex, the following triangles decompose the polygon:
The following SAS/IML function decomposes a convex polygon into triangles. The triangles are returned in a SAS/IML list.
The function is called on a convex hexagon and the resulting decomposition is shown below.
The function uses the PolyIsConvex function, which is part of the Polygon package.
You can download and install the Polygon package.
You need to load the Polygon package before you call the function.
/* assume the polygon package is installed */ proc iml; package load polygon; /* load the polygon package */ /* Decompose a convex polygon into triangles. Return a list that contains the vertices for the triangles. This function uses a function in the Polygon package, which must be loaded. */ start TriangulateConvex(P); /* input parameter(N x 2): vertices of polygon */ isConvex = PolyIsConvex(P); if ^isConvex then return ( [] ); /* The polygon is not convex */ numTri = nrow(P) - 2; /* number of triangles in convex polygon */ L = ListCreate(numTri); /* create list to store triangles */ idx = 2:3; do i = 1 to ListLen(L); L$i = P[1,] // P[idx,]; idx = idx + 1; end; return (L); finish; /* Specify a convex polygon and visualize the triangulation. */ P = { 2 1 , 3 1 , 4 2 , 5 4 , 3 6 , 1 4 , 1 2 }; L = TriangulateConvex(P); |
To illustrate the process, I’ve included a
graph that shows a decomposition of the convex hexagon into triangles. The triangles are returned in a list. The next section shows how to generate uniform points at random inside the union of the triangles in this list.
This section generates random points in a union of triangles. The following function takes two arguments: the number of points to generate (N) and a list of triangles (L). The algorithm computes the relative areas of the triangles and uses them to determine the probability that a point will be generated in each. It then uses the RandUnifTriangle function from the previous article to generate the random points.
/* Given a list of triangles (L), generate N random points in the union, where the number of points is proportional to Area(triangle) / Area(all triangles) This function uses functions in the Polygon package, which must be loaded. */ start RandUnifManyTriangles(N, L); numTri = ListLen(L); /* compute areas of each triangle in the list */ AreaTri = j(1, numTri,.); /* create vector to store areas */ do i = 1 to numTri; AreaTri[i] = PolyArea(L$i); /* PolyArea is in the Polygon package */ end; /* Numbers of points in the triangles are multinomial with probability proportional to Area(triangle)/Area(polygon) */ NTri = RandMultinomial(1, N, AreaTri/sum(AreaTri)); cumulN = 0 || cusum(NTri); /* cumulative counts; use as indices */ z = j(N, 3, .); /* columns are (x,y,TriangleID) */ do i = 1 to numTri; k = (cumulN[i]+1):cumulN[i+1]; /* the next NTri[i] elements */ z[k, 1:2] = RandUnifTriangle(L$i, NTri[i]); z[k, 3] = i; /* store the triangle ID */ end; return z; finish; /* The RandUnifTriangle function is defined at https://blogs.sas.com/content/iml/2020/10/19/random-points-in-triangle.html */ load module=(RandUnifTriangle); call randseed(12345); N = 2000; z = RandUnifManyTriangles(N, L); |
The z vector is an N x 3 matrix. The first two columns contain the (x,y) coordinates of N random points. The third column contains the ID number (values 1,2,…,k) that indicates the triangle that each point is inside of. You can use the PolyDraw function in the Polygon package to visualize the distribution of the points within the polygon:
title "Random Points in a Polygon"; title2 "Colors Assigned Based on Triangulation"; call PolyDraw(P, z); |
The color of each point indicates which triangle the point is inside. You can see that triangles with relatively small areas (blue and purple) have fewer points than triangles with larger areas (green and brown).
In summary, this article shows how to generate random points inside a planar polygon. The first step is to decompose the polygon into triangles. You can use the relative areas of the triangles to determine the probability that a random point is in each triangle. Finally, you can generate random points in the union of the triangles.
(Note: The algorithm works for any collection of planar triangles.)
This article uses functions in the Polygon package. Installing and loading a package is a way to define a set of related functions that you want to share. It is an alternative to using %INCLUDE to include the module definitions into your program.
The post Generate random points in a polygon appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Debugging SASUSER issues when you use SAS® software was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
When you use SAS software, you might occasionally encounter an issue with SASUSER. This post helps you debug some of the more common issues:
By default, SAS tries to store custom templates and styles that PROC TEMPLATE creates in SASUSER. In some SAS environments with multiple users on a server, your SASUSER location might be read-only (set with the RSASUSER option). If you do not need the template or style to persist between sessions, you can set the template path to include the WORK library first:
ods path(prepend) work.template(update);
If you are working with a local SAS session, this issue can occur when a corrupt or old copy of the templat.sas7bitm file exists in your SASUSER directory. To resolve the issue
proc options option=sasuser; run; |
If you see a note or warning in the log indicating that SAS cannot open the SASUSER.PROFILE catalog, you should ensure first that you have only a single SAS session running. If you have multiple SAS sessions running concurrently only the first SAS session has Update access to SASUSER.
If only one SAS session is active and you still receive a note or warning that SAS cannot open SASUSER.PROFILE:
proc options option=sasuser; run; |
In Microsoft Windows operating environments, rename the files as follows:
In UNIX operating environments, rename the files as follows:
If you see a note or warning in the log indicating that SAS cannot open SASUSER.REGSTRY, ensure first that you have only a single SAS session running. If you have multiple SAS sessions running concurrently only the first SAS session has Update access to SASUSER.
If only one SAS session is active and you still receive a note or warning that SAS cannot open SASUSER.REGSTRY:
proc options option=sasuser; run; |
If one or more files or catalogs in SASUSER are corrupted, various abnormal endings and errors can occur when you use ODS or when you create graphics output.
If you suspect that this is the case, determine the location of your SASUSER directory by submitting the following code to SAS:
proc options option=sasuser; run; |
If you follow the debugging steps for any of the issues outlined above and find that you still have Read access to SASUSER, the problem might be with your SAS installation. Specifically, your installation might have the RSASUSER SAS system option set. This system option sets SASUSER to Read-Only mode. To determine the current setting for this option, submit the following statements to SAS and then check the new information that is written to the SAS log:
proc options option=rsasuser; run; |
In a multiuser SAS environment or SAS Grid Computing environment, RSASUSER might be set by policy. In that case, you must adjust your programs/process to not rely on SASUSER for personal content. If working with a local or private SAS environment, you can change the option to NORSASUSER in your SAS configuration file.
As you can see from this post, a variety of reasons can cause issues with the SASUSER directory. These issues can occur when one or more catalogs or item stores in your SASUSER directory become corrupted or are created with an earlier installation of SAS. However, if you rename the catalogs or item stores with a file extension that SAS does not recognize, SAS creates new, uncorrupted copies of these files when you restart SAS.
Debugging SASUSER issues when you use SAS® software was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Generate random points in a triangle appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How can you efficiently generate N random uniform points in a triangular region of the plane?
There is a very cool algorithm (which I call the reflection method) that
makes the process easy. I no longer remember where I saw this algorithm, but it is different from the “weighted average” method in Devroye (1986, p. 569-570).
This article describes and implements the reflection algorithm for generating random points in a triangle from the uniform distribution. The graph to the right shows 1,000 random points in the triangle with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0). The method works for any kind of triangle: acute, obtuse, equilateral, and so forth.
In this article, “random points” means that the points are drawn randomly from the uniform distribution.
The easiest way to understand the algorithm is to think about generating points in a parallelogram. For simplicity, translate the parallelogram so that one vertex is at the origin. Two sides of the parallelogram share that vertex. Let a and
b be the vectors from the origin to the adjacent vertices.
To produce a random point in the parallelogram, generate u1, u2 ~ U(0,1) and form the vector sum
p = u1*a + u2*b
This is the 2-D parameterization of the parallelogram, so for random u1 and u2, the point p is uniformly distributed in the parallelogram. The geometry is shown in the following figure.
The following SAS/IML program generates 1,000 random points in the parallelogram. The graph is shown above.
proc iml; n = 1000; call randseed(1234); /* random points in a parallelgram */ a = {3 2}; /* vector along one side */ b = {1 -2}; /* vector along adjacent side */ u = randfun(n // 2, "Uniform"); /* u[,1], u[,2] ~ U(0,1) */ w = u[,1]@a + u[,2]@b; /* linear combination of a and b */ title "Random Points in Parallelogram"; call scatter(w[,1], w[,2]) grid={x,y}; |
The only mysterious part of the program is the use of the Kronecker product (the ‘@’ operator)
to form linear combinations of the
a and b vectors. The details of the Kronecker product operator are described in a separate article. Briefly, it is an efficient way to generate linear combinations of a and b without writing a loop.
A useful fact about random uniform variates is that if u ~ U(0,1), then also v = 1 – u ~ U(0,1).
You can use this fact to convert N points in a parallelogram into N points in a triangle.
Let u1, u2 ~ U(0,1) be random variates in (0,1).
If u1 + u2 ≤ 1, then the vector u1*a + u2*b is in the triangle with sides a and b.
If u1 + u2 > 1, then define v1 = 1 – u1 and v2 = 1 – u2, which are also random uniform variates. You can verify that the sum v1 + v2 ≤ 1, which means that the vector v1*a + v2*b is in the triangle with sides a and b.
This is shown in the following graph. The blue points are the points for which u1 + u2 ≤ 1.
The red points are for u1 + u2 > 1. When you form v1 and v2, the red triangle get reflected twice and ends up on top of the blue triangle. The two reflections are equivalent to a 180 degree rotation about the center of the parallelogram, which might be easier to visualize.
With this background, you can now generate random points in any triangle. Let P1, P2, and P3 be the vertices of the triangle. The algorithm to generate random points in the triangle is as follows:
The following SAS/IML program implements this algorithm and runs it for the triangle
with vertices P1=(1, 2), P2=(4, 4), and P3=(2, 0).
/* generate random uniform sample in triangle with vertices P1 = (x0,y0), P2 = (x1,y1), and P3 = (x2,y2) The triangle is specified as a 3x2 matrix, where each row is a vertex. */ start randUnifTriangle(P, n); a = P[2,] - P[1,]; /* translate triangle to origin */ b = P[3,] - P[1,]; /* a and b are vectors at the origin */ u = randfun(n // 2, "Uniform"); idx = loc(u[,+] >= 1); /* identify points outside of the triangle */ if ncol(idx)>0 then u[idx,] = 1 - u[idx,]; /* transform variates into the triangle */ w = u[,1]@a + u[,2]@b; /* linear combination of a and b vectors */ return( P[1,] + w ); /* translate triangle back to original position */ finish; store module=(randUnifTriangle); /* triangle contains three vertices */ call randseed(1234,1); P = {1 2, /* P1 */ 4 4, /* P2 */ 2 0}; /* P3 */ n = 1000; w = randUnifTriangle(P, n); title "Random Points in Triangle"; ods graphics / width=480px height=480px; call scatter(w[,1], w[,2]) grid={x,y}; |
The graph of the 1,000 random points appears at the top of this program.
As written, the programs in this article create scatter plots that show the random points. To improve the exposition, I used the polygon package to draw graphs that overlay the scatter plot and a polygon. You can download and install the polygon package if you have PROC IML with SAS 9.4m3 or later.
You can download the complete SAS program that performs all the computations and creates all the graphs in this article.
In summary, this article shows how to generate random uniform points in a triangle by using the reflection algorithm. The reflection algorithm is based on generating random points in a parallelogram. If you draw the diagonal of a parallelogram, you get two congruent triangles. The algorithm reflects (twice) all points in one triangle into the other triangle.
The algorithm is implemented in SAS by using the SAS/IML language, although you could also use the SAS DATA step.
The post Generate random points in a triangle appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Using Microsoft Excel functions in SAS was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
You might have heard about SAS – Microsoft partnership announced in June 2020 that officially joined the powers of SAS analytics with Microsoft’s cloud technology to further advance Artificial Intelligence (AI).
This partnership did not just happen out of nowhere. SAS has a long and deep history of integrating with Microsoft technologies. Examples include:
In this post we will look at a lesser known, but quite useful feature in SAS that allows SAS users to bring many Microsoft Excel functions right to their SAS programs. I hope that many SAS users (not just MS Excel aficionados) will love to discover this functionality within SAS.
SAS has a wide variety of built-in functions, however there are still many Microsoft Excel functions that are not intrinsically implemented in SAS. Luckily, many of them are made available in SAS via PROC FCMP as user-defined functions (see section PROC FCMP and Microsoft Excel). These functions are predefined for you and their definitions are stored in the SASHELP.SLKWXL data table provided with your SAS installation. You can generate a list of these functions by running the following code:
proc fcmp inlib=SASHELP.SLKWXL listall; run; |
You can also capture the list of available Excel functions in a SAS data table using ODS OUTPUT with CODELIST= option:
ods noresults; ods output codelist=WORK.EXCEL_FUNCTIONS_LIST (keep=COL1 COL2); proc fcmp inlib=SASHELP.SLKWXL listall; run; ods output close; ods results; |
From this data table you can produce a nice looking HTML report listing all these functions:
data WORK.EXCEL_SAS_FUNCTIONS (keep=exc sas arg); label exc='Excel Function' sas='SAS Function' arg='Arguments'; set WORK.EXCEL_FUNCTIONS_LIST (rename=(col2=arg)); sas = tranwrd(col1,'Function ',''); exc = tranwrd(sas,'_slk',''); run; ods html path='c:\temp' file='excel_sas_functions.html'; title 'List of Excel functions available in SAS (via SASHELP.SLKWXL)'; proc print data=EXCEL_SAS_FUNCTIONS label; run; ods html close; |
When you run this code, you should get the following list of Excel functions along with their SAS equivalents:
Obs | Excel Function | SAS Function | Arguments |
---|---|---|---|
1 | even | even_slk | ( x ) |
2 | odd | odd_slk | ( x ) |
3 | factdouble | factdouble_slk | ( x ) |
4 | product | product_slk | ( nums ) |
5 | multinomial | multinomial_slk | ( nums ) |
6 | floor | floor_slk | ( n, sg ) |
7 | datdif4 | datdif4_slk | ( start, end ) |
8 | amorlinc | amorlinc_slk | ( cost, datep, fperiod, salvage, period, rate, basis ) |
9 | amordegrc | amordegrc_slk | ( cost, datep, fperiod, salvage, period, rate, basis ) |
10 | disc | disc_slk | ( settlement, maturity, pr, redemp, basis ) |
11 | tbilleq | tbilleq_slk | ( settlement, maturity, discount ) |
12 | tbillprice | tbillprice_slk | ( settlement, maturity, discount ) |
13 | tbillyield | tbillyield_slk | ( settlement, maturity, par ) |
14 | dollarde | dollarde_slk | ( fdollar, frac ) |
15 | dollarfr | dollarfr_slk | ( ddollar, frac ) |
16 | effect | effect_slk | ( nominal_rate, npery ) |
17 | coupnum | coupnum_slk | ( settlement, maturity, freq, basis ) |
18 | coupncd | coupncd_slk | ( settlement, maturity, freq, basis ) |
19 | coupdaysnc | coupdaysnc_slk | ( settlement, maturity, freq, basis ) |
20 | couppcd | couppcd_slk | ( settlement, maturity, freq, basis ) |
21 | coupdays | coupdays_slk | ( settlement, maturity, freq, basis ) |
22 | db | db_slk | ( cost, salvage, life, period, month ) |
23 | yield | yield_slk | ( settlement, maturity, rate, pr, redemp, freq, basis ) |
24 | yielddisc | yielddisc_slk | ( settlement, maturity, pr, redemp, basis ) |
25 | coupdaybs | coupdaybs_slk | ( settlement, maturity, freq, basis ) |
26 | oddfprice | oddfprice_slk | ( settlement, maturity, issue, fcoupon, rate, yield, redemp, freq, basis ) |
27 | oddfyield | oddfyield_slk | ( settlement, maturity, issue, fcoupon, rate, pr, redemp, freq, basis ) |
28 | oddlyield | oddlyield_slk | ( settlement, maturity, linterest, rate, pr, redemp, freq, basis ) |
29 | oddlprice | oddlprice_slk | ( settlement, maturity, linterest, rate, yield, redemp, freq, basis ) |
30 | price | price_slk | ( settlement, maturity, rate, yield, redemp, freq, basis ) |
31 | pricedisc | pricedisc_slk | ( settlement, maturity, discount, redemp, basis ) |
32 | pricemat | pricemat_slk | ( settlement, maturity, issue, rate, yld, basis ) |
33 | yieldmat | yieldmat_slk | ( settlement, maturity, issue, rate, pr, basis ) |
34 | received | received_slk | ( settlement, maturity, investment, discount, basis ) |
35 | accrint | accrint_slk | ( issue, finterest, settlement, rate, par, freq, basis ) |
36 | accrintm | accrintm_slk | ( issue, maturity, rate, par, basis ) |
37 | duration | duration_slk | ( settlement, maturity, coupon, yld, freq, basis ) |
38 | mduration | mduration_slk | ( settlement, maturity, coupon, yld, freq, basis ) |
39 | avedev | avedev_slk | ( data ) |
40 | devsq | devsq_slk | ( data ) |
41 | varp | varp_slk | ( data ) |
NOTE: Excel functions that are made available in SAS are named from their Excel parent functions, suffixing them with _SLK to distinguish them from their Excel incarnations, as well as from native SAS functions.
In order to use any of these Excel functions in your SAS code, all you need to do is to specify the functions definition data table in the CMPLIB= option:
options cmplib=SASHELP.SLKWXL;
Let’s consider several examples.
This function returns number rounded up to the nearest odd integer:
options cmplib=SASHELP.SLKWXL; data _null_; x = 5.9; y = odd_slk(x); put 'odd( ' x ') = ' y; run; |
SAS log:
odd( 5.9 ) = 7
This function returns number rounded up to the nearest even integer:
options cmplib=SASHELP.SLKWXL; data _null_; x = 6.4; y = even_slk(x); put 'even( ' x ') = ' y; run; |
SAS log:
even( 6.4 ) = 8
This function returns the double factorial of a number. If number is not an integer, it is truncated.
Double factorial (or semifactorial) of a number n, denoted by n!!, is the product of all the integers from 1 up to n that have the same parity as n.
For even n, the double factorial is n!!=n(n-2)(n-4)…(4)(2), and for odd n, the double factorial is n!! = n(n-2)(n-4)…(3)(1).
Here is a SAS code example using the factdouble() Excel function:
options cmplib=SASHELP.SLKWXL; data _null_; n = 6; m = 7; nn = factdouble_slk(n); mm = factdouble_slk(m); put n '!! = ' nn / m '!! = ' mm; run; |
It will produce the following SAS log:
6 !! = 48
7 !! = 105
Indeed, 6!! = 2 x 4 x 6 = 48 and 7!! = 1 x 3 x 5 x 7 = 105.
This function multiplies all elements of SAS numeric array given as its argument and returns the product:
options cmplib=SASHELP.SLKWXL; data _null_; array x x1-x5 (5, 7, 1, 2, 2); p = product_slk(x); put 'x = ( ' x1-x5 ')'; put 'product(x) = ' p; run; |
SAS log:
x = ( 5 7 1 2 2 )
product(x) = 140
Indeed 5*7*1*2*2 = 140.
This function returns the ratio of the factorial of a sum of values to the product of factorials:
MULTINOMIAL(a_{1}, a_{2}, … , a_{n}) = (a_{1} + a_{2} + … + a_{n})! : (a_{1}! a_{2}! … a_{n}!)
In SAS, the argument to this function is specified as numeric array name:
options cmplib=SASHELP.SLKWXL; data _null_; array a a1-a3 (1, 3, 2); m = multinomial_slk(a); put 'a = ( ' a1-a3 ')'; put 'multinomial(a) = ' m; run; |
SAS log:
a = ( 1 3 2 )
multinomial(a) = 60
Indeed (1+3+2)! : (1! + 3! + 2!) = 720 : 12 = 60.
You can explore other Excel functions available in SAS via SASHELP.SLKWXL user-defined functions by cross-referencing them with the corresponding Microsoft Excel functions documentation (alphabetical or by categories). As you can see in the above List of Excel functions available in SAS, besides mathematical and statistical functions exemplified in the previous section, there are also many Excel financial functions related to securities trading that are made available in SAS.
Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.
Using Microsoft Excel functions in SAS was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post A continuous band plot for visualizing uncertainty in regression predictions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A previous article discusses the confidence band for the mean predicted value in a regression model.
The article shows a “graded confidence band plot,” which I saw in
Claus O. Wilke’s online book, Fundamentals of Data Visualization (Section 16.3). It communicates uncertainty in the predictions.
A graded band plot is shown to the right for the 95% (outermost), 90%, 80%, 70% and 50% (innermost) confidence levels.
The graded band plot to the right has five confidence levels, but that was an arbitrary choice. Why not use 10 levels? Or 100?
When you start using many levels, it is no longer feasible to identify the individual confidence bands.
However, you can use a heat map to visualize the uncertainty in the mean. Given any (x,y) value in the graph, you can find the value of α such that the 100(1-α)% confidence limit passes through (x,y).
You can then color the point (x,y) by the value of α and use a gradient legend to indicate the α values (or, equivalently, the confidence levels).
An example is shown below.
In this article, I show how to create this visualization, which I call a continuous band plot.
This article was inspired by a question and graph that Michael Friendly posted on Twitter.
Most statistics textbooks provide a formula for the 100(1-α)% confidence limits for the predicted mean of a regression model. You can see the formula in the SAS documentation.
To keep the math simple, I will restrict the discussion to one-dimensional linear regression of the form Y = b0 + b1*x + ε, where ε ~ N(0, σ).
Suppose that the sample contains n observations.
Given a significance level (α) and values for the explanatory variable (x), the upper and lower confidence limits are given by
CLM(x) = pred(x) ± t_{α} SEM(x)
where
The SAS/IML documentation includes a Getting Started example about computing statistics for linear regression. The following program computes these quantities for α=0.05 and for a sequence of x values:
data Have; input x y; datalines; 21 45 24 40 26 32 27 47 30 55 30 60 33 42 34 67 36 50 39 54 ; proc iml; /* Use IML to produce regression estimates */ use Have; read all var {x y}; close; X = j(nrow(x),1,1) || x; /* add intercept col to design matrix */ n = nrow(x); /* sample size */ xpxi = inv(X`*X); /* inverse of X'X */ b = xpxi * (X`*y); /* parameter estimates of coeffcients */ yHat = X*b; /* predicted values for data */ dfe = n - ncol(X); /* error DF */ mse = ssq(y-yHat)/dfe; /* MSE = SSQ(resid)/dfe */ /* Given M = inv(X`*X), compute leverage at x. In general, the leverage is h = vecdiag(x*M*x`), but for a single regressor this simplifies. */ start leverage(x, M); return ( M[1,1] + 2*M[1,2]*x + M[2,2]*x##2 ); finish; alpha = 0.05; t_a= quantile("t", 1-alpha/2, dfe); /* evaluate CLM for a sequence of x values */ z = T(do(20,40,1)); /* sequence of x values in [20,40] */ h = leverage(z, xpxi); /* evaluate h(x) */ SEM = sqrt(mse * h); /* SEM(x) = standard error of predicted mean */ Pred = b[1] + b[2]*z; /* evaluate pred(x) */ LowerCLM = Pred - t_a * SEM; UpperCLM = Pred + t_a * SEM; print z Pred LowerCLM UpperCLM; |
The program computes the predicted values and the lower and upper 95% confidence limits for the mean for a sequence of x values in the interval [20, 40]. If you overlay these curves on a scatter plot of the data, you obtain the usual regression fit plot that is produced automatically by regression procedures in SAS.
The previous section implements formulas that are typically presented in a first course on linear regression. Given α and x, the formulas enable you to find the upper and lower CLM at x.
But here is a cool thing that I haven’t seen before: You can INVERT that formula.
That is, given a
value for the explanatory variable (x) and the response (y), you can find the significance level (α) such that the 100(1-α) confidence limit passes through (x, y).
This enables you to compute a heat map that visualizes the uncertainty in the predicted mean.
It is not hard to invert the CLM formula. You can use algebra to find
t_{α} = |y – pred(x)| / SEM(x)
You can then apply the CDF function to both sides to find
1 – α/2 = CDF(“t”, |y – pred(x)| / SEM(x), n-2)
which you can solve for α(x,y).
The following program carries out these computations:
/* Given (x,y), find alpha(x,y) so that the 100(1-alpha)% confidence limit passes through (x,y). */ /* Vectorize this process and compute alpha(x,y) for a grid of (x,y) values */ xx = do(20, 40, 0.25); /* x in [20,40] */ yy = do(26, 73, 0.2); /* y in [26,73] */ xy = ExpandGrid(yy, xx); /* generate (x,y) pairs where x changes fastest */ x = xy[,2]; y = xy[,1]; /* vectors of x & y */ h = leverage(x, xpxi); /* h(x) */ SEM = sqrt(mse * h); /* SEM(x) = standard error of predicted mean */ Pred = b[1] + b[2]*x; /* best predition of mean at x */ LHS = abs(y - Pred)/SEM; p = cdf("t", LHS, dfe); /* = 1 - alpha/2 */ alpha = 2*(1 - p); ConfidenceLevel = 100*(1-alpha); /* write the data to a SAS data set */ create Heat var{'x' 'y' 'alpha' 'ConfidenceLevel'}; append; close; QUIT; |
For each (x,y) value in a grid, the program computes the α value such that the 100(1-α)% confidence limit passes through (x,y). These values are written to a SAS data set and are used in the next section.
The following statements visualize the uncertainty in the predicted value by creating a heat map of the confidence bands for a range of confidence values:
data ContBand; set Have Heat(rename=(x=xx y=yy)); /* concatenate with orig data */ label xx="x" yy="y"; run; title 'Continuous Band Plot'; title2 'Confidence for Mean Predicted Value'; proc sgplot data=ContBand noautolegend; heatmapparm x=xx y=yy colorresponse=ConfidenceLevel / colormodel=(VeryDarkGray Gray White) name="prob"; reg x=x y=y/ markerattrs=(symbol=CircleFilled size=12); gradlegend "prob"; yaxis min=30 max=68; run; |
The graph is shown in the first section of this article. The graph is similar to Wilke’s graded confidence band plot. However, instead of overlaying a small number of confidence bands, it visualizes “infinitely many” confidence bands. Basically, it inverts the usual process (limits as a function of confidence level) to produce confidence as a function of y, for each x.
The gradient legend associates a color to a confidence level. In theory, you can associate each (x,y) point with a confidence level. In practice, it is hard to visually distinguish grey scales. If it is important to distinguish, say, the 80% confidence region, you can replace the color ramp for the COLORMODEL= option. For example, you could use colormodel=(black red yellow cyan white) if you are looking for vibrant colors that emphasize the 25%, 50%, and 75% levels.
It is enlightening to take a vertical slice of the continuous band plot. A vertical slice provides a “profile plot,” which shows how the confidence level changes as a function of y for a fixed value of x. Here is the profile plot at x=30:
title 'Profile Plot of Confidence Level'; title2 'x = 30'; proc sgplot data=Heat noautolegend; where x = 30; series x=ConfidenceLevel y=y; refline 49.2 / axis=y; yaxis min=30 max=68 grid; xaxis grid; run; |
When x=30, the predicted mean is 49.2, which is marked by a horizontal reference line.
That point prediction is the “0% confidence interval,” since it does not account for sampling variability.
For ConfidenceLevel > 0, the curves give the width of the confidence interval for the predicted mean.
The width is small and increases almost linearly with the confidence level until about the 70% level.
The lower and upper 80% CLMs are 45.1 and 53.3, respectively.
The lower and upper 95% CLMs are much bigger at 42.4 and 56.0, respectively.
The 99% CLMs are 39.3 and 59.1. As the confidence level approaches 100%, the width of the interval increases without bound.
In summary, this article shows formulas for the lower and upper 100(1-α)% confidence limits for the predicted mean at a point x. You can invert the formulas: for any (x,y), you can find the value of α for which the 100(1-α)% confidence limits pass through the point (x,y). You can use a heat map to visualize the resulting surface, which is a continuous version of the graded confidence band plot.
The continuous band plot is one way to visualize the uncertainty of the predicted mean in a regression model.
The post A continuous band plot for visualizing uncertainty in regression predictions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Visualize uncertainty in regression predictions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
You’ve probably seen many graphs that are similar to the one at the right. This plot shows a regression line overlaid on a scatter plot of some data. Given a value for the independent variable (x), the regression line gives the best prediction for the mean of the response variable (y). The light blue band shows a 95% confidence band for the conditional mean.
This article is about how to understand the confidence band. The band conveys uncertainty in the location of the conditional mean. You can think of the confidence band as being a bunch of vertical confidence intervals, one at each possible value of x.
For example, when x=30, the graph predicts 49.2 for the conditional mean of y and shows that
the confidence interval for the mean of y at x=30 is [42.4, 56.0].
The confidence intervals for x=20 and x=40 are wider, which indicates that there is more uncertainty in the prediction when x is extreme than when x is near the center of the data. (The confidence interval of the conditional mean is sometimes called the confidence interval of the prediction. No matter what you call it, the intervals in this article are for the MEAN, not for individual responses.)
Statistics have uncertainty because they are based on a random sample from the population. If you were to choose a different random sample of (x,y) values from the same population, you would get a different regression line. If you choose a third random sample, you would get yet another regression line. In this article, I discuss some intuitive ways to think about the confidence band without using any formulas. I show that the confidence band is related to repeatedly choosing a random sample and fitting many regression lines. I show a couple of alternative ways to visualize uncertainty in the predicted values.
You can use a simulation to demonstrate the relationship between a confidence interval and repeated random sampling.
The best way is to use a (known) model to repeatedly generate the random sample. However, I am going to use a slightly different approach, which is discussed in Section 13.3 of Simulating Data with SAS, Wicklin (2013). When you deal with real data, you don’t know the real underlying relationship between the response and explanatory variables, but you can always simulate from the fitted regression model. This means that you fit the data and then using the parameters estimates as if they were the true values of the parameters.
Let’s see how this works. The following DATA step defines a toy example that has 10 observations. The call to PROC REG fits a linear model to the data.
data Have; input x y; datalines; 21 45 24 40 26 32 27 47 30 55 30 60 33 42 34 67 36 50 39 54 ; proc reg data=Have outest=PE alpha=0.05 plots(only)=fitplot(nocli); model y = x; quit; proc print data=PE noobs label; label Intercept="b0" x="b1" _RMSE_="s"; var Intercept x _RMSE_; run; |
The REG procedure creates a fit plot that is similar to the one at the top of this article.
The model is Y = b0 + b1*x + eps, where eps ~ N(0, s). The call to PROC PRINT shows the three parameter estimates: the intercept term (b0), the coefficient of the linear term (b1), and the “root mean square error,” which is the estimate of the standard deviation of the error term (s).
The conditional mean is assumed to be normally distributed inside the confidence band. For a particular value of x, the probability of the conditional mean is high near the regression line and
lower near the upper and lower limits. You can visualize the distribution by overlaying confidence bands for various confidence levels. If you make the bands semi-transparent, the union of the bands is darkest near the regression line and lightest far away from the line.
In SAS, you can use the REG statement (with the CLM option) in PROC SGPLOT to overlay several semi-transparent confidence bands that have different levels of confidence. The following graph overlays the 95%, 90%, 80%, 70% and 50% confidence bands. To save typing, I use a SAS macro to create each REG statement.
%macro CLM(alpha=, thickness=); reg x=x y=y / nomarkers alpha=&alpha lineattrs=GraphFit(thickness=&thickness) clm clmattrs=(clmfillattrs=(color=gray transparency=0.8)); %mend; title "Graded Confidence Band Plot"; title2 "alpha = 0.05, 0.1, 0.2, 0.3, 0.5"; proc sgplot data=Have noautolegend; %CLM(alpha=0.05, thickness=0); %CLM(alpha=0.10, thickness=0); %CLM(alpha=0.20, thickness=0); %CLM(alpha=0.30, thickness=0); %CLM(alpha=0.50, thickness=2); scatter x=x y=y/ markerattrs=(symbol=CircleFilled size=12); run; |
Claus O. Wilke calls this
a “graded confidence band plot” in his book
_Fundamentals of Data Visualization_ (Section 16.3).
It communicates that the location of the conditional mean is uncertain, but it is most likely to be found near the regression line.
The confidence band visualizes the uncertainty in the prediction due to the fact that the data are a random sample from some unknown population.
Although we don’t know the underlying true relationship between the (x,y) pairs, we can simulate random samples from the fitted linear model by using the parameter estimates as if they were the true parameters. The following SAS DATA step simulates 500 random samples from the regression model. The x values are fixed. The y values are
simulated according to Y = b0 + b1*x + eps, where eps ~ N(0, s).
/* simulate many times from model, using parameter estimates as the true model */ %let numSim = 500; data RegSim; call streaminit(1234); b0 = 21.1014; b1 = 0.93662; RMSE = 9.33046; set Have; /* for each X value in the original data */ do SampleID = 1 to &numSim; /* simulate Y = b0 + b1*x + eps, eps ~ N(0,RMSE) */ YSim = b0 + b1*x + rand("Normal", 0, RMSE); output; end; run; /* use BY-group processing to fit a regression model to each simulated data */ proc sort data=RegSim; by SampleID; run; proc reg data=RegSim outest=PESim alpha=0.05 noprint; by SampleID; model YSim = x; quit; |
The output from PROC REG is 500 pairs of parameter estimates (b0 and b1). Each estimate represents a regression line for a random sample from the same linear model. Let’s see what happens if you overlay all those regression lines on a single plot:
/* two points determine a line, so score regression on [min(x), max(x)] */ data Viz; set PESim(rename=(Intercept=b0 x=b1)); /* min(x)=21; max(x)=39. Evaluate fit b0 + b1*x for each simulated sample */ xx = 21; yy = b0 + b1*xx; output; xx = 39; yy = b0 + b1*xx; output; keep SampleID xx yy; run; /* overlay the fits on the original data */ data Combine; set Have Viz; run; title "Overlay of &numSim Regression Lines"; title2 "Y = b0 + b1*x + eps, eps ~ N(0,RMSE)"; ods graphics / antialias=off GROUPMAX=10000; proc sgplot data=Combine noautolegend; reg x=x y=y / nomarkers alpha=0.05 clm clmattrs=(clmfillattrs=(transparency=0.5)); series x=xx y=yy / group=SampleId lineattrs=(color=gray pattern=solid) transparency=0.9; scatter x=x y=y; reg x=x y=y / nomarkers; run; |
Notice that most of the regression lines are in the interior of the confidence band. In fact, if you fix a value of x (such as x=30) and evaluate all the regression lines at x, then about 95% of the conditional means will be contained within the original confidence band.
This is the intuitive meaning of the confidence band. If you imagine obtaining a different random sample from the same population and fitting a regression line, the conditional mean will be contained in the band for 95% of the random samples.
As I said earlier, you should think of the confidence bands as a bunch of vertical confidence intervals. The distribution of the conditional mean within each vertical slice is assumed to be normally distributed with mean b0 + b1*x.
The following histograms show the distribution of the predicted means for x=21 and x=39 (respectively) over all 500 regression lines. You should compare the 2.5th and 97.5th percentiles for these histograms
to the vertical limits (at the same x values) of the confidence band in the graph at the top of this article.
title "Distribution of Conditional Means"; proc sgpanel data=Combine(where=(xx^=.)); label yy="y" xx="x0"; panelby xx / layout=rowlattice columns=1; histogram yy; run; |
In summary, this article visualizes a few facts about the confidence band for a regression line. The confidence band conveys uncertainty about the location of the conditional mean.
The visualization can be improved by overlaying several semi-transparent bands, a graph that
Wilke calls a “graded confidence band plot.”
You can use simulation to relate the confidence band to sampling variability.
If you simulate many random samples and overlay the regression lines, they form a confidence band. The limits of the confidence band are related to the quantiles of the conditional distribution of the means.
The post Visualize uncertainty in regression predictions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This is a tip for those who use the Metacoda Security Plug-ins Batch Interface for scheduled automation of SAS® metadata security reporting, testing and identity synchronization. You will find this tip useful if you are using the same configuration values in multiple batch configuration or Identity Sync Profile (IDSP) XML files and would like a … Continue reading “Metacoda Plug-ins Tip: Replaceable Tokens”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This tip was prompted by a question from a Metacoda customer who, having seen a SAS® metadata object id in a log file (e.g. A53GCFFH.AP00001D), wanted to quickly identify what SAS metadata object that id referred to. As a Metacoda Plug-ins user they have access to the Metacoda Metadata Explorer plug-in which supports simple text … Continue reading “Metacoda Plug-ins Tip: Locate a Metadata Object By Id”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |