Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9^{th} edition) codes along with some patient demographic data.
I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.
It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!
In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here’s the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter’s advice to write clever code that no one can understand so they can’t fire you!)
The following program demonstrates the =: operator:
*An interesting trick to read ICD codes; <strong>Data</strong> ICD_9; input ICD : $7. @@; if ICD =: "493" the output; datalines; 493 770.6 999 493.9 493.90 493.100 ; title "Listing of All Asthma Codes"; proc print data=ICD_9 noobs; run;
Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.
The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It’s called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter “10-8 awaiting crew” or “Ron Cody” in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.
Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. An application in machine learning
is to measure how distributions in a parametric family differ from a data distribution.
This article shows that if you minimize the Kullback–Leibler divergence over a set of parameters, you can find a distribution that is similar to the data distribution. This article focuses on discrete distributions.
As explained in a previous article,
the Kullback–Leibler (K-L) divergence between two discrete probability distributions is the sum
KL(f, g) = Σ_{x} f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.)
For this sum to be well defined, the distribution g must be strictly positive on the support of f.
One application of the K-L divergence is to measure the similarity between a hypothetical model distribution defined by g and an empirical distribution defined by f.
As an example, suppose a call center averages about 10 calls per hour. An analyst wants to investigate whether the number of calls per hour can be modeled by using a Poisson(λ=10) distribution. To test the hypothesis, the analyst records the number of calls for each hour during 100 hours of operation. The following SAS DATA step reads the data. The call to PROC SGPLOT creates a histogram shows the distribution of the 100 counts:
data Calls; input N @@; label N = "Calls per hour"; datalines; 11 19 11 13 13 8 11 9 9 14 10 13 8 15 7 9 6 12 7 13 12 19 6 12 11 12 11 9 15 4 7 12 12 10 10 16 18 13 13 8 13 10 9 9 12 13 12 8 13 9 7 9 10 9 4 10 12 5 4 12 8 12 14 16 11 7 18 8 10 13 12 5 11 12 16 9 11 8 11 7 11 15 8 7 12 16 9 18 9 8 10 7 11 12 13 15 6 10 10 7 ; title "Number of Calls per Hour"; title2 "Data for 100 Hours"; proc sgplot data=Calls; histogram N / scale=proportion binwidth=1; xaxis values=(4 to 19) valueshint; run; |
The graph should really be a bar chart, but I used a histogram with BINWIDTH=1 so that the graph reveals that the value 17 does not appear in the data. Furthermore, the values 0, 1, 2, and 3 do not appear in the data.
I used the SCALE=PROPORTION option to plot the data distribution on the density scale.
The call center wants to model these data by using a Poisson distribution. The traditional statistical approach is to use maximum likelihood estimation (MLE) to find the parameter, λ, in the Poisson family so that the Poisson(λ) distribution is the best fit to the data.
However, let’s see how using the Kullback–Leibler divergence leads to a similar result.
Let’s compute the K-L divergence between the empirical frequency distribution and a Poisson(10) distribution.
The empirical distribution is the reference distribution; the Poisson(10) distribution is the model.
The Poisson distribution has a nonzero probability for all x ≥ 0, but
recall that the K-L divergence is computed by summing over the observed values of the empirical distribution, which is the set {4, 5, …, 19}, excluding the value 17.
proc iml; /* read the data, which is the reference distribution, f */ use Calls; read all var "N" into Obs; close; call Tabulate(Levels, Freq, Obs); /* find unique values and frequencies */ Proportion = Freq / nrow(Obs); /* empirical density of frequency of calls (f) */ /* create the model distribution: Poisson(10) */ lambda = 10; poisPDF = pdf("Poisson", Levels, lambda); /* Poisson model on support(f) */ /* load K-L divergence module or include the definition from: https://blogs.sas.com/content/iml/2020/05/26/kullback-leibler-divergence-discrete.html */ load module=KLDiv; KL = KLDiv(Proportion, poisPDF); print KL[format=best5.]; |
Notice that although the Poisson distribution has infinite support, you only need to evaluate the Poisson density on the (finite) support of empirical density.
The previous section shows how to compute the Kullback–Leibler divergence between an empirical density and a Poisson(10) distribution. You can repeat that computation for a whole range of λ values and plot the divergence versus the Poisson parameter. The following statements compute the K-L divergence for λ on [4, 16] and plots the result. The minimum value of the K-L divergence is achieved near λ =
10.7. At that value of λ, the K-L divergence between the data and the Poisson(10.7) distribution is 0.105.
/* Plot the K-L div versus lambda for a sequence of Poisson(lambda) models */ lambda = do(4, 16, 0.1); KL = j(1, ncol(lambda), .); do i = 1 to ncol(lambda); poisPDF = pdf("Poisson", Levels, lambda[i]); KL[i] = KLDiv(Proportion, poisPDF); end; title "K-L Divergence from Poisson(lambda)"; call series(lambda, KL) grid={x y} xvalues=4:16 label={'x' 'K-L Divergence'}; |
The graph shows the K-L divergence for a sequence of Poisson(λ) models.
The Poisson(10.7) model has the smallest divergence from the data distribution, therefore it is the most
similar to the data among the Poisson(λ) distributions that were considered.
You can use a numerical optimization technique in SAS/IML if you want to find a more accurate value that minimizes the K-L divergences.
The following graph overlays the PMF for the Poisson(10.7) distribution on the empirical distribution for the number of calls.
You might wonder how minimizing the K-L divergence relates to the traditional MLE method for fitting a Poisson model to the data. The following call to PROC GENMOD shows that the MLE estimate is λ = 10.71:
proc genmod data=MyData; model Obs = / dist=poisson; output out=PoissonFit p=lambda; run; proc print data=PoissonFit(obs=1) noobs; var lambda; run; |
Is this a coincidence? No. It turns out that there a connection between the K-L divergence and the negative log-likelihood. Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.
This article shows how to compute the
Kullback–Leibler divergence between an empirical distribution and a Poisson distribution. The empirical distribution was the observed number of calls per hour for 100 hours in a call center.
You can compute the K-L divergence for many parameter values (or use numerical optimization) to find the parameter that minimizes the K-L divergence. This parameter value corresponds to the Poisson distribution that is most similar to the data. It turns out that minimizing the K-L divergence is equivalent to maximizing the likelihood function. Although the parameter estimates are the same, the traditional MLE estimate comes with additional tools for statistical inference, such as estimates for confidence intervals and standard errors.
The post Minimizing the Kullback–Leibler divergence appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Multi-purpose macro function for getting information about data sets was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Did you know you could have a single universal function that can replace all the functions in the world? All those sin(x), log(x), … whatever(x) can all be replaced by a single super function f(x). Don’t believe me? Just make those functions names – sin, log, … whatever to be another argument to that all-purpose function f, just like that: f(x, sin), f(x, log), … f(x, whatever). Now, we only must deal with a single function instead of many, and its second argument will define what transformation needs to be done with the first argument in order to arrive at this almighty function’s value.
Last time I counted there were more than 600 SAS functions, and that is excluding call routines and macro functions. But even that huge number grossly under-represents the actual number of functions available in SAS. That is because there are some functions that are built like the universal multi-purpose super function described above. For example, look at the following functions:
finance() function represents several dozen various financial functions;
finfo() function represents multiple functions returning various information items about files (file size, date created, date modified, access permission, etc.);
dinfo() function returns similar information items about directories;
attrn() function returns numeric attributes of a data set (number of observations, number of variables, etc.)
attrc() function returns character attributes of a data set (engine name, encoding name, character set, etc.)
Each of these functions represents not a single function, but a group of functions, and one of their arguments stipulates specific functionality (an information item or an attribute) that is being requested. You can think of this argument as a function modifier.
%sysfunc() is a super macro function that brings a wealth of SAS functions into SAS macro language. With very few exceptions, most SAS functions are available in SAS macro language thanks to the %sysfunc().
Moreover, we can build our own user-defined macro functions using SAS-supplied macro functions (such as %eval, %length, %quote, %scan, etc.), as well as hundreds of the SAS non-macro functions wrapped into the %sysfunc() super macro function.
Armed with such a powerful arsenal, let’s build a multi-purpose macro function that taps into the data tables’ metadata and extracts various information items about those tables.
Let’s make this macro function return any of the following most frequently used values:
Obviously, we can create much more of these information items and attributes, but here I am just showing how to do this so that you can create your own list depending on your needs.
In my earlier blog post, How to create and use SAS macro functions, we had already built a macro function for getting the number of observations; let’s expand on that.
Here is the SAS Macro code that handles extraction of all four specified metadata items:
%macro dsinfo(dset,info); /* dset - data set name */ /* info - modifier (NOBS, NVARS, VARLIST, VARLISTC) */ %local dsid result infocaps i; %let infocaps = %upcase(&info); %let dsid = %sysfunc(open(&dset)); %if &dsid %then %do; %if &infocaps=NOBS %then %let result = %sysfunc(attrn(&dsid,nlobs)); %else %if &infocaps=NVARS %then %let result = %sysfunc(attrn(&dsid,nvars)); %else %if &infocaps=VARLIST %then %do i=1 %to %sysfunc(attrn(&dsid,nvars)); %let result = &result %sysfunc(varname(&dsid,&i)); %end; %else %if &infocaps=VARLISTC %then %do i=1 %to %sysfunc(attrn(&dsid,nvars)); %if &i eq 1 %then %let result = %sysfunc(varname(&dsid,&i)); %else %let result = &result,%sysfunc(varname(&dsid,&i)); %end; %let dsid = %sysfunc(close(&dsid)); %end; %else %put %sysfunc(sysmsg()); &result %mend dsinfo; |
The SAS log will show:
%put NOBS=***%dsinfo(SASHELP.CARS,NOBS)***; NOBS=***428*** %put NVARS=***%dsinfo(SASHELP.CARS,NVARS)***; NVARS=***15*** %put VARLIST=***%dsinfo(SASHELP.CARS,VARLIST)***; VARLIST=***Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length*** %put VARLISTC=***%dsinfo(SASHELP.CARS,VARLISTC)***; VARLISTC=***Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length***
We used the following statement to make our macro function case-insensitive regarding the info argument:
%let infocaps = %upcase(&info);
Then depending on the up-cased second argument of our macro function (modifier) we used the attrn(), varnum() and varname() functions within %sysfunc() to retrieve and construct our result macro variable.
We stick that result macro variable value, &result, right before the %mend statement so that the value is returned to the calling environment.
While info=VARLIST (space-separated variable list) is useful in DATA steps, info=VARLISTC (comma-separated variable list) is useful in PROC SQL.
Having this %dsinfo macro function at hands, we can use it in multiple programming scenarios. For example:
/* ending SAS session if no observations to process */ %if %dsinfo(SASHELP.CARS,NOBS)=0 %then %do; endsas; %end; /* further processing */ data MYNEWDATA (keep=%dsinfo(SASHELP.CARS,VARLIST)); retain %dsinfo(SASHELP.CARS,VARLIST); set SASHELP.CARS; if _n_=1 then put %dsinfo(SASHELP.CARS,VARLIST); /* ... */ run; |
Here we first check if there is at least one observation in a data set. If not (0 observations) then we stop the SAS session and don’t do any further processing. Otherwise, when there are some observations to process, we continue.
If SAS code needs multiple calls to the same macro function with the same argument, we can shorten the code by first assigning that macro function’s result to a macro variable and then reference that macro variable instead of repeating macro function invocation. Here is an example:
/* further processing */ %let vlist = %dsinfo(SASHELP.CARS,VARLIST); data MYNEWDATA (keep=&vlist); retain &vlist; set SASHELP.CARS; if _n_=1 then put &vlist; /* ... */ run; |
Do you see the benefits of these multi-purpose SAS macro functions? Can you suggest other scenarios of their usage? Please share your thoughts in the comments section below.
Multi-purpose macro function for getting information about data sets was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post The Kullback–Leibler divergence between discrete probability distributions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
If you have been learning about machine learning or mathematical statistics,
you might have heard about the
Kullback–Leibler divergence. The Kullback–Leibler divergence is a measure of dissimilarity between two probability distributions. It measures how much one distribution differs from a reference distribution.
This article explains the Kullback–Leibler divergence and shows how to compute it for discrete probability distributions.
Recall that there are many statistical methods that indicate how much two distributions differ.
For example, a maximum likelihood estimate involves finding parameters for a reference distribution that is similar to the data.
Statistics such as the Kolmogorov-Smirnov statistic are used in goodness-of-fit tests to compare a data distribution to a reference distribution.
Let f and g be probability mass functions that have the same domain.
The Kullback–Leibler (K-L) divergence is the sum
KL(f, g) = Σ_{x} f(x) log( f(x)/g(x) )
where the sum is over the set of x values for which f(x) > 0. (The set {x | f(x) > 0} is called the support of f.)
The K-L divergence measures the similarity between the distribution defined by g and the reference distribution defined by f.
For this sum to be well defined, the distribution g must be strictly positive on the support of f. That is, the Kullback–Leibler divergence is defined only when g(x) > 0 for all x in the support of f.
Some researchers prefer the argument to the log function to have f(x) in the denominator. Flipping the ratio introduces a negative sign, so an equivalent formula is
KL(f, g) = –Σ_{x} f(x) log( g(x)/f(x) )
Notice that if the two density functions (f and g) are the same, then the logarithm of the ratio is 0. Therefore, the K-L divergence is zero when the two distributions are equal.
The K-L divergence is positive if the distributions are different.
The Kullback–Leibler divergence was developed as a tool for information theory, but it is frequently used in machine learning. The divergence has several interpretations. In information theory, it
measures the information loss when f is approximated by g. In statistics and machine learning, f is often the observed distribution and g is a model.
As an example, suppose you roll a six-sided die 100 times and record the proportion of 1s, 2s, 3s, etc. You might want to compare this empirical distribution to the uniform distribution, which is the distribution of a fair die for which the probability of each face appearing is 1/6.
The following SAS/IML statements compute the Kullback–Leibler (K-L) divergence between the empirical density and the uniform density:
proc iml; /* K-L divergence is defined for positive discrete densities */ /* Face: 1 2 3 4 5 6 */ f = {20, 14, 19, 19, 12, 16} / 100; /* empirical density; 100 rolls of die */ g = { 1, 1, 1, 1, 1, 1} / 6; /* uniform density */ KL_fg = -sum( f#log(g/f) ); /* K-L divergence using natural log */ print KL_fg; |
The K-L divergence is very small, which indicates that the two distributions are similar.
Although this example compares an empirical distribution to a theoretical distribution, you need to be aware of the limitations of the K-L divergence. The K-L divergence compares two distributions and assumes that the density functions are exact. The K-L divergence does not account for the size of the sample in the previous example. The computation is the same regardless of whether the first density is based on 100 rolls or a million rolls. Thus, the K-L divergence is not a replacement for traditional statistical goodness-of-fit tests.
Let’s compare a different distribution to the uniform distribution. Let h(x)=9/30 if x=1,2,3 and let h(x)=1/30 if x=4,5,6. The following statements compute the K-L divergence between h and g and between g and h.
In the first computation, the step distribution (h) is the reference distribution.
In the second computation, the uniform distribution is the reference distribution.
h = { 9, 9, 9, 1, 1, 1} /30; /* step density */ g = { 1, 1, 1, 1, 1, 1} / 6; /* uniform density */ KL_hg = -sum( h#log(g/h) ); /* h is reference distribution */ print KL_hg; /* show that K-L div is not symmetric */ KL_gh = -sum( g#log(h/g) ); /* g is reference distribution */ print KL_gh; |
First, notice that the numbers are larger than for the example in the previous section. The f density function is approximately constant, whereas h is not. So the distribution for f is more similar to a uniform distribution than the step distribution is.
Second, notice that the K-L divergence is not symmetric. In the first computation (KL_hg), the reference distribution is h, which means that the log terms are weighted by the values of h. The weights from h give a lot of weight to the first three categories (1,2,3) and very little weight to the last three categories (4,5,6). In contrast, g is the reference distribution
for the second computation (KL_gh). Because g is the uniform density, the log terms are weighted equally in the second computation.
The fact that the summation is over the support of f means that you can compute the K-L divergence between an empirical distribution (which always has finite support) and a model that has infinite support.
However, you cannot use just any distribution for g. Mathematically, f must be absolutely continuous with respect to g. (Another expression is that f is dominated by g.) This means that for every value of x such that f(x)>0, it is also true that g(x)>0.
It is convenient to write a function, KLDiv, that computes the Kullback–Leibler divergence for vectors that give the density for two discrete densities. The call KLDiv(f, g) should compute the weighted sum of log( g(x)/f(x) ), where x ranges over elements of the support of f.
By default, the function verifies that g > 0 on the support of f and returns a missing value if it isn’t. The following SAS/IML function implements the Kullback–Leibler divergence.
/* The Kullback–Leibler divergence between two discrete densities f and g. The f distribution is the reference distribution, which means that the sum is probability-weighted by f. If f(x0)>0 at some x0, the model must allow it. You cannot have g(x0)=0. */ start KLDiv(f, g, validate=1); if validate then do; /* if g might be zero, validate */ idx = loc(g<=0); /* find locations where g = 0 */ if ncol(idx) > 0 then do; if any(f[idx]>0) then do; /* g is not a good model for f */ *print "ERROR: g(x)=0 when f(x)>0"; return( . ); end; end; end; if any(f<=0) then do; /* f = 0 somewhere */ idx = loc(f>0); /* support of f */ fp = f[idx]; /* restrict f and g to support of f */ return -sum( fp#log(g[idx]/fp) ); /* sum over support of f */ end; /* else, f > 0 everywhere */ return -sum( f#log(g/f) ); /* K-L divergence using natural logarithm */ finish; /* test the KLDiv function */ f = {0, 0.10, 0.40, 0.40, 0.1}; g = {0, 0.60, 0.25, 0.15, 0}; KL_pq = KLDiv(f, g); /* g is not a valid model for f; K-L div not defined */ KL_qp = KLDiv(g, f); /* f is valid model for g. Sum is over support of g */ print KL_pq KL_qp; |
The first call returns a missing value because the sum over the support of f encounters the invalid expression log(0) as the fifth term of the sum. The density g cannot be a model for f because g(5)=0 (no 5s are permitted) whereas f(5)>0 (5s were observed).
The second call returns a positive value because the sum over the support of g is valid.
In this case, f says that 5s are permitted, but g says that no 5s were observed.
This article explains the Kullback–Leibler divergence for discrete distributions. A simple example shows that the K-L divergence is not symmetric. This is explained by understanding that the K-L divergence involves a probability-weighted sum where the weights come from the first argument (the reference distribution).
Lastly, the article gives an example of implementing the Kullback–Leibler divergence in a matrix-vector language such as SAS/IML. This article focused on discrete distributions.
The next article shows how the K-L divergence changes as a function of the parameters in a model.
The post The Kullback–Leibler divergence between discrete probability distributions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Bilinear interpolation in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article shows how to perform two-dimensional bilinear interpolation in SAS by using a SAS/IML function. It is assumed that you have observed the values of a response variable on a regular grid of locations.
A previous article showed how to interpolate inside one rectangular cell.
When you have a grid of cells and a point (x, y) at which to interpolate, the first step is to efficiently locate which cell contains (x, y). You can then use the values at the corners of the cell to interpolate at (x, y), as shown in the previous article.
For example, the adjacent graph shows the bilinear interpolation function that is defined by 12 points on a 4 x 3
grid.
You can download the SAS program that creates the analyses and graphs in this article.
For two-dimensional linear interpolation of points, the following sets of numbers are the inputs for bilinear interpolation:
The goal of interpolation is to estimate the response at each location (s_{i}, t_{j}) based on the values at the corner of the cell that contains the location.
I have written a SAS/IML function called BilinearInterp, which you can freely download and use. The following statements indicate how to define and store the bilinearInterp function:
proc iml; /* xGrd : vector of Nx points along x axis yGrd : vector of Ny points along y axis z : Ny x Nx matrix. The value Z[i,j] is the value at xGrd[j] and yGrd[i] t : k x 2 matrix of points at which to interpolate The function returns a k x 1 vector, which is the bilinear interpolation at each row of t. */ start bilinearInterp(xgrd, ygrd, z, t); /* download function from https://github.com/sascommunities/the-do-loop-blog/blob/master/interpolation/bilinearInterp.sas */ finish; store module=bilinearInterp; /* store the module so it can be loaded later */ QUIT; |
You only need to store the module one time. You then load the module in any SAS/IML program that needs to use it. You can learn more about storing and loading SAS/IML modules.
The fitting data for bilinear interpolation consists of the grid of points (Z) and the X and Y locations of each grid point.
The X and Y values do not need to be evenly spaced, but they can be.
Often the fitting data are stored in “long form” as triplets of (X, Y, Z) values.
If so, the first step is to read the triplets and convert them into a matrix of Z values where the columns are associated with the X values and the rows are associated with the Y values.
For example, the following example data set specifies a response variable (Z) on a 4 x 3 grid. The values of the grid in the Y direction are {0, 1, 1.5, 3} and the values in the X direction are {1, 2, 5}.
As explained in the previous article, the data must be sorted by Y and then by X. The following statements read the data into a SAS/IML matrix and extract the grid points in the X and Y direction:
data Have; input x y z; datalines; 1 0 6 2 0 7 5 0 5 1 1 9 2 1 9 5 1 7 1 1.5 10 2 1.5 9 5 1.5 6 1 3 12 2 3 9 5 3 8 ; /* If the data are not already sorted, sort by Y and X */ proc sort data=Have; by Y X; run; proc iml; use Have; read all var {x y z}; close; xGrd = unique(x); /* find grid points */ yGrd = unique(y); z = shape(z, ncol(yGrd), ncol(xGrd)); /* data must be sorted by y, then x */ print z[r=(char(yGrd)) c=(char(xGrd))]; |
Again, note that the columns of Z represent the X values and the rows represent the Y values.
The xGrd and yGrd vectors contain the grid points along the horizontal and vertical dimensions of the grid, respectively.
The bilinearInterp function assumes that the points at which to interpolate are stored in a k x 2 matrix. Each row is an (x,y) location at which to interpolate from the fitting data.
If an (x,y) location is outside of the fitting data, then the bilinearInterp function returns a missing value.
The scoring locations do not need to be sorted.
The following example specifies six valid points and two invalid interpolation points:
t = {0 0, /* not in data range */ 1 1, 1 2, 2.5 2, 4 0.5, 4 2, 5 3, 6 3}; /* not in data range */ /* LOAD the previously stored function */ load module=bilinearInterp; F = bilinearInterp(xGrd, yGrd, z, t); print t[c={'x' 'y'}] F[format=Best4.]; |
The grid is defined on the rectangle [1,5] x [0,3].
The first and last rows of t are outside of the rectangle, so the interpolation returns missing values at those locations. The locations (1,1) and (5,3) are grid locations (corners of a cell), and the interpolation returns the correct response values. The other locations are in one of the grid cells. The interpolation is found by scaling the appropriate rectangle onto the unit square and applying the bilinear interpolation formula, as shown in the previous article.
You can use the ExpandGrid function to generate a fine grid of locations at which to interpolate.
You can visualize the bilinear interpolation function by using a heat map or by using a surface plot.
The heat map visualization is shown at the top of this article. The colors indicate the interpolated values of the response variable. The markers indicate the observed values of the response variable. The reference lines show the grid that is defined by the fitting data.
The following graph visualizes the interpolation function by using a surface plot. The function is piecewise quadratic. Within each cell, it is linear on lines of constant X and constant Y. The “ridge lines” on the surface correspond to the locations where two adjacent grid cells meet.
In summary, you can perform bilinear interpolation in SAS by using a SAS/IML function. The function uses fitting data (X, Y, and Z locations on a grid) to interpolate.
Inside each cell, the interpolation is quadratic and is linear on lines of constant X and constant Y.
For each location point, the interpolation function first determines what cell the point is in.
It then uses the corners of the cell to interpolate at the point.
You can download the SAS program that performs bilinear interpolation and generates the tables and graphs in this article.
The post Bilinear interpolation in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post What is bilinear interpolation? appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
I’ve previously written about linear interpolation in one dimension.
Bilinear interpolation is a method for two-dimensional interpolation on a rectangle.
If the value of a function is known at the four corners of a rectangle, an interpolation scheme gives you a way to estimate the function at any point in the rectangle’s interior. Bilinear interpolation is a weighted average of the values at the four corners of the rectangle. For an (x,y) position inside the rectangle, the weights are determined by the distance between the point and the corners. Corners that are closer to the point get more weight. The generic result is a saddle-shaped quadratic function, as shown in the contour plot to the right.
The Wikipedia article on bilinear interpolation provides a lot of formulas, but the article is needlessly complicated. The only important formula is how to interpolate on the unit square [0,1] x [0,1]. To interpolate on any other rectangle, simply map your rectangle onto the unit square and do the interpolation there.
This trick works for bilinear interpolation because the weighted average depends only on the relative position of a point and the corners of the rectangle. Given a rectangle with lower-left corner (x0, y0) and upper right corner (x1, y1), you can map it into the unit square by using the transformation (x, y) → (u, v) where u = (x-x0)/(x1-x0) and v = (y-y0)/(y1-y0).
Suppose that you want to interpolate on the unit square.
Assume you know the values of a continuous function at the corners of the square:
If (x,y) is any point inside the unit square, the interpolation at that point is the following weighted average of the values at the four corners:
F(x,y) = z00*(1-x)*(1-y) +
z10*x*(1-y) +
z01*(1-x)*y +
z11*x*y
Notice that the interpolant is linear with respect to the values of the corners of the square. It is quadratic with respect to the location of the interpolation point. The interpolant is linear along every horizontal line (lines of constant y) and along every vertical line (lines of constant x).
Suppose that the function values at the corners of a unit square are z00 = 0,
z10 = 4,
z01 = 2, and
z11 = 1.
For these values, the bilinear interpolant on the unit square is shown at the top of this article. Notice that the value of the interpolant at the corners agrees with the data values.
Notice also that the interpolant is linear along each edge of the square. It is not easy to see that the interpolant is linear along each horizontal and vertical line, but that is true.
In SAS, you can use the SAS/IML matrix language to define a function that performs bilinear interpolation on the unit square. The function takes two arguments:
In a matrix language such as SAS/IML, the function can be written in vectorized form without using any loops:
proc iml; start bilinearInterpSquare(z, t); z00 = z[1]; z10 = z[2]; z01 = z[3]; z11 = z[4]; x = t[,1]; y = t[,2]; cx = 1-x; /* cx = complement of x */ return( (z00*cx + z10*x)#(1-y) + (z01*cx + z11*x)#y ); finish; /* specify the fitting data */ z = {0 4, /* bottom: values at (0,0) and (1,0) */ 2 1}; /* top: values at (0,1) and (1,1) */ print z[c={'x=0' 'x=1'} r={'y=0' 'y=1'}]; t = {0.5 0, /* specify the scoring data */ 0 0.25, 1 0.66666666, 0.5 0.75, 0.75 0.5 }; F = bilinearInterpSquare(z, t); /* test the bilinear interpolation function */ print t[c={'x' 'y'} format=FRACT.] F; |
For points on the edge of the square, you can check a few values in your head:
The other points are in the interior of the square. The interpolant at those points is a weighted average of the corner values.
Let’s use the bilinearInterpSquare function to evaluate a grid of points.
You can use the ExpandGrid function in SAS/IML to generate a grid of points. When we look at the values of the interpolant on a grid or in a heat map, we want the X values to change for each column and the Y values to change for each row. This means that you should reverse the usual order of the arguments to the ExpandGrid function:
/* for visualization, reverse the arguments to ExpandGrid and then swap 1st and 2nd columns of t */ xGrd = do(0, 1, 0.2); yGrd = do(0, 1, 0.2); t = ExpandGrid(yGrd, xGrd); /* want X changing fastest; put in 2nd col */ t = t[ ,{2 1}]; /* reverse the columns from (y,x) to (x,y) */ F = bilinearInterpSquare(z, t); Q = shape(F, ncol(yGrd), ncol(xGrd)); print Q[r=(char(YGrd,3)) c=(char(xGrd,3)) label="Bilinear Interpolant"]; |
The table shows the interpolant evaluated at the coordinates of a regular 6 x 6 grid.
The column headers give the coordinate of the X variable and the row headers give the coordinates of the Y variable. You can see that each column and each row is an arithmetic sequence, which shows that the interpolant is linear on vertical and horizontal lines.
You can use a finer grid and a heat map to visualize the interpolant as a surface. Notice that in the previous table, the Y values increase as you go down the rows. If you want the Y axis to point up instead of down, you can reverse the rows of the grid (and the labels) before you create a heat map:
/* increase the grid resolution and visualize by using a heat map */ xGrd = do(0, 1, 0.1); yGrd = do(0, 1, 0.1); t = ExpandGrid(yGrd, xGrd); /* want X changing fastest; put in 2nd col */ t = t[ ,{2 1}]; /* reverse the columns from (y,x) to (x,y) */ F = bilinearInterpSquare(z, t); Q = shape(F, ncol(yGrd), ncol(xGrd)); /* currently, the Y axis is pointing down. Flip it and the labels. */ H = Q[ nrow(Q):1, ]; reverseY = yGrd[ ,ncol(yGrd):1]; call heatmapcont(H) range={0 4} displayoutlines=0 xValues=xGrd yValues=reverseY colorramp=palette("Spectral", 7)[,7:1] title="Interpolation at Multiple Points in the Unit Square"; |
The heat map agrees with the contour plot at the top of this article. The heat map also makes it easier to see that the bilinear interpolant is linear on each row and column.
In summary, you can create a short SAS/IML function that performs bilinear interpolation on the unit square.
(You can download the SAS program that generates the tables and graphs in this article.)
The interpolant is a quadratic saddle-shaped function inside the square.
You can use the ideas in this article to create a function that performs bilinear interpolation on an arbitrary grid of fitting data. The next article shows a general function for bilinear interpolation in SAS.
The post What is bilinear interpolation? appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
How to Add Sticky Headers with ODS HTML was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
This blog demonstrates how to modify your ODS HTML code to make your column headers “sticky,” or fixed in position. Using sticky headers is most beneficial when you have long tables on your web page and you want the column headers to stay in view while scrolling through the rest of the page. The ability to add sticky headers was added with CSS 2.1, with the cascading style sheet (CSS) position property and its sticky value. You might have seen this capability before CSS 2.1 because it was supported by WebKit, which is a browser engine that Apple developed and is used primarily in the Safari browser (In Safari, you use the position property with the value -webkit-sticky.) The position: sticky style property is supported in the latest versions of the major browsers, except for Internet Explorer. The FROZEN_HEADERS= option can be used with the TableEditor tagset; see the TableEditor tagset method below.
Here is a brief explanation about the task that this blog helps you accomplish. Since the position: sticky style property is supported with the <TH> HTML tags within tables, it is very easy for you to add the position: sticky style for HTML tables that ODS HTML generates. When this CSS style attribute is added for the headers, the headers are fixed within the viewport, which is the viewable area. The content in the viewport is scrollable, as seen in the example output below.
In the past, JavaScript was the main tool for generating fixed headers that are compatible across browsers and devices. However, the position: sticky property has also made it easier to fix various other elements, such as footers, within the viewport on the web page. This blog demonstrates how to make the <TH> tag or .header class sticky but enable the rest of the web page to be scrolled. The techniques here work for both desktop and mobile applications. There are multiple ways to add this style. Choose the method that is most convenient for you.
This example uses the position: sticky style property for the .header class, which is added to the HEADTEXT= option in the ODS HTML statement. The .header class is added along with the position style property between the <HEAD> and </HEAD> tags, which is the header section of the web page. This method is very convenient. However, you are limited to 256 characters and you might want to add other CSS style properties. The position style property is added using the .header class name, which is used by ODS HTML to add style attributes to the column headers. As the name suggests, cascading elements cascade and enable elements with like names to be combined. In the following code example, the HEADTEXT= option uses a CSS rule with the .header class and the position: sticky property for the header section of the web page.
ods html path="c:\temp" file="sticky.html" headtext="<style> .header {position: sticky;top:0}</style>"; proc print data=sashelp.cars; run; ods html close; |
Here is what the output looks like:
You can also add the position: sticky property to the .header class from an external CSS file, which can be referenced in ODS HTML code by using the STYLESHEET= option with the (URL=) suboption. This method uses a CSS file as a basis for the formatting, unlike the first method above, which had applied the default HTMLBLUE style for the destination.
Another item worth mentioning in this second example is the grouping of the CSS class selectors, which match the style element names used with ODS and the TEMPLATE procedure. For example, the .body, .systemtitle, .header, .rowheader, and .data class selectors are added and grouped into the font-family style property. This method is also used for several of the other style properties below. The .data class adds some additional functionality worth discussing, such as the use of a pseudo style selector, which applies a different background color for even alternating rows. In the example below, the .class names and the template element names are the same. You should place the CSS style rules that are shown here in a file that is named sticky.css.
.body, .systemtitle, .header, .rowheader, .data { font-family: arial, sans-serif; } .systemtitle, .header, .rowheader { font-weight: bold } .table, .header, .rowheader, .data { border-spacing: 0; border-collapse: collapse; border: 1px solid #606060; } .table tbody tr:nth-child(even) td { background-color: #e0e0e0; color: black; } .header { background-color: #e0e0e0; position: -webkit-sticky; position: sticky; top:0; } .header, .rowheader, .data { padding: 5px 10px; } |
After you create that CSS file, you can use the ODS HTML statement with the STYLESHEET= option. In that option, the (URL=) suboption uses the sticky.css file as the basis for the formatting. Forgetting to add the (URL=) suboption re-creates a CSS file with the current template style that is being used.
ods html path="c:\temp" file="sticky.html" stylesheet=(url="sticky.css"); proc print data=sashelp.cars; run; ods html close; |
Here is what the output looks like:
The pseudo class selector in the CSS file indicated that even alternating rows for all <TD> tags would be colored with the background color gray. Also, the position: sticky property in the .header class fixed the position of the header within the viewport.
A third method uses the TableEditor tagset, which enables sticky headers to be added by using options. Options are also applied to modify the style for the alternating even and odd rows as well as to have sortable headers.
/* Reference the TableEditor tagset from support.sas.com. */ filename tpl url "http://support.sas.com/rnd/base/ods/odsmarkup/tableeditor/tableeditor.tpl"; /* Insert the tagset into the search path for ODS templates. */ ods path(Prepend) work.templat(update); %include tpl; ods tagsets.tableeditor file="c:\output\temp.html" options(sticky_headers="yes" sort="yes" banner_color_even="#e0e0e0") style=htmlblue; proc print data=sashelp.cars; run; ods tagsets.tableeditor close; |
Here is what the output looks like:
In summary, this article describes how easy it is to add sticky headers to tables that are generated by using the ODS HTML destination. Adding fixed headers to any table allows the output to dynamically preserve the headers in the viewable area while scrolling through the table, allowing a much richer experience. Give it a try and let me know how it goes.
How to Add Sticky Headers with ODS HTML was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Find points where a regression curve has zero slope appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article shows how to find local maxima and maxima on a regression curve, which means finding points where the slope of the curve is zero. An example appears at the right, which shows locations where the loess smoother in a scatter plot has local minima and maxima.
Except for simple cases like quadratic regression, you need to use numerical techniques to locate these values.
In a previous article, I showed how to use SAS to evaluate the slope of a regression curve at specific points. The present article applies that technique by scoring the regression curve on a fine grid of point. You can use finite differences to approximate the slope of the curve at each point on the grid. You can then estimate the locations where the slope is zero.
In this article, I use the LOESS procedure to demonstrate the technique, but the method applies equally well to any one-dimensional regression curve. There are several ways to score a regression model:
The technique in this article will not detect inflection points. An inflection point is a location where the curve has zero slope but is not a local min or max. Consequently, this article is really about “how to find a point where a regression curve has a local extremum,” but I will use the slightly inaccurate phrase “find points where the slope is zero.”
For convenience, I assume the explanatory variable is named X and the response variable is named Y.
The goal is to find locations where a nonparametric curve (x, f(x)) has zero slopes, where f(x) is the regression model. The general outline follows:
SAS distributes the ENSO data set in the SASHelp library.
You can create a DATA step view that renames the explanatory and response variables to X and Y, respectively, so that it is easier to follow the logic of the program:
/* Create VIEW where x is the independent variable and y is the response */ data Have / view=Have; set Sashelp.Enso(rename=(Month=x Pressure=y)); keep x y; run; |
After the data set is created, you can use PROC SQL to find the minimum and maximum values of the explanatory variable. You can create an evenly spaced grid of points for the range of the explanatory variable.
/* Put min and max into macro variables */ proc sql noprint; select min(x), max(x) into :min_x, :max_x from Have; quit; /* Evaluate the model and estimate derivatives at these points */ data Grid; dx = (&max_x - &min_x)/201; /* choose the step size wisely */ do x = &min_x to &max_x by dx; output; end; drop dx; run; |
This is the step that will vary from procedure to procedure. You have to know how to use the procedure to score the regression model on the points in the Grid data set.
The LOESS procedure supports a SCORE statement, so the call fits the model and scores the model on the Grid data set:
/* Score the model on the grid */ ods select none; /* do not display the tables */ proc loess data=Have plots=none; model y = x; score data=Grid; /* PROC LOESS does not support an OUT= option */ /* Most procedures support an OUT= option to save the scored values. PROC LOESS displays the scored values in a table, so use ODS to save the table to an output data set */ ods output ScoreResults=ScoreOut; run; ods select all; |
If a procedure supports the STORE statement, you can use PROC PLM to score the model on the data. The
SAS program that accompanies this article includes an example that uses the GAMPL procedure. The GAMPL procedure does not support the STORE or SCORE statements, but you can use
the missing value trick to find zero derivatives.
This is the mathematical portion of the computation. You can use a backward difference scheme to estimate the derivative (slope) of the curve. If (x0, y0) and (x1, y1) are two consecutive points along the curve (in the ScoreOut data set), then the slope at (x1, y1) is approximately m = (y1 – y0) / (x1 – x0). When the slope changes sign between consecutive points, it indicates that the slope changed from positive to negative (or vice versa) between the points. If the slope is continuous, it must have been exactly zero somewhere on the interval. You can use a linear approximation to find the point, t, where the slope is zero. You can then use linear interpolation to approximate the point (t, f(t)) at which the curve is a local min or max.
You can use the following SAS DATA step to process the scoring data, approximate the slope, and estimate where the slope of the curve is zero:
/* Compute slope by using finite difference formula. */ data Deriv0; set ScoreOut; Slope = dif(p_y) / dif(x); /* (f(x) - f(x-dx)) / dx */ xPrev = lag(x); yPrev = lag(p_y); SlopePrev = lag(Slope); if n(SlopePrev) AND sign(SlopePrev) ^= sign(Slope) then do; /* The slope changes sign between this obs and the previous. Assuming linearity on the interval, find (t, f(t)) where slope is exactly zero */ t0 = xPrev - SlopePrev * (x - xPrev)/(Slope - SlopePrev); /* use linear interpolation to find the corresponding y value: f(t) ~ y0 + (y1-y0)/(x1-x0) * (t - x0) */ f_t0 = yPrev + (yPrev - p_y)/(x - xPrev) * (t0 - xPrev); if sign(SlopePrev) > 0 then _Type_ = "Max"; else _Type_ = "Min"; output; end; keep t0 f_t0 _Type_; label f_t0 = "f(t0)"; run; proc print data=Deriv0 label; run; |
The table shows that there are seven points at which the derivative of the loess regression curve has a local min or max.
If you want to display the local extreme on the graph of the regression curve, you can concatenate the original data, the regression curve, and the local extreme. You can then use PROC SGPLOT to overlay the three layers. The resulting graph is shown at the top of this article.
data Combine; merge Have /* data : (x, y) */ ScoreOut(rename=(x=t p_y=p_t)) /* curve : (t, p_t) */ Deriv0; /* extrema: (t0, f_t0) */ run; title "Loess Smoother"; title2 "Red Markers Indicate Zero Slope for Smoother"; proc sgplot data=Combine noautolegend; scatter x=x y=y; series x=t y=p_t / lineattrs=GraphData2; scatter x=t0 y=f_t0 / markerattrs=(symbol=circlefilled color=red); yaxis grid; run; |
In summary, if you can evaluate a regression curve on a grid of points, you can approximate the slope at each point along the curve. By looking for when the slope changes sign, you can find local minima and maxima. You can then use a simple linear estimator on the interval to estimate where the slope is exactly zero.
You can download the SAS program that performs the computations in this article.
The post Find points where a regression curve has zero slope appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
What to expect when you take SAS training: Before, during and after was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
You’ve chosen the right class, added-to-cart, and hit submit.
Once you book a class with us, no matter the format, you can expect an email confirming your request within 24 hours. For instructor-led training courses, a reminder email is sent 3-5 days before the course is set to begin providing access to the course notes and instructions on what will happen the first day. SAS Live Web course instructions include tasks to perform to ensure your system is set up properly. If you’re taking in-person, classroom training, you can expect an email with guidelines for the specific training center location with the address and travel or parking tips.
For e-Learners, you can start right away! Your confirmation email will give you a link to your personal My Training page where you’ll log in to access your training – anytime, anywhere.
Depending on the course level, you may be asked if you’ve met all the prerequisites. Maybe you’ll even take a training assessment. We’re always available to answer your questions and want you to be 100% satisfied with your course, so reach out and we’ll be sure you’re in the correct class.
First day jitters? Nah, we’ve got you covered. The reminder email you’ll receive has all the tools you need to get started. So, relax and just show up! SAS instructors are some of the best teachers in the business – and you can be assured they know their stuff. You’ll learn tips and tricks, even when they’re reviewing familiar content!
Live Web classes are as interactive as traditional classroom training. With our state-of-the-art technology, you’ll interact with the instructor and classmates throughout the course and have access to a virtual lab with the software and data. As you noticed when you registered, the class layout varies – sometimes you have full-day training and sometimes the class is split into half-day sessions over a longer time period. Always check the times to be sure you log in to the right time zone.
One of the greatest things about SAS instructors is their diversity – we really love to encourage uniqueness, so our classes vary a bit. Each instructor has their own way of breaking down the course, and much of it will depend on you, the students who make up the class. So, be ready to speak up and share what you know, what you don’t, and what you want to accomplish.
What remains the same across the board is the fact that you’ll undoubtedly walk away with several ah-ha moments. Expect lectures interspersed with mathematical details on the algorithms used in the demos. You’ll have quizzes and exercises that take it to the next level. Don’t worry, there’s always room for Q&A, and the instructors make themselves available 30 minutes before and after class to answer questions. And, we’re all human, so expect some breaks. Full-day classes will also have a lunch hour.
As you approach the end of your training, reflect on and realize the accomplishments you’ve achieved – including all the new SAS skills you have to show off!
But that doesn’t mean the fun ends!
That’s right, you’ve only scratched the surface – to really solidify your skills, you must use what you learned. Most classes have Extended Learning Pages, which you’ll get access to in your Thank you email after class. As you practice your newfound knowledge you may have questions. While you probably have someone at work who can assist, most instructors encourage students to email them when questions arise.
If you were part of an onsite course or just have a group of people working on similar tasks, it might be a good idea to schedule a mentoring session with an instructor. While this is not free, it’s invaluable to see SAS in action using your own data.
There are plenty of other great resources available free of charge, right at your fingertips.
So, track your progress, earn Learn Badges and prepare for a globally recognized SAS Certification. Then, see where it leads.
What to expect when you take SAS training: Before, during and after was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Cubic spline interpolation in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
I recently showed how to use linear interpolation in SAS.
Linear interpolation is a common way to interpolate between a set of planar points, but the interpolating function (the interpolant) is not smooth. If you want a smoother interpolant,
you can use cubic spline interpolation. This article describes how to use a cubic spline interpolation in SAS.
As mentioned in my previous post, an interpolation requires two sets of numbers:
The following SAS DATA steps define the data for this example. The POINTS data set contains the sample data, which are shown as blue markers on the graph to the right. The SCORE data set contains the scoring data, which are shown as red tick marks along the horizontal axis.
/* Example dats for 1-D interpolation */ data Points; /* these points define the model */ input x y; datalines; 0 1 1 3 4 5 5 4 7 6 8 3 10 3 ; data Score; /* these points are to be interpolated */ input t @@; datalines; 2 -1 4.8 0 0.5 1 9 5.3 7.1 10.5 9 ; |
On the graph, the blue curve is the cubic spline interpolant. Every point that you interpolate will be on that curve. The red asterisks are the interpolated values for the values in the SCORE data set. Notice that points -1 and 10.5 are not interpolated because they are outside of the data range. The following section shows how to compute the cubic spline interpolation in SAS.
A linear interpolation uses a linear function on each interval between the data points. In general, the linear segments do not meet smoothly: the resulting interpolant is continuous but not smooth. In contrast, spline interpolation uses a polynomial function on each interval
and chooses the polynomials so that the interpolant is smooth where adjacent polynomials meet. For polynomials of degree k, you can match the first k – 1 derivatives at each data point.
A cubic spline is composed of piecewise cubic polynomials whose first and second derivatives match at each data point. Typically, the second derivatives at the minimum and maximum of the data are set to zero. This kind of spline is known as a “natural cubic spline” with knots placed at each data point.
I have previously shown how use the SPLINE call in SAS/IML to compute a smoothing spline. A smoothing spline is not an interpolant because it does not pass through the original data points. However, you can get interpolation by using the SMOOTH=0 option. Adding the TYPE=’zero’ option results in a natural cubic spline.
For more control over the interpolation,
you can use the SPLINEC function (‘C’ for coefficients) to fit the cubic splines to the data and obtain a matrix of coefficients. You can then use that matrix in the SPLINEV function
(‘V’ for value) to evaluate the interpolant at the locations in the scoring data.
The following SAS/IML function (CubicInterp) computes the spline coefficients from the sample data and then interpolates the scoring data. The details of the computation are provided in the comments, but you do not need to know the details in order to use the function to interpolate data:
/* Cubic interpolating spline in SAS. The interpolation is based on the values (x1,y1), (x2,y2), ..., (xn, yn). The X values must be nonmissing and in increasing order: x1 < x2 < ... < xn The values of the t vector are interpolated. */ proc iml; start CubicInterp(x, y, t); d = dif(x, 1, 1); /* check that x[i+1] > x[i] */ if any(d<=0) then stop "ERROR: x values must be nonmissing and strictly increasing."; idx = loc(t>=min(x) && t<=max(x)); /* check for valid scoring values */ if ncol(idx)=0 then stop "ERROR: No values of t are inside the range of x."; /* fit the cubic model to the data */ call splinec(splPred, coeff, endSlopes, x||y) smooth=0 type="zero"; p = j(nrow(t)*ncol(t), 1, .); /* allocate output (prediction) vector */ call sortndx(ndx, colvec(t)); /* SPLINEV wants sorted data, so get sort index for t */ sort_t = t[ndx]; /* sorted version of t */ sort_pred = splinev(coeff, sort_t); /* evaluate model at (sorted) points of t */ p[ndx] = sort_pred[,2]; /* "unsort" by using the inverse sort index */ return( p ); finish; /* example of linear interpolation in SAS */ use Points; read all var {'x' 'y'}; close; use Score; read all var 't'; close; pred = CubicInterp(x, y, t); create PRED var {'t' 'pred'}; append; close; QUIT; |
The visualization of the interpolation is similar to the code in the previous article, so the code is not shown here. However, you can download the SAS program that performs the cubic interpolation and creates the graph at the top of this article.
Although cubic spline interpolation is slower than linear interpolation, it is still fast:
The CubicInterp program takes about 0.75 seconds to fit 1000 data points and interpolate one million scoring values.
In summary, the SAS/IML language provides the computational tools for cubic spline interpolation. The CubicInterp function in this article encapsulates the functionality so that you can perform cubic spline interpolation of your data in an efficient manner.
The post Cubic spline interpolation in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |