The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A quantile-quantile plot (Q-Q plot) is a graphical tool that compares a data distribution and a specified probability distribution. If the points in a Q-Q plot appear to fall on a straight line, that is evidence that the data can be approximately modeled by the target distribution.
Although it is not necessary, some data analysts like to overlay a reference line to help “guide their eyes” as to whether the values in the plot fall on a straight line. This article describes three ways to overlay a reference line on a Q-Q plot. The first two lines are useful during the exploratory phase of data analysis; the third line visually represents the estimates of the location and scale parameters in the fitted model distribution. The three lines are:
If you need to review Q-Q plots, see my previous article that describes what a Q-Q plot is, how to construct a Q-Q plot in SAS, and how to interpret a Q-Q plot.
Let me be clear: It is not necessary to overlay a line on a Q-Q plot. You can display only the points on a Q-Q plot and, in fact, that is the default behavior in SAS when you create a Q-Q plot by using the QQPLOT statement in PROC UNIVARIATE.
The following DATA step generates 97 random values from an exponential distribution with shape parameter σ = 2 and three artificial “outliers.” The call to PROC UNIVARIATE creates a Q-Q plot, which is shown:
data Q(keep=y); call streaminit(321); do i = 1 to 97; y = round( rand("Expon", 2), 0.001); /* Y ~ Exp(2), rounded to nearest 0.001 */ output; end; do y = 10,11,15; output; end; /* add outliers */ run; proc univariate data=Q; qqplot y / exp grid; /* plot data quantiles against Exp(1) */ ods select QQPlot; ods output QQPlot=QQPlot; /* for later use: save quantiles to a data set */ run; |
The vertical axis of the Q-Q plot displays the sorted values of the data; the horizontal axis displays evenly spaced quantiles of the standardized target distribution, which in this case is the exponential distribution with scale parameter σ = 1.
Most of the points appear to fall on a straight line, which indicates that these (simulated) data might be reasonably modeled by using an exponential distribution. The slope of the line appears to be approximately 2, which is a crude estimate of the scale parameter (σ). The Y-intercept of the line appears to be approximately 0, which is a crude estimate of the location parameter (the threshold parameter, θ).
Although the basic Q-Q plot provides all the information you need to decide that these data can be modeled by an exponential distribution, some data sets are less clear. The Q-Q plot might show a slight bend or wiggle, and you might want to overlay a reference line to assess how severely the pattern deviates from a straight line. The problem is, what line should you use?
Cleveland (Visualiizing Data, 1993, p. 31) recommends overlaying a line that connects the first and third quartiles. That is, let p_{25} and p_{75} be the 25th and 75th percentiles of the target distribution, respectively, and let
y_{25} and y_{75} be the 25th and 75th percentiles of the ordered data values.
Then Cleveland recommends plotting the line through the ordered pairs
(p_{25}, y_{25}) and (p_{75}, y_{y5}).
In SAS, you can use PROC MEANS to compute the 25th and 75th percentiles for the X and Y variables in the Q-Q plot. You can then use the DATA step or PROC SQL to compute the slope of the line that passes between the percentiles. The following statements analyze the Q-Q plot data that was created by using the ODS OUTPUT statement in the previous section:
proc means data=QQPlot P25 P75; var Quantile Data; /* ODS OUTPUT created the variables Quantile (X) and Data (Y) */ output out=Pctl P25= P75= / autoname; run; data _null_; set Pctl; slope = (Data_P75 - Data_P25) / (Quantile_P75 - Quantile_P25); /* dy / dx */ /* if desired, put point-slope values into macro variables to help plot the line */ call symputx("x1", Quantile_P25); call symputx("y1", Data_P25); call symput("Slope", putn(slope,"BEST5.")); run; title "Q-Q Plot with Reference Line"; title2 "Reference Line through First and Third Quartiles"; title3 "Slope = &slope"; proc sgplot data=QQPlot; scatter x=Quantile y=Data; lineparm x=&x1 y=&y1 slope=&slope / lineattrs=(color=Green) legendlabel="Percentile Estimate"; xaxis grid label="Exponential Quantiles"; yaxis grid; run; |
Because the line passes through the first and third quartiles, the slope of the line is robust to outliers in the tails of the data. The line often provides a simple visual guide to help you determine whether the central portion of the data matches the quantiles of the specified probability distribution.
Keep in mind that this is a visual guide. The slope and intercept for this line should not be used as parameter estimates for the location and scale parameters of the probability distribution, although they could be used as an initial guess for an optimization that estimates the location and scale parameters for the model distribution.
Let’s be honest, when a statistician sees a scatter plot for which the points appear to be linearly related, there is a Pavlovian reflex to fit a regression line to the values in the plot.
However, I can think of several reasons to avoid adding a regression line to a Q-Q plot:
If you choose to ignore these problems, you can use the REG statement in PROC SGPLOT to add a reference line. Alternatively, you can use PROC REG in SAS (perhaps with the NOINT option if the location parameter is zero) to obtain an estimate of the slope:
proc reg data=QQPlot plots=NONE; model Data = Quantile / NOINT; /* use NOINT when location parameter is 0 */ ods select ParameterEstimates; quit; title2 "Least Squares Reference Line"; proc sgplot data=QQPlot; scatter x=Quantile y=Data; lineparm x=0 y=0 slope=2.36558 / lineattrs=(color=Red) legendlabel="OLS Estimate"; xaxis grid label="Exponential Quantiles"; yaxis grid; run; |
For these data, I used the NOINT option to set the threshold parameter to 0. The zero-intercept line with slope 2.36558 is overlaid on the Q-Q plot. As expected, the outliers in the upper-right corner of the Q-Q plot have pulled the regression line upward, so the regression line has a steeper slope than the reference line based on the first and third quartiles.
Because the tails of an empirical distribution often differ from the tails of the target distribution, the regression-based reference line can be misleading. I do not recommend its use.
The previous sections describe two ways to overlay a reference line during the exploratory phase of the data analysis. The purpose of the reference line is to guide your eye and help you determine whether the points in the Q-Q plot appear to fall on a straight line. If so, you can move to the modeling phase.
In the modeling phase, you use a parameter estimation method to fit the parameters in the target distribution.
Maximum likelihood estimation (MLE) is often the method-of-choice for estimating parameters from data.
You can use the HISTOGRAM statement in PROC UNIVARIATE to obtain a
maximum likelihood estimate of the shape parameter for the exponential distribution, which turns out to be 2.21387. If you specify the location and scale parameters in the QQPLOT statement, PROC UNIVARIATE will automatically overlay a line that represents that fitted values:
proc univariate data=Q; histogram y / exp; qqplot y / exp(threshold=0 scale=est) odstitle="Q-Q Plot with MLE Estimate" grid; ods select ParameterEstimates GoodnessOfFit QQPlot; run; |
The ParameterEstimates table shows the maximum likelihood estimate. The GoodnessOfFit table shows that there is no evidence to reject the hypothesis that these data came from an Exp(σ=2.21) distribution.
Notice the distinction between this line and the previous lines. This line is the result of fitting the target distribution to the data (MLE) whereas the previous lines were visual guides.
When you display a Q-Q plot that has a diagonal line, you should state how the line was computed.
In conclusion, you can display a Q-Q plot without adding any reference line. If you choose to overlay a line, there are three common methods. During the exploratory phase of analysis, you can display a line that connects the 25th and 75th percentiles of the data and target distributions. (Some practitioners use an OLS regression line, but I do not recommend it.) During the modeling phase, you can use maximum likelihood estimation or some other fitting method to estimate the location and scale of the target distribution. Those estimates can be used as the intercept and slope, respectively, of a line on the Q-Q plot. PROC UNIVARIATE in SAS displays this line automatically when you fit a distribution.
The post Three ways to add a line to a Q-Q plot appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Let's track the falling gasoline prices! appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
When I fill up my daily-driver Prius, the price of gasoline isn’t that important. But when I occasionally take a trip in my V8 Suburban, I pay a lot more attention! Therefore I was pleasantly surprised when I noticed that gasoline prices have been falling. How much have they fallen, […]
The post Let's track the falling gasoline prices! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Now is your chance to learn even more about SAS hash tables with four additional articles on the subject.
The post Four more tips about hash tables appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
When you overlay two series in PROC SGPLOT, you can either plot both series on the same axis or you can assign one series to the main axis (Y) and another to a secondary axis (Y2). If you use the Y and Y2 axes, they are scaled independently by default, which is usually what you want. However, if the measurements for the two series are linearly related to each other, then you might want to specify the tick values for the Y2 axis so that they align with the corresponding tick marks for the Y axis.
This article shows how to align the Y and Y2 axes in PROC SGPLOT in SAS for two common situations.
The simplest situation is a single set of data that you want to display in two different units. For example, you might use one axis to display the data in imperial units (pounds, gallons, degrees Fahrenheit, etc.) and the other axis to display the data in metric units (kilograms, liters, degrees Celsius, etc.).
To plot the data, define one variable for each unit. For example, the Sashelp.Class data records the weight for 19 students in pounds. The following DATA view creates a new variable that records the same data in kilograms. The subsequent call to PROC SGPLOT plots the pounds on the Y axis (left axis) and the kilograms on the Y2 axis (right axis). However, as you will see, there is a problem with the default scaling of the two axes:
data PoundsKilos / view=PoundsKilos; set Sashelp.Class(rename=(Weight=Pounds)); Kilograms = 0.453592 * Pounds; /* convert pounds to kilos */ run; title "Independent Axes"; title2 "Markers Do Not Align Correctly!"; /* the tick marks on each axis are independent */ proc sgplot data=PoundsKilos; scatter x=Height y=Pounds; scatter x=Height y=Kilograms / Y2Axis; run; |
The markers for the kilogram measurements should exactly overlap the markers for pounds, but they don’t. The Y and Y2 axes are independently scaled because PROC SGPLOT does not know that pounds and kilograms are linearly related. The SGPLOT procedure displays each variable by using a range of round numbers (multiples of 10 or 20). The range for the Y2 axis is [20, 70] kilograms, which corresponds to a range of [44.1, 154.3] pounds. However, the range for the Y axis is approximately [50, 150] pounds. Because the axes display different ranges, the markers do not overlap.
To improve this graph, use the VALUES= and VALUESDISPLAY= options on the YAXIS statement (or Y2AXIS statement) to force the ticks marks on one axis to align with the corresponding tick marks on the other axis. In the following DATA step, I use the kilogram scale as the standard and compute the corresponding pounds.
data Ticks; do Kilograms = 20 to 70 by 10; /* for each Y2 tick */ Pounds = Kilograms / 0.453592; /* convert kilos to pounds */ Approx = round(Pounds, 0.1); /* use rounded values to display tick values */ output; end; run; proc print; run; |
You can use the Pounds column in the table to set the VALUES= list on the YAXIS statement. You can use the Approx column to set the VALUESDISPLAY= list, as follows:
/* align tick marks on each axis */ title "Both Axes Use the Same Scale"; proc sgplot data=PoundsKilos noautolegend; scatter x=Height y=Pounds; /* Make sure the plots overlay exactly! Then you can set SIZE=0 */ scatter x=Height y=Kilograms / markerattrs=(size=0) Y2Axis; yaxis grid values=(44.092 66.139 88.185 110.231 132.277 154.324) valuesdisplay=('44.1' '66.1' '88.2' '110.2' '132.3' '154.3'); run; |
Success! The markers for the two variables align exactly. After verifying that they align, you can use the MARKERATTRS=(SIZE=0) option to suppress the display of one of the markers.
Notice that the Y axis (pounds) no longer displays “nice numbers” because I put the tick marks at the same vertical heights on both axes.
A different way to solve the misalignment problem is to use the MIN=, MAX=, THRESHOLDMIN=, and THRESHOLDMAX= options on both axes. This will enable both axes to use “nice numbers” while still aligning the data. If you want to try this approach, here are the YAXIS and Y2AXIS statements:
/* set the axes ranges to coresponding values */ yaxis grid thresholdmin=0 thresholdmax=0 min=44.1 max=154.3; y2axis grid thresholdmin=0 thresholdmax=0 min=20 max=70; |
Another situation that requires two Y axes is the case of two series that use different units. For example, you might want to plot the revenue for a US company (in dollars) and the revenue for a Japanese company (in yen) for a certain time period.
You can use the conversion rate between yen and dollars to align the values on the axes.
Of course, the conversion from Japanese yen to the US dollars changes each day, but you can use an average conversion rate to set the correspondence between the axes.
This situation also occurs when two devices use different methods to measure the same quantity.
The following example shows measurements for a patient who receives a certain treatment. The quantity of a substance in the patient’s blood is measured at baseline and for every hour thereafter. The quantity is measured in two ways: by using a traditional blood test and by using a new noninvasive device that measures electrical impedance.
The following statements define and plot the data. The two axes are scaled by using the default method:
data BloodTest1; label t="Hours after Medication" x="micrograms per deciliter" y="kiloOhms"; input x y @@; t = _N_ - 1; datalines; 169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 76.1 16.1 72.2 16.7 70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 ; title "Overlay Measurements for Two Medical Devices"; title2 "Default Scaling"; proc sgplot data=BloodTest1; series x=t y=x / markers legendlabel="Standard Lab Value"; series x=t y=y / markers Y2Axis legendlabel="New Device"; xaxis values=(0 to 12 by 2); yaxis grid label="micrograms per deciliter"; y2axis grid label="kiloOhms"; run; |
In this graph, the Y axes are scaled independently. However,
the company that manufactures the device used Deming regression to establish that the measurements from the two devices are linearly related by the equation Y = –10.56415 + 0.354463*X, where X is the measurement from the blood test. You can use this linear equation to set the scales for the two axes.
The following DATA step uses the Deming regression estimates to convert the tick marks on the Y axis into values for the Y2 axis.
(Click here for the PROC PRINT output.) The call to PROC SGPLOT creates a graph in which the Y2 axis is aligned with the Y axis according to the Deming regression estimates.
data Ticks; do Y1 = 60 to 160 by 20; /* use Deming regression to find one set of ticks in terms of the other */ Y2 = -10.56415 + 0.354463 * Y1; /* kiloOhms as a function of micrograms/dL */ Approx = round(Y2, 0.1); output; end; run; proc print; run; title "Align Y Axes for Different Series"; title2 "Measurements are Linearly Related"; proc sgplot data=BloodTest1; series x=t y=x / markers legendlabel="Standard Lab Value"; series x=t y=y / markers Y2Axis legendlabel="New Device"; xaxis values=(0 to 12 by 2); yaxis grid label="micrograms per deciliter" offsetmax=0.1 values=(60 to 160 by 20); /* the same offsets must be used in both YAXIS and Y2AXIS stmts */ y2axis grid label="kiloOhms" offsetmax=0.1 values=(10.7036 17.7929 24.8822 31.9714 39.0607 46.1499) valuesdisplay=('10.7' '17.8' '24.9' '32.0' '39.1' '46.1'); run; |
In this new graph, the measurements are displayed on compatible scales and the reference lines connect round numbers on one axis to the corresponding values on the other axis.
The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
New to SAS? Here are tips from the translator of The Little SAS Book, Fifth Edition.
Hongqiu Gu, Ph.D. works at the China National Clinical Research Center for Neurological Diseases at the National Center for Healthcare Quality Management in Neurological Diseases at Beijing Tiantan Hospital, Capital Medical University.
He shared these important tips to learn SAS well:
1. Read SAS Reference Books
I have not counted the number of SAS books I have read; I would estimate over 50 or 60. The best books to give me a deep understanding of SAS are the SAS Reference Books, including SAS Language Reference Concepts, SAS Functions and CALL Routines Reference, SAS Macro Language Reference, and so on. There are lots of excellent books published by SAS Press, and usually they are concise and suitable for quick learners. However, when I realized that SAS could give me a powerful career advantage, I needed to learn SAS systematically and deeply. I believe the SAS Reference Books are the most authoritative and comprehensive learning materials. Besides, all the updated SAS Reference Books are free to all readers.
2. Use the SAS Help and Documentation frequently
No one can remember all the syntaxes or options in SAS. However, don’t worry, SAS Help and Documentation is our best friend. I use the SAS Help and Documentation quite often. Even as an experienced SAS user, there are still many situations in which I need to ask for help from SAS Help and Documentation. Every time I use it, I learn something new.
3. Solve SAS related questions in SAS communities
As the saying goes, practice makes perfect. Answering SAS related questions is a good way to practice. Questions can come from daily work, from friends around you, or from other SAS users on the web. From 2013 to 2015, I spent a lot of time in the largest Chinese SAS online community answering SAS related questions and I learned many practical skills in a short period.
4. Make friends with skilled SAS programmers
Learning alone without interacting with others will lead to ignorance. I have learned a lot from other experienced SAS users and SAS developers. We share our ideas from time to time, and benefit a lot from the exchange.
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
Recently The Little SAS Book reached a major milestone. For the first time ever, it was translated into another language. The language in this case was Chinese, and the translator was Hongqiu Gu, Ph.D. from the China National Clinical Research Center for Neurological Diseases at the National Center for Healthcare Quality Management in Neurological Diseases at Beijing Tiantan Hospital, Capital Medical University.
To mark this achievement, I asked Hongqiu a few questions.
Susan: First I want to say how honored I am that you translated our book. It must have been a lot of work. Receiving a copy of the translation was a highlight of the year for me. How did you learn SAS?
Hongqiu: How did I learn SAS? That is a long story. I had not heard of SAS before I took an undergraduate statistics course in 2005. The first time I heard the name “SAS,” I mistook it for SARS (Severe Acute Respiratory Syndrome). Although the pronunciations of these two words are entirely different for native English speakers, most Chinese people pronounced them as /sa:s/. At that time, I was not trying to learn SAS well, and I simply wanted to pass the exam. After the exam, all I had learned about SAS was entirely forgotten. However, during the preparation of my master’s thesis, I had to do a lot of data cleaning and data analysis work with SAS, and I began to learn SAS enthusiastically.
Susan: Why did you decide to translate The Little SAS Book?
Hongqiu: Although I highly recommend the SAS Reference Books for learning SAS, most beginners need a concise SAS book to give them a quick overview of what SAS is and what SAS can do. There is no doubt that The Little SAS Book is the best one as the first SAS book for SAS beginners. However, it was not easy for a Chinese SAS beginner to get a hardcopy of The Little SAS Book because it was not available in the Chinese market and the price was too high if they shopped overseas. Another barrier is the language. Most beginners still want an elementary book in their mother language. Besides, lots of R books had been introduced and translated into Chinese. Therefore, I believed there was an urgent need to translate this book into Chinese. So I tried several times to contact SAS press to get permission to translate it into Chinese, but no reply. Things changed when manager Frank Jiang from SAS China found me after my book, The Romance of SAS Programming, was published by Tsinghua University Press.
Susan: How long did it take you to translate the book?
Hongqiu: First, I must state that the Chinese version of The Little SAS Book is a collaborative work. Manager Frank Jiang from SAS China together with managing editor Yang Liu from Tsinghua University Press did much early-stage work to start this project. We began the translation in early April 2017 and finished the translation in July 2017. After that, we took more than three months to complete the two rounds of cross-audit to make sure the translation was correct and typo errors were minimized.
Members of the translation team include Hongqiu Gu, Adrian Liu, Louanna Kong, Molly Li, Slash Xin, Nick Li, Zhixin Yang, Amy Qian, Wei Wang, and Ke Yang.
Members of the audit team include Silence Zeng, Mary Ma, Wei Wang, Jianping Xue, and Sikan Luan.
Susan: What was the hardest part of translating it?
Hongqiu: The book is written in plain English and easy to understand. We did not find any particular part that hard to translate.
Susan: Are there a lot of SAS users in China?
Hongqiu: There are a lot of SAS users in China. I’ve no idea what the exact number of SAS users in China is. With the increasing need for SAS users in medicine, life science, finance and banking industries, SAS users will become more and more prevalent.
Susan: Thank you for sharing your experiences. Perhaps someday we can meet in person at SAS Global Forum.
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Does this situation sound familiar? You have a complex analysis that must be finished urgently. The data was delivered late and its quality and structure are far from the expected standard. The time pressure to present the results is huge, and your SAS program is not giving you the expected […]
The post Favorite SAS Press books from a SAS Press author appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Numbers don’t lie, but sometimes they don’t reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community.
Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.
I believe that learning should not stop when we leave school.
If you, too, are a lifelong learner, the following topics deserve a second look. I’ve included articles from four different categories.
These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!
Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!
The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Deming regression for comparing different measurement methods appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Deming regression (also called errors-in-variables regression) is a total regression method that fits a regression line when the measurements of both the explanatory variable (X) and the response variable (Y) are assumed to be subject to normally distributed errors.
Recall that in ordinary least squares regression, the explanatory variable (X) is assumed to be measured without error. Deming regression is explained in a Wikipedia article and in a paper by K. Linnet (1993).
A situation in which both X and Y are measured with errors arises when comparing measurements from different instruments or medical devices. For example, suppose a lab test measures the amount of some substance in a patient’s blood. If you want to monitor this substance at regular intervals (for example, hourly), it is expensive, painful, and inconvenient to take the patient’s blood multiple times. If someone invents a medical device that goes on the patient’s finger and measures the substance indirectly (perhaps by measuring an electrical property such as bioimpedance), then that device would be an improved way to monitor the patient. However, as explained in Deal, Pate, and El Rouby (2009), the FDA would first need to approve the device and determine that it measures the response as accurately as the existing lab test. The FDA encourages the use of Deming regression for method-comparison studies.
There are several ways to compute a Deming Regression in SAS.
The SAS FASTats site suggests maximum likelihood estimation (MLE) by using PROC OPTMODEL, PROC IML, or PROC NLMIXED. However, you can solve the MLE equations explicitly to obtain an explicit formula for the regression estimates.
Deal, Pate, and El Rouby (2009) present a rather complicated macro, whereas
Njoya and Hemyari (2017) use simple SQL statements. Both authors also provide SAS code for estimating the variance of the Deming regression estimates, either by using the jackknife method or by using the bootstrap. However, the resampling schemes in both papers are inefficient because they use a macro loop to perform the jackknife or bootstrap.
The following SAS DATA Step defines pairs of hypothetical measurements for 65 patients, each of whom received the standard lab test (measured in micrograms per deciliter) and the new noninvasive device (measured in kiloohms):
data BloodTest; label x="micrograms per deciliter" y="kiloOhms"; input x y @@; datalines; 169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 76.1 16.1 72.2 16.7 70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 174.7 57.8 137.9 39.0 114.6 30.4 99.8 21.1 90.1 21.7 85.1 25.2 80.7 20.6 78.1 19.3 77.8 20.9 76.0 18.2 77.8 18.3 74.2 15.7 73.1 13.9 182.5 55.5 144.0 38.7 123.8 35.1 107.6 30.6 96.9 25.7 92.8 19.2 87.2 22.4 86.3 18.4 84.4 20.7 83.7 20.6 83.3 20.0 83.9 18.8 82.7 21.8 160.8 49.9 122.7 32.2 102.6 19.2 86.6 14.7 76.1 16.6 69.6 18.8 66.7 7.4 64.4 8.2 63.0 15.5 61.7 13.7 61.2 9.2 62.4 12.0 58.4 15.2 171.3 48.7 136.3 36.1 111.9 28.6 96.5 21.8 90.3 25.6 82.9 16.8 78.1 14.1 76.5 14.2 73.5 11.9 74.4 17.7 73.9 17.6 71.9 10.2 72.0 15.6 ; title "Deming Regression"; title2 "Gold Standard (X) vs New Method (Y)"; proc sgplot data=BloodTest noautolegend; scatter x=x y=y; lineparm x=0 y=-10.56415 slope=0.354463 / clip; /* Deming regression estimates */ xaxis grid label="Lab Test (micrograms per deciliter)"; yaxis grid label="New Device (kiloohms)"; run; |
The scatter plot shows the pairs of measurements for each patient. The linear pattern indicates that the new device is well calibrated with the standard lab test over a range of clinical values. The diagonal line represents the Deming regression estimate, which enables you to convert one measurement into another.
For example, a lab test that reads 100 micrograms per deciliter is expected to correspond to 25 kiloohms on the new device and vice versa.
(If you want to convert the new readings into the old, you can regress X onto Y and plot X on the vertical axis.)
The following SAS/IML function implements the explicit formulas that compute the slope and intercept of the Deming regression line:
/* Deming Regression in SAS */ proc iml; start Deming(XY, lambda=); /* Equations from https://en.wikipedia.org/wiki/Deming_regression */ m = mean(XY); xMean = m[1]; yMean = m[2]; S = cov(XY); Sxx = S[1,1]; Sxy = S[1,2]; Syy = S[2,2]; /* if lambda is specified (eg, lambda=1), use it. Otherwise, estimate. */ if IsEmpty(lambda) then delta = Sxx / Syy; /* estimate of ratio of variance */ else delta = lambda; c = Syy - delta*Sxx; b1 = (c + sqrt(c**2 + 4*delta*Sxy**2)) / (2*Sxy); b0 = yMean - b1*xMean; return (b0 || b1); finish; /* Test the program on the blood test data */ use BloodTest; read all var {x y} into XY; close; b = Deming(XY); print b[c={'Intercept' 'Slope'} L="Deming Regression"]; |
The SAS/IML function can estimate the ratio of the variances of the X and Y variable. In the SAS macros by Deal, Pate, and El Rouby (2009) and Njoya and Hemyari (2017), the ratio is a parameter that is determined by the user. The examples in both papers use a ratio of 1, which assumes that the devices have an equal accuracy and use the same units of measurement. In the current example, the lab test and the electrical device use different units. The ratio of the variances for these hypothetical devices is about 7.4.
You might wonder how accurate the parameter estimates are. Linnet (1993) recommends using the jackknife method to answer that question. I have previously explained how to jackknife estimates in SAS/IML, and the following program is copied from that article:
/* Helper modules for jackknife estimates of standard error and CI for parameters: */ /* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ start SeqExclude(n,i); if i=1 then return 2:n; if i=n then return 1:n-1; return (1:i-1) || (i+1:n); finish; /* return the i_th jackknife sample for (n x p) matrix X */ start JackSamp(X,i); return X[ SeqExclude(nrow(X), i), ]; /* return data without i_th row */ finish; /* 1. Compute T = statistic on original data */ T = b; /* 2. Compute statistic on each leave-one-out jackknife sample */ n = nrow(XY); T_LOO = j(n,2,.); /* LOO = "Leave One Out" */ do i = 1 to n; J = JackSamp(XY,i); T_LOO[i,] = Deming(J); end; /* 3. compute mean of the LOO statistics */ T_Avg = mean( T_LOO ); /* 4. Compute jackknife estimates of standard error and CI */ stdErrJack = sqrt( (n-1)/n * (T_LOO - T_Avg)[##,] ); alpha = 0.05; tinv = quantile("T", 1-alpha/2, n-2); /* use df=n-2 b/c both x and y are estimated */ Lower = T - tinv#stdErrJack; Upper = T + tinv#stdErrJack; result = T` || T_Avg` || stdErrJack` || Lower` || Upper`; print result[c={"Estimate" "Mean Jackknife Estimate" "Std Error" "Lower 95% CL" "Upper 95% CL"} r={'Intercept' 'Slope'}]; |
The formulas for the jackknife computation differs slightly from the SAS macro by
Deal, Pate, and El Rouby (2009). Because both X and Y have errors, the t quantile must be computed by using n–2 degrees of freedom, not n–1.
If X and Y are measured on the same scale, then the methods are well-calibrated when the 95% confidence interval (CI) for the intercept includes 0 and the CI for the intercept includes 1. In this example, the devices use different scales. The Deming regression line enables you to convert from one measurement scale to the other; the small standard errors (narrow CIs) indicate that this conversion is accurate.
In summary, you can use a simple set of formulas to implement Deming regression in SAS. This article uses SAS/IML to implement the regression estimates and the jackknife estimate of the standard errors. You can also use the macros that are mentioned in the section “Deming regression in SAS,” but
the macros are less efficient, and you need to specify the ratio of the variances of the data vectors.
The post Deming regression for comparing different measurement methods appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Last year, I wrote more than 100 posts for The DO Loop blog. Of these, the most popular articles were about data visualization, SAS programming tips, and statistical data analysis.
Here are the most popular articles from 2018 in each category.
I write this blog because I love to learn new things and share what I know with others.
If you want to learn something new, read (or re-read!) these popular articles from 2018. Then share this page with one of your colleagues.
Happy New Year! I hope we both have many opportunities to learn and share in 2019!
The post Top posts from <em>The DO Loop</em> in 2018 appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |