Accounting for Annonimization noise in Binary Target variable (Nlmixed)

This post was kindly contributed by BusinessKen - go there to comment and to read the full post.

I usually am emphatic that the best way to improve model quality is to GET BETTER DATA!. However, Yesterday I used advanced statistics (well advanced for the business world I am in) to address a data quality issue. The situation arises due to restrictions put by government on the use of prescription data put in the public domain. When the data reaches my computer it has been through several hands and on the was some intentional noise is introduced and the relevant information put into bands to protect the individual GP. The way the annonimization is introduced may not be treated as fully random and independent noise.
Usually I just define a binary target variable identifying the top 20% prescribers and fit a logistic regression explaining who are the big prescribers (or even better, who are the fastest growers). However, for a drug that has just been launched the rate of misscalsification in the data is too uncomfortable. About 30% of the prescribers are masked in the data handed to me as non-prescribers – Gahhhh! 
Lets say my binary target variable is called ‘Prescribed’ where 1 means at least one prescription in the period. It has two flavours the annonimized that I have and the true value that I wish I had. I can define two probabilities:
P1=P(True Prescibed=1|Annonimized Prescibed=1)
P0=P(True Prescibed=0|Annonimized Prescibed=0)
The less equal these probabilities are the more you should be worried even if both are relatively small.
If I disregard this issue and assuming I have only one independent variable called Var1 then I would naively do a logistic regression:
proc logistic data=Staging.Mart;
 model Prescibed(event=‘1’)=Var1;
 run;
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
1
-3.2095
0.0278
13351.1260
<.0001
Var1
1
0.0848
0.00257
1087.7234
<.0001
I could use NLMIXED to achieve the same analysis:
proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 model Prescibed ~ binary(p);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-3.2095
0.02778
37E3
-115.55
<.0001
0.05
-3.2639
-3.1551
-0.00003
b1
0.08480
0.002571
37E3
32.98
<.0001
0.05
0.07976
0.08984
-0.00291
The code may be modified slightly to do the same job but allow for further flexability – a general log likelihood definition:
proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 ll=Prescibed*log(p)+(1-Prescibed)*log(1-p);
 model Prescibed ~ general(ll);
 run;
Now I can introduce the noise probabilities P1 and P0:
proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 P1=0.99; *P(True Prescibed=1|Prescibed=1);
 P0=0.90; *P(True Prescibed=0|Prescibed=0);
 if Prescibed=1 then ll=(log(p)+0*log(1-p))*P1+
                     (0*log(p)+1*log(1-p))*(1-P1);
             else ll=(log(p)+0*log(1-p))*(1-P0)+
                     (0*log(p)+1*log(1-p))*P0;
 model Prescibed ~ general(ll);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-1.8449
0.01555
37E3
-118.61
<.0001
0.05
-1.8753
-1.8144
-0.00004
b1
0.03795
0.001994
37E3
19.03
<.0001
0.05
0.03404
0.04186
-0.00089
The parameter estimates have changed significantly:

 
Another fun day at the office.

This post was kindly contributed by BusinessKen - go there to comment and to read the full post.