This post was kindly contributed by SAS Analysis - go there to comment and to read the full post. |
Business understanding: A Healthcare company would like to know how the health status and personal information of a patient would influence his/her total health care expense.
Data understanding: The target variable is the total health cost (interval). Besides it, there are 40 input variables, such as sex, age, income, marital status, disease indicator, etc. A multiple linear regression is needed to predict the total health cost according the patients’ overall situation.
Data preparation: The data has already been cleaned. Since the target variable is highly skewed, a log transformation is applied. In each variable, any missing or inapplicable value is indicated by ‘-1’. Some variables, such as education level, are aggregated into ordinal variable.
Modeling: Proc Glmselect is used to construct the linear models. The raw data is divided into two parts. A stepwise method is used to select significant parameters.
Model evaluation: The fit statistics on the validation dataset is used to evaluate the performance of the model. The final model is eventually determined by validation average standard error (ASE).
Deployment: The GLM formula can be outputted and implemented by Proc Score in the future.
Acknowledgment: Dr. Goutam Chakraborty. Department of Marketing. Oklahoma State University.
ods html;
ods graphics on;
proc select data=USHETH seed=93093 plots(stepaxis=number)=all;
partition fraction(validate=0.5);
class sex census_region marital_status years_educ highest_degree served_armed_forces foodstamps_purchase more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis diabetes_diag_binary
;
model log_totalexp=sex census_region age marital_status years_educ highest_degree served_armed_forces foodstamps_purchase total_income more_than_one_job wears_eyeglasses person_blind wear_hearing_aid is_deaf person_weight numb_visits dental_checkup cholest_lst_chck last_checkup last_flushot lost_all_teeth last_psa last_pap_smear last_breast_exam last_mammogram bld_stool_tst sigmoidoscopy_colonoscopy wear_seat_belt high_blood_pressure_diag heart_disease_diag angina_diagnosis heart_attack other_heart_disease stroke_diagnosis emphysema_diagnosis joint_pain currently_smoke asthma_diagnosis child_bmi adult_bmi diabetes_diag_binary
/selecion=stepwise(stop=validate select=sl)
;
output out=outdata;
run;
ods graphics off;
ods html close;
This post was kindly contributed by SAS Analysis - go there to comment and to read the full post. |