Tag: data mining

Learn about new data mining and machine learning procedures in SAS Viya

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and […]

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

Foresight is 2020! New books to take your skills to the next level

Are you ready to get a jump start on the new year? If you’ve been wanting to brush up your SAS skills or learn something new, there’s no time like a new decade to start! SAS Press is releasing several new books in the upcoming months to help you stay […]

Foresight is 2020! New books to take your skills to the next level was published on SAS Users.

SAS introduces the blended classroom

We all have different learning styles. Some learn best by seeing and doing; others by listening to lectures in a traditional classroom; still others simply by diving in and asking questions along the way. Traditional face-to-face classroom instruction,…

Flexibility of SAS Enterprise Miner

Do you use an array of tools to perform predictive analytics on your data? Is your current tool not flexible enough to accommodate some of your requirements? SAS Enterprise Miner may be your solution. With growing number of data mining applications, having a tool which can do variety of analysis […]

The post Flexibility of SAS Enterprise Miner appeared first on The SAS Training Post.

sklearn DecisionTree plot example needs pydotplus

In Python, sklearn (scikit-learn)’s DecisionTree example uses pydot for plotting the generated tree: @here.But for Python 3, pydot has some issues with the string from dot_data.getvalue(), for example it will report “TypeError: startswith first arg mus…

Use recursion and gradient ascent to solve logistic regression in Python

This post was kindly contributed by DATA ANALYSIS – go there to comment and to read the full post. In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive…

PROC PLS and multicollinearity

Multicollinearity and its consequences

Multicollinearity usually brings significant challenges to a regression model by using either normal equation or gradient descent.

1. Invertible SSCP for normal equation

According to normal equation, the coefficients could be obtained by \hat{\beta}=(X'X)^{-1}X'y. If the SSCP turns to be singular and non-invertible due to multicollinearity, then the coefficients are theoretically not solvable.

2. Unstable solution for gradient descent

The gradient descent algorithm seeks to use iterative methods to minimize residual sum of squares (RSS). For example, as the plot above shows, if there is strong relationship between two regressors in a regression, many possible combinations of \beta1 and \beta2 lie along a narrow valley, which all corresponds to the minimal RSS. Thus \beta1 can be negative, positive or even zero, which becomes a confounding effect with respect to a stable model.

Partial Least Squares v.s. Principle Components Regression

The most direct way to deal with multicollinearity is to break down the regressors and construct new orthogonal variables. PLS and PCR are both dimension reduction methods that eliminate multicollinearity. The difference is that PLS also implements the response variable to select the new components. PLS is particularly useful in answering questions with multiple response variables. The PLS procedure in SAS is a powerful and flexible tool applying either PLS or PCR. One book, An Introduction to StatisticalLearning, suggests PCR over PLS.
While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.
In the example using the baseball data set below, with 10-fold cross-validation, PLS chooses 9 components, while PCR picks out 5.
filename myfile url 'https://svn.r-project.org/ESS/trunk/fontlock-test/baseball.sas';
%include myfile;
proc contents data=baseball position;
ods output position = pos;
run;

proc sql;
select variable into: regressors separated by ' '
from pos
where num between 5 and 20;
quit;
%put ®ressors;

data baseball_t;
set baseball;
logsalary = log10(salary);
run;

proc pls data=baseball_t censcale nfac=10 cv=split(10);
title 'partial least squares';
model logsalary=®ressors;
run;

proc pls data=baseball_t censcale method = pcr nfac=10 cv=split(10);
title 'princinple components regression';
model logsalary=®ressors;
run;