The VARCLUS procedure is a useful SAS procedure for variable reduction. It is based on divisive clustering technique.
The HI option specifies that the clusters at different levels maintain a hierarchical structure that prevents variables from transferring from one cluster to another after the split is made. In other words, variables cannot be reassigned to other clusters as they are assigned once in a cluster.
The SHORT option suppresses some of the output generated by PROC VARCLUS.
Important Points
How to select best variables from each cluster
A best variable has a high correlation with its own cluster and has a low correlation with the other clusters.
It is because when a variable has maximum correlation with own cluster and minimum correlation with next cluster. the 1- R**2 ratio will be minimum. See the formula below.
- All variables start in one cluster. Then, a principal components analysis is done on the variables in the cluster to determine whether the cluster should be split into two subsets of variables.
- If the second eigenvalue for the cluster is greater than the specified cutoff, then the inital cluster is split into two clusters. If the second eigenvalue is large, it means that at least two principal components account for a large amount of variation among the inputs.
- To determine which inputs are included in each cluster, the principal component scores are rotated obliquely to maximize the correlation within a cluster and minimize the correlation between clusters.
- This process ends when the second eigenvalues of all current clusters fall below the cutoff.
proc varclus data=imputed maxeigen=.7 short hi;
var Q1-Q5 VAR1-VAR20;
run;
The MAXEIGEN option specifies the largest permissible value of the second eigenvalue in each cluster (default value is 1)
The SHORT option suppresses some of the output generated by PROC VARCLUS.
Important Points
- By default, maximum clusters is equal to the number of variables in the model. The MAXCLUSTERS option can be used to specify the largest number of clusters desired. It's better not to specify the option and let SAS decides the number of clusters.
- By default, PROC VARCLUS uses a non-hierarchical version of this algorithm, in which variables can also be reassigned to other clusters. The HI option is used to run hierarchical version.
- Larger eigenvalue thresholds result in fewer clusters, and smaller thresholds yield more clusters.
- Variables belonging to different clusters may be correlated as it is a type of oblique component analysis.
How to select best variables from each cluster
A best variable has a high correlation with its own cluster and has a low correlation with the other clusters.
A variable that has the lowest 1- R-squared ratio is likely to be a good representative for the cluster. It means maximum correlation with own cluster and minimum correlation with next cluster.Why lowest 1 - R-squared ratio?
It is because when a variable has maximum correlation with own cluster and minimum correlation with next cluster. the 1- R**2 ratio will be minimum. See the formula below.
SAS Macro for Variable Selection
%macro varsel(input=, vars= , output =);
ods select none;
ods output clusterquality=summary
rsquare=clusters;
proc varclus data=&input maxeigen=.7 short hi;
var &vars;
run;
ods select all;
data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;
data selvars;
set clusters (where = (NumberOfClusters=&nvar));
keep Cluster Variable RSquareRatio;
run;
data cv / view=cv;
retain dummy 1;
set selvars;
keep dummy cluster;
run;
data filled;
update cv(obs=0) cv;
by dummy;
set selvars(drop=cluster);
output;
drop dummy;
run;
proc sort data = filled;
by cluster RSquareRatio;
run;
data &output;
set filled (rename = (variable = Best_Variables));
if first.cluster then output;
by cluster;
run;
%mend;
%varsel(input= abc, vars= _numeric_ , output = rest);
Same Variable Selection Technique (Varlcus) in R
%macro varsel(input=, vars= , output =);
ods select none;
ods output clusterquality=summary
rsquare=clusters;
proc varclus data=&input maxeigen=.7 short hi;
var &vars;
run;
ods select all;
data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;
data selvars;
set clusters (where = (NumberOfClusters=&nvar));
keep Cluster Variable RSquareRatio;
run;
data cv / view=cv;
retain dummy 1;
set selvars;
keep dummy cluster;
run;
data filled;
update cv(obs=0) cv;
by dummy;
set selvars(drop=cluster);
output;
drop dummy;
run;
proc sort data = filled;
by cluster RSquareRatio;
run;
data &output;
set filled (rename = (variable = Best_Variables));
if first.cluster then output;
by cluster;
run;
%mend;
%varsel(input= abc, vars= _numeric_ , output = rest);
Same Variable Selection Technique (Varlcus) in R
install.packages("Hmisc")R Code : IV and Clustering
v = varclus(x, similarity="spear")
Excellent Article!!!.I have been looking for this logic.
ReplyDeleteGreat article, simple and to the point without technical jargon clutters. Excellent for someone new to proc varclus looking to get started with it...
ReplyDeleteAmazing! Easy to follow with clear explanations.
ReplyDeleteThank you!
Thank you for your lovely words. Cheers!
DeleteHi,
ReplyDeleteCan you suggest how do we perform the same in R?
Thanks,
Krishna
Good job Deepanshu ! The 'R-square own cluster' & 'R-square next cluster' is not clear from your article. Specifically I mean 'next' is not explicitly presented here with clarity.
ReplyDeleteCan't we use proc factor for variables reduction?
ReplyDeletewhats th password for the excel?
ReplyDeleteCan someone please explain how to calculate R-squre within and next closest cluster?
ReplyDeletewhat is the password of the excel?
ReplyDeleteCan you please provide the code with example dataset? This helps in understanding the code flow! Appreciate your efforts. Thank you
ReplyDeleteThank you..great articles...what is the password for the excel file?
ReplyDeletenice
ReplyDelete