SAS : Proc Varclus Explained

The VARCLUS procedure is a useful SAS procedure for variable reduction. It is based on divisive clustering technique.

All variables start in one cluster. Then, a principal components analysis is done on the variables in the cluster to determine whether the cluster should be split into two subsets of variables.
If the second eigenvalue for the cluster is greater than the specified cutoff, then the inital cluster is split into two clusters. If the second eigenvalue is large, it means that at least two principal components account for a large amount of variation among the inputs.
To determine which inputs are included in each cluster, the principal component scores are rotated obliquely to maximize the correlation within a cluster and minimize the correlation between clusters.
This process ends when the second eigenvalues of all current clusters fall below the cutoff.

If a cluster has only 1 variable in it, it means that this variable has only one principal component and hence, second eigenvalue of this variable is 0.

proc varclus data=imputed maxeigen=.7 short hi;
var Q1-Q5 VAR1-VAR20;
run;

The MAXEIGEN option specifies the largest permissible value of the second eigenvalue in each cluster (default value is 1)

The HI option specifies that the clusters at different levels maintain a hierarchical structure that prevents variables from transferring from one cluster to another after the split is made. In other words, variables cannot be reassigned to other clusters as they are assigned once in a cluster.

The SHORT option suppresses some of the output generated by PROC VARCLUS.

Important Points

By default, maximum clusters is equal to the number of variables in the model. The MAXCLUSTERS option can be used to specify the largest number of clusters desired. It's better not to specify the option and let SAS decides the number of clusters.
By default, PROC VARCLUS uses a non-hierarchical version of this algorithm, in which variables can also be reassigned to other clusters. The HI option is used to run hierarchical version.
Larger eigenvalue thresholds result in fewer clusters, and smaller thresholds yield more clusters.
Variables belonging to different clusters may be correlated as it is a type of oblique component analysis.

How to select best variables from each cluster

A best variable has a high correlation with its own cluster and has a low correlation with the other clusters.

A variable that has the lowest 1- R-squared ratio is likely to be a good representative for the cluster. It means maximum correlation with own cluster and minimum correlation with next cluster.

Why lowest 1 - R-squared ratio?

It is because when a variable has maximum correlation with own cluster and minimum correlation with next cluster. the 1- R**2 ratio will be minimum. See the formula below.

SAS Macro for Variable Selection

%macro varsel(input=, vars= , output =);
ods select none;
ods output clusterquality=summary
rsquare=clusters;

proc varclus data=&input maxeigen=.7 short hi;
var &vars;
run;
ods select all;

data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;

data selvars;
set clusters (where = (NumberOfClusters=&nvar));
keep Cluster Variable RSquareRatio;
run;

data cv / view=cv;
retain dummy 1;
set selvars;
keep dummy cluster;
run;

data filled;
update cv(obs=0) cv;
by dummy;
set selvars(drop=cluster);
output;
drop dummy;
run;

proc sort data = filled;
by cluster RSquareRatio;
run;

data &output;
set filled (rename = (variable = Best_Variables));
if first.cluster then output;
by cluster;
run;

%mend;

%varsel(input= abc, vars= _numeric_ , output = rest);

Same Variable Selection Technique (Varlcus) in R

install.packages("Hmisc")
v = varclus(x, similarity="spear")

R Code : IV and Clustering

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 13 Responses to "SAS : Proc Varclus Explained"

Snehotosh BanerjeeMay 28, 2015 at 9:53 AM
Excellent Article!!!.I have been looking for this logic.
AnonymousFebruary 17, 2016 at 10:51 AM
Great article, simple and to the point without technical jargon clutters. Excellent for someone new to proc varclus looking to get started with it...
UnknownMay 11, 2016 at 6:58 AM
Amazing! Easy to follow with clear explanations.

Thank you!
UnknownMay 26, 2016 at 9:35 AM
Hi,

Can you suggest how do we perform the same in R?

Thanks,
Krishna
UnknownJanuary 15, 2017 at 1:00 PM
Good job Deepanshu ! The 'R-square own cluster' & 'R-square next cluster' is not clear from your article. Specifically I mean 'next' is not explicitly presented here with clarity.
AnonymousApril 5, 2017 at 3:43 AM
Can't we use proc factor for variables reduction?
AnonymousMay 17, 2017 at 12:08 PM
whats th password for the excel?
AnonymousSeptember 6, 2017 at 3:53 AM
Can someone please explain how to calculate R-squre within and next closest cluster?
UnknownMay 3, 2019 at 10:21 AM
what is the password of the excel?
FollowerAugust 2, 2021 at 5:36 PM
Can you please provide the code with example dataset? This helps in understanding the code flow! Appreciate your efforts. Thank you
UnknownAugust 24, 2021 at 2:44 AM
Thank you..great articles...what is the password for the excel file?
SEO ExpertMarch 28, 2024 at 7:19 AM
nice