Example 14-2: Woodyard Hammock Data Section
SAS uses the Euclidian distance metric and agglomerative clustering, while Minitab can use Manhattan, Pearson, Squared Euclidean, and Squared Pearson distances as well. Both SAS and Minitab use only agglomerative clustering.
Download the text file containing the data here: wood.txt
Cluster analysis is carried out in SAS using a cluster analysis procedure that is abbreviated as cluster. We will look at how this is carried out in the SAS program below.
Download the SAS Program here: wood1.sas
Click on the arrow in the window below to see how to perform a cluster analysis using the Minitab statistical software application.
Dendrograms (Tree Diagrams) Section
The results of cluster analysis are best summarized using a dendrogram. In a dendrogram, distance is plotted on one axis, while the sample units are given on the remaining axis. The tree shows how sample units are combined into clusters, the height of each branching point corresponding to the distance at which two clusters are joined.
In looking at the cluster history section of the SAS (or Minitab) output, we see that the Euclidean distance between sites 33 and 51 was smaller than between any other pair of sites (clusters). Therefore, this pair of sites was clustered first in the tree diagram. Following the clustering of these two sites, there are a total of n - 1 = 71 clusters, and so, the cluster formed by sites 33 and 51 is designated "CL71".
The Euclidean distance between sites 15 and 23 was smaller than between any other pair of the 70 unclustered sites or the distance between any of those sites and CL71. Therefore, this pair of sites was clustered second. Its designation is "CL70".
In the seventh step of the algorithm, the distance between site 8 and cluster CL67 was smaller than the distance between any pair of unclustered sites and the distances between those sites and the existing clusters. Therefore, site 8 was joined to CL67 to form the cluster of 3 sites designated as CL65.
The clustering algorithm is completed when clusters CL2 and CL5 are joined.
The plot below is generated by Minitab. In SAS the diagram is horizontal. The color scheme depends on how many clusters are created (discussed later).
What do you do with the information in this tree diagram?
We decide the optimum number of clusters and which clustering technique to use. We adapted the wood1.sas program to specify the use of the other clustering techniques. Here are links to these program changes. In Minitab also you may select other options instead of single linkage from the appropriate box.
|wood1.sas||specifies complete linkage|
|wood2.sas||is identical, except that it uses average linkage|
|wood3.sas||uses the centroid method|
|wood4.sas||uses the simple linkage|
As we run each of these programs we must remember to keep in mind that our goal is a good description of the data.
Applying the Cluster Analysis Process
First, we compare the results of the different clustering algorithms.
To arrive at the optimum number of clusters we may follow this simple guideline. Select the number of clusters that have been identified by each method. This is accomplished by finding a breakpoint (distance) below which further branching is ignored. In practice, this is not necessarily straightforward. You will need to try a number of different cut points to see which is more decisive. Here are the results of this type of partitioning using the different clustering algorithm methods on the Woodyard Hammock data. A dendrogram helps determine the breakpoint.
|Complete Linkage||Partitioning into 6 clusters yields clusters of sizes 3, 5, 5, 16, 17, and 26.|
|Average Linkage||Partitioning into 5 clusters would yield 3 clusters containing only a single site each.|
|Centroid Linkage||Partitioning into 6 clusters would yield 5 clusters containing only a single site each.|
|Single Linkage||Partitioning into 7 clusters would yield 6 clusters containing only 1-2 sites each.|
For this example, complete linkage yields the most satisfactory result.