14.9 - Defining Initial Clusters

14.9 - Defining Initial Clusters

Now that you have a good idea of what is going to happen, we need to go back to our original question for this method... How should we define the initial clusters? Again, there are two main approaches that are taken to define the initial clusters.

1st Approach: Random assignment

The first approach is to assign the clusters randomly. This does not seem like it would be a very efficient approach. The main reason to take this approach would be to avoid any bias in this process.

2nd Approach: Leader Algorithm

The second approach is to use a Leader Algorithm. (Hartigan, J.A., 1975, Clustering Algorithms). This involves the following procedure:

  • Step 1. Select the first item from the list. This item forms the centroid of the initial cluster.
  • Step 2. Search through the subsequent items until an item is found that is at least distance δ away from any previously defined cluster centroid. This item will form the centroid of the next cluster.
  • Step 3: Step 2 is repeated until all K cluster centroids are obtained or no further items can be assigned.
  • Step 4: The initial clusters are obtained by assigning items to the nearest cluster centroids.

The following video illustrates this procedure for k = 4 clusters and p = 2 variables plotted in a scatter plot:

Example 14-5: Woodyard Hammock Data (Initial Clusters)

Now, let's take a look at each of these options, in turn, using our Woodyard Hammock dataset.

We first must determine:

  • The number of clusters K
  • The radius \(δ\) for the leader algorithm.

In some applications, the theory specific to the discipline may suggest reasonable values for K.  In general, however, there is no prior knowledge that can be applied to find K.  Our approach is to apply the following procedure for various values of K.  For each K, we obtain a description of the resulting clusters. The value of K is then selected to yield the most meaningful description. We wish to select K large enough so that the composition of the individual clusters is uniform, but not so large as to yield too complex a description for the resulting clusters.

Here, we shall take K = 4 and use the random assignment approach to find a reasonable value for \(δ\).

Using SAS

This random approach is implemented in SAS using the following program below.

Download the SAS Program here: wood6.sas

We use the fastclus procedure, which stands for fast cluster analysis. This is designed specifically to develop results quickly especially with very large datasets. Remember, unlike the previous cluster analysis methods, we will not get a tree diagram out of this procedure.

We need to first specify the number of clusters that we want to include.  In this case, we ask for four clusters. Then, we set replace=random, indicating the initial cluster centroids will be randomly selected from the study subjects (sites).

Using Minitab

Click on the video below to see how to perform a cluster analysis using the K-means procedure in Minitab's statistical software.

When you run this program, you will always get different results because a different random set of subjects is selected each time.

The first part of the output gives the initial cluster centers.  SAS picks four sites at random and lists how many species of each tree are at each site.

The procedure works iteratively until no reassignments can be obtained. The following table was copied from the SAS output for discussion purposes.

Cluster Maximum Point to Centroid Distance Nearest Cluster Distance to Closest Cluster
1 21.1973 3 16.5910
2 20.2998 3 13.0501
3 22.1861 2 13.0501
4 23.1866 3 15.8186

In this case, we see that cluster 3 is the nearest neighboring cluster to cluster 1 and the distance between those two clusters is 16.591.

To set delta for the leader algorithm, we want to pay attention to maximum distances between the cluster centroids and the furthest site in that cluster. We can see that all of the maximum distances exceed 20.  Based on these results, we set the radius \(δ = 20\).

Now, we can turn to SAS program below where this radius \(δ\) value is used to run the Leader Algorithmic approach.

Using SAS

Here is the SAS program modified to accommodate these changes:

Download the SAS Program here: wood7.sas

The fastclus procedure is used again, only this time with the leader algorithm options are specified.

We set the maximum number of clusters to four and also set the radius to equal 20, the delta value that we decided on earlier.

Again, the output produces the initial cluster centroids. Given the first site, it will go down the list of the sites until it finds another site that is at least 20 away from this first point. The first one it finds forms the second cluster centroid. Then it goes down the list until it finds another site that is at least 20 away from the first two to form the third cluster centroid. Finally, the fourth cluster is formed by searching until it finds a site that is at least 20 away from the first three.

SAS also provides an iteration history showing what happens during each iteration of the algorithm. The algorithm stops after five iterations, showing the changes in the location of the centroids. In other words, convergence was achieved after 5 iterations.

Next, the SAS output provides a cluster summary which gives the number of sites in each cluster. It also tells you which cluster is closest. From this, it seems that Cluster 1 is in the middle because three of the clusters (2,3, and 4) are closest to Cluster 1 and not the other clusters.  The distances between the cluster centroids and their nearest neighboring clusters are reported, i.e., Cluster 1 is 14.3 away from Cluster 4.  The SAS output from all four clusters is in the table below:

Cluster Size Nearest Neighbor Distance
1 28 4 14.3126
2 9 1 17.6003
3 18 1 19.3971
4 17 1 14.3126

In comparing these spacings with the spacing that we found earlier, you will notice that these clusters are more widely spaced than the previously defined clusters.

The output of fastclus also gives the results of individual ANOVAs for each species. However, only the \(r^{2}\) values for each ANOVAs are presented. The \(r^{2}\) values are computed, as usual, by dividing the model sum of squares by the total sum of squares. These are summarized in the following table:

Code Species \(\boldsymbol{r^{2}}\) \(\boldsymbol{r ^ { 2 } / \left( 1 - r ^ { 2 } \right)}\) F
carcar Ironwood 0.785 3.685 82.93
corflo Dogwood 0.073 0.079 1.79
faggra Beech 0.299 0.427 9.67
ileopa Holly 0.367 0.579 13.14
liqsty Sweetgum 0.110 0.123 2.80
maggra Magnolia 0.199 0.249 5.64
nyssyl Blackgum 0.124 0.142 3.21
ostvir Blue Beech 0.581 1.387 31.44
oxyarb Sourwood 0.110 0.124 2.81
pingla Spruce Pine 0.033 0.034 0.76
quenig Water Oak 0.119 0.135 3.07
quemic Swamp Chestnut Oak 0.166 0.199 4.50
symtin Horse Sugar 0.674 2.063 46.76

Given \(r^{2}\) , the F-statistic is:

\(F = \dfrac{r^2/(K-1)}{(1-r^2)/(n-K)}\)

where K-1 is the degrees of freedom between clusters and n-K is the degrees of freedom within clusters.

In our example, n = 72 and K = 4.  If we to take the ratio of \(r^{2}\) divided by 1-\(r^{2}\), multiply the result by 68, and divide by 3, we arrive at the F-values in the table.

Each of these F-values is tested at K - 1 = 3 and n - K = 68 degrees of freedom.  Using the Bonferroni correction, the critical value for an \(α = 0.05\) level test is \(F_{3,68,0.05/13} = 4.90 \). Therefore, anything above 4.90 will be significant here.  In this case, the species in boldface in the table above are the species where the F-value is above 4.90.

Let's look at the cluster means for the significant species identified above.  The species and the species' means are listed in the table below.  As before, the larger numbers within each row are boldfaced.  As a result, you can see that ironwood is most abundant in Cluster 3, Beech is most abundant in Cluster 1 and so forth...

  Cluster
Species 1 2 3 4
Ironwood 4.1 7.2 21.2 2.1
Beech 11.1 6.1 5.7 6.2
Holly 5.5 5.9 4.4 13.2
Magnolia 5.3 3.3 2.8 3.0
Blue Beech 4.5 5.3 2.4 14.6
Horse Sugar 0.9 16.1 0.6 2.2

In looking down the columns of the table, we can characterize the individual clusters:

  • Cluster 1: Primarily Beech and Magnolia: These are the large canopy species typical of old-growth forest.
  • Cluster 2: Primarily Horse Sugar: These are a small understory species typical of small-scale disturbances (light gaps) in the forest.
  • Cluster 3: Primarily Ironwood: This is an understory species typical of wet habitats.
  • Cluster 4: Primarily Holly and Blue Beech: This is also an understory species typical of dry habitats.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility