for each value of the variable, the variable’s mean is subtracted and the result is divided by the standard deviation so that the resulting variable has mean 0 and standard deviation 1. [This means that the standardised variable values are measures of how many standard deviations above or below the mean that particular value was.] Standardisation of numerical variables is performed for cluster analysis because the clusters are to be formed based on how different the various variable values are. This should not depend on the scale of the variables. With standardised variables, when measuring the distance between two data points or between two clusters, equal weighting is given to the contribution from each variable. If there is a reason to put more weight on one variable, this can be done by multiplying the standardised variable by an appropriate factor. (ii) grps is first defined in line 5 as an empty matrix with 9 rows and two columns. There is then a loop, which for each value of n from 2 to 10, applies the k-means command to Utils thus dividing the data into n clusters. Then n is placed in the (n-1) row and first column, while the between cluster variance as a proportion of the total variance goes in the (n-1) row and second column. Thus we will plot the proportion of between variance as a function of the number of clusters. Since we would like to maximise the between cluster variance, we can use this plot to decide how many clusters it is useful to have. (iii) It seems appropriate to choose 6 clusters, as when we go beyond 6, the increase in between- clusters proportion is minimal. (b) (i) Linkage methods are methods of defining the distance between clusters. They are calculated based on the distance between pairs of points, one in each cluster. For example, single linkage For two clusters C1 and C2, find the pair of points, one in C1 and one in C2, that are the shortest distance apart. Complete linkage: same idea but find the pair of points, one in each cluster, that are the furthest apart. (ii) The shortest distance is AB and EF which are both 2. So create clusters AB and EF. The distance CD is 3 and C and D are more distant from all other points, so form a cluster CD. Now we calculate the distances between these 3 clusters, as determined by complete linkage. AB to CD: the biggest distance is AD = 12 AB to EF: the biggest distance is AF = 8.5 CD to EF: the biggest distance is DE = 8 so form cluster CDEF at height of 8 Max distance from AB to CDEF is AD = 12 so form cluster ABCDEF at height of 12 (c)
(i) Simple matching: 6/10 = 0.6 Jaccard: 2/5 = 0.4 (ii)
