# The cross validation concepts described above can be

• Notes
• 424

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 310 - 312 out of 424 pages.

The cross-validation concepts described above can be applied includingthe validation data set and testing data set in analogy.4.2Cluster EvaluationThere are numerous methods to choose from to perform cluster analysis andthe same now applies to cluster evaluation. As with the choice of algorithm,one needs to understand the specifics of the data, to which it is applied, andthe objectives to be able to use the parameters of evaluation correctly toµĂƚĂ ƐĞƚdƌĂŝŶ ƐĞƚdĞƐƚ ƐĞƚdƌĂŝŶ ƐĞƚdĞƐƚ ƐĞƚdƌĂŝŶ ƐĞƚdƌĂŝŶ ƐĞƚdĞƐƚ ƐĞƚ^Ɖůŝƚ ĚĂƚĂ ŝŶƚŽ < ŐƌŽƵƉƐ^ĐŽƌĞ^ĐŽƌĞ^ĐŽƌĞ±ŽŵƉƵƚĞ ŵĞĂŶ ŽĨ ƐĐŽƌĞƐ͙͙͙͙͗Fig. 21K-fold cross-validation procedure
16Mathematical Background of Machine Learning299determine which cluster analysis is more appropriate for the case. There isusually not a single or universal measure, but rather a bundle of them.The general requirements of evaluation include determining the perfor-mance of the applied algorithm and deciding on the number size, shape, etc.of clusters or sometimes even sub-clusters. Cluster validity indices (CVI) canbe used to assess the issue.There are the two types of CVIs: internal and external. External CVIs canbe used only when the true number of clusters is known (hence the name)since it then compares the result of a procedure partition to it. Internal CVIsare to be applied to all other cases. The other distinction is their methodapplicability, which can either be a crisp (hard) or a fuzzy (soft) clustering.11However, the former might be converted into the latter if needed.Of over half a hundred different indices, we describe some of thosewe had an option to choose from12when evaluating the clusters createdthroughout this chapter.The following external CVIs were selected in our cluster evaluationprocess:Crisp partitions:Rand index (to be maximized).Adjusted rand index (to be maximized).Jaccard index (to be maximized).Folkes–Mallows (to be maximized).Variation of information (to be minimized).Fuzzy partitions:Soft rand index (to be maximized).Soft adjusted rand index (to be maximized).Soft variation of information (to be minimized).Soft normalized mutual information based on max entropy (to bemaximized).Since the external cluster validity indices gauge the propriety of data distri-bution into the simulated clusters in accordance with the distribution in thetrue ones, one needs to deal with a confusion matrix, which is based on the11Hard or crisp clustering involves strict and excluding placement of a data point in relation to a par-ticular cluster, meaning that an observation cannot belong to two or more clusters at the same time.This would be possible under using fuzzy or soft partitioning and one point would belong to clusters todiffering extents.12The list of CVIs is taken from Sarda-Espinosa (2019).

Course Hero member to access this document

Course Hero member to access this document

End of preview. Want to read all 424 pages?

Course Hero member to access this document

Term
Fall
Professor
st
Tags
Financial services, Volker Liermann