{"id":232,"date":"2022-10-08T15:06:41","date_gmt":"2022-10-08T15:06:41","guid":{"rendered":"https:\/\/memerabble.com\/?p=232"},"modified":"2022-10-11T23:31:00","modified_gmt":"2022-10-11T23:31:00","slug":"clustering-algorithms-overview","status":"publish","type":"post","link":"https:\/\/memerabble.com\/?p=232","title":{"rendered":"Clustering Algorithms Overview"},"content":{"rendered":"\n<p>Less precise than other forms of modelling (specifically classification). The development of useful information from the data and the identification of a particular cluster often requires significant domain expertise. <\/p>\n\n\n\n<p>It is a form of unstructured learning as the algorithm defines the clusters, and groups the instances accordingly. The aim of the application of a clustering algorithm it to identify groups of rows that are both similar to each other, and are also dissimilar rows that are in other groups.<\/p>\n\n\n\n<p>They map out the relative distance between different sets of instances that are &#8216;closer&#8217; to each other. <\/p>\n\n\n\n<p>There are two key forms of &#8216;distance&#8217; in this context, the within group similarity: <em>intra-group distance<\/em>, and the between group dissimilarity: <em>inter-group distance<\/em>.<\/p>\n\n\n\n<p>They can be used as summarisation tools: e.g. reducing the computational intensity that is involved in training a neural network analysis, by training it on a reduced selection of rows, or based on <em>cluster specific prototypes<\/em><\/p>\n\n\n\n<p>It is always wise to use a variety of clustering tools when analysing data. They are are for fishing expeditions where we don&#8217;t know what is in the lake. The risk is that in only using a particular algorithm we might, say identify three groups within a k-Means analysis because that is what we have specified, but may miss out on distinct subgroups within those groups that could be identified through examining a dendrograph.<\/p>\n\n\n\n<p><strong>Different Forms of Clusters:<\/strong><\/p>\n\n\n\n<p>Well separated clusters &#8211; instances that are part of each cluster are closer to the other elements in that cluster than they are to the elements of other clusters<\/p>\n\n\n\n<p><em>Centre-Based Clusters <\/em>&#8211; instances are closer to the centrepoint of a particular cluster than than they are to the centrepoint of any other cluster<\/p>\n\n\n\n<p><em>Contiguous Clusters <\/em>&#8211; instances are closer to another member of the same cluster than they are to a member of any other cluster<\/p>\n\n\n\n<p><em>Density Based Clusters <\/em>&#8211; Clusters are identified after the use of noise reducing algorithms <\/p>\n\n\n\n<p><em>Proprietary\/Pre-specified clusters<\/em>: Instances are sorted according to some pre-specified category based on domain expertise<\/p>\n\n\n\n<p><strong>Categories of Clustering Algorithms:<\/strong><\/p>\n\n\n\n<p><em>Partitioning Methods: <\/em>The number of clusters are specified in advance. Such methods are examples of <em>complete clustering<\/em> where each row is allocated to exactly one cluster, these classifications are both <em>exhaustive <\/em>and <em>non-overlapping <\/em>(e.g. k-Means). Best used where data are concave, normalised, and have Gaussian distributions.<\/p>\n\n\n\n<p><em>Hierarchical\/Agglomerative Clustering: <\/em>Rows are sorted into <em>tree structures. <\/em>Branches break from parent rows at optimal points which identifies the relationship between and so the relative proximity of distinct groups from each other. These methods are another form of <em>complete clustering<\/em> as at every stage of the hierarchy each row is a member of one cluster, and is only a member of at most one cluster.<\/p>\n\n\n\n<p><em>Density Based: <\/em>These methods are used where there are dense areas of rows which are separated by areas of lower density. The sparsely populated areas are screened out or de-weighted, therefore not all rows are assigned to a cluster. This results in <em>partial clustering<\/em>, though where rows are assigned to a cluster, they are only assigned to a single cluster <em>non-overlapping<\/em> (e.g. DBScan).<\/p>\n\n\n\n<p><em>Model-Based Clustering:<\/em> The distribution of clusters is parameterised, i.e. it is assumed that they follow particular distributions &#8211; such as normal distribution &#8211; and <em>given that<\/em>, population statistics e.g. mean, variance, are estimated, along with the probability that each row is a member of a particular cluster. As any row potentially has a likelihood of being part of any given class, and so these are a class of clustering algorithms which are <em>overlapping<\/em>, but may also be either partial or complete in their clustering.<\/p>\n\n\n\n<p>Grid Based Methods: Organises the rows into clusters linearly &#8211; where rows that are furthest from each other are separates to the greatest degree &#8211; and those that fall into the middle are not assigned to a group (<em>partial clustering<\/em>). Useful for identifying outliers, but members of any cluster are only assigned to a single cluster, so this is an algorithm that is <em>non-overlapping<\/em>.<\/p>\n\n\n\n<p><em>Fuzzy Clustering<\/em>: Every row belongs to every cluster, with probability <em>p<\/em>*<br>p~[0,1], \u03a3 (i=1&#8230; n, n= number of variables\/attributes) = 1<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Less precise than other forms of modelling (specifically classification). The development of useful information from the data and the identification of a particular cluster often requires significant domain expertise. It is a form of unstructured learning as the algorithm defines the clusters, and groups the instances accordingly. The aim of&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[38],"tags":[],"class_list":["post-232","post","type-post","status-publish","format-standard","hentry","category-data"],"_links":{"self":[{"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/posts\/232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/memerabble.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=232"}],"version-history":[{"count":1,"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":233,"href":"https:\/\/memerabble.com\/index.php?rest_route=\/wp\/v2\/posts\/232\/revisions\/233"}],"wp:attachment":[{"href":"https:\/\/memerabble.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/memerabble.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/memerabble.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}