Unsupervised machine learning models

Hierarchical Clustering

Split

Consider all the samples as one cluster, then split into sub-clusters depending on distances.

Combine

Consider each sample as one cluster, then combine the nearest into one cluster.

K-means

NP-hard problem

Steps:

  1. Select k random nodes as the center of clusters
  2. For the rest of n-k samples, calculate the distance between the sample with the centers
  3. Classify the sample to the nearest cluster
  4. Re-calculate the center of the clusters
  5. Repeat 3-4 until the centers don’t change anymore

How to decide the first centers
The randomly selected nodes as first centers would influence the results of the clusters.

Solutions:

  1. Select the initial centers for multiple times
  2. Use hierarchical clustering to find k clusters, and then use them as the initial centers.

How to choose k
If not decided by business rules, then use elbow rule.
Draw a linear picture between values of k and diameter of clusters:

Elbow Rule

The x is the values of k, and the y is the diameter of the clusters, eg, the sum of the square distance between points in a cluster and the cluster centroid.

From the picture above, the best value of k is 3.

Other Improments for K-Means

Sensitive to Initial Values
k-means++:

Sensitive to Noise and Outliners

Distance

Jaccard Distance

Cosine Distance