What is clustering?

  • clustering algorithm looks at a number of data points and automatically finds data points that are related or similar to each other
  • In supervised learning, the dataset included both the inputs x as well as the target outputs y
  • In unsupervised learning, you are given a dataset with just x, but not the labels or the target labels y

Applications of clustering

  • Grouping similar news
  • Market segmentation
  • DNA analysis
  • Astronomical data analysis

K-means intuition

  • Take a random guess at where might be the center of the clusters
    • The first is assign points to cluster centroid and the second is move cluster centroids
    • Repeat until it finds that there are no more changes to the points or to the locations of the clusters centroids

K-means algorithm

  • are the vectors that have the same dimensions
  • Corner cases
    • when a cluster has 0 points assigned to it
      • delete the cluster or initialize one more random cluster

Optimization objective

  • = index of cluster (1,2,…,K) to which example is currently assigned
  • = cluster centroid k
  • = cluster centroid of cluster to which example has been assigned

Cost function

  • Distortion function

Initializing K-means

  • Choose
  • Randomly pick training examples
  • Set ,,…, equal to these examples

Random initialization

For i = 1 to 100 {
  # Randomly initialize K-means
  # Run K-menas. Get c_1,...c_m, mu_1,...mu_k
  # Compute cost function (distortion)
}
# pick set of clusters that gave lowest cost J

Choosing the number of clusters

  • Elbow method
    • Minimizing the number of cluster is not a good practice
    • Evaluate K-means based on how well it performs on that later purpose