16 - K-Means

K-means: Lloyd-Forgy Algorithm

  1. Specify the number of output clusters K
  2. Select K observations at random from the N data points as the initial cluster centroids
  3. Assignment step: Assign each observation to the closest centroid based on the distance measure chosen
  4. Update step: For each of the K clusters update the centroid by computing the new mean values of all the data points now in the cluster
  5. Iteratively repeat steps 3-4 until a stopping criterion is met

K-means vs. K-means++



Clustering is evaluated based on the data that was clustered itself

  • A good clustering will produce high quality clusters with:
  • high intra-cluster similarity
  • low inter-cluster similarity
  • The measured quality of a clustering depends on
  • data representation
  • similarity measure

Each value of confusion matrix is given by rand.


  • K-means is an iterative (approximated) clustering method that converges to a local minimum
  • Tries to minimize the internal sum of squared Euclidean distances (Lloyd-Forgy Algorithm)
  • Many variants:
    • K-means++, K-medoids (PAM Algorithm), BFR K-means, etc.
  • Internal vs. External measures of clustering quality