16 - K-Means
K-means: Lloyd-Forgy Algorithm
- Specify the number of output clusters K
- Select K observations at random from the N data points as the initial cluster centroids
- Assignment step: Assign each observation to the closest centroid based on the distance measure chosen
- Update step: For each of the K clusters update the centroid by computing the new mean values of all the data points now in the cluster
- Iteratively repeat steps 3-4 until a stopping criterion is met
K-means vs. K-means++
TBD
Evaluations
Clustering is evaluated based on the data that was clustered itself
- A good clustering will produce high quality clusters with:
- high intra-cluster similarity
- low inter-cluster similarity
- The measured quality of a clustering depends on
- data representation
- similarity measure
Each value of confusion matrix is given by rand.
Review…
- K-means is an iterative (approximated) clustering method that converges to a local minimum
- Tries to minimize the internal sum of squared Euclidean distances (Lloyd-Forgy Algorithm)
- Many variants:
- K-means++, K-medoids (PAM Algorithm), BFR K-means, etc.
- Internal vs. External measures of clustering quality