15 - Clustering
Review
- Clustering is an unsupervised learning technique to group “similar” data objects together
- Depends on:
- object representation
- similarity measure
- Depends on:
- Harder when data dimensionality gets large (curse of dimensionality)
- Number of output clusters is part of the problem itself!
Partitioning: Hard Clustering
There are several ways to divide Cluster Algorithms, one of them is the Partitioning-based.
It is like doing binary strings: we have N elements and K ways to divide them, and a 0 or 1 that states if an element belongs or not to the string.
- For example, if K=2 → 0000001, 101001011 or 1010100101 whatever as long as EVERY element appears AT LEAST once.
- technically if we have K=2, we have assignment. More in general , however, since we are interested to the fact which every element must appear AT LEAST once, this number is slightly minor
- cases like 000000000 are excluded
K-way non-empty partitions of N elements→
- Objective function
- exact solution must explore exponential search space → NP-Hard
- non-convex due to the discrete assignment matrix A → multiple local minima
- LLoyd-Forgy Function
- NP-hardness doesn’t allow us to compute the exact solution (i.e., global optimum)
- Non-convexity doesn’t allow us to rely on nice property of convex optimization (unique global optimum)
- A convex objective can be (approximately) solved with numerical methods to find the global optimum
Lloyd-Forgy Function
It is a 2-step iterative algorithm:
- assignment step
- update step
It Does not guarantee to find the global optimum as it may stuck to a local optimum or a saddle point