15 - Clustering

Review

  • Clustering is an unsupervised learning technique to group “similar” data objects together
    • Depends on:
      • object representation
      • similarity measure
  • Harder when data dimensionality gets large (curse of dimensionality)
  • Number of output clusters is part of the problem itself!

Partitioning: Hard Clustering

There are several ways to divide Cluster Algorithms, one of them is the Partitioning-based.

It is like doing binary strings: we have N elements and K ways to divide them, and a 0 or 1 that states if an element belongs or not to the string.

  • For example, if K=2 0000001, 101001011 or 1010100101 whatever as long as EVERY element appears AT LEAST once.
    • technically if we have K=2, we have assignment. More in general , however, since we are interested to the fact which every element must appear AT LEAST once, this number is slightly minor
    • cases like 000000000 are excluded

K-way non-empty partitions of N elements

  • Objective function
    • exact solution must explore exponential search space NP-Hard
    • non-convex due to the discrete assignment matrix A multiple local minima
  • LLoyd-Forgy Function
    • NP-hardness doesn’t allow us to compute the exact solution (i.e., global optimum)
    • Non-convexity doesn’t allow us to rely on nice property of convex optimization (unique global optimum)
    • A convex objective can be (approximately) solved with numerical methods to find the global optimum

Lloyd-Forgy Function

It is a 2-step iterative algorithm:

  • assignment step
  • update step

It Does not guarantee to find the global optimum as it may stuck to a local optimum or a saddle point