Sunday, June 22, 2014

a bit of data structure: Disjoint Set Find-Union

Problem Statement:
How do we represent disjoint sets in a manner such that it supports these two functions efficiently:
1. union(A,B): given disjoint sets \(A\) and \(B\), return \(C = A\cup B\)
2. same(u,v): given 2 element \(u\) and \(v\), check whether they are in the same set.

To efficiently support a data structure of disjoint sets, we need a clever construct to represent the disjoint sets. How? Here we'll discuss the standard way of doing it.

Let's say we have \(N\) elements indexed as \(1,2,3,\ldots , N\). Initially, all elements are disjoint to each other. We have a notion of parent, expressed as an array parent[\(1 \ldots N] such that parent[i] points to the parent of element i. At first, every element is a parent to itself since they are disjoint. We call a procedure to initialize this condition as init():

input: none
output: initialize the disjoint set data structure

for i in 1 to N:
   parent[i] = i

The union of 2 elements will result in parent of the first element pointing to the second element. As this procedure proceeds, a forest of trees will result, where each tree is a representation of a set. Now imagine if we want to check if 2 element is indeed inside the same tree, that means we need to traverse up the parent chain of each element until we find an element which parent points to itself (hence the root of the tree) and finally compare whether the two roots are the same. This traversal may take \(O(\log{N})\) each time (the height of the tree). Hence we need to compress this tree such that subsequent check of the root of a tree will result in \(O(1)\) average. Call this procedure find(u) as described below:

input: an element
output: the root of the disjoint set it is in

if u != parent[u]:
    parent[u] = find(parent[u])
return parent[u]

Here we check whether \(u\) is root, if so then we return \(u\). Otherwise, we need to update the parent of u to point to the root of the tree, and return it. Hence it updates the parent-child relationship in the tree such that the tree is compressed to a flatter tree.

Finally we can describe the two main functions above in an efficient manner:

input: 2 elements u,v
output: union of the set containing u and the set containing v

parent[find(u)] = find(v)

input: 2 elements u,v
output: is u and v in the same set?

return find(u) == find(v)

Here is a minimalist implementation of the data structure:
vector<int> p(100);
vector<int> r(100);
int init() { for (int i=0;i<100;++i) p[i]=i, r[i] = 0; }
int find_set(int i) { return (p[i] == i) ? i : (p[i] = find_set(p[i])); }
bool same_set(int i, int j) { return find_set(i) == find_set(j); }
void union_set(int i, int j) {
 if ( same_set(i,j) ) return;
 int x = find_set(i);
 int y = find_set(j);
 if(r[x] > r[y]) {
  p[y] = x;
 } else {
  p[x] = y;
  if (r[x] == r[y]) ++r[y];