Thursday, November 22, 2012 CC-BY-NC
Approximation algorithms: set cover

Maintainer: admin

Lecture notes for COMP 360 class, taught by Hamed Hatami. These lecture notes are student-generated and any errors or omissions should be assumed to be the fault of the notetaker and not of the lecturer. To correct an error, you have to be registered and logged-in; alternatively, you can contact @dellsystem directly.

1Set cover

Given some subsets $S_1, \ldots, S_m \subseteq U$, find the smallest number of sets whose union is $U$.

Example: $\{\{1, 2, 3\}, \{2, 3, 4\}, \{1, 2, 5\}, \{4\}, \{5\}\}$. In this case, we need at least 2 sets - for instance, $\{2, 3, 4\}$ and $\{1, 2, 5\}$ would work.

1.1NP-completeness

To show that this is NP-complete, we reduce vertex cover to this. (It's obviously in NP, because a certifier would just find the union of the proposed subsets, etc.) For each vertex, we make a subset consisting of all its adjacent edges. Then $U$ would be the set of all the edges in the graph.

1.2A greedy approximation algorithm

Here's a simple greedy algorithm. We first sort the subsets in order of decreasing size (largest first). In pseudocode:

R = U
S = {}
while R != {}:
    S_i = the set that covers the most things in R
    R = R / S_i (remove the elements from R that are also in S_i)
    add S_i to S
output S

With the above example, our algorithm would first identify $\{1, 2, 3\}$, with $R = \{4, 5\}$. Then, $\{2, 3, 4\}$, with $R = \{5\}$. Then, $\{1, 2, 5\}$, with $R = \{\}$ and so we terminate.

1.2.1The factor

What's the factor of this algorithm? First, we introduce some terminology. Harmonic factor (which I don't think is actually a term):

$$H_n = \sum \frac{1}{n} \approx \ln n \tag{from calculus}$$

Cost of an element $a$:

$$c_a = \frac{1}{|S_i \cap R|}$$

where $S_i$ is the first set chosen by the algorithm which contains the element $a$. In the example above, the costs would be:

$a$ 1 2 3 4 5
$c_a$ $\frac{1}{3}$ $\frac{1}{3}$ $\frac{1}{3}$ 1 1

The total cost, $\sum c_a$, would be 3 in this case. More generally, it's equal to the number of selected sets, because for each set, you divide by the number of elements in it that haven't yet been placed into $U$, etc (so the total cost is 1).

Claim: for each set $S_i$ (even those not selected by the algorithm), $\sum c_a \leq 1 + \frac{1}{2} + \ldots + \frac{1}{|S_i|} \leq H_n$. Proof left for another day.

We will now prove that this algorithm is $H_n$-factor where $n = |U|$. Well, later.

1.3An ILP approach

We could also formulate this problem as an integer linear program. We want to minimise $\sum x_i$ such that $\sum x_i \geq 1$ for all $a \in U$, $x_i \in \{0, 1\}$.

Under construction