In which we describe a -approximate algorithm for the Metric TSP, we introduce the Set Cover problem, observe that it can be seen as a more general version of the Vertex Cover problem, and we devise a logarithmic-factor approximation algorithm.
1. Better Approximation of the Traveling Salesman Problem
In the last lecture we discussed equivalent formulations of the Traveling Salesman problem, and noted that Metric TSP-R can also be seen as the following problem: given a set of points and a symmetric distance function
that satisfies the triangle inequality, find a multi-set of edges such that
is a connected multi-graph in which every vertex has even degree and such that
is minimized.
Our idea will be to construct by starting from a minimum-cost spanning tree of
, and then adding edges so that every vertex becomes of even degree.
But how do we choose which edges to add to ?
Definition 1 (Perfect Matching) Recall that graph
is a matching if no two edges in
have an endpoint in common, that is, if all vertices have degree zero or one. If
is a matching, we also call the edge set
a matching. A matching is a perfect matching if every vertex has degree one
Note that a perfect matching can exist only if the number of vertices is even, in which case .
Definition 2 (Min Cost Perfect Matching) The Minimum Cost Perfect Matching Problem is defined as follows: an input of the problem is a an even-size set of vertices
and a non-negative symmetric weight function
; the goal is to find a perfect matching
such that the cost
of the matching is minimized.
We state, without proof, the following important result about perfect matchings.
There is a polynomial-time algorithm that solves the Minimum Cost Perfect Matching Problem optimally.
We will need the following observation.
In every undirected graph, there is an even number of vertices having odd degree.
Proof: Let be any graph. For every vertex
, let
be the degree of
, and let
be the set of vertices whose degree is odd. We begin by noting that the sum of the degrees of all vertices is even, because it counts every edge twice:
The sum of the degrees of the vertices in is also even, because it is a sum of even numbers. So we have that the sum of the degrees of the vertices in
is even, because it is a difference of two even numbers:
Now it follows from arithmetic modulo 2 that if we sum a collection of odd numbers and we obtain an even result, then it must be because we added an even number of terms. (Because the sum of an even number of odd terms is even.) So we have proved that is even.
We are now ready to describe our improved polynomial-time approximation algorithm for General TSP-R.
- Input: instance
of Metric TSP-R
- Find a minimum cost spanning tree
of
relative to the weight function
- Let
be the set of points that have odd degree in
- Find a minimum cost perfect matching
over the points in
relative to the weight function
- Let
be the multiset of edges obtained by taking the edges of
and the edges of
, with repetitions
- Find a Eulerian cycle
in the graph
- Output
We first note that the algorithm is correct, because is a connected multigraph (because it contains the connected graph
) and it is such that all vertices have even degree, so it is possible to find an Eulerian cycle, and the Eulerian cycle is a feasible solution to General TSP-R.
The cost of the solution found by the algorithm is
We have already proved that, if is an optimal spanning tree, then
.
Lemma 3 below shows that , and so we have that the cost of the solution found by the algorithm is
, and so we have a polynomial time
-approximate algorithm for Metric TSP-R. (And also General TSP-R and Metric TSP-NR by the equivalence that we proved in the previous lecture.)
Lemma 3 Let
be a set of points,
be a symmetric distance function that satisfies the triangle inequality, and
be an-even size subset of points. Let
be a minimum cost perfect matching for
with respect to the weight function
. Then
Proof: Let be a cycle which is an optimal solution for the Metric TSP-R instance
. Consider the cycle
which is obtained from
by skipping the elements of
, and also the elements of
which are repeated more than once, so that exactly once occurrence of every element of
is kept in
. For example, if
,
and
is the cycle
then we obtain
by skipping the occurrences of
and the second occurrence of
, and we have the cycle
. Because of the triangle inequality, the operation of skipping a point (which means replacing the two edges
with the single edge
) can only make the cycle shorter, and so
Now, is a cycle with an even number of vertices and edges, so we can write
, where
is some ordering of the vertices and
. We note that we can partition the set of edges in
into two perfect matchings: the perfect matching
and the perfect matching
. Since
is made of the union of the edges of
and
, we have
The perfect matching is the minimum-cost perfect matching for
, and so
and
, so we have
and hence
An important point is that the algorithm that we just analyzed, like every other approximation algorithm, is always able to provide, together with a feasible solution, a certificate that the optimum is greater than or equal to a certain lower bound. In the 2-approximate algorithm TSP algorithm from the previous lecture, the certificate is a minimum spanning tree, and we have that the TSP optimum is at least the cost of the minimum spanning tree. In the improved algorithm of today, the cost of minimum spanning tree gives a lower bound, and twice the cost of the minimum cost perfect matching over gives another lower bound, and we can take the largest of the two.
Let us work out an example of the algorithm on a concrete instance, and see what kind of solution and what kind of lower bound we derive. Our set of points will be: Cupertino, Mountain View, Palo Alto, Santa Clara, and Sunnyvale. We have the following distances in miles, according to Google map:
C | MV | PA | SC | SV | |
C | 0 | 7 | 12 | 7 | 4 |
MV | 0 | 8 | 9 | 4 | |
PA | 0 | 14 | 10 | ||
SC | 0 | 5 | |||
SV | 0 |
The reader can verify that the triangle inequality is satisfied. If we run a minimum spanning tree algorithm, we find the following tree of cost 21
This tells us that the optimum is at least 21 miles.
If we employ the algorithm from the last lecture, we perform a DFS which gives us the cycle Palo Alto Mountain View
Sunnyvale
Cupertino
Sunnyvale
Santa Clara
Sunnyvale
Mountain View
Palo Alto, which has a length of 42 miles. After skipping the places that have already been visited, we get the cycle Palo Alto
Mountain View
Sunnyvale
Cupertino
Santa Clara
Palo Alto, whose length is 37 miles.
Today’s algorithm, instead, looks for a minimum cost perfect matching of the points that have odd degree in the spanning tree, that is all the places except Mountain View. A minimum cost perfect matching (there are two optimal solutions) is whose cost is 17 miles, 10 for the connection between Palo Alto and Sunnyvale, and 7 for the one between Cupertino and Santa Clara.
This tells us that the TSP optimum must be at least 34, a stronger lower bound than the one coming from the minimum spanning tree.
When we add the edges of the perfect matching to the edges of the spanning tree we get the following graph, which is connected and is such that every vertex has even degree:
We can find an Eulerian cycle in the graph, and we find the cycle Palo Alto Mountain View
Sunnyvale
Santa Clara
Cupertino
Sunnyvale
Palo Alto, whose length is 38 miles. After skipping Sunnyvale the second time, we have the cycle Palo Alto
Mountain View
Sunnyvale
Santa Clara
Cupertino
Palo Alto whose length is 36 miles.
In summary, yesterday’s algorithm finds a solution of 37 miles, and a certificate that the optimum is at least 21. Today’s algorithm finds a solution of 36 miles, and a certificate that the optimum is at least 34.
2. The Set Cover Problem
Definition 4 The Minimum Set Cover problem is defined as follows: an input of the problem is a finite set
and a collection of subsets
, where
and
.
The goal of the problem is to find a smallest subcollection of sets whose union is
, that is we want to find
such that
and
is minimized.
For example, suppose that we want to assemble a team to work on a project, and each of the person that we can choose to be on the team has a certain set of skills; we want to find the smallest group of people that, among themselves, have all the skills that we need. Say, concretely, that we want to form a team of programmers and that we want to make sure that, among the team members, there are programmers who can code in C, C++, Ruby, Python, and Java. The available people are Andrea, who knows C and C++, Ben, who knows C++ and Java, Lisa, who knows C++, Ruby and Python, and Mark who knows C and Java. Selecting the smallest team is the same as a Minimum Set Cove problem in which we have the instance
In which the optimal solution is to pick , that is Lisa and Mark.
Although this is an easy problem on very small instances, it is an NP-hard problem and so it is unlikely to be solvable exactly in polynomial time. In fact, there are bad news also about approximation.
Theorem 5 Suppose that, for some constant
, there is an algorithm that, on input an instance of Set Cover finds a solution whose cost is at most
times the optimum; then every problem in
admits a randomized algorithm running in time
, where
is the size of the input.
If, for some constant
, there is a polynomial time
-approximate algorithm, then
.
The possibility of nearly-polynomial time randomized algorithms is about as unlikely as , so the best that we can hope for is an algorithm providing a
factor approximation.
A simple greedy approximation provides such an approximation.
Consider the following greedy approach to finding a set cover:
- Input: A set
and a collection of sets
-
- while there is an uncovered element, that is an
such that
- Let
be a set with the largest number of uncovered elements
-
- Let
- return
To work out an example, suppose that our input is
The algorithm will pick four sets as follows:
- At the first step, all the elements of
are uncovered, and the algorithm picks
, which is the set that covers the most elements (five);
- At the second step, there are five remaining uncovered elements, and the best that we can do is to cover two of them, for example picking
;
- At the third step there remain three uncovered elements, and again the best we can do is to cover two of them, by picking
;
- At the fourth step only
remains uncovered, and we can cover it by picking
.
As with the other algorithms that we have analyzed, it is important to find ways to prove lower bounds to the optimum. Here we can make the following easy observations: at the beginning, we have 10 items to cover, and no set can cover more than 5 of them, so it is clear that we need at least two sets. At the second step, we see that there are five uncovered items, and that there is no set in our input that contains more than two of those uncovered items; this means that even the optimum solution must use at least sets to cover those five items, and so at least
sets, that is at least 3 sets, to cover all the items.
In general, if we see that at some point there are items left to cover, and that every set in our input contains at most
of those items, it follows that the optimum contains at least
sets. These simple observations are already sufficient to prove that the algorithm is
-approximate.
We reason as follows. Let be the input to the algorithm, and let
be an ordering of the elements of
in the order in which they are covered by the algorithm. Let
be the number of elements that become covered at the same time step in which
is covered. Let
be the number of sets used by an optimal solution and
be the number of sets used by the algorithm.
For every , define
The intuition for this definition is that, at the step in which we covered , we had to “pay” for one set in order to cover
elements that were previously uncovered. Thus, we can think of each element that we covered at that step as having cost us
times the cost of a set. In particular, we have that the total number of sets used by the algorithm is the sum of the costs:
Now, consider the items and let us reason about how the optimum solution manages to cover them. Every set in our input covers at most
of those
items, and it is possible, using the optimal solution, to cover all the items, including the items
with
sets. So it must be the case that
from which we get
The quantity
is known to be at most , and so we have
It is easy to prove the weaker bound , which suffices to prove that our algorithm is
-approximate: just divide the sum into terms of the form
, that is
and notice that each term is at most 1 (because each term is itself the sum of terms, each
) and that the whole sum contains at most
such terms.
3. Set Cover versus Vertex Cover
The Vertex Cover problem can be seen as the special case of Set Cover in which every item in appears in precisely two sets.
If is an instance of Vertex Cover, construct the instance of Set Cover in which
, and in which we have one set
for every vertex
, defined so that
is the set of all edges that have
as an endpoint. Then finding a subcollection of sets that covers all of
is precisely the same problem as finding a subset of vertices that cover all the edges.
The greedy algorithm for Set Cover that we have discussed, when applied to the instances obtained from Vertex Cover via the above transformation, is precisely the greedy algorithm for Vertex Cover: the algorithm starts from an empty set of vertices, and then, while there are uncovered edges, adds the vertex incident to the largest number of uncovered edges. By the above analysis, the greedy algorithm for Vertex Cover finds a solution that is no worse than times the optimum, a fact that we mentioned without proof a couple of lectures ago.