# On the Theoretical Complexity of Efficient Testing

Complexity was one of the courses I took at Imperial that I found to be especially intellectually stimulating. I would say it was probably the most challenging course of the degree for me at least (excluding the individual project; the second half of Intelligent Data and Probabilistic Inference probably comes close, also bearing in mind the bar being set at 90). I would say the course has certainly helped me reason about algorithms and problems in more interesting and divergent ways, and has also been useful in figuring out when to stop trying to devise excessively clever algorithms (by understanding when that is impossible).

The course starts by looking at decision problems (that is, problems with a yes/no) answer, and whether they fit into the class called P (decidable in polynomial time). Examples of problems in P include finding the maximum element in an array or list, and finding the shortest path between a pair of nodes in a graph (this is breadth-first search for unweighted, and Dijkstra’s algorithm for weighted, both of which run in polynomial time).

We then move on to NP (nondeterministically decidable in polynomial time) – effectively, this means that it is possible to verify a “proof” of the problem being decided as a yes in polynomial time. Well-known problems in NP include the Hamiltonian path problem (given a graph, is there a path passing through each node once?) – this one is in NP because a “proof” that the answer is yes can be given in the form of a path. We can check that the edges match simply by walking down that path, and the path is a permutation of the nodes. Note that we don’t need to be able to prove negative answers in polynomial time – there is a separate class called co-NP for cases where we can prove these. (NP $\cap$ co-NP is a strict subset of both NP and co-NP, and includes P as well as a couple of other problems like integer factorisation.)

(The course also covered the underlying theory behind P and NP, concerning Turing machines – I won’t go into detail here.)

One of the biggest parts of the course involved the concept of many-one reduction; a problem V reduces to a problem W if we can translate an instance of V to an instance of W, within certain constraints. For example, the problem of determining whether all integers in a set are positive reduces to that of determining whether all integers are negative under a polynomial time constraint; the first problem (positive) holds for some set $S$ if and only if the second holds for $T$, which is $S$ with all elements multiplied by $-1$. In a sense, we’re showing that if we can solve W, then we can solve V as well.

We then defined the hardest problems in a given class to be the problems that are complete for that class; that is, every problem in that class reduces to a complete problem (and yes, the complete problems reduce to themselves). For NP, the constraint on the reduction is that it can be carried out in polynomial time. This makes sense as NP would not be closed under reductions allowing more generous bounds (for example, if V reduced to W under some exponential time procedure and W was in NP, that doesn’t mean V is in NP; however, if V reduces to W under a polynomial time procedure and W is in NP, then so is V). This holds because if we had a guess for V, we could “translate” it using our reduction to a guess for W which can be verified in polynomial time.

The Hamiltonian path problem (hereafter HP) I introduced earlier is NP-complete, and can be used to show that other problems are NP-complete (by reducing HP to the other problem). This works because from our definition of NP-completeness, any problem in NP can be reduced to HP, and then reduced to our problem. This involves potentially composing two polynomial time mappings – but, the composition of polynomials is polynomial too. Note that it’s not just a sum, because the “output” of the first mapping, which is the input to the second mapping, could be bigger than the initial input (though only polynomially so!).

(For showing HP itself is NP-complete, there is a reduction of SAT – the problem of determining whether a logical formula in conjunctive normal form – to HP, and SAT itself can be proven NP-complete from the definition of NP over Turing machines.)

The inspiration for this post came from some preparatory work I was doing for a reading group with James, on algorithms for test-suite minimisation. The authors comment that the problem of finding a minimal representative subset for a test-suite is NP-complete. As a starting point, removing some of the details, their specific problem (involving mapping of tests to requirements shown by said tests) given is easily reformulated as the Minimum Set Cover problem; that is, given some possible subsets $C$ of a bigger set $S$, find the smallest subset of $C$, such that the union of the sets we put in our subset itself is $S$.

Strictly speaking, this isn’t a decision problem, but we can put it into a decision form: given some possible subsets $C$ of a bigger set $S$, is there a subset of $C$, such that the union of the sets we put in our subset itself is $S$ with size not greater than k? Solving the original problem can then be performed by a binary search on k, at each step testing this decision problem for a different value of k. Let’s call this problem Set Cover (Decision) or SC(D).

We could proceed with trying to show HP reduces to SC(D), though I find it easier to do things a bit differently – we can pick any problem that’s NP-complete (so I’ll pick SAT), and we can also exploit transitivity of reductions as mentioned above. I’ll thus use two intermediate steps:

1. Reduce SAT to 3SAT (the problem of determining whether a formula in conjunctive normal form where each clause has three literals is satisfiable)
2. Reduce 3SAT to Vertex Cover (VC – given a graph and some positive integer k, is there a set of nodes, size up to k, such that every edge has at least one endpoint in the set?)
3. Reduce VC to SC(D).

Step 3 is probably the easiest now – given an instance of VC, we map each edge to be an element in $S$, and each node to be a candidate subset containing all of the edges that are adjacent to it. That mapping requires us to iterate through the graph perhaps a few times, but that’s definitely polynomial in the input. Comparing the definitions it’s pretty clear that this should work (that said, if you were actually doing a complexity course you might need a more rigorous proof).

Step 1 relies on the use of logical equivalences – the idea being that we can always rewrite a clause that doesn’t use three literals into one or more clauses that do. The one and two cases simply involve repeating one of the literals that’s already there; if we have more than three, we can introduce auxiliary variables to chain them together:

```// SAT
A or B or C or D or E

// 3-SAT equivalent version
(A or B or X) and
(!X or C or Y) and
(!Y or D or E)```

Consider that the 3-SAT version is satisfiable if and only if the SAT version is satisfiable. If any one of the original variables can be set to true, then we can choose suitable values of X and Y (plus other auxiliary variables in larger cases) to satisfy all of the other clauses, “rippling out” from some variable that was set to true. However, if none of them are, we’ll run into a conflict somewhere.

Finally, step 2 is somewhat trickier. At a high level, the idea is that we want to enforce some way of selecting a truth value for each variable, that would satisfy at least one of the literals in a clause. We can then construct triangles for each clause, somehow wanting to have our cover only include two of the nodes in a triangle (corresponding to ‘false’ literals). The unselected literal must then be “true” in some sense, but we cannot have multiple conflicting literals in different clauses being true. To handle this, we introduce an additional constraint linking the literals we pick as true across clauses, yielding a construction like this:

The formula is satisfiable if and only if we can build a vertex cover picking two of the nodes in each clause, and one truth value for each variable. So k = 2C plus V, where C is the number of clauses and V the number of variables.

I think the course and especially exams on it were difficult because these constructions require nontrivial insight, which can be really difficult under pressure. Having a large set of known starting points would certainly have helped; this kind of three-stage derivation would probably not be needed in an exam, though constructions of a similar level of difficulty as step 2 were certainly present.