On the Practical Complexity of Efficient Testing

This is a part 2 to the previous post on the theoretical complexity of efficient testing. Recall that we modelled tests as being used to verify that code satisfied some requirements, and then modelled the problem of efficient verification as finding the smallest set of tests that covered all of the requirements.

Although the decision problem is NP-complete, we can still put forth a decent attempt at solving it. We can rewrite the set-covering problem as an integer linear programming problem (define an indicator variable indicating whether each test was included in the test set or not, and define a constraint for each requirement, indicating that at least one of the tests that satisfies it is true; we then need to minimise the sum of all of the indicator variables). There are many solvers such as GLPK or CBC that can solve even fairly large instances of these problems. Similarly, we can also reformula set cover as boolean satisfiability; there are many solvers that can handle large formulae with many variables as well.

That said, although we can minimise the number of tests being used, it’s not entirely certain that we should, for several reasons. For example, suppose we wanted to test a function that returns all instances of characters occurring exactly two times in a string. Well, this is one possible implementation – and I’d be fairly confident in saying that you can’t really do better than linear time (you can’t avoid parts of the string in general, though there are some cases where you can shortcircuit e.g. if you have examined a portion of the string where every allowable character has appeared at least three times).

The first problem would obviously be whether the number of tests is even a good metric. I’ve written a few tests for the method above:

I’d certainly prefer having the three tests which each test something specific, as opposed to the single canFindPairs() test (in fact, if I came across the latter in a code review I would push back on it). The main problem here is that one way of reducing the number of tests is simply to merge existing tests or run large integration tests only, which is generally not great. Incidentally, this could lead to an extended heuristic, where we weight test methods by number of assertions.

But let’s suppose tests have disjoint assertions, and we don’t attempt to game the system in the way described above. The next issue is then how we define requirements. One possibility is to give methods well-defined postconditions and check that tests verify these, but this is unlikely to scale to large systems.

A common method, then, is to use code coverage as a proxy (this can be measured automatically via tracing of test calls). Line coverage, including adjusting for conditionals could be a good starting point. However, this isn’t really a good metric either – the three tests introduced above or the single canFindPairs() test actually achieve 100 percent coverage, by most definitions:

  • We have an input that violates the precondition, and two that pass it (line 2).
  • We do exercise the body of the for loop with the “aa” and “aaa” tests (lines 5-6).
  • We have both true and false outputs in the filter construction (line 10). This might not even be considered to be a requirement for the line to be covered.

Yet, if someone submitted the above tests only for findPairs() and I did a code review, I would ask them to add more testing. I’d probably expect at least the following:

Furthermore, the above method is not actually correct if going beyond UTF-16, so if (but only if) that would be likely given the context of the application involved I would ask for a test featuring that as well.

Thus, by optimising for code coverage and eliminating tests based on that, we run the risk of weakening our tests to the point where they couldn’t catch legitimate faults. For example, a test using characters outside of UTF-16 as described above would be unlikely to improve coverage at all, and thus might be pruned (thus allowing our implementation to pass even though it wouldn’t work). Approaches for evaluating whether this is worthwhile can include having developers plant faults in code, seeing if test suites after pruning can still catch them, or automatically mutating implementations (e.g. interchanging operations, changing the order of lines of code etc.) and seeing if test suites behave differently before and after pruning.

Coverage is still probably one of the least worst metrics in my opinion – I can’t really think of a good way of improving on it cheaply and scalably. Furthermore, studies have shown that in spite of line coverage being a kind of blunt instrument, it is nonetheless able to in several practical cases achieve decent reductions in test suite sizes without harming fault detection too much; nonetheless, the most aggressive solutions (such as using integer linear programming) seem to overfit to some extent, performing more than commensurately less well at detecting faults that were introduced.

Leave a Reply