A Metric for Speedrunning Exams

In gaming, a speedrun is a playthrough of a game where one attempts to complete it as quickly as possible, perhaps subject to some constraints (such as completing a game with 100% completion of all sidequests). I wouldn’t say I’m skilled/fast enough for aggressively focusing on speed in many of the tests I’ve covered; I didn’t finish the AIME, and while I did pretty much “complete” the BMO and MAT within their time limits it wasn’t with much time to spare at all.

Typically when taking exams where one expects to finish comfortably (and even in cases when one does not) I find it’s rarely a good idea to go as fast as possible; mistakes are likely to be made and one typically wants to secure the marks for questions one is able to answer. I’d almost always use up almost all of the time as well; there’s typically no reason not to go back and check one’s past work, and/or improve the quality of one’s existing answers. Generally up to and including the IB exams I took at the end of high school, for most subjects there was time left over at the end; once at Imperial most exams tended to be very fast-paced – I would either finish them with typically fewer than ten minutes on the clock, or not finish them at all. (There were exceptions, such as first-year Logic, second-year Concurrency and fourth-year Modal Logic.)

A speedrun of an exam would involve completing the questions as quickly as possible; perhaps using much less time than is given. In constructing a metric for this, it becomes apparent that incorrect answers need to be penalised in some way (it would otherwise be feasible to leave all questions blank and immediately stop the clock). However, ideally the penalties would not be too harsh; for example, invalidating performances with any answers incorrect would address our first concern, but would not be enjoyable at all. It also seems awkward to invalidate an entire run on the basis of not knowing, say, a single fact or definition.

There are two obvious raw metrics for performance of a speedrun of an exam:

  • the proportion of total marks scored, M \in [0, 1] where higher is better;
  • the proportion of time taken, T \in [0, 1] where lower is better.

Combining those metrics, I think the following metric for an overall performance P is fairly intuitive.

P_0 = M/T

In a sense we’re measuring efficiency with respect to time, against a benchmark student who uses all of the time and scores 100% of the marks. However, I don’t think this is a good metric, because it can readily be abused; a strategy that quickly determines the easiest mark(s) to score on the paper and then attempts only those marks will do very well (note: not necessarily questions; for example, “find the roots of f(x) = 36x^4 - 23x^3 - 31x^2 - 102x” is a rather nasty question, but there’s one obvious answer that drops out in a second or two).

Of course, a way around this could be that scripts must be checked at the end, to ensure that there’s a reasonable bona fide attempt to each and every problem. However, that requires manual verification. An alternative option could be to introduce a minimum mark threshold m; attempts that have M < m are invalidated.

P_1 = \begin{cases} 0 & M < m \\ M/T & M \geq m \end{cases}

This metric is a decent improvement, though it still has some issues:

  • A strategy that seeks to identify the m easiest marks to obtain and then attempts only those would perform well. This can be mitigated if m is fairly high; for example, for A level mathematics papers I would set something like m = 0.93 or so.
  • Also, if m is set too high (e.g. m = 1 is the aforementioned “invalidate all non-perfect runs” strategy), too many runs, including runs that fail because of unknown facts/definitions may be invalidated.

We can also consider continuous penalty functions based on M that increase perhaps more harshly than M itself, as M decreases from 1. For example,

P_2 = \max \left\lbrace 0, \dfrac{1 - 2(1 - M)}{T} \right\rbrace

Thus, a run with M = 0.7 for a given time has its score reduced to 40 percent of another run with the same time but M = 1. The max from 0 could be left out, if one wished to give negative scores to runs with M < 0.5 though I think that’s fairly harsh.

There’s also no reason we should restrict ourselves to linear functions. Consider

P_3 = M^\alpha / T, \alpha > 1

Higher values of \alpha will penalise flaws more heavily; consider two runs with the same time, but one having M = 0.9 and the other M = 1; with \alpha = 2 the imperfect run scores 81 percent the score of the perfect run, but with \alpha = 10 the imperfect run scores a mere 34.9 percent the score of the perfect run! Also, observe that as \alpha \rightarrow \infty we approach P_1 with a 100 percent threshold.

Of course, we also have the exponential option:

P_4 = e^{-\lambda (1 - M)} / T, \lambda > 0

In approaches 3 and 4, each additional mistake is penalised more. I think this makes sense for papers where candidates who understand the concepts can be expected to lose most or all of the points they lose to careless errors.

One or two slips should, in my opinion, result in a notable performance hit relative to perfect runs, but the run should still be salvageable; more than a few would seem like grounds for invalidating the run. It could be possible to blend the threshold idea (approach 1) with either approach 3 or 4, though we could argue that the heavy penalties involved would already destroy runs with “more than a few” errors.


Leave a Reply