Should Exams Be Graded Out Of 137?

I reviewed Misbehaving a couple of weeks ago, and in the introduction the author (Richard Thaler) relates an interesting tale about grading exams. He mentions that setting the maximum score to be 137 instead of 100 in a US college context was useful, in that it allowed him to gain a better understanding of students’ performance without drawing too much complaint [1].

Thaler finds that exams can be made a lot more difficult, without hurting students’ confidence too much. The actual percentages students obtain are immaterial (because how raw scores translate to actual grades is done on a curve); yet, in a US college context where A grades often start at 90% or above, suddenly getting a 72 (even if it is an A+) could be unnerving. Making exams difficult allows students to show that they haven’t just learned the material but have understood the logic and thinking behind it. For a first year logic course, say, I wouldn’t be opposed to the exam including unseen extensions of the material covered in class (for example, branching into monadic second-order logic or simple temporal logics); for a final year course, I’d expect students to be able to understand papers on the subject (even if unseen) and apply knowledge from said papers. This may be related to my work and interests, but the courses I remember best at Imperial include disproportionately many with difficult exams (Software Engineering (Design), Complexity and Intelligent Data and Probabilistic Inference come to mind).

There is a risk of stressing students out if they do the math, of course. A few quick presses on a keyboard or calculator should reveal that, say, a 96/137 is just about 70 percent. However, one has to be fairly deliberate about doing such calculations, as 137 is a rather awkward number to mentally calculate percentages with. If you asked me to estimate what a mark out of 137 corresponded to in a 100-point scale, I would probably engage in some kind of iterative search algorithm where I added 13.7s, and then 1.4s. However, apparently people are generally lazy enough that they won’t bother with computing the percentage of their raw score. They’d simply look at the grade itself (which largely depends on where the raw score stands in relation to others, not so much on its absolute value).

Would this system be applicable or useful in the UK? I’ll acknowledge it was indeed a very clever solution in a US context, though I’m not sure it’s necessarily as useful in the UK. This mainly stems from the fact that grades in the UK are generally numerically lower; an A (a First) typically starts at 70 percent rather than 90. There’s thus already quite a fair bit of room to separate students that have merely learned the material from those that have engaged and understood it. Thaler mentions that students scored an average of 72 out of 100 on his initial exam (which caused an uproar among his students); that’s on the high side in the UK! Typically, exams at Imperial would have a fairly difficult ‘tail end’ for the last 30 percent; I could see a case for a really hard ‘last stretch’ (10 percent) which tended to occur in some of the programming tests, but I think most exams have had enough discriminatory power. Generally, you’d want to use research projects and other forms of work over longer periods of time to sift out the top students (as those are probably more similar to both industry and postgraduate academia).

Another difference is that UK universities typically aggregate course grades by a percentage system, rather than by GPA. This does make sense; I would generally consider a student scoring 90 and 65 in two modules (77.5 average) to be far superior to one scoring a pair of 72s, while the GPA system would say otherwise. I’d be inclined to think I’d prefer the 90/65 student to one with a pair of 78s, even. This is not a problem for using a raw limit of 137, as they can always be translated (non-linearly, if need be) to percentage grades.

One benefit of this system is psychological; even if the raw score might well be just 60 percent, scoring an 82 sounds much better. However, at Imperial at least, exam scripts generally aren’t returned to students. Also, this relies on most exams being graded out of 100 points (or less); if one is accustomed to 200-point exams, then being confronted with a score in the 80s might feel terrible (even though it is percentage-wise better than a 110 in a 200-point exam!)

There is also one minor objection I have, speaking more generally. Unless the exam being set is really, really hard, it’s likely that some students are likely to score above 100. Thaler claims this “(generated) a reaction approaching ecstasy” – I’m not sure I’d feel the same way as it would once again become front and center that the raw score is not a percentage, unlike most exams! The illusion would be clearly broken. Nonetheless, these students can probably take comfort in the fact (if they need to) that they’re probably well ahead of the rest of the class.

In general, I think grading out of 137 (in raw scoring terms) can be pretty useful if (1) existing mandated grade boundaries or expectations are too high, so that one can’t distinguish students’ command of material; (2) one can control how raw scores are actually translated into grades that the students receive; (3) students get to view the raw scores, and (4) most exams are graded out of 100 (or possibly less).

All four appear to be true in the US system. While (2), (4) and maybe (1) are true at Imperial at least, (3) was not – and I’d say without (3) it probably wouldn’t be worth the effort.


Leave a Reply