Browse Category

Mathematics

Limitations of the Bus Factor

Over lunch today, a friend and I discussed what we were doing in response to COVID-19. I mentioned I was working from home, and that the purpose of working from home was presumably to avoid the offices becoming a coronavirus cluster. This isn’t the same as quarantine or self-isolation in that I can and do still leave the house, attend smaller-scale social events and meet friends. The probability seems not too high, but even if I become sick the idea is that we don’t want this to spread to others on the team or at the company.

The bus factor is a simple measurement of how concentrated key personnel are in a team. It is defined as the minimum number of people that can be removed from a team before the project stalls. For example, if a shop has five employees, where everyone can restock shelves but only two people are authorised to operate the cash register, then the shop team has a bus factor of 2, since if those two people are removed the shop will not function.

Generally, a higher number is better, as it means the team can tolerate more people being unavailable. This is often used to inform risk planning, such as by prioritising context-sharing, developer documentation or other means of spreading knowledge.

However, there is quite a lot of variance within a single measurement. For example, suppose we have a team A which has two leads, and we say that the team will stall if both of them are unavailable (but only them – any other group of people becoming unavailable, including everyone but one of the leads, is okay). Also consider team B, where the absence of any two people will cause the team to stall (though no individual would). Both teams have a bus factor of 2. However, A is strictly safer than B, and it seems to be quite a large margin. I’m not sure that B is safer than a team C, where there is one absolutely critical lead (bus factor 1), for that matter.

This could get us to a concrete problem. Suppose there is a 1 percent probability that each person on a team becomes unavailable for some reason. What is the smallest number of people such that it is possible that a team with a bus factor of 2 is actually riskier than a team with a bus factor of 1?


Finding the Extremes

To answer this question, we want to find the extremes of a given Bus Factor BF in terms of risk, because we want to compare the riskiest Bus Factor 2 team against the safest Bus Factor 1 team. This drops out quite naturally: the safest team is the one where there exists one specific set of BF people which will cause operations to stall, and no other combination that is not a superset of that set is essential. Conversely, the riskiest team is one where every combination of BF people is critical. As defined above, team B is the riskiest possible Bus Factor 2 team, and team C is the safest possible Bus Factor 1 team.

Without loss of generality, let the people in the team be numbered 1, 2, \ldots, n and the lead have number 1 (or 1 and 2 in the case of team A). Let the probability that an entity E becomes unavailable be P(E_\times). Then, the probability that each team’s work stalls is as follows:

\displaystyle P(A_\times) = P(1_\times \wedge 2_\times)

\displaystyle P(B_\times) = P \left( \bigvee_{i=1}^{n} \bigvee_{j=i+1}^{n} i_\times \wedge j_\times \right)

\displaystyle P(C_\times) = P(1_\times)

These terms as written aren’t very comparable – depending on the underlying probability distributions.

Attempting to Compare Terms

I’ve used logical formulas above to avoid making claims about the underlying probability distributions. However, if we assume that the probabilities are independently and identically distributed, and we say that over some fixed period of time the probability that an individual becomes unavailable is p, then we have

\displaystyle P(A_\times) = p^2

\displaystyle P(C_\times) = p

P(B_\times) is marginally trickier. This is the probability there are at least two people who become unavailable. It is probably easier to work out the complement P(B_\times'): that is the probability zero people or one person is unavailable. If we define X as the number of people (out of n) who become unavailable, then P(B_\times') = P(X=0) + P(X=1). It’s clear that P(X=0) = (1-p)^n. P(X=1) = np(1-p)^{n-1}: the probability a specific person is unavailable and no others are is p(1-p)^{n-1}, but there are n valid versions of this. More generally, X follows what’s called a binomial distribution. Thus,

\displaystyle P(B_\times') = (1-p)^n + np(1-p)^{n-1} = (1-p)^{n-1} \left( 1 - p + np \right)

and thus

P(B_\times) = 1 - (1-p)^{n-1} \left( 1 - p + np \right)

Notice that, as expected, this is dependent on n, while the other teams A and C have probabilities that don’t depend on n. Furthermore, as n increases, P(B_\times) increases – intuitively, as you consider more people, every way in which the team was already broken ignoring the new people remains broken, but some previously unbroken states can also break. (A more mathematical argument could look at the ratios of successive terms.)

Synthesis

The combination of exponential and polynomial (well, linear) terms in P(B_\times) would make direct comparison with P(C_\times) difficult. We can however fix some risk probability p and find out at which value of n team B (no one is individually critical, but every pair of people is critical) has a greater probability of stalling than team C (there is a single critical lead). To answer our initial question, we choose p=0.01. Since we have a closed form for the probability of team B having issues, we can plot a graph:

It turns out the answer here, n=16 is actually relatively small in computational terms, even if it is large as far as teams are concerned. That said, the conditions for a team being the safest possible for a given bus factor are extremely strict – the sheer under-resourcing from removing people, even if they aren’t information silos or have unique knowledge, would still severely impair team function.

Is It Someone’s Birthday Everyday? Solving the Inverse Birthday Problem

The birthday paradox is a fairly well-known but unintuitive observation that although there are normally 365 days in a year (for simplicity, ignoring leap years and assuming a uniform distribution for birthdays), with as few as 23 people it is more likely than not that two of them will share a birthday. One of my teammates proposed an inverse of this problem: what is the minimum number of people are needed so that it is more likely than not that it is someone’s birthday every day?

Modelling A Specific Instance

There’s a lot to take in here, so let’s try to solve a simpler problem first. We could attempt to solve: given a specific number of people, how would we calculate the probability that it is someone’s birthday every day? If we have this method, we can reconstruct the answer to the original problem via a search algorithm.

The more general problem here would be: suppose we have N \geq 1 people. Each person draws one of K \geq 1 events with replacement – these are equally probable and independent. What is the probability that all events occur at least once? The more general introductory problem has K = 365, but of course that’s tricky to work with right off the bat, so let’s examine a couple of the smaller cases.

Algorithmic Ideation

Base Case Analysis

If K = 1, since we said that we always have at least one person, the one possible event will happen and the probability is 1. Moving on to K = 2, here N = 1 is a special case with probability of 0. For larger N, there are two specific situations we want to avoid, where every person draws the same event (remember that there were two events). Each of these happens with probability 2^{-N} so the probability of everything occurring once would be 1 - 2 \times 2^{-N} = 1 - 2^{-N + 1}.

We can now analyze K = 3 – if N \leq 2 we definitively have probability 0. For N = 3, the first person will receive a novel outcome with certainty, but the second and third person also need novel outcomes, and this happens with probability \frac{2}{3} \times \frac{1}{3} = \frac{2}{9}. It gets a little trickier with N = 4:

  • The first person to draw an event will get a novel event. This leaves 3 people to draw events, and 2 out of 3 events undrawn. We can call this state (3, 2, 3).
  • The second person to draw an event will get a novel event with probability 2/3, putting us in the state (2, 1, 3). Alternatively, he/she will draw the same event as the first person with probability 1/3, getting us in the state (2, 2, 3).
  • If we’re in (2, 1, 3), there’s a 1/3 probability that the third person draws the last event (in which case we are done), and a 2/3 probability he doesn’t, putting us in (1, 1, 3).
  • If we’re in (1, 1, 3) we have one person who needs to draw the last event with probability 1/3.
  • If we’re in (2, 2, 3), the third person will get a novel event with probability 2/3, putting us in (1, 1, 3). He might draw the same event as the first two people, in which case we definitely failed.

More generally, we can model the problem using a state machine (a technique which I discussed two and a half years ago). The states of this machine can be written as 3-tuples (p, q, r) which means we have p people to draw events, there are r events of which q still need to be drawn. Notice that q \leq r, and that for our problem, we have r = K and we generally want to find the probability of success for the state (N, K, K).

For K = 3, N = 4 we can draw this as follows.

Recurrence Relations

We can define V(p, q, r) as the probability that transitions from the state (p, q, r) will eventually end up in a SUCCESS state, defined as (p, 0, r) for all values of p and r (as that means all events were drawn). We make some observations on simple cases:

  • V(p, 0, r) = 1. We are already in a success state by definition.
  • V(0, q \geq 1, r) = 0. If everyone has drawn an event and there are still events that haven’t occurred, there is no chance for us to draw those events.

That’s all we need, and we can look at the recursive case. In the state (p, q, r), there is a \frac{q}{r} probability that we draw a new outcome, putting us in the state (p-1, q-1, r). There is also a 1 - \frac{q}{r} probability that that doesn’t happen – in that case we’re in the state (p-1, q, r). It thus follows that

V(p, q, r) = \dfrac{q}{r} \times V(p-1, q-1, r) + \left( 1 - \dfrac{q}{r} \right) \times V(p-1, q, r)

Notice that these states are well-formed, since we handled the p=0 and q=0 base cases earlier. It is possible to optimise this by adding some more sophisticated base cases, but this still gives us a clear enough method for working out the probability. For example,

  • V(2, 1, 3) =\frac{1}{3} \times V(1, 0, 3) + \frac{2}{3} \times V(1, 1, 3) = \frac{1}{3} \times 1 + \frac{2}{3} \times \frac{1}{3} = \frac{1}{3} + \frac{2}{9} = \frac{5}{9}
  • V(2, 2, 3) =\frac{2}{3} \times V(1, 1, 3) + \frac{2}{3} \times V(1, 2, 3) = \frac{2}{3} \times \frac{1}{3} + \frac{1}{3} \times 0 = \frac{2}{9}
  • V(4, 3, 3) = V(3, 2, 3) = \frac{2}{3} \times V(2, 1, 3) + \frac{2}{3} \times V(2, 2, 3) = \frac{4}{9}

 

Solving The Specific Instance

The answer to our original problem, for a given number of people N is then V(N, 365, 365). Working this out by hand is likely to be painful, so we should avoid that. It’s easy to code something simple up that is based on the recurrence relation above:

private static double findSuccessProbability(long remainingPeople, long remainingEvents, long totalEvents) {
    if (remainingEvents == 0) {
        return 1.0;
    }
    if (remainingPeople == 0) {
        return 0.0;
    }

    double probabilityOfNewEvent = ((double) remainingEvents) / totalEvents;
    double successProbabilityOnNewEvent = findSuccessProbability(remainingPeople - 1, remainingEvents - 1, totalEvents);
    double successProbabilityOnOldEvent = findSuccessProbability(remainingPeople - 1, remainingEvents, totalEvents);

    return probabilityOfNewEvent * successProbabilityOnNewEvent
            + (1 - probabilityOfNewEvent) * successProbabilityOnOldEvent;
}

We can test this and verify that indeed V(4, 3, 3) = \frac{4}{9}. However, if we try to call this with, say (500, 365, 365) the program doesn’t seem to give the answer quickly.

One solution here is to use memoisation. The naive version written above ends up computing the answer for the same state potentially exponentially many times, but we can cache the results of previous computations and use them. This could be done with a map with the 3-tuple as a cache key; alternatively, a bottom-up dynamic programming solution can be written, since there’s a clear order in which we process the subproblems. In any case, these methods should get the runtime down from exponential to O(NK).

Reconstructing The General Solution

We want to find the smallest N such that V(N, 365, 365) > 0.5. We could start from N = 365 and then keep trying larger values of N: since K is constant this is actually not that bad as we can reuse the results of computations from previous attempts when trying new values of N.

public static void main(String[] args) {
    long numPeople = 365;
    double successProbability = findSuccessProbability(numPeople, 365, 365);
    while (successProbability <= 0.5) {
        numPeople++;
        successProbability = findSuccessProbability(numPeople, 365, 365);
    }
    System.out.println(numPeople);
}

This gives us our answer: 2287. We have V(2286, 365, 365) = 0.499414 \ldots while V(2287, 365, 365) = 0.500371 \ldots.

This approach looks a bit primitive – and normally we’d use a modified binary search where we first search for an upper bound via something that grows exponentially, and then binary search up to our bound. This works, since V(p, q, r) increases as p increases for constant q, r.

However, for this specific problem and more generally for recurrences where we end up reducing parameters by 1, we still need to burn through almost all of the smaller subproblems we skip over by isolating the bound, so it won’t actually give us that much here.

Usually when working with recurrences it is nice to find a closed-form solution for our expression (i.e. we have a general expression for V(p, q, r) that doesn’t itself reference V), though I didn’t really see how to make progress with this one.

Detecting Progress in Sequences

I often try to evaluate whether something that is difficult to measure directly and clearly has improved. For example, I might like to evaluate how my German or logic puzzle skills have changed over time. I could try German past exams or look at logic contest results – however, one problem with these is that there is a lot of noise. For example, for the logic contest results, a score could be artificially low because I had a bad day or the contest was unexpectedly hard; it could also be high because I made multiple lucky guesses on difficult puzzles. Thus, a single measurement is unlikely to be sufficiently reliable.

One solution is then to use one’s average score, or take other statistical summaries of multiple data points. However, we probably don’t want to consider all of the data points we have as equally important. For example, I ranked 155th in a Sudoku GP contest in late 2017 – if I’m trying to evaluate my skill level at the end of 2019, that’s probably not relevant.

We could pick a cut-off point (for example, the beginning of 2019, or the last ten contests) and then discard all of the data from before that, and then simply treat the remaining data as equally important. This is often the basis of sliding window algorithms; if we say that we’re interested in one’s average score from the last ten contests, we can find this metric over time by considering a part of the list ending at today. There are methods for calculating these metrics efficiently (taking time linear in the length of the data stream).

Unfortunately, choosing a suitable window can be difficult – small windows can be vulnerable to noise, while large ones may fail to account for trends present within an individual window. As far as I know, this selection is more of an art than a science.

We can use more complicated approaches as well. Instead of picking a hard cut-off, where data from before the cut-off is irrelevant, we can instead treat data points as becoming less relevant over time. A method that’s often used is exponential weighting; giving the most recent observation a weight of 0 < \alpha < 1, the second most recent a weight of \alpha (1 - \alpha), the third \alpha (1 - \alpha)^2 and so on. As \alpha approaches 0, we approach a simple historical average; as \alpha approaches 1, we approach remembering just the most recent element. I’m not sure if the underlying assumption that events become exponentially less relevant over time is appropriate.

In spite of possibly sounding complex, this method does have computationally favourable properties. If we’re keeping track of a stream of data, we don’t actually need more than constant additional memory. It’s enough to keep just the previous reported average, because incorporating a fresh data point D into our metric can be done by S_{new} = \alpha D + (1 - \alpha) S_{old}.

There are some dangers here as well. The first challenge is bootstrapping; how should one pick the initial value of S? One could use the first observation, or perhaps an average of the first few observations if short-term divergence from reality is unacceptable.

I think there’s also a risk with massive outliers massively skewing the average (e.g. an API call which usually takes nanoseconds exceptionally taking an hour because of a system outage). This exists with any statistical technique, but if \alpha is small, our estimate will be “corrupted” by the exceptional data even after quite a few additional measurements. With the sliding window method, once the window has expired, the exceptional data point drops out.

In general, the methods we’ve covered assign weighting functions to the data points – the simple average just assigns the same weight to everything, the sliding window assigns the same weight to everything in the window and 0 to things outside the window, while the exponentially weighted moving average (EWMA) weights each point differently based on how recent it is.

As an extension, there are techniques for maintaining constant-size reservoirs of values that can be used to approximate more general summaries like order statistics, standard deviations or skewness. These often rely on holding a subset of the values being observed in memory. The selection mechanism for which values should be kept can be written to bias towards more recent measurements. In some ways, the calculation of our standard sliding-window based moving average can be implemented as a special case of this, where new entries are always included, and the oldest entry at each insertion is evicted. That said, we would probably not do this for just an average, as we can do that with constant memory (just remember the current average).

It’s not a particularly scientific or deterministic method, but in practice I find it useful to consider graphs with various transforms on top of them and draw conclusions based on that. I don’t have the sufficient statistical background or intuition to decide beforehand what would work well, unfortunately.

Anatidaephobia: Ducks, Ponds and Probability

We discussed another interesting question at work, this time over Slack. This one seemed more mathematical than programming-based, though.

Four small ducks are in a large circular pond. They can be at any point in the circle, with equal probability. What is the probability that a diameter can be drawn so that all four ducks are in the same semicircle in the pond?

Naturally, there is a straightforward generalisation:

N small ducks are in a large circular pond. They can be at any point in the circle, with equal probability. What is the probability that we can fence off some sector of the pond which subtends an angle P, so that all four ducks are enclosed in the fenced area?

If I had to do this as part of an engineering problem, my first reaction would be to implement a Monte Carlo simulation. This would quickly reveal an answer of \frac{1}{2} for the first part, but in the second part things might become less obvious.

Usually for this kind of problem I tend to start by varying some of the parameters and trying to solve a simpler version of the problem. With one duck, drawing a suitable diameter is trivial; with two, drawing a diameter through one of the ducks is sufficient (since the second duck is on one side or the other – ‘small’ here means that we don’t consider the case where a duck lies exactly on the diameter). Going up to three, things get a little complicated; depending on the position of the first two ducks, the third duck can either be placed anywhere (if the first two ducks are at the same location, for example) or perhaps might be quite restricted (if the ducks are on almost opposite sides of the pond).

I then looked at other possible ways of simplifying the problem. For example, we don’t really care about where the ducks are relative to the centre of the pond. The relative angles between the ducks and the centre of the pond suffice to identify whether drawing the diameter is possible, and how far they are from the centre of the pond on a given axis won’t affect this. We can thus consider the ducks as uniform randomly occupying points on a looping one-dimensional continuum, such as the interval [0, 1).

Returning to three ducks, we can try to formalise the intuition that the range of positions allowed for the third duck varies depending on the position of the first two ducks. Define the span of n ducks S_n to be the total space on the continuum that the ducks occupy. For base cases, we define S_1 = 0, and S_2 is just the smaller distance between the first and second duck, so it has to be uniformly distributed between 0 and 0.5.

If we fix the value of S_2 = x, we can attempt to find the range of allowable third-duck positionings such that S_3 \leq n. Without loss of generality, suppose the two ducks are sitting at points 0 and x. Then, the lowest possible point the third duck can sit at would be x - n, and the highest possible point n (assuming of course n \geq x; if this is not the case then there are clearly no possible positions). The range of this interval is n - (x - n) = 2n - x, and the probability that a duck lands in that range is 2n - x. This of course makes the assumption that 2n - x is less than 1; if it is more than 1, then the probability would be 1; however, in our specific case since x > 0 and n = \frac{1}{2} we don’t have to worry about this.

This then reduces to a problem about conditional probabilities. We want to weight each possible value of S_2 based on how likely it is; the relative likelihood of each value is given by the probability density function. For S_2, we have

\displaystyle f_{S_2}(x) = \left \{ \begin{array}{lr} 2 & 0 \leq x \leq 0.5 \\ 0 & \text{otherwise} \end{array} \right.

Then, weighting based on this density function, we have:

\displaystyle P(S_3 \leq n) = \int_{x} P(S_3 \leq n | S_2 = x) f_{S_2}(x) \text{d}x = \int_{0}^{n} (2n-x) f_{S_2}(x) \text{d}x

If n \leq 0.5 then that will be equal to

\displaystyle \int_{0}^{n} (2n-x) (2) \text{d}x = \left[ 4nx - x^2 \right]_{0}^{n} = 3n^2

Thus we can find a diameter for \dfrac{3}{4} = 0.75 of cases with three ducks. If n > 0.5, then we need to make several refinements – most obviously, the upper bound of the integral stops at 0.5, as there’s no span with two ducks larger than that. Furthermore, there may be values of x where 2n - x > 1 in which case we need to clamp it down to 1. For simplicity, we focus on the case where n \leq 0.5 which is sufficient for our purposes.

Moving on, to find P(S_4 \leq n), we can similarly formulate

\displaystyle P(S_4 \leq n) = \int_{x} P(S_4 \leq n | S_3 = x) f_{S_3}(x) \text{d}x

f_{S_3}(x) may not seem immediately obvious, but we can rely on the CDF of S_3 which was calculated earlier – the probability density function is simply its derivative. Thus f_{S_3}(x) = 6x, and the conditional probability can be handled with the same logic that we used to go from S_2 to S_3, bearing in mind that the middle duck(s) aren’t important. Hence

\displaystyle P(S_4 \leq n) = \int_{0}^{n} (2n-x) (6x) \text{d}x = 6 \int_{0}^{n} 2nx - x^2 \text{d}x = 6 \left[ nx^2 - \frac{x^3}{3} \right]_{0}^{n} = 4n^3

and we have the answer to the original question, by substituting n = \frac{1}{2} we obtain an answer of \frac{1}{2}.

Interestingly, the CDFs of the various S_k seems to take on the form kn^{k-1}. We can prove this result by induction on k. This clearly holds for k \leq 4, based on our work so far – though note that k \leq 2 is enough for our proof. So we now take some arbitrary integer k, and assume that P(S_k \leq n) = kn^{k-1}. We need to show that P(S_{k+1} \leq n) = (k+1)n^k. The way we can do this is very similar to what we did for the concrete steps.

\displaystyle P(S_{k+1} \leq n) = \int_{x} P(S_{k+1} \leq n | S_k = x) f_{S_k}(x) \text{d}x

Since we’re assuming that P(S_k \leq n) = kn^{k-1}, f_{S_k}(x) = k(k-1)x^{k-2}. We can simplify out the conditional probability term, based on a similar argument on the conditional probability as before. Thus

\displaystyle P(S_{k+1} \leq n) = \int_{0}^{n} (2n-x) k(k-1)x^{k-2} \text{d}x = k(k-1) \int_{0}^{n} (2n-x) x^{k-2} \text{d}x

What follows is a bit of careful rewriting:

\displaystyle P(S_{k+1} \leq n) = k(k-1) \int_{0}^{n} 2n x^{k-2}- x^{k-1} \text{d}x = k(k-1) \left[ \frac{2n x^{k-1}}{k-1} - \frac{x^k}{k} \right]_{0}^{n}

And we can simplify this to:

\displaystyle P(S_{k+1} \leq n) = \left[ 2nk x^{k-1} - (k-1)x^k \right]_{0}^{n} = 2kn^k - (k-1)n^k = (k+1)n^k

which was what we expected. This also allows us to draw more specific conclusions – for example, in general for n = \frac{1}{2} the probability is just simply \frac{k}{2^{k-1}} for k ducks.

Death by Cubes (Algorithmic Modelling: Pandemic)

“So the probability we lose is 2/3, multiplied by 5/23, or 10/69. It’s pretty much a 6 in 7 shot.”

I met a friend for dinner and board games a while back, and one of the games we played was Pandemic. Designed by Matt Leacock, the game is themed around managing and discovering cures for diseases before they spread across the world.

Diseases are modelled using cubes, and they spread across a network of cities (which are nodes in a graph). On players’ turns, they perform actions to contain the spread of diseases, research cures for these diseases or share knowledge with other players. After they are done with their actions, players draw cards from an infection deck, where each city (node) is represented once. For each card, players need to add a disease cube of the city’s colour to the city. For example, considering a simplified network:

  • If D is drawn, one yellow cube will be added to D. Similarly, if E is drawn, one red cube will be added to E; if F is drawn, one blue cube will be added to F.
  • A city is allowed to have a maximum of three disease cubes of each type; if a fourth cube of the same type needs to be placed, an outbreak happens. Instead of placing the fourth cube, all connected cities receive one cube of the relevant colour. For example, if C is drawn, a blue outbreak occurs in C, and one blue cube is added to both B and F.
  • The rules on having at most three cubes of each type still apply when placing additional cubes, meaning that outbreaks can cause chain reactions. For example, if A is drawn, a red outbreak occurs in A. This causes one red cube to be added to B, D and E. However, B already has three cubes, meaning that a chain outbreak occurs in B; this causes one red cube to be added to A, C, E and F.
  • This looks like it would loop infinitely, but there is some mercy; in a given chain reaction, each city may only outbreak once. Thus, instead of adding a red cube and triggering an outbreak again in A, no red cube is added.

The game ramps up steadily in pace, by increasing the infection rate (number of cards drawn) over time. Players lose the game when the eighth outbreak occurs, when they need to place disease cubes but can’t (these are component-limited), or when the deck of player cards runs out.

Separately, the goal of the game is to research a cure for each of the four diseases in the game, which requires collecting data on the spread of the diseases. There is a deck of player cards which contains twelve city cards associated with each disease; players need to collect five of them (in one player’s hand) and bring them to a research station. This can be tricky, as players draw cards independently, and giving cards is restricted (I can only give you a card if we’re both in the one city that is associated with that card, and if it’s my turn). At the end of each player’s turn, players draw two cards from this deck (if they can’t, the game is lost).

The deck of player cards also contains a few special event cards which help players, and more interestingly several Epidemic cards which are disruptive. The number of Epidemic cards to include is variable (4 are used in an easy game, 6 for a difficult one). When an Epidemic card is drawn, players resolve a three-step process:

  1. “Increase”: The infection rate marker moves along its track, possibly increasing the infection rate (number of infection cards to draw at the end of each player’s turn). This begins at 2, but eventually becomes 4.
  2. “Infect”: There is a sudden wave of disease in a new city. Players draw the bottom card of the infection deck, and the city in question receives 3 disease cubes of its colour. The usual Outbreak rules apply, though thankfully only one Outbreak occurs even if the city already has two or three cubes.
  3. “Intensify”: The infection discard pile is shuffled (including the just-revealed card from the bottom of the infection deck). These cities are likely to suffer from more disease soon, especially since this happens as part of players drawing cards (before the infect phase of the current player’s turn).

Going back to the initial calculation, we were on the second last turn of a game with six Epidemic cards, five of which had already been drawn. There were three cards left in the player deck. We already had seven outbreaks occur, meaning that one more would lose the game. However, we had also cured three diseases, and the Medic which I was playing was sitting in a research station with all of the cards to cure the last disease on the next turn. Furthermore, my friend who was playing the Operations Expert held the One Quiet Night Event card, which allows skipping of the Infection phase.

Thus, the only risk of losing was drawing the last Epidemic card, and then in the Infect step revealing a city on the board which already had disease cubes of its colour (from outbreaks in adjacent cities). These events draw cards from separate decks, so their probability should be independent.

The first term is easy – there were three player cards remaining, one of which was the last Epidemic card, so there would be a 2/3 chance we would draw it. For the second term, we looked through the infection discard pile, which at that time had 24 cards (along with Sydney, which we removed from the deck using a special event card). There were thus 23 unseen infection cards; six of them corresponded to locations on the board which had disease cubes. On my friend’s last turn, we were able to defuse one of these locations, leaving a final loss probability of (2/3) * (5/23).

It turned out that we got lucky as we didn’t draw the Epidemic card, and even then the card at the bottom of the deck would have been safe. I wasn’t initially keen on doing too much of the calculation, figuring that clearing one risky place was easy but two was hard, but my friend (who is a statistician) decided to go ahead with the calculation.

On Set Collecting

When I hear the word ‘set’, the first thing I think of is the computing data structure, and then the mathematical construct that that data structure is trying to model. Essentially, a set is an unordered collection of distinct objects. I’d probably then think of the Egyptian deity before thinking of the card game, or failing at a contract in bridge. That said, we recently had some interesting discussions about the card game over lunch at work.

The game consists of a deck of 81 distinct cards which each show a number of shapes. Each card has four attributes, and each of these attributes can take on three values – number of shapes (one, two or three), colour (red, purple or green), shading (shaded, striped or unshaded) and shape (diamond, rounded rectangle, ‘S’).

Initially, twelve cards are dealt to the table; players compete in real-time to identify a set of three cards where, for each attribute, the cards are either all the same or all different. The player collects these cards, and the dealer replenishes the cards on the table.

It is possible that no set is possible with twelve cards. In physical games, the dealer will, after a while, deal three more cards to the table if no one can identify a Set. We were playing on an app, though; this app waits for players to claim that there is no set available, and penalises players if they make an erroneous claim. More often than not, even though all of us were looking carefully at the board, we were too quick to declare no-set with 12 cards. However, we had one round where even with 15 cards (after one declaration of no-set), no one found anything. After a long while, someone pressed the no-set button, and indeed there wasn’t a set!

After the game, discussion turned towards finding the maximum number of cards where no set was possible. We quickly settled on 16 as a lower bound; this is possible if you only use two of the values for each of the four attributes (e.g. by throwing out all cards that have three symbols, are red, are shaded or are diamonds). Now it becomes impossible to perform matches along any attributes being different, since you need one occurrence of each of the three values; also, because cards are unique there is no set of three cards where all four attributes are the same (that would be three copies of the same card). This arrangement has 2^4 = 16 cards.

Going beyond this is tricky, mainly because the number of possible groups of cards you have to pick from is large. There are 81 ways of selecting the first card, and 80 ways of selecting the second. The third would have 78 ways, since there is one specific card that won’t go with the first and second. However, later on you can’t just deduct one card for each already selected card; for a simple counterexample, suppose I have already selected:

  • (1) one red shaded diamond
  • (2) one green shaded rectangle
  • (3) one green shaded S

Now suppose I try and add the card (4) with one red shaded rectangle. This means that the card with one purple shaded rectangle would now become ineligible for selection, because it would form a set with (2) and (4). However, we have already eliminated this card from consideration, because it would form a set anyway with (1) and (3).

I turned to writing a quick program here. This is a standard algorithm based on depth-first search.

  • Each card is represented as a vector with four components.
  • There is an initial set of 81 vectors created.
  • To guide our search, whenever we consider adding a card to a working set, we iterate over the elements of the existing working set and determine which cards would, together with the new card, create a set. This relies on an observation that for any pair of (distinct) cards, there is one and only one card that would form a set with the first two. We then prune these set-creating cards from the search tree.
private static void recursiveSearchForCapSet(Set<SetCard> possibleCards, Set<SetCard> workingSet) {
    if (possibleCards.isEmpty()) {
        if (workingSet.size() > best.size()) {
            System.out.println("New best found: " + workingSet.size() + "; " + workingSet);
            best = workingSet;
        }
        return;
    }

    for (SetCard candidate : possibleCards) {
        Set<SetCard> elimination = Sets.newHashSet();
        for (SetCard workingCard : workingSet) {
            SetCard newBadCard = computeLastElementOfSet(candidate, workingCard);
            elimination.add(newBadCard);
        }
        elimination.add(candidate);
        workingSet.add(candidate);
        recursiveSearchForCapSet(Sets.difference(possibleCards, elimination), workingSet);
        workingSet.remove(candidate);
    }
}

private static SetCard computeLastElementOfSet(SetCard first, SetCard second) {
    // XXX There's probably a cleaner way to compute 'all different'
    long targetP1 = first.p1 == second.p1 ? first.p1 : 3 - (first.p1 + second.p1);
    long targetP2 = first.p2 == second.p2 ? first.p2 : 3 - (first.p2 + second.p2);
    long targetP3 = first.p3 == second.p3 ? first.p3 : 3 - (first.p3 + second.p3);
    long targetP4 = first.p4 == second.p4 ? first.p4 : 3 - (first.p4 + second.p4);
    return new SetCard(targetP1, targetP2, targetP3, targetP4);
}

This found sets of size 17 quite quickly, but knowing that the bound was 20, I knew that wasn’t good enough. There were several implementation tricks that I used to help speed up the search.

  • This search is easily parallelisable; each branch of the search tree may be explored in parallel. A simple implementation (and what I used) could rely on Java’s parallel streams when doing the recursion, for instance. The orchestration around keeping track of the best answer needs a few small changes to be thread-safe, of course.
  • If the goal is merely to find some set of size 20, then we can exploit the symmetry of the problem and fix the first card to be any card (I used 0,0,0,0), since any cap set could be re-expressed as one that contains 0,0,0,0 by permuting the assignment of attributes to variable values. There may be more clever symmetry to exploit here than what I did.
  • Computing the last element of a set involves a bunch of jumps, branches and math; the result of this computation is deterministic, and can be cached so that it can be reused along different branches of the search tree. There’s probably something more clever I can do regarding overlapping sub-problems (this improves the runtime only by a constant factor).

This was sufficient to get the tool to spit out many sets of size 20, though I unfortunately didn’t churn through every possible selection, so this doesn’t show that 20 is an optimum.

Algorithmic Modelling – Codenames

Codenames is a board game designed by Vlaada Chvatil which tests communication. The game is played with two teams; each team splits itself into clue-givers and guessers.

There is a board of 25 cards, each with one (or more) words on them. Of these cards, 9 are owned by the first team, 8 owned by the second team, 7 are neutral and one is the ‘assassin’.

The clue-givers attempt to communicate to the guessers which cards are owned by their team. This is done by giving a one-word clue followed by the number of cards N corresponding to the clue. The rules establish some bounds on allowed associations (for example, ‘sounds like’ clues are not allowed).

I don’t know how much time went into selecting the words to appear on individual cards, but there are certainly many words in the deck that can be interpreted in many ways, which makes the game fun. For example, I can think of quite a number of ways to clue BOND (COVALENT for unambiguity; DURATION, ASSET or DEBT for the debt; JAMES, AGENT or SEVEN or even at a stretch something like CASINO for the character; FRIEND or FRIENDSHIP; STREET or TUBE; CONTRACT, PROMISE or WORD for the idea of a promise; ADHERE or CONNECT, among others). Which one I might pick would depend on the other words on the board.

Guessers then identify up to N+1 cards they think their team owns. This can be based on the entire history of clues given, not just the previous clue. These cards are guessed one at a time; a team is only allowed to make further guesses if they guess ‘correctly’ one of their team’s cards. Teams may refrain from making all guesses.

The game ends either when all cards owned by a team are revealed (that team wins), or if a team reveals the assassin (that team loses).

In practice, clue-givers will need to consider all cards, not just the ones their team owns. The penalty for ‘failure’ varies; the assassin is an instant loss, and revealing a card owned by the other team is also bad. For example, with the cards APPLE, BANANA and GRAPE it would be very tempting to declare (FRUIT, 3) as a clue; yet, if KIWI is owned by the other team (or worse, is the assassin) it might be dangerous.

However, if the other team has already revealed that KIWI is theirs (e.g. with another clue, maybe SOUTH or BIRD) then the FRUIT clue becomes safe. Thus, pre-computing a strategy at the beginning of the game (e.g. by partitioning the eight or nine clues owned into several logical groups while avoiding other words) may not be optimal.

I tend to consider making an effort to win to be a key part of games, and thus what moves may be considered good also often depends on the state of the game. For example, if the opposing team has just one card left, I will give a broader clue that may have more tenuous links in a ‘do or die’ effort to finish on this turn. A less extreme example would be attempting weaker links if behind, or perhaps playing slightly more conservatively if ahead.

Mathematically modelling Codenames can be tough. We can try modelling the game state as a tuple (O, E, N, x, h, f_O, f_E) where O, E, N are sets of the remaining clue words our own team has, words the enemy team has, and the neutral words, x is the assassin, h = (t,w,n)^+ is the history of the game, and f_O and f_E are preference functions of the form h, w, n, O, E, N, x \rightarrow P, returning an ordered list O \cup E \cup N \cup \lbrace x \rbrace. (t, w, n) is a clue meaning that a given team gave a clue word and hinted that some number of cards was relevant. This already abstracts two difficult parts away – ordering the clues, and determining how many of the top preferences to pick.

I think the preference functions need to take into account previous clues from both teams; if a previous clue could clearly have corresponded to two words and I picked the wrong one, it might be fairly high on the list even if unrelated. Similarly if this scenario happens to the opposing team, I would probably avoid the word that would blatantly be theirs.

The notion of degree of confidence also isn’t captured as well in our ordering; going back to the fruit example, having a clear ordering would imply that clues for less than 4 items could reliably result in correct guesses (if we knew that APPLE was by a small margin the best pick, we could safely and confidently give (FRUIT, 1) to clue it, which seems wrong). One could imagine modelling these with selection probabilities, though things would be even more complex.

The above representation still seems computationally difficult to work with. The evolution of the preference function as one moves from one state to another is unclear (so in that representation it is fully general), making lookahead difficult. It doesn’t seem like a greedy choice is always best; for example, given eight clues that are reasonably divisible into four pairs, a clue for 4 words taking one element from each pair might be a bad idea if the other words can’t easily be linked.

We can proceed by simplifying the preference functions; a simple approach could be that for each w, n each team has a persistent preference function that returns an ordered list of O_0 \cup E_0 \cup N_0 \cup \lbrace x \rbrace. The preference functions during the game then return the subsequence of that list that contains strictly the words still left in the game. This of course doesn’t take into account past information or clues from the other team.

With this, we can attempt to solve the game using standard techniques; assuming that the vocabulary of clues is bounded (let’s say it must be from the Linux dictionary), a game state is winning for the current team if there exists some word for which the preference function returns everything in O as a prefix. A state is losing if all moves in that state produce a state which is winning for the opposing team.

We can generalise this to a simple probabilistic model as well; the preference ‘functions’ instead return a discrete random variable that indicates either guessing some word or passing. A simplified model could then, at the start of the game, assign weights to each candidate indicating the relative probability that candidate would be selected. These can be normalized to account for a pass probability. As words are removed from the board, the probability of the remaining words being selected scales (we can imagine a simple rejection-sampling where we discard words that aren’t actually on the board any more).

The algorithm for the probability that we get a win from a given state is then slightly more complex (though I think still reasonably covered by standard techniques).

Likelihood Estimation

One of my team-mates introduced another interesting question over lunch. Working through it reminded me of some of the statistics problems I struggled with at Imperial, specifically during Intelligent Data and Probabilistic Inference. It reinforced that in spite of scoring 90 for that course I’m still not confident I knew what I was doing then (or now).

Suppose you have a symmetric triangular distribution of unknown width and mean. Given that the distribution has yielded three independent samples of 2, 4 and 5, what is the expectation of the mean?

The triangular distribution can be used as an estimate for something which is known to be bounded both above and below, that also takes into account a value known to be the most probable. Some argue that it is the simplest distribution satisfying these (though one could argue that a cosine or some kind of truncated normal might be more applicable).

The instinctive answer I have is simply the mean of the samples, or \frac{11}{3}, though I was suspicious as probability and statistics often yield non-intuitive results.

The distribution is named as such because it has a triangular probability density function; because of the laws of probability (area under the function must be 1), specifying the minimum, maximum and mean is enough to uniquely identify it. Supposing we have a minimum a, a maximum b and a mean c, this yields a pdf of:

f(x) =\begin{cases} \dfrac{2(x-a)}{(b-a)(c-a)} & a \leq x \leq c \\ \dfrac{2(b-x)}{(b-a)(b-c)} & c \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

We are only dealing with a symmetric case, so we can set c = \frac{a+b}{2} which cleans things up a little:

f(x) =\begin{cases} \dfrac{4(x-a)}{(b-a)^2} & a \leq x \leq c \\ \dfrac{4(b-x)}{(b-a)^2} & c \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

Based on our observations that we have three samples of 2, 4 and 5, we can construct the likelihood that a given triangular distribution gave rise to a certain result. While the probability sampling a given distribution resolves to a precise value is infinitesimally small, we can still compare them in relative terms using the density functions. We can write this as

P(2,4,5;a,b) = f(2)f(4)f(5)

Expanding this term will depend on where exactly 2, 4 and 5 fall in our triangle. Let’s work out the most promising case (where 2 falls on the left of c while 4 and 5 fall on its right); the rest are left as an exercise to the reader. In this case, we have

P(2,4,5;a,b) = \dfrac{4(2-a)}{(b-a)^2} \times \dfrac{4(b-4)}{(b-a)^2} \times \dfrac{4(b-5)}{(b-a)^2} = \dfrac{64(2-a)(b-4)(b-5)}{(b-a)^6}

At this point, we notice the original question needs a bit more specification. We aren’t given what the distribution of possible values of a and b is. One way of getting around this is just to pick a uniform distribution; however, that isn’t quite defined over the real line. We can for now simply find the maximum likelihood estimate for a and b.

Alternatively, if we give prior probability distributions for a and b, we could also use the samples as information to get a posterior distribution. Usually we would pick a conjugate prior distribution that wouldn’t fundamentally change even when accounting for the sample; I didn’t find one for the triangular distribution, though.

If we want to find the most likely distribution, we seek to find an extreme point; this can be done by taking partial derivatives (and this expression actually lines up quite well with the quotient rule). There is a fairly standard ‘trick’ for handling these, though; since the logarithm is a strictly increasing function, we compute the log-likelihood instead. The maximum of the logarithm will also be the maximum of the original function. Using the laws of logarithms, we get something a lot more tractable:

\log P(2,4,5;a,b) = \log 64 - 6 \log(b-a) + \log(2-a) + \log(b-4) + \log(b-5)

Computing the partial derivatives is then straightforward. We then set them to zero, and solve the resulting equations simultaneously; this yields

a, b = \left( \dfrac{1}{3} \left( 4 + \sqrt{\frac{2}{5}} \right), \dfrac{1}{3} \left( 16 - \sqrt{10} \right) \right), \left( \dfrac{1}{3} \left( 4 - \sqrt{\frac{2}{5}} \right), \dfrac{1}{3} \left( 16 + \sqrt{10} \right) \right)

We’ll want the second pair of solutions; the first actually has b \approx 4.279 which is no good (we need b > 5 ). Interestingly, the mean of that triangular distribution is then \dfrac{2}{15} \left(25 + \sqrt{10} \right) \approx 3.75497 which is not quite 11/3.

Indeed, though, the log-likelihood we get with these values of a and b is about -4.74053. Indeed, if we look at the family of distributions with  a = \frac{11}{3} - \alpha and  b = \frac{11}{3} + \alpha , the best we get is about -4.7473.

On Challenges that Build

On my return flight from Singapore to London, I listened to quite a few hours of music. Two of the songs I listened to and enjoyed at least partially for similar reasons were It’s Gonna Be Me (by NSync), and I Can’t Be Mad (by Nathan Sykes). It’s a bit of a strange pairing as the former seems to be an upbeat, relaxed pop song while the latter is a fairly moody piano ballad. However, the common element I latched on to here was that both songs feature sections that are repeated multiple times, with the vocals developing additional complexity on each iteration (thinking about it this is fairly common in songs that are critically reviewed well, and also in songs I like). For example, in It’s Gonna Be Me there is a line in the chorus which is sung four times over the course of the song, and its complexity develops:

The challenges in I Can’t Be Mad have a couple of changed notes, but also (if trying to reproduce the original) demand different productions of the notes (falsetto vs not, belts, etc). There’s always a risk of adding too many embellishments, though I find expanding upon base melodies can be quite interesting. Singing these, and considering what would be reasonable for my voice (adding a closing run to the last syllable above, for instance) and what would not be (adding a +1 semitone key change after the second chorus in I Can’t Be Mad – original is already awfully hard), can be enjoyable too.

Generalising this, I quite like the idea of “increasingly complex variations on the same theme” when learning concepts and when teaching them. This already seems to happen for many concepts in mathematics. Over the course of an A-level student’s mathematics education, he/she might understand how to write a quadratic expression as a product of linear factors (e.g. converting 6x^2 - 19x - 7 into (2x-7)(3x+1)). This could first begin with expressions where inspection works feasibly. However, students should also be presented with some examples where inspection is extremely difficult or even impossible (though probably only after gaining some confidence with the cases where inspection is plausible). For general expressions, one could try to use both the quadratic formula and factor theorem to factorise something like 6x^2 - 19x - 8 into -\frac{1}{24}(-12x + \sqrt{553} + 19)(12x + \sqrt{553} - 19). However, there will be some expressions like 6x^2 - 19x + 16 where the solutions to the quadratic are not real; later, with some understanding of complex numbers, these would make sense. Students will also learn about problems which may not obviously be quadratics but can be written as such (like x^4 + 2x^2 + 1); the ability to synthesise the various techniques can then be tested with something like 7x^8 - 10x^4.

To some extent my Masters project also had this theme – linear time logic, adding knowledge, adding dynamic modalities, generalising that to full branching time logic, and then switching out the infinite traces for finite traces. I haven’t written a course or a book on a computer science topic yet, but I can imagine that there might at least be sections that follow this kind of sequence.

This pattern also occurs a fair bit in many technical interviews I’ve seen as well, where problems start easy, but additional and progressively more challenging constraints are repeatedly introduced. The purposes here could include testing for a breaking point, seeing how candidates react to problems without an obvious solution, or whether they are able to synthesise additional information to come to a solution.

I find that I often learn best by practicing on smaller examples at first, and then (attempting to) generalise their conclusions to larger models, considering when these conclusions may fail or not. Having multiple variations of progressive difficulty can be useful as they can give a sense of achievement as partial progress towards an overall goal is made. Furthermore, I find understanding how changes in the problem scenario leads to the base solution method being applicable or inapplicable to be a key part of understanding as well; there is a clear need to reason about this when considering incremental variations. Going back to It’s Gonna Be Me, for example, aiming downwards at the word ‘love’ and not conserving sufficient air or energy for it might work for the first three passes, but it’s unlikely to on the last round.

There is a risk that the method can be frustrating in that it seems like it is consistently ‘moving the goalposts’, especially if one forgets that the partial goals are partial goals (and starts to think of them as complete ends in and of themselves). The standard I’m using for understanding (ability to critically evaluate applicability in novel contexts) may be seen as a little high. I also haven’t covered how to bootstrap the method (that is, how to develop an understanding of how to attack the base problem before any variations are introduced). Nonetheless I think there are some contexts where this works well. I’ve found it to be useful in singing, mathematics and interviewing at least!

Making Heads of Tail Risks

I remember that I was fairly anxious at the beginning of my fourth year at Imperial. I was concerned about securing work after university. Looking back, this seemed patently ridiculous; I had topped my class for the third time and already had a return offer in hand from Palantir. However, owing to sweeping government rhetoric about controlling post-study work visas at the time, I saw “not being able to get a work visa” as the primary risk then, even if it was remote. That statement in and of itself was probably correct, though the time I spent to monitor and mitigate that risk (reading up on government committee reports, and considering alternatives like a H1B1, EU blue card or doing a Tier-2 ICT after a year) was excessive.

Of course, this never materialised; and even if it did, the only likely impact would be that I’d have to fly home to Singapore in between finishing uni and starting work (I did not; though on hindsight that might have been a good thing to do).

I’m not sure when I first became aware of the concept of probability distribution functions (or, for that matter, continuous random variables). These functions are continuous, take on nonnegative values and integrate (across all variables) to 1. In the case of single variable functions, one can plot them on a two-dimensional graph; one may get results looking somewhat like the picture above, in some cases.

Areas of regions underneath the graph are proportional to the probability that a value falls in that region. For example, a uniform distribution would have a probability function that’s just a horizontal line. The graphs for the return of investments 1 and 2 in the example above follow what’s called a normal distribution; investment 3 follows a Student’s t distribution which has fatter tails.

Since areas are proportional, a simple technique for generating random values from an arbitrary distribution is called rejection sampling; if one draws a box around the distribution and throws darts randomly at it, one can take the x-coordinate of the first dart that lands underneath the function as a representative random sample.

That’s a basic mathematical introduction. If we had to rank the quality of the return profiles above (remember: right means higher returns), a lot would depend on what we were trying to do. I would personally rank investment 2 (the green curve) on top; it has a considerably higher mean return than investment 1 (blue) and adds only a small amount of variability. We can calculate what’s known as the standard deviation of a given distribution; this is a measure of how much variability there is with respect to the mean. In fact, the blue curve has a standard deviation of 0.6; this is 0.7 for the green curve.

Ranking investments 1 and 3 is more difficult; the mean of 3 is higher, but you add a lot of uncertainty. I’d probably rank them 2, 1, 3. However, there is also an argument in favour of investment 3 – if one is only interested if the returns exceed a certain level. It’s a similar line of argument where if you’d ask me to double a large sum of money (nominally) in 20 years, I’d pick a bond; 10 years, a general stock index fund, and 10 minutes, probably blackjack or aggressive forex speculation.

Whichever investment we pick, it’s possible that we may get unexpectedly awful (or excellent!) results. The standard deviation could give us some measure of what to expect, but there is still a non-zero probability that we get an extreme result. For the normal distributions (the blue and green curves), there is a 99.7% probability that a single observation will be within three standard deviations of the mean; this does also mean that there’s a 0.3% probability it does not, and about a 0.15% probability it’s lower than three standard deviations below the mean.

Tail risk refers to the risk of events that may have severe impact but are low-probability; considering them is important. Going back to the work visa situation, I think I correctly identified visa policy changes as a tail risk, though in hindsight controlling the amount of time spent mitigating them was done poorly – akin to spending $10 to insure against a 1% probability of $100 loss (provided the $100 loss wasn’t crippling – which it wouldn’t have been).

I also spent a lot of time focusing on mitigating this specific tail risk, when perhaps a better solution could be developing resilience to general tail risks that may affect my employment. The obvious routes at the time would have been continuing to do well academically and develop my skills, though others exist too – such as having a greater willingness to relocate, living below one’s means and building up an emergency fund. There are still further tail risks that the above wouldn’t address (e.g. a scenario where computers and automation are universally condemned, all countries practice strict closed-border policies and the global fiat money system collapses) but the costs in mitigating those risks seem untenably high. I haven’t read Antifragile yet (what I describe here is weaker, as it doesn’t demonstrate benefiting from low-probability events), though that’s planned to be on my reading list at some point in the future.

  • 1
  • 2