Browse Month

# That One Thing (2018 Q3 Review)

Monomania refers to a mental disorder where one pathologically focuses on a single thing (naturally, at the expense of others, as one’s attention is limited). To some extent, I may have experienced this in my second year at Imperial, where for some reason I became fixated on my academic performance. I remember studying almost 100 hours per week when preparing for my exams then (typically achieved by a very strict 9am-11pm schedule everyday for about five to six weeks). I’m not sure how I managed at the time. In terms of grades, I think one should aim at 60 + epsilon, 70 + epsilon or 100, and I chose the last of these.

Things have changed since. I now evaluate how things have changed over a variety of facets. Even professionally, the metrics are a bit more multi-dimensional now. I’m not sure if it was the Palantir internship or something else, but in fourth year I made some decisions that clearly weren’t optimal as far as academic performance was concerned (working 15-20 hours a week). At that time, my main goals were get an average of 90% and submit twenty pull requests as a part-time software engineer. After I graduated the metrics became writing papers and pull requests; the latter quickly changed to patents as, all other things held equal, I find developing two or three innovative systems far more interesting and probably more meaningful than fifty rudimentary changes.

In terms of development, I’ve certainly found my recent work to be challenging (which is a good thing!). I’ve been continuing my focus on performance work, though now that many easy wins have been taken the features are getting more complex. I’m also starting to have more clarity on the kind of work that motivates me (and discussing this with my lead and others); having such discussions feels relevant and good.

I presented the LDLK on finite traces paper at AAMAS’18 (and it seems I now have a h-index of 2). It’s unlikely there’ll be another conference paper for a while though, as I’m currently diverting my computer-science energies towards writing a journal paper which summarises and extends the three conference papers to some extent.

A highlight of the quarter was receiving the VCLA award for an outstanding Masters’ thesis as well (Imperial article). In terms of raw computer science difficulty, MCMAS-Dynamic is probably the most complicated project I’ve delivered (there are a couple I’ve done since that I’m also proud of, but I’m pretty sure MCMAS still wins). Some recognition is always nice; too much ego-stroking can be unhealthy, but the odd bit is enjoyable.

In terms of finance, I didn’t really pay much attention to things this quarter. I was aware of the Turkish and Argentinian currency troubles as well as the US-China trade spat, but things on my end are still largely business-as-usual. In fact I thought this quarter was not going to be good, but looking at things it’s largely an effect of recency bias; while performance in September was indeed weak, July and August were actually very good.

The high expenses of Q2 have actually snowballed further in Q3, though most of this is explainable by six-month periodic expenses and booking travel for the end of the year. Discretionary expenses have actually fallen a fair amount, which is unsurprising as Q2 expenses included Stockholm travel as well as renewing my web server. Savings rate is down, though still on track to be in a good place this year.

Work on dealing with logical puzzles has continued, and I think I’ve managed to find a source of metrics; Logic Masters India has a good selection of past contests along with a rating system. Based on a few of these, my Puzzle skills are currently around the low to middle 600s, while my Sudoku skills are in the high 700s. Individual contests may vary in difficulty, so normalisation is used. Interestingly, this is done on two axes, comparing both one’s score relative to the top scorer and one’s ranking relative to all participants.

I’ve also started to take an interest in cryptic crosswords, though I’m certainly nowhere as competitive at these; my general knowledge is not sufficiently broad. “Logical puzzles” are designed with the intention of being culture-neutral; these often demand a wide vocabulary (which I think I have) and broad general knowledge (less so). It helps that there is a group at Palantir that attempts these recreationally; I’m not aware of a similar group for general logic puzzles.

When one is investigating a search problem, one typically needs to balance exploration (searching prospectively for additional solutions) and exploitation (locally refining good known solutions). Historically when managing my own decisions I’ve largely been pushing exploitation for quite some time. I first discovered an interest in mathematics when I was around six; while this isn’t quite computer science or software engineering, that was near-by. I started programming at 15, and when choosing my degree looked at CS which I felt balanced my interests, skills and prospects well. When I was at university, I had the aforementioned monomaniac episode; since then I’ve branched out a little.

I’ve been behind the times (the song was released in 2017, apparently), but I’ve been listening to Meant to Be (Bebe Rexha and Florida Georgia Line) a fair bit this quarter. I don’t particularly enjoy the vocals in the original, but it’s been covered well; this one is better polished, while this one has the approach I go for when I try to sing it!

It’s enjoyable for me to (try to) sing along to, including the high vocal line in the chorus; it quickly becomes a question of hitting the tenor Bb4 repeatedly. Listening to the song certainly does make me thing of search space exploration problems, specifically genetic/evolutionary algorithms. The lyrics in the chorus sound very much like evaluating a fitness function on a candidate solution and then deciding whether to accept it:

So won’t you ride with me, ride with me?
See where this thing goes?
If it’s meant to be, it’ll be, it’ll be; baby if it’s meant to be.

The first verse has elements of not being too pressured with exploitation as well, and thematically echoes bits of Good Things Come To Those Who Wait which I looked at last quarter. The second verse reflects a degree of caution after being burned by poor approaches in the past. As I plan for what I might want to try to achieve (God willing) on a timescale of months or years, it’s important to not index too heavily on past experiences or initial reads of how a given path may turn out.

# Algorithmic Modelling – Alphabear 2

Alphabear 2 is a word game where players are given a two-dimensional grid of tiles. Some of these tiles show letters, along with timers; others are concealed and must first be exposed by using an adjacent letter in a word. Players use letter tiles (anywhere on the grid) to form a word. When a player makes a word, he/she scores points equivalent to the sum of the timers on the tiles that are used. The timers on all tiles that aren’t used then count down by 1; tiles that reach zero become unusable stone blocks.

Usually, this means that you want to form long words that incorporate tiles that have shorter timers; for example, there is a ten letter word on the above board (PEDESTALED), but it doesn’t use the F. After playing this word, the timers on the remaining letters will count down by 1; the P will now have a timer of 2, the first E 10, and so on.

However, the game does have various mechanics that mean this isn’t always the best strategy. For example, one needs to ensure that there are actually enough letter tiles to work with, and that the distribution of letter tiles allows reasonable words to be formed. The game also has players choose bears for each mission which have special abilities such as awarding additional points for certain words. Actually, in hindsight FEASTED might have been a better choice than DEADLIFTS, as I was using a bear that gives a point bonus for words ending in ED.

In addition to scoring points by making words, players also receive bonus points for the largest fully cleared rectangular area. The game identifies this for you. I’m not sure how this is actually implemented, though there are large enough boards (15 by 15?) that it’s probably not the naive algorithm of generating and testing all rectangles; where $N$ is the width of the board that would be $O(N^6)$. There is a DP-based approach where one can keep track of the number of stone blocks that slashes this to $O(N^4)$ which would probably be enough. With more clever approaches it’s possible to knock the time complexity down to $O(N^2)$ – and that’s probably optimal as any less would involve not actually reading some parts of the grid.

Most of the time, this isn’t something to worry about too much. I’m able to clear most levels with no stones at all. However, if we need to take a stone, ideally it should be as far away from the center of the board as possible (note that only the largest rectangular area counts).

Defining optimal play in general is difficult, owing to the spatial reasoning required by the largest clear area bonus, uncertainty of tiles that are yet to be revealed and sheer breadth of options when faced with many open tiles.

Before starting a round, the player has to select bears to use. Optimal decisions here largely depend on the specific challenge being played. A bear that scores bonus points for six-letter words isn’t particularly useful on a tight board where one has only a few open tiles at a time; some bears add time to timed boards, which isn’t useful if the board has no time limit. Furthermore, players select up to three bears each round; these bears can have useful synergies. In particular, I like a combination of “summon three Zs” and “score triple points for words including JQXZ (not stacking, unfortunately)”; “summon ING” or “summon ED” goes quite well with these two as well. I’ve used these to complete a mission deemed “impossible” given the strength of my bears at the time.

We can perhaps begin by focusing more narrowly on optimal play of an Alphabear endgame (where all tiles have been revealed), ignoring special abilities bears may have for now and assuming that the board has a solution. Here is an example:

From this point on, there is no more uncertainty as to what letters we have to use. Interestingly, this board shows that a greedy algorithm where one tries to play the longest possible words doesn’t work. One could begin WORKSHOPPED, leaving RAII, but this is actually an unwinnable state. There are several two-turn solutions, like ROADWORKS + HIPPIE, or WORSHIPPED + KORAI. (Obviously, this is very high-level play; I think I played SKIPPER with this board. That leaves RODIAHWO, for which I’d probably have played something like RADIO + HOW, though HAIRDO + WO is better.)

Given a multiset of pairs of letters and timers $L = (l,t)^+$, we want to return a suitably ordered list of words $S = \left[ l^+ \right]$, that satisfies the following properties:

• legitimacy: every word $s \in S$ is included in the game’s dictionary.
• resourcing: words do not use tiles that aren’t in $L$.
• timing: for each pair $(l, t) \in L$, $l$ is used in a word before or on position $t$.

The above set of properties is not correctly specified without a refinement to deal with duplicate letters. There are several ways to deal with this; assuming low cardinality, one approach could be to recognise all duplicate instances of letters as separate letters, and suitably expand the dictionary to allow this. Our list is then $L = \left[ (A,3), (B_1,1), (B_2,1), (E, 1) \right]$, and the dictionary recognises both $B_1 E$ and $B_2 E$.

I don’t have a good way of finding an optimal solution; I suspect the problem is actually NP-complete; it feels like SAT or 3SAT may be reducible to this problem, though I haven’t worked through the details. Nonetheless, there are several search heuristics we can try. Clearly, on a given turn we must use everything which now has a timer of 1. Also, although this board is an example of why you don’t want to be greedy, in general it probably makes sense to do some kind of ordered backtracking, where one expands the most promising nodes first. I can imagine an A* search could actually work well, though finding an admissible heuristic could be hard.

There’s a fourth property we want as well:

• optimality: the score produced by the words is maximal.

Going back to our sample board, WORSHIPPED and KORAI will score one more point than ROADWORKS and HIPPIE, because there is one fewer timer tick. In general, we want to minimise the number of timer ticks by playing longer words upfront where possible, as this reduces the number of timers that are still ticking. Once one has devised a feasible solution, it may make sense to perform some iterative optimisation by seeing if it’s possible to bring letters forward. Of course, that approach may get us stuck in a local optimum. I could see some kind of iterative hill-climbing algorithm from the top-n admissible solutions (as found by our ordered backtracking) yielding decent results.

Unfortunately my vocabulary and anagram skills are far from good enough to handle the required complexity, especially under time pressure. Most of the time, successfully playing every letter on the board is enough to complete the mission’s objectives. Interestingly, it looks like the game wasn’t really programmed to deal properly with successfully using every letter but failing the mission. The game claims that there’ll be an additional reward on the reward screen when identifying the board as clear; yet, there is no reward screen as the mission was failed.

# Rebooting Ourselves

It was a fairly rough week. Thus, over this weekend I thought about re-examining and resetting various procedures and things I do, as opposed to actively filling my time with activities. This reminded me of a Humans of New York post I came across several years ago:

“I’m rebooting my life entirely, again. It’s time for Andrew 5.0.”

In computer science, semantic versioning is a system for identifying different versions of software products in a standardised way. Under this system, a product’s version is an ordered triple of nonnegative integers written $major.minor.patch$. This is typically used for software, though the definition does not seem to require it. The system discusses changes in terms of a public application programming interface (API), which specifies what functionality the product offers.

In terms of software, a SQL database’s API could include the types of queries that may be processed. For MCMAS-Dynamic, the public API would include the details of the modelling language it is able to verify properties for. A non-software example could include a simple kettle; the public API could include how one adds or removes liquid, how one turns it on or off, and possibly alarms or other feedback mechanisms for when the liquid has boiled.

When a new version of a product is released, the version number is increased (in lexicographic terms). How this increase is done depends on the types of changes since the previous version:

• If the public API is ‘broken’, meaning that previously valid ways of using the API are no longer valid or accomplish different things, then the change requires a major version bump. To do this, the major version is incremented, and the minor and patch versions are reset to 0 (e.g. $7.5.1 \leadsto 8.0.0$). For example, if the kettle used to play an alarm when it the liquid was boiled and this was a part of the public API, then the major version should be bumped if this functionality is removed. (In particular, if the API did not specify that the kettle had to play an alarm, this change might not warrant a major version bump.)
• If new features are added without ‘breaking’ the API or there are non-trivial internal improvements, the change leads to a minor version bump. The minor version is incremented and the patch version is reset to 0 (e.g. $7.5.1 \leadsto 7.6.0$). For example, if the new version of the kettle is substantially more energy-efficient, then that could be a minor version bump.
• If something was broken and has been fixed (without changing the public API), then the patch version should be incremented (e.g. $7.5.1 \leadsto 7.5.2$). For example, if the kettle previously rang an alarm twice when the liquid was boiled even though the kettle’s API specifies it should only ring once, then a change that makes the alarm only ring once could be part of a patch version bump.
• Multiple changes in aggregate should be evaluated in aggregate. In most cases, the largest magnitude of all constituent changes applies, though generally speaking this is not true (consider one bugfix plus two changes, one which breaks the API and another that reverts that change – that is a patch bump, not a major bump).

Generally, making a more aggressive version bump than would be required for one’s change is acceptable, though it can confuse users. In particular, I tend to expect backward-incompatible changes when facing a major version bump; not finding any can be surprising and confusing.

The sentiment of the quote sounded like it was a major version bump. Defining an API for one’s life is obviously very difficult; even if one tries to use a lot of abstraction, I find that there are just too many facets. Rather loosely, our API might be split into a bunch of micro-services. We can treat physical needs and bodily functions like breathing and digestion as infrastructural. These services might then focus on the range of activities we involve ourselves in, or range of activities we could involve ourselves in. For me personally, this could include software engineering, getting along with other people, finance and budgeting, computer science, writing, puzzle solving and so on.

Hopefully, I would imagine that we chew through a lot of patch versions as we continue to improve skills. Today’s release notes could include “Jeremy knows a little bit more about thread pools” (I read a chapter of Java Performance Tuning today over a post-lunch coffee). Minor versions would also be relatively common; this wouldn’t be from today specifically, but “Jeremy can vaguely attempt a Balance Loop puzzle” is probably pretty recent, extending the sudoku and other puzzle-solving features.

Depending on how we define the API, major version bumps could be very common. It is typically important to be relatively disciplined with breaking changes in an API in a software engineering context, as clients may often depend on one’s product in non-obvious ways. While others’ dependencies on us can indeed be non-obvious, I think one factor that changes things is that our systems seem to be ephemeral whilst program code is not. A codebase left as it is over years or centuries retains its capabilities (admittedly, finding suitable infrastructure to run the product might be an issue).

On the other hand, there is some evidence that we lose skills that are underutilised with time. I used to play Dance Dance Revolution quite a lot and could probably pass an arbitrary level 15 song as well as some 17s; I doubt I can do that today as I haven’t played in a few years. The ways we interact with others or manage our finances may change as our personal environments change as well; for example, if I moved away from the UK, I would not be able to allocate my investments the way I do now, because I would lose the ability to use ISAs (and probably most other forms of UK-specific tax-free savings). This may even happen without action (for example, if the UK government changes how ISAs or tax-free savings work) – though you could argue that declaring the use of specific vehicles in one’s API might be too specific and implementation-dependent (“I will use tax-advantaged accounts that are valid in my location appropriately” is maybe better).

In light of the above, I would be a bit laxer with what constituted a ‘breaking change’, which pulls things back toward subjectivity which I think semantic versioning was trying to avoid. I might regard myself as having major version 2 right now; I could consider everything up to and including my second year at Imperial as version 0, which is typically used in development to refer to a pre-release period of rapid iteration. Although National Service and/or moving to the UK for studies did bring about nontrivial changes, I really didn’t know what I wanted to do at that time (not that I know now, but there is at least a vague direction).

The Google internship was probably the turning point for version 1; that also coincided with several major changes with regard to finance, investment, philosophy and priorities. I’d call the second major change to be when graduating from Imperial and starting at Palantir; even then, I’d regard the first set of changes to be more fundamental. The re-examination I did over the weekend is actually probably a patch release (or maybe a minor that improves several non-functional characteristics); it certainly doesn’t warrant a major version bump.

# Algorithmic Modelling – Codenames

Codenames is a board game designed by Vlaada Chvatil which tests communication. The game is played with two teams; each team splits itself into clue-givers and guessers.

There is a board of 25 cards, each with one (or more) words on them. Of these cards, 9 are owned by the first team, 8 owned by the second team, 7 are neutral and one is the ‘assassin’.

The clue-givers attempt to communicate to the guessers which cards are owned by their team. This is done by giving a one-word clue followed by the number of cards N corresponding to the clue. The rules establish some bounds on allowed associations (for example, ‘sounds like’ clues are not allowed).

I don’t know how much time went into selecting the words to appear on individual cards, but there are certainly many words in the deck that can be interpreted in many ways, which makes the game fun. For example, I can think of quite a number of ways to clue BOND (COVALENT for unambiguity; DURATION, ASSET or DEBT for the debt; JAMES, AGENT or SEVEN or even at a stretch something like CASINO for the character; FRIEND or FRIENDSHIP; STREET or TUBE; CONTRACT, PROMISE or WORD for the idea of a promise; ADHERE or CONNECT, among others). Which one I might pick would depend on the other words on the board.

Guessers then identify up to N+1 cards they think their team owns. This can be based on the entire history of clues given, not just the previous clue. These cards are guessed one at a time; a team is only allowed to make further guesses if they guess ‘correctly’ one of their team’s cards. Teams may refrain from making all guesses.

The game ends either when all cards owned by a team are revealed (that team wins), or if a team reveals the assassin (that team loses).

In practice, clue-givers will need to consider all cards, not just the ones their team owns. The penalty for ‘failure’ varies; the assassin is an instant loss, and revealing a card owned by the other team is also bad. For example, with the cards APPLE, BANANA and GRAPE it would be very tempting to declare (FRUIT, 3) as a clue; yet, if KIWI is owned by the other team (or worse, is the assassin) it might be dangerous.

However, if the other team has already revealed that KIWI is theirs (e.g. with another clue, maybe SOUTH or BIRD) then the FRUIT clue becomes safe. Thus, pre-computing a strategy at the beginning of the game (e.g. by partitioning the eight or nine clues owned into several logical groups while avoiding other words) may not be optimal.

I tend to consider making an effort to win to be a key part of games, and thus what moves may be considered good also often depends on the state of the game. For example, if the opposing team has just one card left, I will give a broader clue that may have more tenuous links in a ‘do or die’ effort to finish on this turn. A less extreme example would be attempting weaker links if behind, or perhaps playing slightly more conservatively if ahead.

Mathematically modelling Codenames can be tough. We can try modelling the game state as a tuple $(O, E, N, x, h, f_O, f_E)$ where $O, E, N$ are sets of the remaining clue words our own team has, words the enemy team has, and the neutral words, $x$ is the assassin, $h = (t,w,n)^+$ is the history of the game, and $f_O$ and $f_E$ are preference functions of the form $h, w, n, O, E, N, x \rightarrow P$, returning an ordered list $O \cup E \cup N \cup \lbrace x \rbrace$. $(t, w, n)$ is a clue meaning that a given team gave a clue word and hinted that some number of cards was relevant. This already abstracts two difficult parts away – ordering the clues, and determining how many of the top preferences to pick.

I think the preference functions need to take into account previous clues from both teams; if a previous clue could clearly have corresponded to two words and I picked the wrong one, it might be fairly high on the list even if unrelated. Similarly if this scenario happens to the opposing team, I would probably avoid the word that would blatantly be theirs.

The notion of degree of confidence also isn’t captured as well in our ordering; going back to the fruit example, having a clear ordering would imply that clues for less than 4 items could reliably result in correct guesses (if we knew that APPLE was by a small margin the best pick, we could safely and confidently give (FRUIT, 1) to clue it, which seems wrong). One could imagine modelling these with selection probabilities, though things would be even more complex.

The above representation still seems computationally difficult to work with. The evolution of the preference function as one moves from one state to another is unclear (so in that representation it is fully general), making lookahead difficult. It doesn’t seem like a greedy choice is always best; for example, given eight clues that are reasonably divisible into four pairs, a clue for 4 words taking one element from each pair might be a bad idea if the other words can’t easily be linked.

We can proceed by simplifying the preference functions; a simple approach could be that for each $w, n$ each team has a persistent preference function that returns an ordered list of $O_0 \cup E_0 \cup N_0 \cup \lbrace x \rbrace$. The preference functions during the game then return the subsequence of that list that contains strictly the words still left in the game. This of course doesn’t take into account past information or clues from the other team.

With this, we can attempt to solve the game using standard techniques; assuming that the vocabulary of clues is bounded (let’s say it must be from the Linux dictionary), a game state is winning for the current team if there exists some word for which the preference function returns everything in $O$ as a prefix. A state is losing if all moves in that state produce a state which is winning for the opposing team.

We can generalise this to a simple probabilistic model as well; the preference ‘functions’ instead return a discrete random variable that indicates either guessing some word or passing. A simplified model could then, at the start of the game, assign weights to each candidate indicating the relative probability that candidate would be selected. These can be normalized to account for a pass probability. As words are removed from the board, the probability of the remaining words being selected scales (we can imagine a simple rejection-sampling where we discard words that aren’t actually on the board any more).

The algorithm for the probability that we get a win from a given state is then slightly more complex (though I think still reasonably covered by standard techniques).

# Likelihood Estimation

One of my team-mates introduced another interesting question over lunch. Working through it reminded me of some of the statistics problems I struggled with at Imperial, specifically during Intelligent Data and Probabilistic Inference. It reinforced that in spite of scoring 90 for that course I’m still not confident I knew what I was doing then (or now).

Suppose you have a symmetric triangular distribution of unknown width and mean. Given that the distribution has yielded three independent samples of 2, 4 and 5, what is the expectation of the mean?

The triangular distribution can be used as an estimate for something which is known to be bounded both above and below, that also takes into account a value known to be the most probable. Some argue that it is the simplest distribution satisfying these (though one could argue that a cosine or some kind of truncated normal might be more applicable).

The instinctive answer I have is simply the mean of the samples, or $\frac{11}{3}$, though I was suspicious as probability and statistics often yield non-intuitive results.

The distribution is named as such because it has a triangular probability density function; because of the laws of probability (area under the function must be 1), specifying the minimum, maximum and mean is enough to uniquely identify it. Supposing we have a minimum $a$, a maximum $b$ and a mean $c$, this yields a pdf of:

$f(x) =\begin{cases} \dfrac{2(x-a)}{(b-a)(c-a)} & a \leq x \leq c \\ \dfrac{2(b-x)}{(b-a)(b-c)} & c \leq x \leq b \\ 0 & \text{otherwise} \end{cases}$

We are only dealing with a symmetric case, so we can set $c = \frac{a+b}{2}$ which cleans things up a little:

$f(x) =\begin{cases} \dfrac{4(x-a)}{(b-a)^2} & a \leq x \leq c \\ \dfrac{4(b-x)}{(b-a)^2} & c \leq x \leq b \\ 0 & \text{otherwise} \end{cases}$

Based on our observations that we have three samples of 2, 4 and 5, we can construct the likelihood that a given triangular distribution gave rise to a certain result. While the probability sampling a given distribution resolves to a precise value is infinitesimally small, we can still compare them in relative terms using the density functions. We can write this as

$P(2,4,5;a,b) = f(2)f(4)f(5)$

Expanding this term will depend on where exactly 2, 4 and 5 fall in our triangle. Let’s work out the most promising case (where 2 falls on the left of $c$ while 4 and 5 fall on its right); the rest are left as an exercise to the reader. In this case, we have

$P(2,4,5;a,b) = \dfrac{4(2-a)}{(b-a)^2} \times \dfrac{4(b-4)}{(b-a)^2} \times \dfrac{4(b-5)}{(b-a)^2} = \dfrac{64(2-a)(b-4)(b-5)}{(b-a)^6}$

At this point, we notice the original question needs a bit more specification. We aren’t given what the distribution of possible values of $a$ and $b$ is. One way of getting around this is just to pick a uniform distribution; however, that isn’t quite defined over the real line. We can for now simply find the maximum likelihood estimate for $a$ and $b$.

Alternatively, if we give prior probability distributions for $a$ and $b$, we could also use the samples as information to get a posterior distribution. Usually we would pick a conjugate prior distribution that wouldn’t fundamentally change even when accounting for the sample; I didn’t find one for the triangular distribution, though.

If we want to find the most likely distribution, we seek to find an extreme point; this can be done by taking partial derivatives (and this expression actually lines up quite well with the quotient rule). There is a fairly standard ‘trick’ for handling these, though; since the logarithm is a strictly increasing function, we compute the log-likelihood instead. The maximum of the logarithm will also be the maximum of the original function. Using the laws of logarithms, we get something a lot more tractable:

$\log P(2,4,5;a,b) = \log 64 - 6 \log(b-a) + \log(2-a) + \log(b-4) + \log(b-5)$

Computing the partial derivatives is then straightforward. We then set them to zero, and solve the resulting equations simultaneously; this yields

$a, b = \left( \dfrac{1}{3} \left( 4 + \sqrt{\frac{2}{5}} \right), \dfrac{1}{3} \left( 16 - \sqrt{10} \right) \right), \left( \dfrac{1}{3} \left( 4 - \sqrt{\frac{2}{5}} \right), \dfrac{1}{3} \left( 16 + \sqrt{10} \right) \right)$

We’ll want the second pair of solutions; the first actually has $b \approx 4.279$ which is no good (we need $b > 5$). Interestingly, the mean of that triangular distribution is then $\dfrac{2}{15} \left(25 + \sqrt{10} \right) \approx 3.75497$ which is not quite $11/3$.

Indeed, though, the log-likelihood we get with these values of $a$ and $b$ is about $-4.74053$. Indeed, if we look at the family of distributions with $a = \frac{11}{3} - \alpha$ and $b = \frac{11}{3} + \alpha$, the best we get is about $-4.7473$.