Boggle Cubeset Design


Introduction

One of my summer projects for 2019 was a web-based Boggle game so that I could play with friends living in other cities. Traditional Boggle is played with physical cubes, with found words written on pen-and-paper, and each round is followed by a phase in which each player reads out the words they found, so that other players can verify the validity of those words (i.e., whether they are real words, and whether they can be found on the board) and non-unique words can be struck. This is often quite time-consuming, but one of the advantages to playing software-hosted Boggle is that this can be automated.

Sometimes, we played in French to mix things up (sidenote: my French is terrible, but it's pretty good practice). Of course, in order to still have the game track words for us at the end of each round, I needed to find a list of valid French words. I eventually found a list from ODS5, and looking through this list led me to think about how the original creators of Boggle decided what letters to paint on the faces of the cubes, and what would be a good "cubeset" for a French lexicon. I wanted to share the results of my investigations here.

Determining the Average Score of a Cubeset

The first step was to determine how to evaluate a candidate cubeset for a given lexicon (for English, I used Collins Scrabble Words 2019). Since there exists an astronomically large number of possible boards that could result from a candidate cubeset, I relied on a statistical sampling approach - "rolling" a board can be simulated by generating a random ordering of the cubes, and choosing a random face for each cube. I chose "average maximum-possible-score" as an indicator for the fitness of a cubeset to the chosen lexicon. It's worth noting that this is certainly not the only valid choice. We are assuming that a cubeset that tends to generate boards with a very high maximum possible score is desirable, but it's conceivable that someone might view sparse boards as desirable as well (if you've ever played Boggle, you will no doubt have experienced boards that are full of words and boards that seem to have no words in them... and you'll understand that those games have very different feels to them). That said, "average maximum-possible-score" (henceforth referred to as just "average score") seems like a reasonable place to begin.

I wrote up a C++ application that will generate boards for me and find all the words (github link). I tried profiling the standard 5x5 cubeset (exact cubes listed later in this post) and, 200,000 runs later (about 8 minutes on my i7-4790, although I didn't bother to make it multithreaded) out pops the following:

It seems we can get a fairly good idea of the average score after about 100,000 runs. The curve is not as smooth at iteration 200,000 as I would have thought, but I think we can confidently say that the average after 100,000 runs should be within ±2 of the true average score, which seems to be around 497.

Switching the Lexicon - English vs. French

So what happens if we naively switch to a French lexicon without modifying the cubes? Running another 200,000 runs showed an average score of 270. This is likely due partly to lexicon size (280k words in the English CSW19, but only 200k words in the French ODS5), but also due to the letter distribution being tailored towards the English language. We'll dive into a more thorough analysis shortly.

But first, here are some pretty charts. The two distributions look similar, but the histogram for French shows more instances with lower max scores, as expected. It's also interesting to note that, even for French, there were instances of boards where it would be possible to score well over 3000 points.

N-gram Analysis

First, let's introduce the default 5x5 cubeset. Each of the 25 strings below represents one cube, with the six letters representing a letter on each face of the cube. For the sake of alignment, the letter Q actually represents a Qu on the face of the cube. (In Boggle, Q is never alone; it is always paired with a u, which unfortunately means words like QANAT can never be scored.) How many times each letter shows up is also listed in the table below.

  AAEEEE IKLQUW DDHNOT FIPRSY DHHLOR
  CCENST OOOTUU CEIPST EIITTT GORRVW
  AAAFRS AEGMNN ENSSSU EMOTTT CEILPT
  AEEEEM DHHLNO ADENNN NOOUTW BJKQXZ
  AFIRSY AEEGMU CEIILT AAFIRS DHLNOR
Default Cubeset Letter Counts
e19m4
t13f4
a12g3
n11k3
o11p3
i10w3
s9q2
r8y2
d6b1
h6j1
l6v1
u6x1
c5z1

An N-gram is a sequence of linguistic elements (here, letters) whose counts may reveal meaningful statistics about a body of text (here, the lexicon under consideration). It's not hard to imagine that the unigram counts of a lexicon should be reflected in a "good" cubeset. For example, J is the least common letter in our English lexicon, so a board that is overpopulated with J probably will not yield as many words as, say, a board that replaces some of them with S and E.

Unigrams - English Lexicon
e287058m73708
s245015g71315
i229895h63613
a196745b47310
r177701y41123
n170300f30331
o168711k23873
t165990v23418
l133085w19567
c102008z12279
d85376x7216
u84212q4301
p76371j4240

Comparing these counts to the letter counts on the default cubeset, things make sense for the most part. A few departures from what I would have otherwise expected (it seems the cubeset could use a few more S, and why are there 3 W but only 1 B?), but the two tables are clearly loosely aligned.

However, since Boggle is about linking letters together to make words, the unigram counts don't tell the whole story. Let's extend N to 2, 3, and 4.

Bigrams - English Lexicon - Top 20
es59483st26854
in52550ed26708
er50630en26189
ti36466ri25637
te30597li25254
ng30357al25000
at30232an24615
is30071ra24190
re29916le24138
on29854ne22662
Trigrams - English Lexicon - Top 20
ing26264ate8100
ati11354ent7478
ess11210ist6737
ion10964ons6318
ers10652tin6317
nes10401ine6206
ies10281est6164
ter9812ise5616
tio8996ali5607
ses8516tic5503
Quadrigrams - English Lexicon - Top 20
tion8686ines3036
ness7538ical2884
atio6359over2806
ting4518ring2623
sses4072ment2612
ings4022iest2400
esse3972olog2394
ions3964alis2357
able3168ties2334
ling3080nter2199

(Okay, quadrigram apparently is not a word, and the concept is instead usually referred to as a "four-gram" in the literature, which seems like a real shame.)

It's worth noting at this point that our counts look drastically different from N-gram counts that you might see published elsewhere (here, for instance). If you look at bigram counts, there's a good chance you'll see TH and HE as the top bigrams for English. This is because these studies often use real English texts as their corpus, rather than a raw list of words. In those studies, the prevalence of the word "the" in the English language is enough to propel TH and HE to the top of the list (not to mention words like "that", "this", "than", etc.), whereas when we analyze our lexicon, TH doesn't even make the top 20!

A note here for Boggle players: knowing these tables will be a great help if you're hunting for large words the next time you're playing Boggle. If you see ING, ESS, or ION on the board in front of you, it may be worth spending a little extra time looking for words containing them.

Let's see what the counts look like for French. I should note that conjugations of verbs are treated as different words, as long as they are represented by a different letter sequence after accents are removed. For example, "acheter", "achetas", and "achetais" (different conjugations of "to buy") are all counted as different words, but "achète" and "acheté" are considered the same word (and are counted as if they were simply "achete").

Unigrams - French Lexicon
e286228d44522
s205335g30094
i195212f26993
a193894b26360
r178809z22500
n151284v20154
t141537h19476
o116867q9422
l75137x5554
u70659y5311
c68985j3426
m53717k646
p47781w158
Bigrams - French Lexicon - Top 10
er67360en45195
on49745ai41748
nt48851es40942
ra48829is35453
re45336ss33944
Trigrams - French Lexicon - Top 10
ent31010ion19057
era24892sse17381
ons24675iez13065
ass22390ron10833
rai20641ssi10673
Quadrigrams - French Lexicon - Top 10
erai15468assi8266
ions14504eron7806
asse12520sion6018
ient9166ment5306
aien8634sent5294

Comparing the unigram counts with English, we actually see ranking of the letters is quite similar. Notable differences are Z and Q, which are more common in French, and K and W, which are very rare in French. The differences in K and W make sense, as they are rarely used letters in French (and typically only in words borrowed from other languages).

The trigrams and quadrigrams show that a sizable percentage of words in our French lexicon come from verb conjugations. Note that the top quadrigram "erai" was seen 15468 times, whereas the top quadrigram in English "tion" was seen only 8686 times, despite the English lexicon containing 80k more words. Those with a familiarity with the French language will recognize that a large number of verbs will contribute "erai" multiple times. For example, "to buy" leads to "acheterai", "acheterais", "acheterait", "acheteraient" all being in the lexicon. Conjugations also explain the larger number of Z noted above, as almost every verb will contribute multiple words that contain "z": "achetez", "achetiez", "acheterez", "achetassiez", "acheteriez".

One final note: the n-gram analysis here treats forward and backward passes as different n-grams, in accordance with the conventional definition of the term. However, in the context of finding an n-gram in a Boggle board, there's no reason to treat them differently: if "MOOD" is on a board, then "DOOM" is also present. Therefore, perhaps a more appropriate analysis here would be to count "era" and "are" as the same trigram, etc. I ran this analysis as well, and "es/se", "ing/gni", and "tion/noit" are still the top bigrams, trigrams, and quadrigrams, respsectively, but some other n-grams do shift around. The second most frequent bigram is no longer "in/ni", but rather "er/re" which actually becomes almost as frequent as "es/se". "ess/sse" also gets a boost over "ati/ita", and "ter/ret" jumps all the way up to the fourth most frequent trigram.

Can We Do Better?

Armed with our new knowledge, let's see how much we might be able to increase the average score for English. From the default set, I tried replacing K -> S, Qu -> S, 2W -> 2R, 2C -> 2G, 5O -> 5I, 3U -> 3I, 2H -> 2N, 2H -> 2T. This is motivated largely by unigram counts presented above, but also by the other counts as well. Note that I, N, and G were all added to try to see more of those "ing" words, and various O are replaced by I because I is quite a bit more commonly found in the top trigrams and quadrigrams (I may have gone a bit overboard on this replacement, since it's not that O is uncommon, but it's a start). The changes are depicted below:

    AAEEEE IKLQUW DDHNOT FIPRSY DHHLOR
    CCENST OOOTUU CEIPST EIITTT GORRVW
    AAAFRS AEGMNN ENSSSU EMOTTT CEILPT
    AEEEEM DHHLNO ADENNN NOOUTW BJKQXZ
    AFIRSY AEEGMU CEIILT AAFIRS DHLNOR

                    ↓↓

    AAEEEE ISLSUR DDHNIT FIPRSY DNTLOR
    CGENST OIOTIU GEIPST EIITTT GORRVW
    AAAFRS AEGMNN ENSSSI EMITTT CEILPT
    AEEEEM DTNLNO ADENNN NOIUTR BJKQXZ
    AFIRSY AEEGMI CEIILT AAFIRS DHLNIR

Here are the results:

Not bad at all - average score goes from 497 to 648, approximately a 30% increase.

Let's try a different kind of change. S is somewhat special letter in English, as it is the most common pluralizing suffix. Since the plural form of a word counts as a separate word, a well-positioned S can really increase the points available on a board. So what if we rearranged letters to guarantee at least one S? From our modified board with 11 S, let's swap all of the S onto two cubes, as shown below. Bolded letters indicate letters that were swapped, but the letter distribution is unchanged here.

    AAEEEE IILIUR DDHNIT FIPRFY DNTLOR
    CGENET OIOTIU GEIPAT EIITTT GORRVW
    

ASSSSS

AEGMNN

SSSSSS

EMITTT CEILPT AEEEEM DTNLNO ADENNN NOIUTR BJKQXZ AFIRAY AEEGMI CEIILT AAFIRR DHLNIR

This guarantees at least 1 S, and makes it very likely to have 2. There is a slight downside in that we miss out on ever being able to find words that end in "sses" (a common quadrigram, from words like "weaknesses", "harnesses", "obsesses", etc.), but my intution tells me this would still be a net positive change.

We see a smaller increase: the average score moves from 648 to 674, which is approximately 4%. I would say a 26 point increase in average score is definitely real, though, and we can conclude that the way the letters are arranged on the faces of the cube can indeed have a noticeable effect on a cubeset's potential.

What's Next?

After all of this effort to find a way to assign a numerical score to a cubeset, I'm left with one lingering thought: there is an astronomically large number of possible cubesets, but one of those cubesets must be the best, yielding a higher score than all of the others. Can we find that cubeset somehow? Moreover, can we apply that same methodology to find a better cubeset for French? Stay tuned!