Optimal Ancient Alphabets and Letter Frequencies in Ugaritic

July, 2024

Writing was been independently invented a handful of times, including in Mesopotamia, China, and Mesoamerica. The alphabet, however, has only been invented once, and all alphabets are either descendents of the original alphabet, or were inspired by it.

One of the earliest alphabets for which we have a relatively large corpus is the Ugaritic script. The Ugaritic script has about 29 to 30 letters which are written by pressing a stylus into clay. This gives it an appearance like cuneiform (an older writing system, though not an alphabet), but the signs used are unique to Ugaritic, and don't mean anything in cuneiform.

Text KTU 1.1 in the Ugaritic alphabet from circa 1400 BCE -to 1200 BCE. Image courtesy of the Louvre. © 2004 GrandPalaisRmn (musée du Louvre) / Franck Raux

I was reading the Wikipedia article on the Ugaritic script, when I came across the claim that "Jared Diamond believes the alphabet was consciously designed, citing as evidence the possibility that the letters with the fewest strokes may have been the most frequent."

This claim comes from a 1994 piece called Writing Right that Diamond wrote for Discover magazine, in which he says:

The other remarkable feature of the Ugaritic alphabet is that the letters requiring the fewest strokes may have represented the most frequently heard sounds of the Semitic language then spoken at Ugarit. Again, this would make it easier to write fast.

I can't find any source for this idea apart from Diamond, nor can I find any analysis of letter frequencies in Ugaritic that could be used to substantiate that claim (though I'm not sure I would know where to look).

I decided to run my own analysis to test whether the most commonly used letters in Ugaritic really did use the fewest strokes/wedges.

Letter Frequencies in Ugaritic

I extracted all of the transliterations from The Texts of the Ugaritic Data Bank by Cunchillos, Vita, Zamora, and Cervigón, and tallied up all of the letters.

In total, there were 177,762 letters in the corpus with the following frequencies (as a fraction of all letters in the corpus):

Letter frequencies in Ugaritic.

Here, I've labeled each bar with both the Ugaritic character and with the standard latin transliteration. The bar chart above has the bars in the Ugaritic alphabet order, which is quite similar to the order of the Greek and Latin alphabets.

The chart below shows the same data, but with the letters sorted from most frequent to least frequent:

Sorted letter frequencies in Ugaritic.

(Interestingly, there seems to be a dropoff in the letter frequency after the sixth letter.)

At first glance, it doesn't look like the most frequently used letters use fewer wedges than the least frequently used ones.

Further, this scatter plot below doesn't seem to show any relation between letter frequency and the number of wedges needed to draw that letter:

Scatter plot showing the relationship between the number of wedges or strokes in an Ugaritic letter, and the frequency of that letter in the corpus.

Overall, it looks like Diamond's claim that Ugaritic uses fewer wedges for frequently used letters is not empirically true.

Making Ugaritic Optimal

Ugaritic is not as "optimal" in the way Diamond supposed. However, if we took the symbols with the fewest wedges and remapped them to the most frequently used letters, we could make an "optimal" version of the Ugaritic alphabet that will require fewer wedges to write typical text!

The table below shows such an optimal remapping from the regular Ugaritic alphabet to an "optimal Ugaritic" alphabet:

Using the regular Ugaritic alphabet, it took 552,324 wedges to write all 177,762 letters in the corpus I am using. However, using the "optimal Ugaritic" alphabet, only 365,964 wedges are needed, which is only about 66% as many!

Huffman Codes

Another approach for optimizing Ugaritic could involve looking at Huffman codes to choose the optimal combination of wedges to use for each letter. Such an undertaking would require designing new symbols, which I'm not prepared to do currently, but I thought I would include this binary Huffman tree generated from the actual frequencies of Ugaritic letters:

Binary Huffman tree for the Ugaritic alphabet using letter frequencies from the corpus.

Incidentally, this is what the Huffman tree looks like using the optimal remapping of Ugaritic letters, which brings more symbols with few wedges to the top.

Binary Huffman tree for the Ugaritic alphabet using letter frequencies from the corpus, but with symbols remapped to use fewer wedges for commonly used letters.