Confound it!—Supporting languages with multiple writing systems

Painting by Pieter Brueghel the Elder, public domain.

In this post, we’ll dip into examples of several multi-script languages, with a deeper dive into Serbian and Chinese, which have interestingly different needs. We’ll try to get a better sense of the complications that arise from supporting readers, editors, and searchers in multi-script languages, and briefly get to know some of the tools that help make it all possible. While the subject can be complicated and the tools are undoubtedly complex, handling multi-script languages well is an essential part of providing information to people in a form that they can readily use.

Making software engineers cry

Depending on your definition of “language” and your definition of “support”, the Wikimedia Foundation supports a bit shy of 300 languages across more than 800 projects. That’s a lot of languages, and the variation across those languages—and the complexity of supporting them—can be staggering.^[1] It’s enough to strike fear in the hearts of software engineers everywhere.

Image by Guillaume Benjamin Armand Duchenne via Wellcome Images, CC BY 4.0.

Human languages, however, don’t care one iota how hard they make software engineers’ lives, so in addition to the baffling variability between languages, there is often considerable variability within a language. English dialects seem able to proliferate without end, and there are many differences in words and phrases[2] (elevator vs lift) and spelling[3] (color vs colour) just between the standard American and British varieties. But at least we all use the same writing system.

Not so in other languages! Let’s take a look…

Serbian—Cyrillic and Latin

Serbian is one of the standard forms of the Bosnian-Croatian-Montenegrin-Serbian language. It can be written in either the Cyrillic or Latin alphabets. While having two scripts complicates matters, the correspondence between the Cyrillic and Latin alphabets is mercifully exact, which makes converting between the two relatively straightforward.

A highway sign outside Belgrade, Serbia, showing both Cyrillic and Latin script. Photo by Jeff Attaway, CC BY 2.0.

———

Enter language converter!

On the Serbian Wikipedia, most articles are written in Cyrillic, though some are written in Latin. If you haven’t made any language preference suggestions on the Serbian Wikipedia, then where English Wikipedia has “Main Page | Talk” near the upper left of the page, Serbian Wikipedia has “Главна страна | Разговор | Ћир./lat.” Under “Ћир./lat.” you have three options: “Ћир./lat.”, “Ћирилица”, and “Latinica”. That’s our language converter in action!

Not too surprisingly, “Latinica” converts the page to Latin text, “Ћирилица” (which in Latin script is “Ćirilica”) gives you Cyrillic, and the default, “Ћир./lat.”, gives you however the text was originally written. Logged in users can set a preference so that they normally see their preferred script.

Even with the relatively straightforward transliteration, there are complications. For example, in the article about Serbian-American actress Sasha Alexander on Serbian Wikipedia, her stage name is provided in English. It wouldn’t be helpful to have the specifically English version of her name converted along with the general text on the page when transliterating to Cyrillic. In this case, the language converter is smart enough to know not to transliterate inside language-specific templates. There’s also ‑ {special markup} ‑ available that can block the conversion for any bit of text. It often gets used for standard abbreviations like units such as km (kilometer) or mm (millimeter), and cardinal directions in coordinates (e.g., the N and E in “44°48′N 20°28′E”). There’s also a magic word[4] to block title conversion: __NOTC__ (also available in Cyrillic as __БЕЗКН__), which gets used for domain names, abbreviations and initialisms, scientific and technical terms, etc.

Complicated is as complicated does—editing and search

So you’re cruising around the Serbian Wikipedia, with your preference for Cyrillic set, reading about your favorite TV show with a controversial ending, Изгубљени (English: Lost), and you decide to add a little detail or correct a minor typo. When you get to the edit page, you discover that the article is actually titled Izgubljeni and it’s written in Latin script, which you aren’t so comfortable reading or writing. Bummer.

Printing it out would not help, even with the nice binding. Photo by CollegeDegrees360, CC BY-SA 2.0.

Help is coming! It’s a very complicated issue, though. Do you convert the entirety of every article, from Cyrillic to Latin and back, every time someone wants to edit it in a different script? Or do you try to identify just what they changed in one script and convert it to the majority script of the article? What about cases unlike Serbian—oooo, foreshadowing!—where the conversion is good, but far from perfect? Fortunately for me, that’s not my problem—whew! But the WMF Parsing team has plans.[5]

Similarly, searching in a given script generally only finds matches in the same script. So on Serbian Wikipedia, searching for Izgubljeni gives a few dozen results, while searching for Изгубљени returns several hundred. This one is my problem, as I’m part of the WMF Search Platform team. I’m working on a plugin for our search engine that will not only merge the Cyrillic and Latin scripts in the search index, but will also do some basic stemming, which lets searches for one form of a word return related forms—like hope, hoped, and hoping in English. Speaking of hope, I also hope in the future to be able to bridge part of the mixed-script gap for other languages and projects where the conversion is, like that for Serbian, relatively straightforward.

Less straightforward transliteration—Uzbek, Kazakh, and Crimean Tatar

Not all transliteration systems are as straightforward as Serbian. To varying degrees, Uzbek (Wikipedia), Kazakh (Wikipedia),[6] and Crimean Tatar (Wikipedia)—all Turkic languages—need more complicated support for their Cyrillic/Latin transliteration, up to and including regular expressions and lists of exceptions that just can’t be handled in any straightforward way.

For these languages, the difficulties of reading, editing, and searching are greater than for Serbian, because any automatic conversion has to be significantly more clever, which also makes it more likely to make mistakes.

More straightforward transliteration—Inuktitut and Shilha

Language converter, of course, supports scripts other than Cyrillic and Latin. The transliteration for Kazakh includes Arabic as well. Other scripts, including some you may be unfamiliar with, are supported, too! For example…

Inuktitut (Wikipedia) is an Inuit language spoken in Canada and written in both Latin and Inuktitut syllabics. Fortunately the mapping between the two is straightforward, like Serbian.

A stop sign in Iqaluit, Nunavut, featuring Inuktitut and English. Photo by Sébastien Lapointe, CC BY-SA 3.0.

Shilha is a Berber language spoken in Morocco and written in Arabic, Latin, and Tifinagh. Its Wikipedia is small and still in the incubator, but it already uses language converter to support Latin/Tifinagh transliteration, in part because the mapping is straightforward.

Photo by Mohamed Amarochan, CC BY-SA 3.0.

———

Confounded, confused, and confuzzled—Chinese characters

The Chinese[7] Wikipedia (language code zh) uses the language converter to transform its text into several varieties, including those of mainland China (zh-cn), Hong Kong (zh-hk), Macau (zh-mo), Singapore (zh-sg), and Taiwan (zh-tw). These varieties can have small differences in punctuation and in a few particular words, but the largest split is between whether they use Traditional or Simplified Chinese characters. Chinese Wikipedia also supports the language codes zh-hant and zh-hans, which are generic Traditional and Simplified Han (Chinese) characters, respectively; we’ll use those codes for our next example.

Image by Kjoonlee and Tomchen1989, Arphic Public License (2001 version).

For those who don’t read Chinese, the difference between Traditional and Simplified characters can be subtle, but it’s easier to see when you can compare the exact same text in the two systems. Open the article for “Wikipedia” rendered in Traditional (維基百科) and Simplified (维基百科) characters in adjacent tabs in your browser. Flip back and forth between them and notice that the Traditional variants of characters are often a bit more complex and have a few more strokes, and look a bit darker on the screen as a result. Simplified characters are—well—a bit simpler looking. Punctuation, like periods, commas, and quotes, are also a bit different.

The Chinese language and Chinese Wikipedia together have a number of additional complexities that make the situation even more challenging:

The mapping between Traditional and Simplified is not one-to-one—in multiple ways! Some words that are a single Traditional character are written with two Simplified characters. A Traditional character that’s part of a multi-character word or phrase might get converted to a different Simplified character as part of that phrase than it would if it were on its own.
Chinese is written without spaces, making it hard for a computer to break it into words (so that it can make decisions about how to transliterate). As a simple analogy in English—written without spaces—the string “MARGARITA” could be “Margarita” or “Marga Rita”. Context provides clues: “IWANTTODRINKAMARGARITA.” vs “HERFIRSTANDMIDDLENAMESAREMARGARITA.”—but computers are terrible at context.
Chinese Wikipedia, unlike Serbian, often has a mix of Traditional and Simplified characters in a given article. Not just in the same article or sentence, but in the same name!

An example I found of the last problem: “UEFA Champions League Final” appears on Chinese Wikipedia in Traditional characters (歐洲冠軍聯賽決賽), in Simplified characters (欧洲冠军联赛决赛), and in a mix of the two (欧洲冠军联赛決賽). In that last instance, the last two characters are Traditional, the rest are Simplified. It strikes me as very odd because the last and third-from-last characters are the same!—so two different versions of the same character are used in the name of a soccer/football[8] league.

In the Bad Old Days, searching for any of these variants would only find articles that contained those specific characters. As of spring 2017, the situation has improved considerably because we convert all text on Chinese Wikipedia to Simplified characters[9] before indexing them for search.

Editing Chinese Wikipedia is still pretty complicated. Unlike Serbian’s Cyrillic/Latin situation—where you could presumably study really hard for a few weeks and become passingly familiar with the few dozens of characters in the script you don’t already know—Traditional and Simplified Chinese have thousands of different characters to learn.

In conclusion…

…there is no conclusion! Well, this blog post is about to end, but the road to fully supporting all the languages of Wikipedia and its sister projects—for reading, editing, and searching—is probably never-ending.

But that shouldn’t be disheartening—every day we come a little bit closer to a world in which every single human being can freely share in the sum of all knowledge. It may never be perfect, but it’s always getting better.

Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation

Footnotes

1. For a very brief but very entertaining review of just some of that complexity, see Roan Kattouw’s lightning talk, given at linux.conf.au 2017 (“Human language wats”): YouTube video, slides on Wikimedia Commons.

2. And of course there is nearly endless variety to be found in English around the world: Appalachian English has sigogglin; Australian English has fair dinkum; Canadian English has namaycush; East African English has boda-boda; Hawaiian English has makai; Indian English has burra-khana; Indonesian English has gotong-royong; Irish English has knawvshawl; Maryland English has moonack; Namibian English has oukie; Philippine English has kilig; Quebec English has sugar pie; Singapore English has taxi uncle; and Texas English has whomperjawed.

3. This is only exacerbated by the fact that English spelling is horrible. The relevant technical term is “orthographic depth”, which is a rough sense of how much a spelling system is WYSIWYG. In the English Wikipedia article on orthographic depth, English is the only example in the category “irregular”. The French Wikipedia article specifically calls out English -ough. It’s a travesty.

4. That’s a rather technical term—follow the link!

5. See C. Scott Ananian’s slides for his Wikimania 2017 presentation on multi-script editing. The slides include his speaker notes, so while it’s not as good as being there, it’s still quite good and full of useful information.

6. Kazakhstan is currently planning to officially shift from the Cyrillic to Latin alphabets by 2025, and the the language converter will need to adapt to that change. An earlier proposal, from October 2017, involved a lot of apostrophes, and was widely criticized. (See an article from The New York Times.) Very recently, a new version was announced that favors acute accents and digraphs. (See an article from Kazinform.) Tables showing the Oct 2017 and Feb 2018 transliterations are on Commons.

7. The label “Chinese” is itself complicated because it can refer to many languages, which can differ as much from each other as the Romance languages do—meaning they are often mutually unintelligible. The Chinese Wikipedia is written in Modern Written Chinese, which is based on the varieties of Chinese spoken throughout China. See also the article on Chinese Wikipedia, on English Wikipedia.

8. To-may-to, to-mah-to. I already said that English is a mess.

9. The software libraries available to handle segmenting Chinese text into words operate on Simplified characters, so converting everything to Simplified first allowed us to also index articles by actual words, rather than by character n-grams; n-grams are much better than nothing, but not great. For more info on the process, you can read my write up for that project.