In discussions with clients as well as among the Digitally Disadvantaged Languages community, I’ve had discussions about BCP 47 Language Tags. I’m writing this up as a general overview, meant to benefit the globalization commuity.
Edit: Thank you for all of the feedback—I continue to update this post.
What is a Language Tag?
To introduce this topic generally, a language tag is a short string that refers minimally to a language (fr
for French), but can optionally include additional context.
Who uses this?
Language Tags are used primarily for two different purposes:
Identifying the language of a document or other resource. For example,
<html lang="fr">
specifies that the contents of the HTML document is in the French language.Identifying a user’s own language (and other) preferences. In this usage, the language tags are sometimes a language preference and sometimes (though technically, incorrectly) referred to as a "locale code". For example,
fr-CA
specifies French as spoken in Canada.
Haven't I seen these before?
One might ask, Haven’t these tags been around for a long time, even before 2006? These aren't always referred to as BCP 47.
Well, yes and no. By way of analogy, Unicode UTF-8 is backwards compatible with ASCII. In other words, the string Hello
is valid ASCII (1963) — but it’s also valid UTF-8. Similarly, fr
and fr-CA
are valid BCP 47, but also valid RFC 3066 (which was previously designated as BCP 47) and even RFC 1766 (1995).
So, just as with a move to Unicode, you might be able to move to BCP 47 by starting with the identifiers you are already using for language tags / locale preferences.
What is BCP 47, formally?
IETF is the Internet Engineering Task Force, responsible for the standards that keep the Internet running. An “RFC” is a Request For Comment — a document put out in public for review. These documents have a wide variety of statuses, from purely informational, to core standards that are widely implemented. BCP 47 is a “Best Common Practice” and actually consists of two documents:
RFC 5646 Tags for Identifying Languages
This specifies the structure of the tags themselves, including private use and extensions.
Note that RFC 5646 is the update to RFC 3066 which is itself the update of RFC 1766. (RFCs, once published, can't be updated, they can only be superseded. For this reason, it's best to refer to 'BCP 47' as a stable identifier, rather than a specific RFC.)
RFC 4647 Matching of Language Tags
This defines how to take user’s preferences of a tag or list of tags. We’ll discuss this in a future post, but also be sure to see CLDR’s Language Matching algorithm.
OK, let’s dig into some sample tags
The following sections explore various BCP 47 language tags. We will “disassemble” them, to see how they tick.
fr
This refers to French. The language part of the language tag is taken from ISO 639. It needs to be the shortest form, so it is fr
and not fra
.
Easy enough? Let’s add some more complications.
ssy
This is the Saho language. Note that this is based on a three letter code, We have left "Set 1" of ISO 639 and are now using a code from the other sets.
It’s important to note that ISO 639 language identifiers can be two or three letters. If only two letter identifiers are supported, software is restricted to supporting the “major, mostly national” languages (to quote ISO).
fr-CA
French in Canada. The region part of the tag in this case is from ISO 3166.
Note that I said region and not country. A simple example is that AQ
, Antarctica, is not a country (as of this writing). There are more complex examples, but that is because what constitutes a "country" is complex. Calling it a "region" is more accurate, and avoids some geopolitical challenges. Plus, see the next example:
fr-002
Okay, what is this? French… in Africa!
002
here is a region, specifically, a UN M.49 numerical code for “Geographic Regions.” The UN statistics division also helpfully provides a hierarchy of which regions are subregions of others, starting with 001
(the world). So for example:
001
(World) contains002
(Africa) which contains015
(Northern Africa) which itself contains regions such asDZ
(Algeria),EG
, (Egypt), …
This hierarchy can be very useful when referring to, for example, a document, preference, or setting that applies to multiple regions. In some situations, fr-002
can be used to refer to fr-DZ
, fr-EG
, etc.
sr-Cyrl
and sr-Latn
The Serbian language can be written with two different scripts: Cyrillic ("српски") and Latin ("srpski"). Notice here that the second component is not a region code, but an ISO 15924 script code.
Let's talk about chopping. Chopping is an old practice whereby locale tags are "chopped up" into hyphen separated segments.
Before script codes and UN M.49 codes were part of language tags, it was common to simply assume this structure: ll-RR That is, a 2-letter language, and a 2-letter region. Rather than yearning for “simpler” times, I’ll point out that this is a good example of where code that worked before will get the wrong answer for BCP 47. At worst case, if you assume (not bothering to look for the hyphen) the region is the fourth and fifth character of the language tag, our example tags will land you in Cyprus (CY
) and Laos (LA
) respectively!
zh-Hant-HK
Now we put together a few items we've picked up along the way. This is Chinese, written in the Traditional script, in Hong Kong. The "old" way to distinguish Simplified and Traditional was to use zh
or zh-CN
(Simplified) versus zh-HK
/zh-TW
(Traditional). But the correct way to do so is to use the script codes Hans
and Hant
.
Yes, "Chinese" is a broad term. Generally, zh
refers to Mandarin, which has a specific language tag of cmn
.
el-polyton
"polyton" here is a variant subtag, registered with the IANA Language Subtag Registry. Together, this tag refers to Polytonic Greek, that is, modern Greek written with polytonic accents.
x-codehive
This is a private use language tag. It could refer to anything! But, don’t use it for open interchange, because there's no way to know what it means or prevent someone else from colliding with your usage of it.
Conclusion
That wraps up our brief introduction to BCP 47. In a future installment, we'd like to cover:
- more about private use tags including
-x-
and other reserved tags - extensions such as
-u-
and-v-
and beyond - how to use BCP 47
- limitations of BCP 47 in current implementations - and some ideas
Meanwhile, you are also recommended to read CLDR’s Unicode BCP 47 Locale Identifier spec, which specifies further constraints and canonicalization operations.