On a Sandy Beach: Incorporating languages for i18n

Mike's Notes

My notes on the architecture decisions required to successfully make Pipi i18n for every human language.

The underlying master language is English.
The Locale user interface (LUI) engine is responsible.
Subsystems include Region, Language, Script, Calendar, UOM, Person Naming, Strings, etc.
Unicode, CLDR, ISO 693-3, the IANA Language Subtag Registry, and others automatically update LUI.
Strings are stored in a database structure mapped to the English string.
Strings are substituted at render time to create code or static web pages in a particular language and script.
A translation process by volunteers will eventually become necessary.
Expect these assumptions to change a lot following experimental evidence.

Resources

Resource

References

Reference

Repository

Home > Ajabbi Research > Library >
Home > Handbook >

Last Updated

17/05/2025

Incorporating languages for i18n

By: Mike Peters

On a Sandy Beach: 23/12/2023

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

One of the significant challenges this year was enabling the Pipi 9 Content Management System (CMS) to use any language or writing system (script).

The decision was made to alter the architecture to save future rework, even though English will be the only language available initially.

Some of the critical resources used were

Unicode Consortium
SIL International
IETF Language Tag
W3C Language tags in HTML and XML
r12a.io

Unicode Consortium

The Unicode Consortium describes its overall purpose as:

...enabling people worldwide to use computers in any language by providing freely available specifications and data to form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of this purpose is to standardize, maintain, educate and engage academic and scientific communities and the general public about, make publicly available, promote, and disseminate to the public a standard character encoding that provides for an allocation for more than a million characters.

UTF-8

UTF-8 can store every character from every writing system, so Pipi uses UTF-8. Each writing system gets a 4-character code. 1000 spots are available. About 178 scripts are listed so far, with many more being added.

For example, Latn is the 4-character code for the Latin Alphabet.

CLDR

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

Among the types of data that CLDR includes are the following:

Translations for language names
Translations for territory and country names
Translations for currency names, including singular/plural modifications
Translations for weekday, month, era, and period of the day, in full and abbreviated forms
Translations for time zones and example cities (or similar) for time zones
Translations for calendar fields
Patterns for formatting/parsing dates or times of day
Exemplar sets of characters used for writing the language
Patterns for formatting/parsing numbers
Rules for language-adapted collation
Rules for spelling out numbers as words
Rules for formatting numbers in traditional numeral systems (such as Roman and Armenian numerals)
Rules for transliteration between scripts, much of it based on BGN/PCGN romanisation

The information is currently used in International Components for Unicode, Apple's macOS, LibreOffice, MediaWiki, and IBM's AIX, among other applications and operating systems.

SIL International

ISO 639-3, maintained by SIL International, is a list of 7,000 + languages described by 3-letter language codes.

For example, eng is the code for English.

IETF

An IETF BCP 47 language tag is a standardized code or tag used to identify human languages on the Internet (a language code). The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.

To distinguish language variants for countries, regions, or writing systems (scripts), IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49. For example, the tag "en" stands for English; "es-419" for Latin American Spanish; "rm-sursilv" for Romansh Sursilvan; "sr-Cyrl" for Serbian written in Cyrillic script; "nan-Hant-TW" for Min Nan Chinese using traditional Han characters, as spoken in Taiwan; and "gsw-u-sd-chzh" for Zürich German. In accordance with ISO 639-3, however, it does not provide codes for distinguishing between Arabic-based scripts and maintains two duplicate codes for Punjabi, as well as a number of dubious or non-existent language distinctions made by its parent's standard.

It is used by computing standards such as HTTP, HTML, XML and PNG.

language-extlang-script-region-variant-extension-privateuse

Simple examples include

en-NZ
en-US
en-UK

IANA Language Subtag Registry

W3C Internationalisation

Language Strings

The initial experiment used string sets from the Media Wiki translation effort

English (4,000+ strings)
French (4,000+ strings)
Spanish (4,000+ strings)
German (4,000+ strings)
Paiwan (Taiwan) (600+ strings)
Abkhazian (Abkhazia) (425 strings)
Maori (New Zealand) (128 strings)

Other sources of string libraries so far discovered

Moodle (100 languages)
Libre Office - Translations Document Foundation
transifex.com

Examples of some standard UI strings;

Login
Logout
Username
Password
Help
File
Edit
Forgot password
Change password
Save
Delete

Translating help documents, the actual content, and scientific terms is a much bigger problem and will require a different solution.

On a Sandy Beach

Incorporating languages for i18n

Mike's Notes

Resources

References

Repository

Last Updated

Incorporating languages for i18n

Unicode Consortium

UTF-8

CLDR

SIL International

IETF

IANA Language Subtag Registry

W3C Internationalisation

Language Strings

Other sources of string libraries so far discovered

Examples of some standard UI strings;

No comments:

Post a Comment