Incorporating languages for i18n

One of the significant challenges this year was enabling any language or writing system (script) to be used by the Pipi 9 Content Management System (CMS).

The decision was made to alter the architecture to save future rework even though English will be the only language available initially.

Some of the critical resources used were

Unicode Consortium

The Unicode Consortium describes its overall purpose as:

...enabling people around the world to use computers in any language by providing freely available specifications and data to form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of this purpose is to standardize, maintain, educate and engage academic and scientific communities and the general public about, make publicly available, promote, and disseminate to the public a standard character encoding that provides for an allocation for more than a million characters.

UTF-8

UTF-8 is able to store every character from every writing system so Pipi uses UTF-8. Each writing system gets a 4-character code. 1000 spots are available. About 178 scripts are listed so far, with many more being added.

For example, Latn is the 4-character code for the Latin Alphabet.

CLDR

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

Among the types of data that CLDR includes are the following:

  • Translations for language names
  • Translations for territory and country names
  • Translations for currency names, including singular/plural modifications
  • Translations for weekday, month, era, and period of the day, in full and abbreviated forms
  • Translations for time zones and example cities (or similar) for time zones
  • Translations for calendar fields
  • Patterns for formatting/parsing dates or times of day
  • Exemplar sets of characters used for writing the language
  • Patterns for formatting/parsing numbers
  • Rules for language-adapted collation
  • Rules for spelling out numbers as words
  • Rules for formatting numbers in traditional numeral systems (such as Roman and Armenian numerals)
  • Rules for transliteration between scripts, much of it based on BGN/PCGN romanisation

The information is currently used in International Components for Unicode, Apple's macOS, LibreOffice, MediaWiki, and IBM's AIX, among other applications and operating systems.

SIL International

ISO 639-3, maintained by SIL International, is a list of 7,000 + languages described by 3-letter language codes.

For example, eng is the code for English.

IETF

An IETF BCP 47 language tag is a standardized code or tag used to identify human languages on the Internet (a language code). The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.

To distinguish language variants for countries, regions, or writing systems (scripts), IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49. For example, the tag "en" stands for English; "es-419" for Latin American Spanish; "rm-sursilv" for Romansh Sursilvan; "sr-Cyrl" for Serbian written in Cyrillic script; "nan-Hant-TW" for Min Nan Chinese using traditional Han characters, as spoken in Taiwan; and "gsw-u-sd-chzh" for Zürich German. In accordance with ISO 639-3, however, it does not provide codes for distinguishing between Arabic-based scripts and maintains two duplicate codes for Punjabi, as well as a number of dubious or non-existent language distinctions made by its parent's standard.

It is used by computing standards such as HTTP, HTML, XML and PNG.

language-extlang-script-region-variant-extension-privateuse

Simple examples include
  • en-NZ
  • en-US
  • en-UK

IANA Language Subtag Registry 

W3C Internationalisation

Language Strings

The initial experiment used string sets from the Media Wiki translation effort

  • English (4,000+ strings)
  • French (4,000+ strings)
  • Spanish (4,000+ strings)
  • German (4,000+ strings)
  • Paiwan (Taiwan) (600+ strings)
  • Abkhazian (Abkhazia) (425 strings)
  • Maori (New Zealand) (128 strings)

Other sources of string libraries so far discovered 

Examples of some standard UI strings;
  • Login
  • Logout
  • Username
  • Password
  • Help
  • File
  • Edit
  • Forgot password
  • Change password
  • Save
  • Delete

Translating help documents, the actual content, and scientific terms is a much bigger problem and will require a different solution.

Architecture Decisions

  • The underlying master language is English.
  • The Locale user interface (LUI) engine is responsible.
  • Subsystems include Region, Language, Script, Calendar, UOM, Person Naming, Strings, etc.
  • LUI is automatically updated by Unicode, CLDR, ISO 693-3, and IANA Language Subtag Registry, etc.
  • Strings are stored in a database structure mapped to the English string.
  • At render time, strings are substituted to create code or static web pages in a particular language and script.
  • Some sort of translation process by volunteers will eventually become necessary.
  • Expect these assumptions to change a lot following experimental evidence.

No comments:

Post a Comment