The i18n Issue

Mike's Notes

This is a copy of the March issue of Ajabbi Research.

It is about the history of the effort to make Pipi available in any language, localisation or script (i18n) requested by users.

Ajabbi Research is published on SubStack on the first Friday of each month, and subscriptions are free.

Each issue is a broad historical overview of a research topic, serving as an index to dozens of previously posted related articles. There are now over 650 articles/posts.

This copy of the issue will be updated with additional information as it becomes available. Check the Last Updated date given below.

Eventually, each issue will be reused on the separate Ajabbi Research website as an introduction to a research area comprising multiple research projects.

Resources

References

  • SIL.
  • Unicode

Repository

  • Home > Ajabbi Research > Library >
  • Home > Handbook > 

Last Updated

03/04/2026

The i18n Issue

By: Mike Peters
Ajabbi Research: 6/03/2026

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

This is the story of the effort to make Pipi available in any human language and script. The steps taken have been part of Pipi's development since 2005, spanning 5 versions.

The NZERN Pipi 2003-2005 Development Plan started it all.

Pipi 4 (2005-2008)

The story starts with Pipi 4. It was a big, successful system that supported community-driven Ecological Restoration in NZ. Here is a history of that Pipi version.

Initially, the large websites that Pipi generated were in English only.  Then, as botanical and zoological information was added, Latin, English, and Maori names were used. Eventually, provision for Chinese was planned to support the Chinese community-led conservation programmes. There were no separate language data structures in the 850-table Pipi 4 database; instead, some entities had additional columns for each language.

English

  • The 25,000 pages websites that Pipi generated were initially in English.

Latin

  • Scientific names were written in Latin in biological data.

Maori

  • Over time, it was realised that support for the Maori Language, Te Reo, was required. As a start, bilingual volunteers provided lists of words to use for regional areas, towns, etc.

Chinese

  • There was an Auckland-based Chinese Community-driven initiative to reach older residents who didn't speak English about conservation.

    Pipi 6 (2017-2019)

    When Pipi was rebuilt from memory, based on the limited experience with Pipi 4, foundational work was done to prepare Pipi for better multilingual support. This would require extra databases.

    Metadata using international codes was added to every database table to enable future language usage. The codes used were ISO 639-3, Country, Unicode, and CLDR/LDML

    ISO 639-3

    ISO 639 gives comprehensive provisions for the identification and assignment of language identifiers to individual languages, and for the creation of new language code elements or for the modification of existing ones (Terms of Reference of the ISO639/MA). - ISO 639-3

    *** 

    It defines three-letter codes for identifying languages. The standard was published by the International Organisation for Standardisation (ISO) on 1 February 2007. As of 2023, this edition of the standard has been officially withdrawn and replaced by ISO 639:2023.

    ISO 639-3 extends the ISO 639-2 alpha-3 codes with an aim to cover all known natural languages. The extended language coverage was based primarily on the language codes used in the Ethnologue (volumes 10–14) published by SIL International, which is now the registration authority for ISO 639-3.[2] It provides an enumeration of languages as complete as possible, including living and extinct, ancient and constructed, major and minor, written and unwritten. However, it does not include reconstructed languages such as Proto-Indo-European.

    ISO 639-3 is intended for use as metadata codes in a wide range of applications. It is widely used in computer and information systems, such as the Internet, in which many languages need to be supported. In archives and other information storage, it is used in cataloging systems, indicating what language a resource is in or about. The codes are also frequently used in the linguistic literature and elsewhere to compensate for the fact that language names may be obscure or ambiguous. Wikipedia

    Examples

    • Eng (English
    • Fra (French)

    ISO_3166-1_alpha-3

    ISO 3166-1 alpha-3 codes are three-letter country codes defined in ISO 3166-1, part of the ISO 3166 standard published by the International Organization for Standardization (ISO), to represent countries, dependent territories, and special areas of geographical interest. They allow a better visual association between the codes and the country names than the two-letter alpha-2 codes (the third set of codes is numeric and hence offers no visual association). They were first included as part of the ISO 3166 standard in its first edition in 1974. - Wikipedia

     Examples

    • ABW  (Aruba)
    • AFG  (Afghanistan)
    • AGO  (Angola)

    Unicode

    Unicode (also known as The Unicode Standard and TUS) is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 17.0[A] defines 159,801 characters and 172 scripts used in various ordinary, literary, academic and technical contexts. - Wikipedia

    Examples

    • Latn (Latin)
    • Lina (Linear B)
    • Hebr (Hebrew)

    CLDR/LDML

    The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML). - Wikipedia

    Example

     <?xml version="1.0" encoding="UTF-8" ?>
    <ldml>
      
        <version number="1.1">ldml version 1.1</version>
        <generation date="2024-03-06"/>
        <language type="en"/>
        <territory type="US"/>
      
      <!-- other locale data sections follow -->
    </ldml>

    Localisation (L10N)

    Language localisation (or language localisation) is the process of adapting a product's translation to a specific country or region. It is the second phase of a larger process of product translation and cultural adaptation (for specific countries, regions, cultures or groups) to account for differences in distinct markets, a process known as internationalisation and localisation. - Wikipedia

    ***

    Pipi internally automatically stores and uses 3-letter language codes, 4-letter Unicode and 3-letter country codes to define Locales.

    Examples

    • eng-Latn-NZD (New Zealand English)
    • eng-Latn-USA (United States English)

    Customers can configure the options for their own websites.

    Examples

    • en-NZ
    • en-uk

    More information

    Pipi 7 (2020)

    Small, simple, static HTML mockups of websites were created to test how different languages could be used. Experiments with HTML and CSS were conducted to display text on a website in Left-to-Right (LTR) and Right-to-Left (RTL) word order.

    Pipi 8 (2021-2022)

    System-wide i18n and L10N namespaces were implemented in all parts of Pipi to enable reliable automation and rapid scaling across multiple languages.

    Pipi 9 (2023-2026)

    Joining up all the built systems to self-generate documentation and a front end User Interface (UI).

    Experiments were conducted to determine how to integrate i18n support with Pipi's other features. It was confirmed that the Pipi core is written in British Standard English and checked by Grammarly.

    A source-target data model structure was created to store i18n scripts. It was greatly influenced by the system used by Wikipedia (MediaWiki) and OpenOffice.

    Experiments were done using 23 languages and writing scripts to test the CMS Engine (cms), data storage, UI layout, etc.

    String Translation

    Community translation will be required using a dedicated workspace for this purpose.

    Account Settings

    Each Pipi is built in 1 language and script. Each account can have many languages. An account has many deployments, each of which is in only one language. A deployment can have many workspaces.

    Localisation

    API Endpoints

    All API connections include a choice of API version and language/script.

    Scripts

    Noto from Google was chosen as the default font for Ajabbi due to the number of scripts it supports.

    KeyMan

    SIL provides an open-source KeyMan that enables keyboards for 2500 different languages to be added to websites. Pipi will use a Keyboard Engine (kyb) to provide this integration. This will be built as part of Pipi 10.

    Url Naming Pattern

    Many experiments were conducted to determine a URL structure that could accommodate websites in many languages. Wikipedia was the main influence.

    Examples

    • eng.example.com
    • example.com/eng/
    • en-uk.example.com
    • example.com/en-nz/

    Documentation

    Documentation and Learning material will need to be provided in many languages. The data models are ready for this. As the British English documentation is completed, it could be auto-translated into US English using Grammarly and into the 9 world languages using Google Translate. It would then need to be checked by volunteer users. This is speculative and will require trial and error to confirm.

    Language Prioritisation

    English > 9 world languages > 7000 local languages + localisation.

    Priority will be given to English, which will then serve as the source for translation into 9 world languages.

    • Arabic
    • Bhasa Indonesian
    • Chinese
    • French
    • German
    • Japanese
    • Hindi
    • Portugese
    • Russian
    • Spanish

    Carefully edited material in those languages can then be translated into any of the other 7,000 languages by volunteers, based on user requests.

      Model-driven UI

      The User Interface Description Language (UIDL) was an EU-funded project that was abandoned in 2010 after 10 years of excellent work. It was to enable accessibility on different screens and devices. The research results were reverse-engineered to build a User Interface Engine (usi) that would run in reverse to generate accessibility solutions for Pipi. The CSS Engine (css) replaced some redundant components of the UIDL project. Additional engines for localisation and personalisation were created.


      Pipi CMS Engine (cms)

      For a first teaching customer, a decision was made early on to autogenerate a separate website for each language (English, Māori, NZ Sign Language, and AAC picture language). This was the simplest solution for the CMS and the users.

      Creating UI for each natural language, including sign languages (i18n), requires user requests and volunteer testers.

      Sign Language

      The scheme was dreamed up to embed NZ Relay Video Interpreting on any webpage and in user workspaces. This is an ongoing experiment, driven by deaf people.

      Picture Language

      Professor Stephen Hawking used AAC via a computer-generated voice. There are many forms of AAC, including picture language. Providing this as a UI is being explored, with other AAC to follow. Important for the millions of people with Cerebral Palsy and Motor Neurone Disease.

      Invented Languages

      This system will be able to provide support for Klingon, Elvish, and other invented languages from books and movies, upon request and with volunteers prepared to do the work. This could be useful for fan communities.

      Dead Languages

      This system will be able to provide support for long-dead languages often studied by linguists and historians, such as Ancient Egyptian, Sumerian, Sanskrit, and Ancient Greek, upon request, with volunteers prepared to do the work. This could be useful for museums and faith communities.

      Workspace personalisation

      The workspace settings will eventually offer complete personalisation of the UI in other languages. This will use a personalisation form in account settings.

      Future Ajabbi Foundation Sponsorship

      Once Ajabbi has established ongoing sponsorship for Ortus for providing open-source BoxLang, the Ajabbi Foundation will generously sponsor open-source SIL KeyMan on an ongoing basis.

      Whats next

      Pipi 9 is available only in English. However, users can request any other language through their profile. Pipi 10 (2027-) will feature those multiple languages.

      The most useful and inspiring resource has been SIL Global.

      Dedication

      Every child has the right to be educated in the language of their people and of their birth. This is dedicated to those working tirelessly to record, strengthen or revive human languages.

      No comments:

      Post a Comment