The State of Internationalization in JavaScript

By on July 19, 2017 11:09 am

As businesses expand into new markets and existing markets become more diverse, it is increasingly rare that enterprise applications can expect to serve speakers of only one language, with identical expectations for how they should be addressed or be presented data. In spite of this, globalization — the process of catering an experience to users in specific regions — receives less attention than is warranted, and many times is an afterthought in application development. In part, however, this is due to how difficult it is to correctly globalize an application.


Globalizing an application requires a tremendous amount of information. Beyond language-specific messaging, applications must also account for locale-aware string sorting, formatting and parsing dates, lists, currencies, and units. Even within a single language, applications may need to cater to multiple dialects — adapting to regional variations in grammar and vocabulary. There are also technical considerations like proper Unicode and bidirectional text handling.

Further, JavaScript itself has not made globalization easy. Between notorious frustrations with localizing Date objects and confusing issues with handling Unicode text, developers are forced to reach for libraries to manage what they hoped would be provided out of the box.

Overview of Globalization

Globalization encompasses two separate steps: internationalization and localization. Internationalization (often abbreviated to i18n after removing 18 letters) is the process of preparing an application for adaptation to multiple contexts, while localization (often abbreviated l10n for the same reason) refers to the process of adapting an application to a specific context. In other words, internationalization can be seen as the architecture and localization as the data.

Some of that data, like locale-specific messaging, will be provided directly by individual applications, while most other data used to localize an application will come from a locale data repository. By far the best and most prevalent repository is The Unicode Consortium’s Common Locale Data Repository (CLDR). The CLDR data describe how to display just about any information that varies from one locale to the next, making it possible for applications to correctly sort values and display lists, currencies, dates, plurals and ordinals. The amount of information in the CLDR and the knowledge needed to process it is overwhelming, so applications will usually access them indirectly through various third-party libraries.

Some tools aim to be one-stop solutions, whereas others focus on a specific aspect of globalizing JavaScript applications.

Over the years, several tools have been developed to address the various aspects of globalization. Some aim to be one-stop solutions, whereas others focus on a specific aspect of globalizing JavaScript applications. The list below is certainly not exhaustive, but effectively demonstrates different solutions to the complex issues with globalizing JavaScript applications.

Dojo 1

As a complete JavaScript toolkit, Dojo 1 provides an entire globalization ecosystem, with mechanisms for locale-specific messaging, as well as for handling numbers, currencies, and dates and times. Dojo 1 leverages data from the CLDR, but stores only a subset of it directly in its repository. This removes the need to download the entire CLDR and covers most use cases, with the downside being that more complex locale support requires additional setup. One of the major strengths of the Dojo 1 implementation is that internationalization is a first-class citizen, so it supports most application needs, and it is clear the entire toolkit was built with internationalization in mind. Its main weaknesses are that it supports only a single locale at a time and that it is not feature-complete – requiring applications to rely on additional tools for things like robust message or unit formatting.

MessageFormat.js

Alex Sexton’s excellent MessageFormat.js is a JavaScript implementation of the International Components for Unicode (ICU) project’s MessageFormat standard. When localizing messages displayed to the user, often the only requirement is a way to swap out messages as the locale changes and a way to replace tokens with values. However, this approach fails when message specifics must adapt to variables like gender or count. For example, the classic TodoMVC application displays a count for the number of remaining todos. In English, this message takes two forms: one for the singular form (“1 item left”), and the other to display the plural form (“{n} items left”). Other languages, however, treat pluralization differently. Arabic, for example, requires an additional form when there are two items left. The ICU MessageFormat is a very well thought-out format that solves these exact issues.

Globalize.js

Globalize.js, developed by Rafael Xavier de Souza, is arguably the most complete JavaScript internationalization ecosystem available today. Not only does it support the ICU MessageFormat format (via MessageFormat.js), it also includes pluralization support and date/time/relative time, number/currency, and unit formatting support. Unlike Dojo 1, Globalize.js does not include CLDR data in its repo, but requires that users supply the CLDR data they need. Aiding this is the companion cldr-data module that downloads the (near) entire CLDR. The main benefit of this approach is that developers can easily update the application’s CLDR data as changes are made. The downside is that cldr-data installs a large portion (~250MB) of the CLDR data, which may not be desirable. However, the project does supply a build tool that can compile formatters and eliminate the need to include CLDR data in the output modules. At the moment, Globalize.js does not fall back to the native Intl global (see below) when available, but that may change in future.

Moment.js

By now most JavaScript developers will be at least marginally familiar with Moment.js, even without prior experience in globalizing applications. Along with Globalize and the Dojo Toolkit, Moment is a member of the JS Foundation, and is the go-to library for parsing, formatting, and validating dates and times. Localizing dates is one of the most frustrating aspects of globalization in JavaScript, but Moment makes doing so easy.

@dojo/i18n

@dojo/i18n is the internationalization ecosystem for Dojo 2. Rather than reinvent the wheel, Dojo 2 delegates message, date, number, and unit formatting and parsing to Globalize.js. Attempting to avoid the limitations of the Dojo 1 implementation, @dojo/i18n supports dynamic locale switching, provides sensible fallbacks for unsupported locales, and even allows multiple locales with multiple text directions to be used within the same application simultaneously. Like Globalize, @dojo/i18n does not supply CLDR data, instead requiring applications to provide all required data. The impetus behind this decision is that the project does not wish to bind applications to a specific version of the CLDR, but also does not want to require that every application download the entire CLDR.

Native Internationalization Support with ES6 and ECMA-402

Recognizing the need for a dedicated focus on internationlization, TC39 established a separate internationalization specification, ECMA-402. Development follows the TC39 process, so new features are proposed, evaluated, and promoted through a series of stages until the change is either rejected or formally added to the language. Thus far, Intl.DateTimeFormat (date and time formatting), Intl.NumberFormat (number and currency formatting), and Intl.Collator (locale-aware string sorting) enjoy support in modern browsers (IE11, Edge, and latest Chrome, Firefox, Safari, iOS Safari, and Android).

For applications that must support older browsers, there is an Intl polyfill that implements the DateTimeFormat and NumberFormat constructors, along with the corresponding ECMA-402 changes to the Date and Number globals. As the project README notes, certain implementation-dependent features have not been polyfilled, such as “best fit” locale matching (which provides more sensible locale fallbacks, like “es-UY” (Uruguayan Spanish) to “es-AR” (Argentine Spanish) rather than the default “es” (Castillian)), or the Intl.Collator constructor due to the complexity of the sorting algorithm as well as to the amount of data that would need to be sent over the network to enable proper sorting. In the past, many applications delegated locale-specific sorting to the server, which unfortunately may still be required where Intl.Collator support is missing.

Beyond the new functionality introduced with ECMA-402, ES6 has also improved internationlization support for the existing String, Date, RegExp, and Number natives. For one, modern browsers now support additional locale and options arguments that have been added to methods like String.prototype.localeCompare and Date.prototype.toLocaleString, and inconsistencies in older browser implementations have been fixed, making those methods safer to use than they previously had been. Especially welcome is the improved support for Unicode text, both in strings and in regular expressions.

Without going into too much detail, JavaScript strings and regular expressions prior to ES6 recognize only the first four hexadecimal digits in a Unicode code point, so those in the Basic Multilingual Plane (0x0000 – 0xFFFF). For example, the “G Clef” musical symbol (“𝄞”) is represented by the code point U+1D11E. However, when we attempt to use that code point in a string ('\u1D11E') or regular expression (/\u1D11E/), JavaScript actually treats it as the two-character sequence “ᴑE” (“U+1D11 Latin Small Letter Sideways O” and “U+0045 Latin Capital Letter E”). Since such characters can also be represented as surrogate pairs, we can work around this problem by using either the surrogate pair ('\uD834\uDD1E') or the character itself (which JavaScript translates internally to its surrogate pair), but surrogate pairs cannot be used as boundaries for regular expression ranges. ES6 solves these issues with Unicode code point escapes (supported in modern browsers, excepting IE11 and under), which use braces to delimit code points:

'\u1D11E' // "ᴑE" (without the escape)
'\u{1D11E}' // '𝄞'

/\u1D11E/.test('𝄞') // false (without the escape)
/\u{1D11E}/u.test('𝄞') // true (note the required `u` flag)

/[𝄞-𝄢]/.test('𝄞') // Error: invalid range
/[𝄞-𝄢]/u.test('𝄞') // true (note the required `u` flag)
/[\u{1D11E}-\u{1D122}]/u.test('𝄞') // true (note the required `u` flag)

Finally, it is worth mentioning String.prototype.normalize, which returns a string’s Unicode normalization form. Since there are multiple code point combinations that can be used to represent the same visual character, it is possible that two visually identical strings will fail a comparison operation. For example, the “é” in “résumé” can be represented either as the character “U+00E9 Latin Small Letter E with Acute”, or as lowercase “e” combined with “U+0301 Combining Acute Accent”:

're\u0301sume\u0301' === 'résumé' // false
'r\u00e9sum\u00e9' === 'résumé' // true

Although the strings are visually identical, they are comprised of different characters and therefore are not equal. To account for such scenarios, the normalize method maps visually identical characters to a canonical form:

const normalized = 're\u0301sume\u0301'.normalize();
normalized === 'résumé' // true
normalized === 'r\u00e9sum\u00e9' // true
normalized.charAt(1) === 'e' // false
normalized.codePointAt(1).toString(16) // "e9", i.e., "U+00E9 Latin Small Leter E with Acute"

Current Proposals

ECMA-402 has made significant improvements to the language, yet those improvements are but a stepping stone toward complete internationalization support. A number of other proposals have been submitted and if these additions are incorporated into the standard, JavaScript will have native support for much of the functionality required to internationalize the majority of applications. The most notable exception is message translation; for the time being, tools like MessageFormat.js and ecosystems like Globalize or @dojo/i18n are still required for robust translation support.

  • A proposal to add Unicode property escapes to regular expressions is currently at Stage 3, and will allow readable keywords in regular expressions instead of unreadable Unicode ranges. For example, if you want to match uppercase characters in a regular expression, how might you do so? /[A-Z]/ does not work for all languages, so an exhaustive solution requires knowledge of which Unicode characters are considered uppercase. Unicode itself, however, already uses properties to classify characters as uppercase or lowercase, and this change will allow those property names to be used:
    		const matchUppercase = /\p{Uppercase}/u; // `u` flag required
    		const matchArabic = /\p{Script_Extensions=Arabic}/u; // `u` flag required
  • Intl.DateTimeFormat.prototype.formatToParts and Intl.NumberFormat.prototype.formatToParts are Stage 4 and Stage 3 proposals (respectively) with very limited (and experimental) browser support. These methods allow dates and numbers (like currencies) to be formatted to an array of individual parts, allowing them to be further formatted as needed. For example,
    		// Create a list element with the following text items:
    		// month: 7
    		// day: 11
    		// year: 2017
    		const ul = document.createElement('li');
    		new Intl.DateTimeFormat('en-US')
    			.formatToParts(new Date(2017, 06, 11))
    			.forEach(({ type, value }) => {
    				// skip any separator like "/"
    				if (type !== 'literal') {
    					const li = document.createElement('li');
    					li.textContent = `${type}: value`;
    					ul.appendChild(li);
    				}
    			});
  • Intl.PluralRules is a Stage 3 proposal that adds native support for loading locale-specific plural forms. Most developers may never need to use this feature directly; it is more likely to be used by libraries like MessageFormat.js to eliminate the need to load additional CLDR data.
    		const ordinal = new Intl.PluralRules('en', { type: 'ordinal' });
    		ordinal.select(1); // "one" ("1st")
    		ordinal.select(2); // "two" ("2nd")
    		ordinal.select(3); // "few" ("3rd")
    		ordinal.select(4); // "other" ("4th")
  • Intl.Segmenter is a Stage 3 proposal that provides locale-specific text segmentation, which is useful for such operations as iterating over strings with multiple combining code points (grapheme clusters that are meant to be interpreted as a single character) or correctly determining word boundaries.
    		const segmenter = new Intl.Segmenter('en', { granulariy: 'word' });
    		const iterator = segmenter.segment('The quick brown fox jumps over the lazy dog');
    		const reversed = Array.from(iterator, ({ segment }) => segment).reverse();
    
    		console.log(reversed.join(' ')); // "dog lazy the over jumps fox brown quick The"
  • Intl.ListFormat, a Stage 2 proposal that converts an array into a localized list string. For example,
    		// "Miles, Scott, and Paul"
    		new Intl.ListFormat('en').format([ 'Miles', 'Scott', 'Paul' ]);
  • Intl.UnitFormat, also a Stage 2 proposal that introduces locale-specific unit formatting capabilities. For example,
    		// "15 hours"
    		new Intl.UnitFormat('en', {
    			type: 'duration',
    			unit: 'hour',
    			style: 'long'
    		}).format(15);
  • Intl.RelativeTimeFormat, a Stage 1 proposal that introduces locale-specific relative time formatting. For example,
    		new Intl.RelativeTimeFormat('en').format(-1, 'day'); // "yesterday"
  • dateStyle and timeStyle options for Intl.DateTimeFormat, a Stage 1 proposal, allows dates and times to be formatted based on a single pattern. Currently, formatting for individual date/time parts must be specified as different options (e.g., ‘month’, ‘day’, ‘hour’). For example,
    		const date = new Date(2017, 06, 11);
    		// current implementation: "July 11, 2017"
    		new Intl.DateTimeFormat('en', {
    			year: 'numeric',
    			month: 'long',
    			day: '2-digit'
    		}).format(date);
    
    		// Use a single "dateStyle" option for the same effect:
    		new Intl.DateTimeFormat('en', { dateStyle: 'long' }).format(date);

Summary

JavaScript has come a long way in improving its accessibility across languages and cultures, but the need for third-party tools is not going away anytime soon. As proposals are not guaranteed to advance at a set pace, it is not clear when new features like locale-aware list formatting can be expected to land. While some of the more advanced options are yet to be implemented, applications that target modern browsers can begin using native functionality today for string sorting and many date, time, number, and currency formatting or parsing operations. Until existing proposals are added to the specification, developers must still reach for libraries for exhaustive Unicode support or unit, list, and duration formatting. And unless a proposal for a message formatting implementation is submitted, libraries like MessageFormat.js may always be necessary.

Comments

  • ekdikeo
  • Jovica Aleksic

    No mention of Yahoo’s format.js project (https://formatjs.io). Any particular reason for that? I hoped to read about how it compares and relates to the approaches presented here.

  • Mostly just that it’s a derivative of messageFormat.js. We should probably either add to this post or do a follow-up post to cover formatjs, iLib, and i18next.

  • jamuhl

    +1 for showing other options too…ICU is just one format – not the only format ;)