This is a post in the series about data as an asset, see previous post here.
With a pure, client-side JavaScript, ActionScript, DHTML or Flash architecture, genuine data-driven structures cannot be implemented. Flash is a notable departure from these text-based scripting techniques as it is compiled for runtime. As a binary, it needs disassembly, and while the xml-based XFL interchange format is supported, it isn’t particularly well-fitted for localization.
Asynchronous technologies were devised before the turn of the millennium to address the issue of high traffic and mandatory reloads of dynamic websites. This evolution advanced hand-in-hand with the development of serialized, atomic data-oriented formats and systems that opened the gate for innovative services and laid the groundwork for the so-called Web 2.0. Data-driven architectures gave rise to the back-end/front-end dichotomy, which in turn allowed for a significant leap in complexity, design, modularity, and more recently, the integration of various platforms into a consolidated ecosystem. As a second instalment on internationalized web architectures, this post is still mainly concerned with the practical aspects. Later in the series, we will look into the ongoing transformation of technologies and talk about the change of mainstream attitude from well-defined to hybrid/fuzzy logic.
A broad array of web technologies allows asynchronous communication with a server in the background that doesn’t interfere with the actual state of the page. The choice depends on complexity, traffic demand, easy deployment, modularity and other requirements.
PHP
One of the simpler methods to organize linguistic information in a data pool is via PHP. While PHP allows string literals to be stored explicitly in code, as complexity grows this practice can bring the maintenance and localization process to a grinding halt. The following example also presents a common localization issue of plurality mapping:
$htmlBody .= sprintf('<p>You have (%s) article(s) in your cart.</p>';
A more robust generic solution that applies to all implementations, not only PHP, is to abstract all strings into dedicated resources. With PHP, data-oriented resources such as YAML and JSON, feature common data types and are designed with interoperability, serialization and atomicity in mind. The syntax of both YAML and JSON reflect the underlying data structures and can store any information, including binary types, such as images. On the other hand, GNU gettext has a less generic use for storing localization strings in a bi-lingual format. All three types are in widespread use in offline application development and internationalization, in the open-source community, especially gettext. gettext comes natively with PHP (and implemented with WordPress as well) as the standard linguistic data exchange format via the dcgettext() function, and inherently supports context-aware plural forms. Depending on the target language, plurals can have one or more variants; however, only as integers. For example, Finnish partitive plurals are not supported, while Slavic paucal variants are:
nplurals=3; plural=n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
One weak spot of gettext is gender support: there is none. This issue can lead to non-localizable strings especially in Slavic languages, but usually it is not a show-stopper. For example, the string “Dear %d” is impossible to translate into Czech, but can be substituted by “Dear Customer”, or similar.
YAML and JSON behave differently and require additional custom code to address such problems. For these resource types, the common practice is to separate language streams and store them in individual files at a predefined location in a strict hierarchy, so one can loop through the translations using JSON.parse() or yaml_parse(). To speed up page render, it’s worth decoding the data and assigning it with the $_SESSION["language"] PHP global.
The immediate advantage of this approach is that the interoperability formats are human-readable, do not need recompilation, are easy to automate and can be quickly pushed through the localization supply chain with minimal or no overhead at all after an initial setup of the localization import plugin.
Perl
In the Perl world, there are two main contenders: gettext and Maketext. As mentioned, gettext is a true localization format that respects syntax differences which can be flexibly defined in the file header. Maketext on the other hand offloads a lot of linguistic work to the programmer and requires language-dependent modules. As opposed to gettext, Maketext requires the quant notation (e.g. “Your search yielded [quant,_1,entry].“) to produce the appropriate version, which is substantially inferior to gettext functions for two main reasons. Most importantly, it expects word order not to change in the localized versions, which, needless to say, is far from being a workable assumption. Secondly, how the variant is selected cannot be customized, and the final result cannot be guaranteed. If you are considering Perl for your web application, go for gettext.
The final part of this article on web architectures is coming soon with more complex solutions.
Pingback: Data as Asset – Web Architectures #3 | espell LABS blog