In our previous post, we floated the idea of how multilingual adaptation of a certain type of E-Learning content can be managed. We’ve chosen a poster child for meticulousness; however, more straightforward materials may not need such an approach. But let’s push forward with investigating a complex case, and look into the technical localization of MS Office content.
With a few rare exceptions, CAT software usually only support the most common features of the more intricate formats, with a good reason. It is not everyday that structural content, embedded locale-specific information or meta-data has to be extracted, but it happens when they are integral part of the scope of localization. Having such options available would not only mean development time going through the roof, but also likely result in a confusing experience for the user. The downside is that when this is a requirement, language engineers are left with the choice of going in manually and bushwhacking through the files, or finding a bit smarter, and less excruciating, but nonetheless out-of-the-box method. In the case of the Office 2007+ formats which are of interest in these posts, Microsoft’s Open XML offers just that.
Excel, PowerPoint and Word files are basically zip containers with xml-based content. Because of the strict structure and document DTD, it’s fairly easy to manipulate not only the text, but even formatting and settings without even firing up Office. To do this in an automatized way, one only needs a zip/unzip- and a regex tool, which, in our case, usually turns out to be PowerGrep. The latter is needed to distil the useful content, prepare and mark-up the individual xml data and generate a single output file, which can then be imported into the CAT tool of one’s choosing with a format-specific custom DTD or filter configuration. Needless to say, regular expressions have to be tuned well so that all elements can be identified correctly, and the import filter/DTD has to be explicit and complete to achieve the desired results and avoid corruption.
All Microsoft products tend to have “hard-coded”, embedded data pulled from a template, which cannot be changed at a later stage from the software, or even with VBA or via API, but still need to be localized. Excel and PowerPoint files store less locale-dependent information compared to Word, and thus they are good candidates for the xml-based approach. Yet Word stands out, as the sheer amount and variety of formatting tags prevent generating an easily readable and clear output. With the single-step processing of structural information and text out of the question, one has to resort mostly to VBA or API, as xml-level processing runs out of steam here.
Having said all this, let’s see how Excel and PowerPoint stand up to the challenge.
First of all, our command-line tool cycles through all the predefined file formats in the chosen folder, extracts the localizable content, marks up the output using regular expressions, modifies the structural elements according to the target locale, sets the language, creates a single bulk from the processed content and finally saves the unused structural data for post-processing.
The resulting xml file contains all custom data, named ranges, fields, object names, lists, conditional formatting, validation options, theme names, links, document properties, comments, slide categorization, embedded content, object names as well as descriptions, and so on. Moreover, these text fragments are also marked up reflecting their context to ensure proper translation and TM matches. Once the files have been imported into the CAT environment, the material is in the capable hands of the translators, and to help their work further, units, number formats, UI references are checked against auto-translation rules as well as the term base to ensure consistency and conformity to MS naming conventions. After the translation and QA rounds, the original target format is restored using the exported xml and the saved, unmodified structural data, producing files which look exactly as if they were created in the target, native environment from scratch. And the best thing is that the pre- and post-processing stages require manual intervention only once: just think of a target locale and hit the red launch button.
Word is a maverick, but we will try to get a handle on it next week.