We have touched upon interoperability in our terminology lifecycle management posts, but today we arrived at a major milestone with our back-end tool, so it’s time for a second helping. It may seem redundant to develop our own software that converts and manages data, when solutions that are available in every LSP’s toolbox, such as qTerm, MultiTerm or Swordfish handle conversion as well to a certain extent. On the bright side, their interoperability capabilities satisfy straightforward and elementary processes; however, as complexity rises, their limitations become apparent.
Interoperability: While the tbx specification is standardized, every developer implements it differently and bends it to its needs. For complex databases, one may not be able to escape writing custom DTDs or XSLT transformations. qTerm supports the MultiTerm tbx model, but when it doesn’t recognize an SDL-specific namespace element, the data is simply ignored.
Delimited lists: In spite of the growing acceptance of dedicated tools that treat terminology as it should be, e.g. in databases, many clients and even LSPs rely on unstructured, flat lists to distribute and manage data. Moreover, authoring solutions and less prominent CAT tools follow the same path, sporadically supporting tbx (like Acrolinx IQ).
Curbed features: The variety of usage scenarios prevents developers to implement solutions to niche requirements, or they prove to be too complex for generic use, such as automated stemming. Another reason may be brand protection: some features are deliberately implemented in an abstruse fashion to make migration to another tool less, well, entertaining.
Complexity: Databases can grow unwieldy, and data relations don’t appear so clear-cut for novice users. Even brilliant translators shy away from working with a full-fledged multi-dimensional database sometimes, not to mention clients. Also, it was important for us to have a malleable tool that allows language engineers, language leads and terminologists alike to manage and conceptualize large databases.
Because neither of the stock solutions has proven to be flexible for everyday production and lacked features, we set out to create one that:
- Handles conversion between flat and multi-dimensional resources without data loss and with minimal user assistance;
- Transcends format limitations;
- Uses heuristic methods to find faults, structure data and assign stemming and casing rules for automated QA;
- Enables pre-set structures that can be applied to any database;
- Supports version control and deduplication;
- As part of lifecycle management, provides an unobtrusive interface, especially during the initial conceptualization and structuralization phase.
The tool relies on Excel interoperability libraries and MSXML 4.0 assemblies to store and manipulate data. While arguably less elegant than a standalone solution, this approach has the following advantages:
- Portability: Structural data, parsing information and user settings are all stored in a single document as embedded, structured CustomXML that can be reapplied to other resources instantly;
- Simplicity and flexibility: All Excel features are available, such as sorting, pivoting, subset selection, comparison, etc. Not being bound by the interface and capabilities of a front-end tool, it offers diverse ways to manipulate data;
- Scalability: It irons out interface limitations and roundabout usage of qTerm and MultiTerm when it comes to large volumes. The next version of qTerm will support multiple selections, deduplication, a forum, etc., which will address some of the headaches, while lifecycle management using MultiTerm is still extremely convoluted and time-consuming.
We are currently maintaining multiple databases with the tool, the largest containing more than 3000 concepts in 39 languages with meta-data, which translates into over 200,000 data points all in all. The most recent version added deduplication, version control, content validation, support for various tbx implementations and many minor features and fixes. The progress is apparent in comparison:
Parsing
Taking only 5-15 minutes, the data has to be parsed first, when categories are defined and relations are stored as xml data. With the initial parsing completed, the user can manipulate the content as a flat list without worrying about restructuring. Referenced images, either from tbx or in an Excel file, are automatically resolved and embedded, facilitating portability.
Interoperability
Once a database is parsed, export and import is only one click away. By default, the tool produces qTerm-compatible tbx and xcs files, and provides customizable merging of existing structural data with any xcs definition, either upon import or separately. Two different termbases thus can be merged without any hassle.
Tweaks
The tool uses several tweaks to enhance interoperability:
- The tbx specification is not without its quirks: for example, space is used as delimiter of list elements. If an element contains a space, it will no longer be defined as a single item, which will break the DTD schema validation upon import, consequently dropping all such elements from entries. As a workaround, unique list elements are automatically scanned and spaces are replaced with non-breaking ones.
- qTerm allows xcs data to be embedded in the tbx main file. Before version 6.2, the complete xcs file was simply appended to a single node, producing non-compliant xml output. Because of these issues, qTerm produces tbx files that may fail to import into other applications.
- MultiTerm and qTerm uses different namespaces, and without modifying their default tbx output, only limited interoperability is possible. Using the tool, the range of mutually recognized meta-data definitions can be extended.
Automatic stemming and casing
If stemming and casing rules are not adjusted properly, the QA engine may produce copious amount of false warnings, which ultimately reduces productivity and potentially leads to introducing mistakes. In large termbases, marking up terms for stemming and casing is very time-consuming and doesn’t justify the end most of the time. Therefore, based on the term’s form and purpose, the tool heuristically determines which stemming rule to use (if this option selected of course), or allows setting the rule en masse for any number of entries. For example, UI elements should never be inflected or changed and must be strictly matched.
Validation
The tbx format requires its complementary xcs definition, which determines the termbase structure. Certain elements, such as list items must be explicitly listed in the xcs, therefore, if the user adds a new, unique element, the definition has to reflect the change. In such cases, reparsing is handled transparently by the tool without the user noticing.
On the other hand, finalized structures do not permit such changes, and if a predefined structure is used, non-conform or inconsistent user input raises a flag.
Version control
If version control is enabled, any modification to an existing term updates the term’s history. Previous versions are stored as any other meta-data, and imports into the dedicated tool for lookup or QA. This applies to user input as well as importing large sets of data.
Deduplication
The user can select any category to be deduplicated, which not only merges the data of the category, but the complete concept entry. Duplicates are automatically identified, and the user can choose which ones to process. For example, “Speaker” as an English term may appear three times, but refer to only two concepts: a presenter and an audio device. In this case, only two entries can be considered duplicates. Conflicts can be resolved several ways:
- Automatic mode 1 (unattended): The entry marked as base will have priority every time, and retain its values. In this mode, multi-selection lists will be merged and single entries retained.
- Automatic mode 2 (unattended): The entry marked as base will have priority, except when empty. In the latter case, it inherits the value of the first non-empty element.
- Manual mode: Each conflict raises a flag, and the user determines which value to keep. Depending on the category type, several actions can be chosen, except for fixed definitions, such as built-in (e.g. case sensitivity in case of qTerm) and boolean, integer, etc. ones. As the default option, retaining the base element and deleting the unnecessary one is available in all cases:
- Terms: If all conflicting terms are correct, they can be split, retaining all related meta-data
- Multiple picklists: Automatically processed, collecting unique entries and updating the definition
- Single picklists: Converted into multiple picklists
- Free text elements: Converted into multiple picklists with a warning that the number and length of the unique items may cause problems.
Once a duplicate has been processed, it is stored in the action history, so intentional skips (e.g. two homonyms that are separate concepts) will not be checked again until the user decides so.
Moreover, duplicate term entries can be exported as xml and sent offline to terminologists if necessary:
Terminology is as good as it is fresh. As products and target audiences change and style become smoother over time, terms also mature. To leave space for creative and useful work, it’s worthwhile to automate wherever possible. Continuously developing our tool and expanding its features lends agility to our terminology management processes, and not only enhances productivity considerably, but simply makes the experience more fluid and effortless for everyone involved.