We hope you are already famished for another LABS post after last week’s hiatus. With all the terminology talk in the industry on best practices, database creation and concepts, there’s no immediate urge for us to add to the sea of discussions. Not that we already have a silver bullet for all the issues, but there are two topics where the eloquence seems to be subsiding. Interoperability and lifecycle management are not hitting the charts, but they are key issues for integrated, flexible solutions, so we try to say a few words edgeways this week about our experiences.
Comedy of interoperability
All CAT tools on the market that are worth mentioning support LISA-standard file formats to exchange data; xliff, tmx and tbx are well-defined, xml-based formats for bilinguals, translation memories and termbases. In practice, strict conformity to the specifications are usually not enforced at all, or only partially implemented; what is more, certain tools prefer proprietary formats, triumphantly preventing effortless and straightforward information exchange. The business rationale behind the latter approach is understandable, albeit not commendable, and artificial barriers can usually be happily circumvented. As for LISA standards, divergence can be seen even in cases of commonly supported features, such as term definitions and flags. Because of the diversity and inconsistency of implementations, built-in support for the formats of other tools has been a little bit more than a checkbox feature in most cases so far.
espell’s terminology management solution is based on Kilgray’s qTerm, which seamlessly integrates with memoQ, providing a bridge between client validators, terminologists, language engineers, translators and project managers. The platform supports multiuser fine-grained permission management, custom data structures and collaboration options, among others. As opposed to using a wide range of tools internally, relying on a robust and modular translation ecosystem has overbearing benefits for every party involved, and in many cases warrants conversion between other formats and CAT software. It goes without saying that interoperability is a key issue for us to achieve best quality and be productive at the same time. Thus, we had to find reliable, secure and dynamic solutions to convert data, and make sure that all the back-and-forth is handled transparently and with zero error margin.
Today our focus is on terminology and tbx, but we may revisit xliff and tmx anomalies and interoperability as well later. Tbx specification has become the measure to go by, and it separates core term data from the structure definition, which is stored either in the tbx header or separately as an XCS file. This renders the format flexible and reflects a genuine hierarchical database structure, as opposed to glossaries and termlists.
Typically, conversion problems originate in two sources: either the initial collection is not a structured set, or the database definitions differ between the tools. While doctoring a structured output is not a complex task if the differences are documented, foraying into the reconstruction of a pile of unrelated data can be a challenging IT endeavour. Termbases are usually not extreme in size (and should not be, for that matter), thus algorithms to restructure data don’t have to be fuzzy or partition tolerant. To the contrary, strict parsing is essential not to lose any information and successfully resolve and deduplicate entities. If one is not scared away by the inconsistency of implementations yet, it is worth mentioning that the standard also features some oddities too, such as the peculiar usage of the space character as delimiter within entry definitions.
Hommage to Rube Goldberg?
As one of our philosophically inclined colleagues put it, any important problem in life can be solved with regular expressions, and those that can’t be, are not worth solving at all. Unfortunately for him, termbase structuring is a problem defying the rule, which is otherwise worthy of universal acknowledgement. The perfect example of a non-hierarchical, two dimensional set is a term list, usually collected in Excel or similar, but some CAT tools produce delimited text or tmx to exchange data, which fall into the same ballpark. Structuring such a set is a hassle if done manually, so espell’s engineers developed a tool to convert data, supporting not only text, but media as well.
In the first phase, the tool gets cracking on the content and cycles through the columns for the user to determine categories and definitions, and then structures the set automatically.
So far so good – the parser can determine the level and relation of the meta-data if definitions of the same type are not multiplied on the same entry level. If multiple entries are present, for example synonyms, it is not possible for the tool to set up the proper relations between the respective meta-data or related information with 100% confidence. However, it tries to heuristically establish them, and if it fails, asks for the user’s input.
If the user chooses not to determine term-level rules for capitalization, forbidden flags or prefix matching, the tool also automatically generates them depending on various factors. For example, if a pipe is present in the term identifying the lexical root, the prefix matching rule will be automatically set as such. Naturally, this feature can be switched off at any time.
We hope this post managed to stifle your hunger for further espell LABS posts. Next week, we’ll look at a question not a bit less intriguing: terminology lifecycle management. Stay tuned!
Pingback: Ode to Terminology Lifecycle Management – Part 3 | espell labs blog
Pingback: Ode to Terminology Lifecycle Management – A Case Study | espell labs blog
Pingback: Ode to Terminology Lifecycle Management – Part 2 | espell LABS blog