Ideally, clients should have large volumes of translation memories, dictionaries and glossaries. However, often clients will come to us with little or no data at all. Other machine translation vendors cannot customize a system without data provided by clients, but with Advance Data Manufacturing, we are able to overcome this challenge.

High quality translation memories will always help to deliver greater quality more quickly, but when they are not available, Language Studio™ Linguists will work with clients to create a set of bilingual data through the use of data analysis and generation tools. The data that is created is focused around key linguistic building blocks associated with the customization of an engine for a specific language pair, domain, target audience, purpose and writing style. Custom engines created with manufactured data are immediately available for use. Manufacturing data gives clients a head start with a quality machine translation system that will continue to mature and improve over time.

Where possible, client validation of manufactured data is leveraged to ensure quality and fine tune any issues. Language Studio™ Linguists will discuss with clients their level of involvement at the start of the project and then tailor the Customization Training Plan around the agreed level of client involvement.

WARNING! Some machine vendors claim to be able to customize a high quality engine with as little as 10,000 sentences.

Try it for yourself and see the difference that such a minimal amount of data has on translation quality. Looking at this from another perspective, if the same 10,000 sentences were used as translation memories in a TMS platform like memoQ or XTM, will have very little impact on the translation project as there simply is not sufficient data to match new translations against. SMT requires even more data to be effective.

The S in SMT stands for Statistical and SMT is all about statistical data and which data is most statistically relevant. If the machine translation vendor has 3 million sentences in their baseline data, your 10,000 sentences will have almost no statistical relevance at all and impact on translation quality will be almost unnoticeable.

Put simply, without creating or sourcing high-quality in-domain data that has enough volume to be statistically relevant, a low quality engine should be the expected outcome.