Using JavaScript to prepare content for better translation quality.


There are many steps in the translation process that can benefit from JavaScript rules, but one of the most useful is Pre-Process JavaScript 2 (JS2). The reason that JS2 is so useful is that it occurs immediately after the text to be translated has been loaded into memory (Retrieve Source - RTS) and before sentences are extracted for translation (Extract Translation Units - ETU).

You can find a Pre-Processing JavaScript template in this article.


In JS2, you can do the following:


1. Provide guidance to the engine on what should and should not be translated.


· Use a regular expression to locate text and simply insert the XML tags <notran>….</notran>


· All content with the notran tags (<notran>) will not be processed by the engine and does not count as words when processing content.


sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(^From:)/gmi), '<notran>From:</notran>');


 

2. Repair damaged content or content that would not translate well


· Often it is known that content is not in a format that would translate well.

o Examples include:

§ Content that has double encoded entities “&amp;amp;” can be corrected to either “&” or “&amp;”


sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(&amp;amp;)/gmi), '&');


 

§ Asian character sets could have additional spacing which can be removed so that Language Studio word segmentation will be more accurate.


· Some content management systems remove formatting which breaks the output

o For example “Intel<sup>TM</sup>” becomes “IntelTM”. A regular expression can search for and repair the text to “Intel™”


sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(([A-Za-z])(TM)\b)/gm), '$1\u2122');


 


3. Force sentence breaks where the standard ETU process is not understanding the sentence boundaries that are specific to your content.


· This is very common in eCommerce where content often lacks sentence structure

o “161/700 Grade Up Series No. 88 Japanese Navy battleship Kongo fast Showa (1941) - Etched Parts (Japan import)”

o The above could be split into 2 sentences which would help improve the translation.

o “161/700 Grade Up Series No. 88”

o “Japanese Navy battleship Kongo fast Showa (1941) - Etched Parts (Japan import)”


sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(Series No. \d+)(\s+)/gmi), '$1<notran>$2</notran>');


4. Change input text structure

You may have a format such as XML where you have some source language content and you want to retrain the original text, but add a translated version. In the example below, the XML file contains a list of books (just one book for the example), which is in English.


We first take the English and clone it, marking it as French. At this point, even though it is marked as French it is still English. We then add NoTran markers (<notran>) to the text to guide the engine that it should only translate the content that is outside of the NoTran markers.


In this example, that would mean the French Title and Description only, with all other content remaining untouched.


Original XML

<?xml version="1.0"?>

<catalog>

   <book id="bk101">

 <author>Gambardella, Matthew</author>

 <genre>Computer</genre>

 <price>44.95</price>

 <publish_date>2000-10-01</publish_date>

 <content lang="EN">

 <title>XML Developer's Guide</title>

 <description>An in-depth look at creating applications with XML.</description>

 </content>

 </book>

</catalog>

Restructured XML to add new language

Clone the EN content and change the language to FR

<?xml version="1.0"?>

<catalog>

 <book id="bk101">

 <author>Gambardella, Matthew</author>

 <genre>Computer</genre>

 <price>44.95</price>

 <publish_date>2000-10-01</publish_date>

 <content lang="EN">

 <title>XML Developer's Guide</title>

 <description>An in-depth look at creating applications with XML.</description>

 </content>

 <content lang="FR">

 <title>XML Developer's Guide</title>

 <description>An in-depth look at creating applications with XML.</description>

 </content>

 </book>

</catalog>


Restructured XML with NoTran markers

Add NoTran tags around content that you do not wish to translate. The content that will be sent to translation is marked in cyan.


<notran><?xml version="1.0"?>

<catalog>

   <book id="bk101">

      <author>Gambardella, Matthew</author>

      <genre>Computer</genre>

      <price>44.95</price>

      <publish_date>2000-10-01</publish_date>

      <content lang="EN">

         <title>XML Developer's Guide</title>

         <description>An in-depth look at creating applications with XML.</description>

      </content>

      <content lang="FR">

         <title></notran>XML Developer's Guide<notran></title>

         <description></notran>An in-depth look at creating applications with XML. <notran></description>

      </content>

   </book>

</catalog></notran>


 

5. Merging Split Lines


· You may be aware that content is split over lines, which would break sentences into multiple pieces. You can merge content back into 1 sentence.


· The below example is simplistic, it looks for content that has a lower comma, case letter or number on the end of a line and a lower case letter on the start of the next line. While this may resolve many such issues, it is not perfect and logic specific to your application and document format should be applied.


sAllSourceSegments = sAllSourceSegments.replace(RegExp(/([a-z0-9,])(\r\n|\r)([a-z])/gm), '$1 $3');


Original text with line breaks

Bora Bora isn't known as 'the romantic island' for nothing. This idyllic tropical getaway[CR]
is one of the most popular honeymoon destinations in the world, and nothing is out of the[CR]
realm of possibility here.[CR]
Consider breakfast on your over-water bungalow's balcony, a picnic on a private island,[CR]
or even getting married in a chapel that sits over the lagoon[CR]
Whether you're planning your honeymoon or just a romantic escape, you won't have to look[CR]
very hard to find romance in Bora Bora.



Repaired Text

Bora Bora isn't known as 'the romantic island' for nothing. This idyllic tropical getaway is one of the most popular honeymoon destinations in the world, and nothing is out of the realm of possibility here.[CR]
Consider breakfast on your over-water bungalow's balcony, a picnic on a private island, or even getting married in a chapel that sits over the lagoon[CR]
Whether you're planning your honeymoon or just a romantic escape, you won't have to look very hard to find romance in Bora Bora.


 

6. Managing Email and Document Headers


· You can put rule based logic in place to lock down content in headers that can be excluded from translation.


· The rules will depend on your specific content, but you could for example search for a line header such as “From:” and then any non-Japanese content mark to not be translated. Ensure that you only apply this kind of rule to header lines like “From:” and not the entire document.


Original text with header

From: "風魔 小太郎" [fuma.kotaro@fumaclan.com]

Sent: Wednesday, May 04, 2011 9:36 PM

To: saito.toshi@examplecompany.com

Subject: ホテルの場所

Attachments: afile(version1)PDPLCD.xls; anotherfile.ppt

 

私は、ホテルの住所を教えてください

 

Text after rules have been applied


<notran>From:</notran> <notran>"</notran>風魔 小太郎<notran>"</notran> <notran> [fuma.kotaro@fumaclan.com] </notran>

<notran>Sent:</notran> <notran>Wednesday, May 04, 2011 9:36 PM</notran>

<notran>To:</notran> <notran>saito.toshi@examplecompany.com</notran>

<notran>Subject:</notran> ホテルの場所

<notran>Attachments: afile(version1)PDPLCD.xls; anotherfile.ppt</notran>

 

私は、ホテルの住所を教えてください