Some source content may have an embedded new line character such as \n within the text. This can often be in the middle of a segment and is used for layout and display purposes.


Example:


This text has a new line\nin the middle of a segment.


When the above text is displayed on the screen after publishing it will look similar to below:


This text has a new line
in the middle of a segment.


If the \n is passed into the translation engine, it would confuse the engine as it is not a word and result in lower quality. In a worst case it could break a word as the engine would likely try to translate it as below:


This text has a new line \ nin the middle of a segment .


The n or \n would be separated from the \ and make a new word “nin”.


A human could estimate where to put the \n when translating the text. A machine can do similar when given the right guidance. However, neither machine nor human will be perfect as to do so would require seeing the text laid out in the publication format and then adjusting accordingly. In both cases an estimate is about the best possible. A proof reader can make fine tuning adjustments to visual publication layouts at a later stage in the process.


Two steps are needed within Language Studio to achieve this:

  1. Remove the \n so that it does not impact translation quality.
  2. Estimate positioning and reinsert the \n after translation is complete.


Attached to this article are 2 JavaScript scripts to achieve this. Step 1 is achieved in JS3, just after the segments have been extracted from the source format into translation units. Step 2 is achieved in JS9, just before detokenization. Simply upload these 2 rule files and apply as runtime rules to the Language Studio project and custom engine that require this feature.


For definition of the translation workflow steps see http://www.asiaonline.net/EN/LanguageStudio/JavaScriptPrePostProcessing.aspx#TranslationWorkflowSteps


Pre-Translation JavaScript - JS3:

 

 
function main(sAllSourceSegments) {
	//Remove all embedded new lines
    sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(\\n)/gm), " "); 

	//Cleanup any noise the changes made to the text.
	//Remove multiple white space
    sAllSourceSegments = sAllSourceSegments.replace(RegExp(/([ ]{2,})/gm), " ");
	//Remove leading spaces	
    sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(^|\r\n|\n)([ ]{1,})/gm), "$1"); 
	//Remove trailing spaces
    sAllSourceSegments = sAllSourceSegments.replace(RegExp(/([ ]{1,})($|\r\n|\n)/gm), "$2"); 
    // Return output
    return sAllSourceSegments;
}
 

 



Post-Translation JavaScript - JS9:

 

function main(sAllSourceSegments, sAllTargetSegments) {
	//Split all source and target segments and the loop through processing 1 segement at a time
    sAllSourceSegments = sAllSourceSegments.replace(RegExp(/(\r\n)/g), "\n");
    sAllTargetSegments = sAllTargetSegments.replace(RegExp(/(\r\n)/g), "\n");
    var aSourceSegments = sAllSourceSegments.split("\n");
    var aTargetSegments = sAllTargetSegments.split("\n");
    for (var i in aSourceSegments) {
        aTargetSegments[i] = processSegment(aSourceSegments[i], aTargetSegments[i]);
    }
    sAllTargetSegments = aTargetSegments.join("\n");
    return sAllTargetSegments;
}

//Processes a single segment
function processSegment(sSourceSegment, sTargetSegment) {
    var sTarget = "";

    //Check to see if this segment has a \n in it
    if (/\\n/g.test(sSourceSegment)) {
        //Note source is not tokenized
        var sSource = sSourceSegment.replace(RegExp(/(\\n)/gm), " AOBRAO "); 
        sSource = basicTokenize(sSource);
        sSourceSegment = sSourceSegment.replace(RegExp(/(^)([ ]{1,})/gm), "$1");
        sSourceSegment = sSourceSegment.replace(RegExp(/([ ]{1,})($)/gm), "$1");
        sSourceSegment = sSourceSegment.replace(RegExp(/([ ]{2,})/gm), " ");
        var aSource = sSource.split(' ');

        //Target is tokenized - Safety clean to be sure
        sTargetSegment = sTargetSegment.replace(RegExp(/(^)([ ]{1,})/gm), "$1");
        sTargetSegment = sTargetSegment.replace(RegExp(/([ ]{1,})($)/gm), "$1");
        sTargetSegment = sTargetSegment.replace(RegExp(/([ ]{2,})/gm), " ");
        var aTarget = sTargetSegment.split(' ');

        //Loop through the source and target inserting the \n in the approximate same position as it was in the source
        for (i = 0; i < aTarget.length; i++) {
            //NOTE: It is very important not to change the number of target language tokens. This could negatively impact any XML markup that is to be reinserted.
			if (i > aSource.length) {
                sTarget += ' ' + aTarget[i];
            }
            else {
                if (aSource[i] == "AOBRAO") {
                    sTarget += ' \\n' + aTarget[i];
                }
                else {
                    sTarget += ' ' + aTarget[i];
                }
            }
        }
        //Safety clean
        sTarget = sTarget.replace(RegExp(/(^)([ ]{1,})/gm), "$1");
        sTarget = sTarget.replace(RegExp(/([ ]{1,})($)/gm), "$1");
        sTarget = sTarget.replace(RegExp(/([ ]{2,})/gm), " ");

        //Assign the final return value;
        sTargetSegment = sTarget;
    }
    return sTargetSegment;
}

function basicTokenize(sIn) {
    var sOut = sIn;

    sOut = sOut.replace(RegExp(/((?![\._,\-@])\p{P}|[\"\&:;])/gmi), " $1 "); //Any punctuation except for .,_-@

    //Handle commas
    sOut = sOut.replace(RegExp(/([a-zA-Z][ ]{0,}),([ ]{0,}[a-zA-Z0-9])/g), "$1 , $2");
    sOut = sOut.replace(RegExp(/([a-zA-Z0-9][ ]{0,}),([ ]{0,}[a-zA-Z])/g), "$1 , $2");

    sOut = sOut.replace(RegExp(/([ ]{2,})/gi), " "); //Remove multiple spaces

    sOut = sOut.trim();

    return sOut;
}