Detect type of field in translator implementation

Jamie Sammons, modified 2 Years ago. New Member Posts: 4 Join Date: 2/17/22 Recent Posts

While fixing a problem in the DeepL translator and then testing the fix in our staging environment I noticed a problem in the translator integration. DeepL has special modes to handle XML and HTML. If they are not used, you can get broken output for HTML text.

Consider this input:

{
    "tag_handling": "html",
    "text": [
        "<h4>AAAA-Bahn-Nutzung</h4>\n<p>Zusammen mit einem amtlichen Lichtbildausweis kann YYYYY als Berechtigung zur Nutzung der AAAAA-Bahn genutzt werden.</p>"
    ],
    "source_lang": "de",
    "target_lang": "en"
}

Without the flag "tag_handling" you'll get this:

{
    "translations": [
        {
            "detected_source_language": "DE",
            "text": "<h4>AAAA Railroad Use</h4>.\n<p>When used in conjunction with a government-issued photo ID, YYYY may be used as authorization to use the AAAAA railroad.</p>"
        }
    ]
}

With that flag you'll get:

{
    "translations": [
        {
            "detected_source_language": "DE",
            "text": "<h4>AAAA-Bahn use</h4>\n<p>Together with an official photo ID, YYYY can be used as authorization to use the AAAAA-Bahn.</p>"
        }
    ]
}

The problem is not easy to spot, you'll have to have a look at the end of the h4 Tag. Without the tag_handling info, an additonal fullstop is inserted after the closing tag. It would be ok, if the fullstop inside the element, but this way the html structure is damaged.

I extracted this into standalone sample, but the same situation happens inside liferay when you try to translate a field with html content.

The TranslatorPacket passed to does not hold enough information to decide. At this point I think the only option is to detect html tags by searching for "closing" tags and use that to switch to html mode.

Question 1: So does anyone have ideas?
​​​​​​​Question 2: Please create a tracking issue for this.