Document Translator Service

2 min read

Document Translator

The document translator translates documents from one language to another. Currently the following file formats are supported

  1. Microsoft Word (.docx)

  2. Microsoft Excel (.xlsx)

Additionaly a glossary can be configured via the GlossaryService and different post processors may be applied via a TextPipeLine.

Known Issues

General

  • Cannot upload small documents in chat for translation

  • After translation, the translated document appears in the uploaded documents list

  • Large documentents may cause rate limit issues depending on the limits

Docx

  • Table of contents are not translated

  • Footnotes and headers are not translated

  • Comments are not translated

  • Hyperlink description is not translated

Excel

  • Sheet names of xlsx are not translated

  • Some formulas containing words may break

  • Comments are not translated

  • Smartart is lost during translation

  • Textboxes are lost during translation

Configuration

The document translator service has two configurations on for the prompts and one for the settings

Default Settings

json
{
    "languageModelName": "AZURE_GPT_4_0613",
    "maxTokensPerTranlationRequest": 500,
    "maxTokenPerMinute": 40000,
    "saveTranslationCallsEnabled": false,
    "translationExamplesIgnored": false,
    "allowedInputLanguages": [
            "Afrikaans", "Albanian", "Arabic", "Aragonese", "Armenian", "Azeri", "Bashkir",
            "Basque", "Belarusian", "Bengali", "Bislama", "Bosnian", "Breton", "Bulgarian",
            "Burmese", "Catalan", "Chamorro", "Chechen", "Chinese", "Cornish", "Corsican",
            "Croatian", "Czech", "Danish", "Dutch", "English", "Esperanto", "Estonian", "Ewe",
            "Faroese", "Fijian", "Finnish", "French", "Galician", "Georgian", "German", "Greek",
            "Greenlandic", "Guaraní", "Haitian Creole", "Hausa", "Hebrew", "Hindi", "Hungarian",
            "Icelandic", "Ido", "Indonesian", "Interlingua", "Interlingue", "Inuktitut", "Irish",
            "Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Kurdish",
            "Kyrgyz", "Lao", "Latin", "Latvian", "Limburgish", "Lingala", "Lithuanian", "Luxembourgish",
            "Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Manx", "Maori", "Marathi",
            "Marshallese", "Mongolian", "Navajo", "Nepali", "Northern Sami", "Norwegian", "Norwegian Bokmål",
            "Norwegian Nynorsk", "Occitan", "Ojibwe", "Old Church Slavonic", "Ossetian", "Pashto", "Persian",
            "Polish", "Portuguese", "Punjabi", "Quechua", "Romanian", "Romansch", "Russian", "Samoan", "Sanskrit",
            "Sardinian", "Scottish Gaelic", "Serbian", "Serbo-Croatian", "Sichuan Yi", "Sindhi", "Slovak",
            "Slovene", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tahitian", "Tajik",
            "Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Tongan", "Tswana", "Turkish", "Turkmen", "Ukrainian",
            "Urdu", "Uyghur", "Uzbek", "Vietnamese", "Volapük", "Walloon", "Welsh", "West Frisian", "Yiddish",
            "Yoruba", "Zhuang", "Zulu"]
}

Parameter Description

Parameter

Description

Default Value

languageModelName

The model that will be used to translate between languages.

AZURE_GPT_4o_2024_1120

maxTokensPerTranlationRequest

The maximum number of tokens that will be translated at once. If the model cannot handle more than this many tokens in a single request then it will be split into multiple requests.

500

maxTokenPerMinute

The maximum number of tokens available for translation tasks per minute.

40000

saveTranslationCallsEnabled

If true, saves each individual document translation LLM call in the chat message.

false

translationExamplesIgnored

If the translation examples (few-shot learning) should be ignored.

false

allowedInputLanguages

Languages that can be recognized to use correspondingly configured few-shot examples, glossary for translation and postprocessing of text.

See below

allowedInputLanguages

This parameter is relevant when using the GlossaryService and PostProcessingService as for these service to work the input language must be recognized unambiguously.

Supported are:

none
"Afrikaans", "Albanian", "Arabic", "Aragonese", "Armenian", "Azeri", "Bashkir",
"Basque", "Belarusian", "Bengali", "Bislama", "Bosnian", "Breton", "Bulgarian",
"Burmese", "Catalan", "Chamorro", "Chechen", "Chinese", "Cornish", "Corsican",
"Croatian", "Czech", "Danish", "Dutch", "English", "Esperanto", "Estonian", "Ewe",
"Faroese", "Fijian", "Finnish", "French", "Galician", "Georgian", "German", "Greek",
"Greenlandic", "Guaraní", "Haitian Creole", "Hausa", "Hebrew", "Hindi", "Hungarian",
"Icelandic", "Ido", "Indonesian", "Interlingua", "Interlingue", "Inuktitut", "Irish",
"Italian", "Japanese", "Javanese", "Kannada", "Kazakh", "Khmer", "Korean", "Kurdish",
"Kyrgyz", "Lao", "Latin", "Latvian", "Limburgish", "Lingala", "Lithuanian", "Luxembourgish",
"Macedonian", "Malagasy", "Malay", "Malayalam", "Maltese", "Manx", "Maori", "Marathi",
"Marshallese", "Mongolian", "Navajo", "Nepali", "Northern Sami", "Norwegian", "Norwegian Bokmål",
"Norwegian Nynorsk", "Occitan", "Ojibwe", "Old Church Slavonic", "Ossetian", "Pashto", "Persian",
"Polish", "Portuguese", "Punjabi", "Quechua", "Romanian", "Romansch", "Russian", "Samoan", "Sanskrit",
"Sardinian", "Scottish Gaelic", "Serbian", "Serbo-Croatian", "Sichuan Yi", "Sindhi", "Slovak",
"Slovene", "Somali", "Spanish", "Sundanese", "Swahili", "Swedish", "Tagalog", "Tahitian", "Tajik",
"Tamil", "Tatar", "Telugu", "Thai", "Tibetan", "Tongan", "Tswana", "Turkish", "Turkmen", "Ukrainian",
"Urdu", "Uyghur", "Uzbek", "Vietnamese", "Volapük", "Walloon", "Welsh", "West Frisian", "Yiddish",
"Yoruba", "Zhuang", "Zulu"

Prompt configuration

Parameter

Description

Default

systemPromptInstruction

System prompt instruction for the document translation service.

See below

userMessageTemplate

A jinja2 template for the user message

See below

systemPromptInstruction

none
You are a helpful AI designed to to translate text to a specified language.
Do it even if the target language is the same as the source language.
Make sure the translated text contains the same amount of carriage returns '\\n' as the original text block.
Try to keep the translated text as close to the original as possible and having approximately the same lenght.

userMessageTemplate

none
"Please translate the following text pieces in {{format_style}} {% if input_language %}from {{input_language}} {% endif %}to {{output_language}}

{% if glossary %}Use the following translation rules 

{{ glossary_text }}{% endif %}

{{formatted_text_pieces}}"

Prompting instructions

On templating with jinja2, the userMessageTemplate will be rendered with a specific set of variables in the code. The below table lists them so a user defined template can optionally use them.

Parameter

Description

input_language

The input language if it was detectable else None

output_language

The output language as a string

glossary

Bolean if the glossary is available

glossary_text

The glossary text

format_style

The style of how the text pieces of a document will be presented to the LLM

formatted_text_pieces

Text pieces formatted in e.g. a html structure

Last updated