...

Annotation Guidelines
for English-Dutch Machine Translation Quality Assessment

version 1.3.3

LT3 Technical Report - LT3 15-03

Arda Tezcan, Lieve Macken, Joke Daems & Laura Van Brussel
arda.tezcan@ugent.be, lieve.macken@ugent.be, joke.daems@ugent.be, laura.vanbrussel@ugent.be
LT3- Language and Translation Technology Team
Department of Translation, Interpreting and Communication
Ghent University

URL: http://www.lt3.ugent.be1

August, 2015


Overview


Introduction

Step1: Annotating Accuracy

Step2: Annotating Fluency



Introduction

Assessing translation quality is a very complex task, and depending on the goal of the assessment, a different approach is needed. We propose a categorization of most typical translation errors, which are aimed to be used specifically in the machine translation (MT) context.


Categorization

The categories are divided into two main groups: Accuracy and Fluency. Accuracy is concerned with the relationship between source text and target text, whereas Fluency is concerned with the construction of the target text and language. While accuracy errors are visible on source and target text level, fluency errors are visible already on target text level. If we can detect an error by looking at the target text only, we will annotate it as a fluency error. If have to look at the source text to detect an error, we will annotate it as an accuracy error. As a result, accuracy and fluency errors can be annotated for the same text as one type of error can occur together with the other type.

The technical report 2.0

This technical report contains a possible classification of translation problems for the translation of texts from English to Dutch and guidelines on how to annotate these problems. Though tuned to suit the needs of the Dutch and English language, the categorization allows for customization to suit different language-pair needs.

These guidelines detail the annotation process with the brat rapid annotation tool.

Using the brat-tool

Hover over the word 'brat' at the top right hand corner to be able to select 'log in' and use your username and password to log in.

You are now ready to start annotating. Just double-click a word or click and drag to select smaller/larger pieces of text and the tool will give you an overview of the possible categories.

The categories are listed below 'entity type'. After selecting the appropriate (sub)category, select the correct (sub)subcategory from the drop-down menu below 'entity attributes'. Make sure to select a (sub)subcategory and an attribute when possible!

Just click 'ok' when you're done or 'cancel' when you've selected a piece of text that you didn't want to select. To change an annotation, double-click the label above the word. You can change the category, subcategory and notes, you can decide to delete the annotation or you can move the annotation. To move an annotation, first select 'move' and then select the text span where you want the annotation to move to.

! Be careful when changing the category of an annotation: sometimes the tool remembers the first chosen subcategory alongside the new subcategory (even when the first subcategory belongs to a different main category than the second). If this happens, simply delete your annotation and make a new one with the correct subcategory.

You can select a word or span more than once, so it is possible to assign different problem categories to the same word.


Linking spans

When annotating accuracy errors (see below) you will need to link words from source sentence to target sentence. You can think of linking as the process of aligning words in the source sentence to the corresponding translation in the target sentence. You can link annotations by selecting two different annotations of the same category and then you link them together with an arrow. You do this by clicking the first annotation (in source sentence) and dragging your mouse pointer to the second annotation (in target sentence). You'll see an arrow appear which contains the error category on it. The guidelines contain information on when you are allowed or required to insert a link between annotations.

Adding fragments to your annotations

When annotating words which belong to the same specific MT error and which are not adjacent, you can annotate them without including the words in between. You can do this by selecting your first annotation (the first part of your annotation) and clicking on "Add frag.". This will allow you to annotate a second span of words in a different part of the sentence. When you make your second annotation, the two annotation will be linked to each other with a dashed line.

General annotation rules

  • Accuracy annotations require two sets of annotations (one in source sentence and one in target sentence) and a "link" from source to target annotation.
  • Fluency annotations require annotations only on the target sentence.
  • Make sure to select a subcategory whenever possible!
  • If you are not confident about an annotation, select the "low confidence" check box in the "Entity attributes" window. When "low confidence" is selected, you will see "**" at the beginning of your annotation.
  • If you are not sure about which subcategory to select, select only the main category.
  • If you think an error fits to a main category but none of the subcategories, select "other" (when applicable) and provide an explanation in the notes section.
  • Give in the correction on the text span that has been annotated, in the 'notes' section
  • If an item contains more than one problem, highlight the item as many times as there are problems, once for each problem (for example, a word that contains both a compound error and a capitalization error).
  • If the same problem occurs more than once, select each occurrence separately.

  • When in doubt…

    If you are not sure whether or not something is an error, consult external sources. You are perfectly allowed to use a dictionary or a search engine to look things up. Some useful sites that you could consult are:

    http://www.vandale.be/
    http://taaladvies.net/
    http://woordenlijst.org
    http://www.vrt.be/taal/

    If you would like to search for a word in one of these resources, you can do this also by double-clicking the corresponding word in brat, and clicking on the link for the available resources on the "Search" section within the annotation window. You can refer to external sources in the 'notes' section to support your decisions. It's also allowed to look back to previous texts, to check how you annotated the same problem in a different translation.


    Step 1: Annotating Accuracy Errors

    The texts that you are about to annotate are the results of a machine translation task from English to Dutch. To be able to judge the quality of the translations, they will be marked for two important error types: Accuracy and Fluency errors. Accuracy errors are concerned with the relationship between source text and target text, whereas fluency errors are concerned with the target text and language. The goal of the current assignment is to annotate translations for accuracy errors. Fluency errors will be dealt with separately.

    Accuracy errors can be described as errors which lead to a target text that does not reflect the same information as the source text. This means that all misinterpretations, contradictions, meaning shifts, additions or deletions are potential errors.

    Please remember that problems regarding only the conventions of the Dutch language (where meaning transfer from source to target has been successful) are not the focus of accuracy and should be handled as fluency.



    Categorization

    A detailed explanation of each category can be found below the overview. The information consists of the category name in the brat-tool (the color is always blue for accuracy annotations), followed by a definition, important remarks, guidelines for annotation and examples. The words that should be annotated are highlighted and the information after the arrow sign is an example of a possible annotation note.

    Accuracy Errors

    Addition

    Omission

    Untranslated

    Do-Not-Translate (DNT)

    Mistranslation

    Mechanical

    Bilingual Terminology

    Source Error

    Other

    Multi-Word Expressions

    Capitalization

    Part-Of-Speech (POS)

    Punctuation

    Word Sense Disambiguation (WSD)

    Other

    Partial

    Semantically Unrelated

    Other


    1. Addition

    Definition: Target text is not present in the source.

    Annotation: Annotate the added target text which is not present in the source. You do not need to annotate any source text and therefore no linking is required either (since the text is not present in source). However, if you cannot separately annotate the target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. In this case annotate the source words that are covered in target annotation and link the source text to target with an arrow.


    2. Omission

    Definition: Source content cannot be found in target.

    Annotation: Annotate the source text that is omitted in target text.


    3. Untranslated

    Definition: The source text is not translated (but was copied to target) when it should have been translated into Dutch.

    Annotation: Annotate the source text that is copied to the target. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the copy in target text. Link annotation in the source text to the annotation in the target text with an arrow.


    4. Do not translate (DNT)

    Definition: The source content is unnecessarily translated into target language when it should have been left untranslated.

    Annotation: Annotate the source text that is translated but should not have been translated. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.


    5. Mistranslation (MisTra)

    Definition Source content has been translated (when it should be translated) but the translation is incorrect.

    5.1. Multi-Word Expression (MWE)

    Definition: The translation is incorrect (and often too literal) because the English sentence contained a multi-word expression such as an idiom, a proverb, a collocation, a compound or a phrasal verb. Idiomatic expressions and proverbs are indicated with a paragraph sign (¶) in van Daele. See an example for the idiom "call it a day" here.

    Annotation: Annotate the source multi-word expression that is incorrectly translated. Annotate the corresponding translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.

    5.2. Part of Speech (POS)

    Definition: The translation represents an incorrect lexical category (Part-of-Speech) of the corresponding source text.

    Annotation: Annotate the source text that is translated with wrong part of speech. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.

    5.3. Word Sense Disambiguation (Sense)

    Definition: The target content refers to a different (and a wrong) sense of the source content.

    Annotation: Annotate the source text that is translated with the wrong sense. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.

    5.3.1 Function Word

    Determiners, prepositions, auxiliaries, conjunctions, pronouns.

    5.3.2 Content Word

    Nouns, verbs, adjectives, adverbs.

    5.4. Partial

    Definition: The translation is incorrect due to the partial translation of a Dutch separable verb.

    Annotation: Annotate the source text that is mistranslated. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.

    5.5. Semantically Unrelated

    Definition: The meaning of one or more translated words is not related in any way to the meaning of the source word(s) and does not make any sense in the context.

    Annotation: Annotate the unrelated word(s) in the target text and their corresponding word(s) in the source. Link annotation in the source text to the annotation in the target text with an arrow.

    5.6. Other

    Definition: The translation is incorrect but the problem cannot be captured with any of the subcategories above.

    Annotation: Annotate the source text that is mistranslated. If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.


    6. Mechanical

    Definition: Mechanical transfer errors which are not related to content transfer.

    Annotation: Annotate the source text or punctuation which contains the wrong mechanical transfer (depending on the type of the error). If you cannot separately annotate the source or target word(s) due to being a part of a compound or a specific phrase, select multiple words or the phrase. Annotate the translation in target text. Link annotation in the source text to the annotation in the target text with an arrow.

    6.1 Capitalization

    Definition: Errors related to the transfer of capitalization rules from source to target.

    Annotation: Annotate the source text which is subject to the wrong capitalization transfer. Annotate the corresponding text in target. Link annotation in the source text to the annotation in the target text with an arrow.

    6.2 Punctuation

    Definition: Errors related to the transfer of punctuation from source to target.

    Annotation: Annotate the source punctuation which is subject to the wrong mechanical transfer. Annotate the corresponding text in target (if the punctuation is not missing in target). Link annotation in the source text to the annotation in the target text with an arrow.

    6.3 Other

    Definition: Errors related to the transfer of other mechanical aspects.

    Annotation: Annotate the source text which is subject to the wrong mechanical transfer. Annotate the corresponding text in target (if available). Link annotation in the source text to the annotation in the target text with an arrow.


    7. Bilingual Terminology

    Definition: The translation does not match the predefined bilingual terminology requirements.

    Annotation Select the words in source text with error. Select the corresponding translation in target text and 'link' source annotation to target annotation with an arrow.


    8. Source Error

    Definition: Errors that are present in the source segment.

    Annotation Select the words in source text with error. Select the corresponding translation in target text and 'link' source annotation to target annotation with an arrow.

    Be careful! Marking source errors and the corresponding translation does not mean that other observed errors should be skipped. Source errors often lead to errors in the target word as well, which should be annotated separately.


    9. Other Accuracy errors (ACC)

    Definition: Other errors regarding the relationship between source and target text, which do not belong to any of the accuracy error categories above.

    Annotation Select the words in source text with error. Select the corresponding translation in target text and 'link' source annotation to target annotation with an arrow.


    Step 2: Annotating Fluency Errors

    The goal of the current assignment is to annotate fluency errors. Accuracy have been dealt with in the previous section.

    Fluency errors can be described as errors which relate to the construction of the target language. A good translation should read as a native Dutch text. This includes respecting the conventions of the language (grammar, lexicon, orthography).

    Please remember that errors of translation and meaning transfer (accuracy) are not the focus of fluency error annotations and should be handled as accuracy errors.

    Following is an overview of all the (sub)categories for fluency errors and guidelines on how to annotate these issues within the brat-tool.


    Categorization

    A detailed explanation of each category can be found below the overview. The information consists of the category name in the brat-tool (the color is always red for fluency annotations), followed by a definition, important remarks, guidelines for annotation and examples. The words that should be annotated are highlighted and the information after the arrow sign is an example of a possible annotation note.

    Fluency Errors

    Grammar

    Lexicon

    Orthography

    Multiple Errors

    Other

    Multi-Word Syntax

    Non-existing or Foreign Word

    Spelling

    Word Form

    Lexical Choice

    Capitalization

    Word Order

    Punctuation

    Extra Words

    Other

    MissingWords

    Other


    1. Grammar

    Definition: Errors regarding the grammatical rules of the Dutch language.

    1.1. Multi-Word Syntax

    Definition: The syntax of a multi-word expression is wrong even though the individual word choices are correct. The text needs a combination of corrections such as reordering, addition and/or removing function words. Please keep in mind that the same text can also include other problems such as "Lexicon" or "Orthography" but these errors should be annotated independent from the grammar errors.

    1.2. Word Form

    Definition: A word(s) is used with incorrect form.

    1.2. Word Order

    Definition: Wrong word order.

    Be careful! If the word order is grammatically correct, but another word order would be better, do not annotate it as a word order error!

    Annotation: Annotate the words in target sentence where a word order error can be seen. Any word that needs a reordering should be within your annotation. If you would like to annotate words that are not adjacent but belong to the same word error, annotate first set of adjacent words and select "add frag." to add other words to your annotation. Indicate the correct order of words in 'notes' section.

    There will be different types of word errors, where some words might need to switch places, some might need to move to a different location in the sentence or other more complex type of errors. Make sure you annotate all relevant words that belong to the same error and please remember to include the text with correct word order in your notes.

    1.3. Extra Words

    1.3.1 Repetitions

    Definition: One or more words are unnecessary repeated in the target sentence.

    Annotation: Select the words that are repeated in the target text.

    1.3.2 Other

    Definition: One or more extra words make the target grammatically incorrect.

    Be careful An 'extra word' error (fluency) can be caused by an 'addition' error (accuracy). If this is the case annotate both type of errors. Remember that not all 'extra word' errors (fluency) mean that there is an 'addition' error (accuracy) and/or the other way around.

    Annotation: Select the words in target text which should not be present.

    1.4. Missing Words

    Definition: One or more missing words make the target grammatically incorrect.

    Be careful: Span Only select 'missing words' if a whole structure (article, constituent, preposition ...) is missing.

    Be careful (2): Omissions A 'missing word' (fluency) can be caused by a 'Omission' (transfer). If this is the case annotate both type of errors. Remember that not all 'missing word' errors (transfer) mean that there is an 'Omission' error (transfer) and/or the other way around.

    Annotation: Select (only) the preceding word for the correct location of the missing word. As an exception, if the missing word should appear at the beginning of the sentence, select the first word of the sentence and add 'first word' to your comments to indicate that this is an exception. Please remember to include the missing word and the correct target text for your annotation in your notes as well.

    Be careful (3): Links

    If a part of a separable verb is missing (normally, you annotated this word as Accuracy > Mistranslation > Partial), create two fluency annotations Fluency > Grammar Syntax > Missing > Function or Content word) and link them. First create an annotation for the word which is also annotated as Accuracy > Mistranslation > Partial). Next create annotation preceding the missing word. Then draw a link from the first annotation to the second annotation. If the missing word is independent from other words in the sentence, annotate only the location of the missing word.

    1.4.1 Function Word

    Determiners, prepositions, auxiliaries, conjunctions, pronouns

    1.4.2 Content Word

    Nouns, verbs, adjectives, adverbs.

    1.5. Other

    Definition: Other grammar errors which do not belong to any of the subcategories above.

    Annotation Select the words in target sentence where a grammar error is identified.


    2. Lexicon

    Definition: Errors regarding the use of the lexicon in the Dutch language.

    2.1. Non-existing or Foreign Word

    Definition: The word(s) is not a part of the Dutch lexicon or is a foreign word. This error often occurs when the source word(s) is not translated into Dutch. On the other hand the MT system can also generate that does not belong to either source or target language.

    Annotation: Select the whole word in target sentence.

    2.2. Lexical Choice

    Definition: The word(s) is a part of the Dutch lexicon but another word(s) should be used for generating a correct Dutch sentence.

    Annotation: Select the whole word in target sentence.

    2.2.1 Function Word

    Determiners, prepositions, auxiliaries, conjunctions, pronouns.

    2.2.2 Content Word

    Nouns, verbs, adjectives, adverbs.

    3. Orthography

    Definition: Errors according to the methodology of writing Dutch language.

    Be careful! If there is more than one type of orthography errors in one word, select the word twice: once for each type. For example, when a compound is split up and spelled with a capital letter, you select the word once for 'capitalization' and once for 'compound'.

    3.1. Spelling

    Definition: Errors related to spelling.

    Annotation: Select the whole word in target sentence.

    3.1.1 Compounds

    Definition: Errors related to spelling of compounds.

    Be careful! When a punctuation mark is causing a compound error, annotate it as 'Spelling & Capitalization -> compound' and not as 'Punctuation'.

    3.1.2 Diacritics

    Definition: Errors related to the use of diacritics

    Annotation: Select the entire word which contains issue(s) of diacritics.

    3.1.3 Other

    Definition: Other spelling error(s) which do not belong to any of the subcategories above.

    Annotation: Select the entire word which contains spelling error(s).

    3.2. Capitalization

    Definition: Errors related to capitalization in Dutch language. Transfer of capitalization rules from source text is handled separately.

    Annotation: Select the whole word in target sentence.

    Be careful! Orthography errors are visible on the target language level, without the need for checking the source language. If a capitalization error is caused by wrong transfer of capitalization rules from the source text, then this should be annotated as "Transfer > Mechanical > Capitalization".

    3.3 Punctuation

    Definition: Errors related to punctuation.

    Annotation: Select the punctuation mark or symbol which is used unnecessarily or is placed incorrectly in the Dutch language. If a punctuation is missing, select the word preceding the correct location of the missing punctuation. As an exception, if the missing punctuation needs to appear at the beginning of the sentence, select the first word of the sentence and add 'first word' in your comments to indicate that this is an exception. Remember to add your comments in 'notes' section to provide the correct text.

    Be careful! Orthography errors are visible on the target language level, without the need for checking the source language. If a punctuation error is caused by wrong transfer, then this should be annotated as "Transfer > Mechanical > Punctuation".

    3.4 Other

    Definition: Other orthography errors which do not fall under any of the subcategories above.

    4. Multiple Fluency Errors

    Definition A combination of errors which make it difficult to annotate fluency errors separately.

    Be careful! Please first try to annotate specific errors if they can be identified. If it becomes difficult to identify specific errors, there is a chain of errors that affect each other and that the structure can be corrected in many different ways or the structure should be completely rephrased , then select 'multiple errors'.

    Annotation Select the whole span of words that contain multiple errors.

    5. Other Fluency Errors

    Definition Other fluency errors, which do not belong to any of the above fluency error categories.




    1. The reports of the LT3 Technical Report Series (ISSN 2032-9717) are available from http://www.lt3.ugent.be/en/publications/ All rights reserved. LT3, Ghent University, Belgium. Back to text.