Information about Meteor

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
Enlarge picture
Example alignment (a).
Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.[1]
Enlarge picture
Example alignment (b).

Algorithm

As with BLEU, the basic unit of evaluation is the sentence, the algorithm first creates an alignment (see illustrations) between two sentences, the candidate translation string, and the reference translation string. The alignment is a set of mappings between unigrams. A mapping can be thought of as a line between a unigram in one string, and a unigram in another string. The constraints are as follows; every unigram in the candidate translation must map to zero or one unigram in the reference translation and vice versa. In any alignment, a unigram in one string cannot map to more than one unigram in another string.

An alignment is created incrementally through a series of stages, which are controlled by modules. A module is simply a matching algorithm, for example the "wn_synonymy" module maps synonyms using WordNet, while the "exact" module matches exact words. Examples are given as follows:

Each stage is split up into two phases. In the first phase, all possible unigram mappings are collected for the module being used in this stage. In the second phase, the largest subset of these mappings is selected to produce an alignment as defined above. If there are two alignments with the same number of mappings, the alignment is chosen with the fewest crosses, that is, with fewer intersections of two mappings. From the two alignments shown, alignment (a) would be selected at this point. Stages are run consecutively and each stage only adds to the alignment those unigrams which have not been matched in previous stages. Once the final alignment is computed, the score is computed as follows: Unigram precision is calculated as:
Examples of pairs of words which
will be mapped by each module
ModuleCandidateReferenceMatch
ExactgoodgoodYes
StemmergoodsgoodYes
SynonymywellgoodYes


Where is the number of unigrams in the candidate translation that are also found in the reference translation, and is the number of unigrams in the candidate translation. Unigram recall is computed as:



Where is as above, and is the number of unigrams in the reference translation. Precision and recall are combined using the harmonic mean in the following fashion, with recall weighted 9 times more than precision:



The measures that have been introduced so far only account for congruity with respect to single words but not with respect to larger segments that appear in both the reference and the candidate sentence. In order to take these into account, longer n-gram matches are used to compute a penalty for the alignment. The more mappings there are that are not adjecent in the reference and the candidate sentence, the higher the penalty will be.

In order to compute this penalty, unigrams are grouped into the fewest possible chunks, where a chunk is defined as a set of unigrams that are adjacent in the hypothesis and in the reference. The longer the adjacent mappings between the candidate and the reference, the fewer chunks there are. A translation that is identical to the reference will give just one chunk. The penalty is computed as follows,



Where c is the number of chunks, and is the number of unigrams that have been mapped. The final score for a segment is calculated as below. The penalty has the effect of reducing the by up to 50% if there are no bigram or longer matches.



To calculate a score over a whole corpus, or collection of segments, the aggregate values for , and are taken and then combined using the same formula. The algorithm also works for comparing a candidate translation against more than one reference translations. In this case the algorithm compares the candidate against each of the references and selects the highest score.

Examples

Referencethecatsatonthemat
Hypothesisonthematsatthecat


>
Score: 0.5000 = Fmean: 1.0000 * (1 - Penalty: 0.5000)
Fmean: 1.0000 = 10 * Precision: 1.0000 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 1.0000
Penalty: 0.5000 = 0.5 * (Fragmentation: 1.0000 ^3)
Fragmentation: 1.0000 = Chunks: 6.0000 / Matches: 6.0000


Referencethecatsatonthemat
Hypothesisthecatsatonthemat


>
Score: 0.9977 = Fmean: 1.0000 * (1 - Penalty: 0.0023)
Fmean: 1.0000 = 10 * Precision: 1.0000 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 1.0000
Penalty: 0.0023 = 0.5 * (Fragmentation: 0.1667 ^3) 
Fragmentation: 0.1667 = Chunks: 1.0000 / Matches: 6.0000


Referencethecatsatonthemat
Hypothesisthecatwassatonthemat


>
Score: 0.9654 = Fmean: 0.9836 * (1 - Penalty: 0.0185)
Fmean: 0.9836 = 10 * Precision: 0.8571 * Recall: 1.0000 / Recall: 1.0000 + 9 * Precision: 0.8571
Penalty: 0.0185 = 0.5 * (Fragmentation: 0.3333 ^3)
Fragmentation: 0.3333 = Chunks: 2.0000 / Matches: 6.0000

Notes

  1. ^  Banerjee, S. and Lavie, A. (2005)

References

  • Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005
  • Lavie, A., Sagae, K. and Jayaraman, S. (2004) "The Significance of Recall in Automatic Metrics for MT Evaluation" in Proceedings of AMTA 2004, Washington DC. September 2004

External links

insufficient context for those unfamiliar with the subject matter.
Please help [ improve the introduction] to meet Wikipedia's layout standards. You can discuss the issue on the talk page.
..... Click the link for more information.
Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.
..... Click the link for more information.
In mathematics, the harmonic mean (formerly sometimes called the subcontrary mean) is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired.
..... Click the link for more information.
Precision has the following meanings:
  1. In engineering, science, industry, and statistics, precision characterises the degree of mutual agreement among a series of individual measurements, values, or results — see accuracy and precision.

..... Click the link for more information.
Recall may refer to:
  • Product recall
  • Recall election
  • Letter to recall sent to return an ambassador from a country, either as a diplomatic protest or because the diplomat is being reassigned elsewhere and is being replaced by another envoy

..... Click the link for more information.
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same
..... Click the link for more information.
In scientific nomenclature, synonyms are different scientific names used for a single taxon. Usage and terminology are different for zoology and botany.

Zoology

In zoological nomenclature, synonyms are different scientific names that pertain to the same taxon, for example
..... Click the link for more information.
bleu or BLEU may be
  • French for blue
  • A 1993 movie, .
  • Bilingual evaluation understudy
  • Belgium-Luxembourg Economic Union
  • Bleu (musician), born William James McAuley III in Boston, Massachusetts and bandmember of pop-group L.E.O..

..... Click the link for more information.
In statistics, the Pearson product-moment correlation coefficient (sometimes known as the PMCC) (r) is a measure of the correlation of two variables X and Y
..... Click the link for more information.
bleu or BLEU may be
  • French for blue
  • A 1993 movie, .
  • Bilingual evaluation understudy
  • Belgium-Luxembourg Economic Union
  • Bleu (musician), born William James McAuley III in Boston, Massachusetts and bandmember of pop-group L.E.O..

..... Click the link for more information.
bleu or BLEU may be
  • French for blue
  • A 1993 movie, .
  • Bilingual evaluation understudy
  • Belgium-Luxembourg Economic Union
  • Bleu (musician), born William James McAuley III in Boston, Massachusetts and bandmember of pop-group L.E.O..

..... Click the link for more information.
In linguistics, a sentence is a unit of language, characterized in most languages by the presence of a finite verb. For example, "The quick brown fox jumps over the lazy dog.
..... Click the link for more information.


An n-gram is a sub-sequence of n items from a given sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis.
..... Click the link for more information.
Vice Versa
Author F. Anstey
Country  United States[1]
Language English
Genre(s) Fantasy novel
Publisher D.
..... Click the link for more information.


Synonyms (in ancient Greek, συν ("syn") = plus and όνομα ("onoma") = name
..... Click the link for more information.
WordNet is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.
..... Click the link for more information.
In mathematics, the intersection of two sets A and B is the set that contains all elements of A that also belong to B (or equivalently, all elements of B that also belong to A), but no other elements.
..... Click the link for more information.
In mathematics, the harmonic mean (formerly sometimes called the subcontrary mean) is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired.
..... Click the link for more information.
Corpus (plural corpora) is Latin for body. It can refer to:
  • The body of Christ
  • A text corpus in linguistics, a large and structured set of texts
  • Corpus callosum, a structure in the brain

..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter