Is all that glitters in MT quality estimation really gold standard?

Graham, Yvette

dc.contributor.author	Graham, Yvette
dc.date.accessioned	2021-04-20T12:21:21Z
dc.date.available	2021-04-20T12:21:21Z
dc.date.created	11/12/16	en
dc.date.issued	2016
dc.date.submitted	2016	en
dc.identifier.citation	Graham, Yvette, Baldwin, Timothy, Dowling, Meghan , Eskevich, Maria, Lynn, Teresa and Tounsi, Lamia (2016) Is all that glitters in MT quality estimation really gold standard? In: 26th International Conference on Computational Linguistics, 11-17 Dec 2016, Osaka, Japan	en
dc.identifier.isbn	978-4-87974-702-0
dc.identifier.other	Y
dc.description.abstract	Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics, such as BLEU or TER, that lack the validity of human assessment. Human-targeted translation edit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation. Original experiments justifying the design of HTER, as opposed to other possible formulations, were limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale. Results show significantly stronger correlation with human judgment for HBLEU over HTER for two of the nine language pairs we include and no significant difference between correlations achieved by HTER and HBLEU for the remaining language pairs. Finally, we evaluate a range of quality estimation systems employing HTER and direct assessment (DA) of translation adequacy as gold labels, resulting in a divergence in system rankings, and propose employment of DA for future quality estimation evaluations.	en
dc.format.extent	3124-3134	en
dc.language.iso	en	en
dc.rights	Y	en
dc.subject	Machine Learning	en
dc.title	Is all that glitters in MT quality estimation really gold standard?	en
dc.title.alternative	Proceedings of the 26th International Conference on Computational Linguistics (COLING)	en
dc.title.alternative	26th International Conference on Computational Linguistics (COLING)	en
dc.type	Conference Paper	en
dc.type.supercollection	scholarly_publications	en
dc.type.supercollection	refereed_publications	en
dc.identifier.peoplefinderurl	http://people.tcd.ie/ygraham
dc.identifier.rssinternalid	227712
dc.rights.ecaccessrights	openAccess
dc.identifier.orcid_id	0000-0001-6741-4855
dc.identifier.uri	https://www.aclweb.org/anthology/C16-1
dc.identifier.uri	http://hdl.handle.net/2262/96107

Files in this item

Name:: C16-1294.pdf
Size:: 211.1Kb
Format:: PDF

View/Open

Name:: license.txt
Size:: 3.424Kb
Format:: Text file

View/Open

This item appears in the following Collection(s)

Computer Science (Scholarly Publications)
Computer Science (Scholarly Publications)
RSS Feeds

Show simple item record

Browse

My Account

Is all that glitters in MT quality estimation really gold standard?

Files in this item

This item appears in the following Collection(s)