dc.contributor.author | Vogel, Carl | |
dc.contributor.author | Moreau, Erwan | |
dc.contributor.editor | Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga | en |
dc.date.accessioned | 2020-02-21T16:50:14Z | |
dc.date.available | 2020-02-21T16:50:14Z | |
dc.date.created | May 7-12, 2018 | en |
dc.date.issued | 2018 | |
dc.date.submitted | 2018 | en |
dc.identifier.citation | Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127 | en |
dc.identifier.other | Y | |
dc.description.abstract | This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al.,
2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing
the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as
a supervised task; (2) language scalability requires a streamlined software engineering process across languages. | en |
dc.format.extent | 1119-1127 | en |
dc.language.iso | en | en |
dc.publisher | European Language Resources Association (ELRA) | en |
dc.rights | Y | en |
dc.subject | Universal dependencies | en |
dc.subject | Word segmentation | en |
dc.subject | Tokenization | en |
dc.subject | Multilinguality | en |
dc.subject | Interoperability | en |
dc.title | Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus | en |
dc.title.alternative | Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) | en |
dc.type | Conference Paper | en |
dc.type.supercollection | scholarly_publications | en |
dc.type.supercollection | refereed_publications | en |
dc.identifier.peoplefinderurl | http://people.tcd.ie/vogel | |
dc.identifier.peoplefinderurl | http://people.tcd.ie/moreaue | |
dc.identifier.rssinternalid | 188594 | |
dc.rights.ecaccessrights | openAccess | |
dc.subject.TCDTheme | Digital Humanities | en |
dc.subject.TCDTag | Computational linguistics | en |
dc.identifier.orcid_id | 0000--000-8928-8546 | |
dc.status.accessible | N | en |
dc.contributor.sponsor | Science Foundation Ireland (SFI) | en |
dc.contributor.sponsorGrantNumber | 13/RC/2106 | en |
dc.identifier.uri | http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html | |
dc.identifier.uri | http://hdl.handle.net/2262/91610 | |