Show simple item record

dc.contributor.authorVogel, Carl
dc.contributor.authorMoreau, Erwan
dc.contributor.editorNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunagaen
dc.date.accessioned2020-02-21T16:50:14Z
dc.date.available2020-02-21T16:50:14Z
dc.date.createdMay 7-12, 2018en
dc.date.issued2018
dc.date.submitted2018en
dc.identifier.citationMoreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127en
dc.identifier.otherY
dc.description.abstractThis paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.en
dc.format.extent1119-1127en
dc.language.isoenen
dc.publisherEuropean Language Resources Association (ELRA)en
dc.rightsYen
dc.subjectUniversal dependenciesen
dc.subjectWord segmentationen
dc.subjectTokenizationen
dc.subjectMultilingualityen
dc.subjectInteroperabilityen
dc.titleMultilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpusen
dc.title.alternativeProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)en
dc.typeConference Paperen
dc.type.supercollectionscholarly_publicationsen
dc.type.supercollectionrefereed_publicationsen
dc.identifier.peoplefinderurlhttp://people.tcd.ie/vogel
dc.identifier.peoplefinderurlhttp://people.tcd.ie/moreaue
dc.identifier.rssinternalid188594
dc.rights.ecaccessrightsopenAccess
dc.subject.TCDThemeDigital Humanitiesen
dc.subject.TCDTagComputational linguisticsen
dc.identifier.orcid_id0000--000-8928-8546
dc.status.accessibleNen
dc.contributor.sponsorScience Foundation Ireland (SFI)en
dc.contributor.sponsorGrantNumber13/RC/2106en
dc.identifier.urihttp://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
dc.identifier.urihttp://hdl.handle.net/2262/91610


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record