Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Vogel, Carl; Moreau, Erwan

dc.contributor.author	Vogel, Carl
dc.contributor.author	Moreau, Erwan
dc.contributor.editor	Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga	en
dc.date.accessioned	2020-02-21T16:50:14Z
dc.date.available	2020-02-21T16:50:14Z
dc.date.created	May 7-12, 2018	en
dc.date.issued	2018
dc.date.submitted	2018	en
dc.identifier.citation	Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127	en
dc.identifier.other	Y
dc.description.abstract	This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.	en
dc.format.extent	1119-1127	en
dc.language.iso	en	en
dc.publisher	European Language Resources Association (ELRA)	en
dc.rights	Y	en
dc.subject	Universal dependencies	en
dc.subject	Word segmentation	en
dc.subject	Tokenization	en
dc.subject	Multilinguality	en
dc.subject	Interoperability	en
dc.title	Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus	en
dc.title.alternative	Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)	en
dc.type	Conference Paper	en
dc.type.supercollection	scholarly_publications	en
dc.type.supercollection	refereed_publications	en
dc.identifier.peoplefinderurl	http://people.tcd.ie/vogel
dc.identifier.peoplefinderurl	http://people.tcd.ie/moreaue
dc.identifier.rssinternalid	188594
dc.rights.ecaccessrights	openAccess
dc.subject.TCDTheme	Digital Humanities	en
dc.subject.TCDTag	Computational linguistics	en
dc.identifier.orcid_id	0000--000-8928-8546
dc.status.accessible	N	en
dc.contributor.sponsor	Science Foundation Ireland (SFI)	en
dc.contributor.sponsorGrantNumber	13/RC/2106	en
dc.identifier.uri	http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
dc.identifier.uri	http://hdl.handle.net/2262/91610

Files in this item

Name:: LREC2018-MV-1072.pdf
Size:: 130.5Kb
Format:: PDF

View/Open

Name:: license.txt
Size:: 3.499Kb
Format:: Text file

View/Open

This item appears in the following Collection(s)

Computer Science (Scholarly Publications)
Computer Science (Scholarly Publications)
RSS Feeds

Show simple item record

Browse

My Account

Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Files in this item

This item appears in the following Collection(s)