Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus
![Thumbnail](/themes/edepositireland/images/white_rectangle.jpeg)
File Type:
PDFItem Type:
Conference PaperDate:
2018Access:
openAccessCitation:
Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127Download Item:
Abstract:
This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al.,
2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing
the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as
a supervised task; (2) language scalability requires a streamlined software engineering process across languages.
URI:
http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.htmlhttp://hdl.handle.net/2262/91610
Sponsor
Grant Number
Science Foundation Ireland (SFI)
13/RC/2106
Author's Homepage:
http://people.tcd.ie/vogelhttp://people.tcd.ie/moreaue
Author: Vogel, Carl; Moreau, Erwan
Sponsor:
Science Foundation Ireland (SFI)Other Titles:
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)Publisher:
European Language Resources Association (ELRA)Type of material:
Conference PaperURI:
http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.htmlhttp://hdl.handle.net/2262/91610
Collections
Availability:
Full text availableKeywords:
Universal dependencies, Word segmentation, Tokenization, Multilinguality, InteroperabilitySubject (TCD):
Digital Humanities , Computational linguisticsMetadata
Show full item recordLicences: