Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus

Vogel, Carl; Moreau, Erwan

This item is covered by a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Internationa. Click to find out more

File Type:

PDF

Item Type:

Conference Paper

Date:

2018

Author:

Vogel, Carl

Moreau, Erwan

Access:

openAccess

Citation:

Moreau, E. & Vogel, C., Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, H?l?ne Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga, European Language Resources Association (ELRA), 2018, 1119-1127

Download Item:

(LREC2018-MV-1072.pdf) 130.5Kb

Abstract:

This paper describes how a tokenizer can be trained from any dataset in the Universal Dependencies 2.1 corpus (UD2) (Nivre et al., 2017). A software tool, which relies on Elephant (Evang et al., 2013) to perform the training, is also made available. Beyond providing the community with a large choice of language-specific tokenizers, we argue in this paper that: (1) tokenization should be considered as a supervised task; (2) language scalability requires a streamlined software engineering process across languages.

URI:

http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
http://hdl.handle.net/2262/91610

Sponsor

Grant Number

Science Foundation Ireland (SFI)

13/RC/2106

Author's Homepage:

http://people.tcd.ie/vogel
http://people.tcd.ie/moreaue

Author: Vogel, Carl; Moreau, Erwan

Other Titles:

Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Publisher:

European Language Resources Association (ELRA)

Type of material:

Conference Paper

URI:

http://www.lrec-conf.org/proceedings/lrec2018/summaries/1072.html
http://hdl.handle.net/2262/91610

Collections

Availability:

Full text available

Subject:

Universal dependencies, Word segmentation, Tokenization, Multilinguality, Interoperability

Subject (TCD):

Digital Humanities , Computational linguistics

Metadata

Show full item record

The following license files are associated with this item:

Original License

Browse

My Account