dc.description.abstract | Learning from natural language is one of the great challenges of Natural Language Processing (NLP) and Machine Learning (ML). Word meanings evolve over time and one of the challenges is how to model such dynamic behaviour. The task of Diachronic Word Sense Induction (DWSI) aims at learning the meaning of words across time, i.e. representing the dynamic evolution of a word sense. The word meaning is inferred in an unsupervised manner from time-stamped examples. Many other tasks of NLP
can be affected by word sense change such as machine translation, question answering, information retrieval and text classification.
This thesis addresses the problem of modelling the meaning changes of ambiguous target words from unlabelled time-stamped text. The modelling techniques used for DWSI rely on Bayesian hierarchical mixture models, and are closely related to Topic Modelling techniques. The random variables of DWSI models are the time Y , the sense S, and the context words W surrounding the target word. The sense is the latent variable in this dynamic story in which a target word acquires new senses and/or lose
old senses over time. The existing DWSI models assume that sense changes can be dependent on time, represented either by the multinomial probability distribution over words given sense P(W|S) or the multinomial probability distribution over words given sense and time P(W|S,Y). It is also assumed that the senses proportions change over time, represented by the multinomial probability distribution over sense given time P(S|Y). The main disadvantage of the existing models is that they are parametric, in the sense that the number of senses is a hyperparameter which has to be known a
priori. This is not ideal given the nature of the DWSI task, which is meant to infer senses (unobserved variables) from unlabelled data in their optimal representations. For example, one of the parametric methods relies on Dirichlet priors while the other method relies on priors defined as intrinsic Gaussian Markov Random Fields which adds artificial constraint on the estimation of P(W |S,Y) as well as P (S|Y). This thesis also addresses the issue of DWSI evaluation. This is a very challenging problem, since a reliable quantitative evaluation requires a large amount of sense-annotated and time-stamped data. I propose the first quantitative evaluation framework for DWSI,
allowing systematic and objective comparisons between models. I introduce a wide range of evaluation measures and a novel method for collecting a gold standard dataset from large sequential collections of documents from scientific publications in a new domain, the biomedical domain. This evaluation framework also allows comparisons between any future Bayesian models related to DWSI. The new evaluation measures are validated on the task, with detailed comparisons between the state of the art (SOA) models showing their respective strengths and weaknesses on various aspects of the task. In particular, the results demonstrate that the complexity of the time dimension with the parametric constraints do not lead to an accurate estimation of the evolution of senses across time. I then propose four DWSI models with different properties and based on Topic Modelling techniques. I advance the SOA by redefining the task by my novel dynamic approach based first on modelling P(W|S) by using the non-parametric priors (hence the time-dependent representations are calculated after estimating P (W |S)) and achieved a new SOA by the new model I designed.
Firstly, I investigate into the issue of number of senses and propose models based
on hierarchical Dirichlet processes priors. This assumes that an infinite number of
senses is possible in theory, allowing a word to be assigned to a new sense during
the inference processes. This also assumes that the corpus is subdivided into a set of
groups and that the senses are shared among multiple related groups of documents.
Such a model has two advantages: first it finds the optimal number of senses during
the inference process that fits the data; second it allows a high quality merging when
the desired number of senses is known. This thesis assumes that these properties
contribute to a more accurate representation of the meaning of the words, in turn
leading to better clustering of the senses across time. Indeed, P (S|Y ) is calculated
after estimating the models of the two modes: non-parametric and parametric. The
results demonstrate that such properties offer a dramatic performance gain compared
to the bespoke DWSI models. Another issue that is common and exists with other
parametric models is choosing the number of features (words) a priori. This comes with
the disadvantages of handling imbalance in word frequency as well as the sparsity in the
data introduced by ?one-hot? representations of words. Thus, word embeddings can be
another direction of potential improvement as it provides a distributed representation
where words in similar meanings are close in a lower-dimensional vector space. Secondly
I then investigate into the parametric dynamic models which compute a multinomial
probability distribution P (W |S, Y ) as the exponentiated inner product between the
word embeddings and per-time sense embeddings. Indeed, the results demonstrate that
there is an improvement, when the model is provided with prefitted embeddings rather
than with embeddings trained simultaneously. Lastly, I conclude that the hierarchical
Dirichlet processes based models show drastically better results/clusters even when
they are compared with a model based on a high-quality and domain-specific pretrained
embeddings. | en |