Data Driven Models for Language Evolution
Delmestri, Antonella
Languages that originate from a common ancestor are genetically related, words are the core of any language and cognates are words sharing the same ancestor and etymology. The evolutionary history of language, therefore, may be discovered by cognate identification and estimated by phylogenetic inference. Using several techniques originally designed for biological sequence analysis, an orthographic learning system for measuring string similarity has been developed and successfully applied to these tasks. Using PAM-like matrices, the system has outperformed the best comparable phonetic and orthographic cognate identification models previously reported in the literature, with results statistically significant and remarkably stable, regardless of the variation of the training dataset dimension. The method has also inferred high-quality Indo-European phylogenies, which are compatible with the benchmark tree and reproduce correctly all the established major language groups and subgroups present in the dataset. This book focuses on computational historical linguistics, but its contribution is also relevant to computational linguistics and natural language processing.