Evolution of Spoken Language Decrypted and Reconstructed by Computer Algorithm

First Posted: Feb 11, 2013 11:17 PM EST

Archaeologists will gain a powerful new tool in their quest to recover ancestral, disappeared languages: Connecting machine learning algorithms with Big Data in this field proved successful and will automate a process that until now took decades. Spoken language changes continuously, and ancient languages eventually die out and disappear - but only apparently, since even modern languages still contain the roots of those proto-languages.

Computer scientists at the University of California, Berkeley, succeeded in creating an automated “time machine,” by feeding huge amounts of modern and ancient known vocabulary and grammar to their self-learning algorithms in order to cross reference and find patterns in the chaos.

The computational model, which uses probabilistic reasoning (exploring logic and statistics to predict an outcome), was indeed able to reconstruct more than 600 Proto-Austronesian languages, which for example lead to the languages spoken in Polynesia, from the existing database of more than 140,000 words. While manual reconstruction is a meticulous process that can take years, the computer system can perform the large-scale reconstructions in a matter of days or even hours, replicating with 85 percent accuracy what linguists had done manually. This could benefit the accelerated reconstruction of many hundred more proto-languages (Proto-Indo-European, Proto-Afroasiatic, Proto-Austronesian), the linguistic ancestors from which all modern languages have evolved and the earliest-known languages.

”What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time,” said Dan Klein, co-author of the paper published online on February 11 in the journal Proceedings of the National Academy of Sciences.

The new program, developed in cooperation with University of British Columbia researchers, could not only boost our understanding of ancient civilizations based on their vocabularies by reconstructing the world’s proto-languages on a large scale, but it would also provide clues to how languages might change years from now.

“Our statistical model can be used to answer scientific questions about languages over time, not only to make inferences about the past, but also to extrapolate how language might change in the future,” said Tom Griffiths, associate professor of psychology, director of UC Berkeley’s Computational Cognitive Science Lab and another co-author of the paper.

The theoretical background of the computational model is the established linguistic theory that words evolve along the branches of a family tree – much like a genealogical tree – reflecting linguistic relationships that evolve over time, with the roots and nodes representing proto-languages and the leaves representing modern languages.

Based on this structure, the program sorts through sets of cognates, words in different languages that share a common sound, history and origin, to calculate the probability of which set is derived from which proto-language, using an algorithm known as the Markov chain Monte Carlo sampler. In endless iterations, while storing a hypothesized reconstruction for each cognate and each ancestral language per step, it comes closer and closer to the precise evolution of languages.

“Because the sound changes and reconstructions are closely linked, our system uses them to repeatedly improve each other,” Klein said. “It first fixes its predicted sound changes and deduces better reconstructions of the ancient forms. It then fixes the reconstructions and re-analyzes the sound changes. These steps are repeated, and both predictions gradually improve as the underlying structure emerges over time.”

See Now: NASA's Juno Spacecraft's Rendezvous With Jupiter's Mammoth Cyclone

TagsArchaeology