February 12, 2013

Automated reconstruction of proto-languages

PNAS doi: 10.1073/pnas.1204678110

Automated reconstruction of ancient languages using probabilistic models of sound change

Alexandre Bouchard-Côté et al.

One of the oldest problems in linguistics is reconstructing the words that appeared in the protolanguages from which modern languages evolved. Identifying the forms of these ancient languages makes it possible to evaluate proposals about the nature of language change and to draw inferences about human history. Protolanguages are typically reconstructed using a painstaking manual process known as the comparative method. We present a family of probabilistic models of sound change as well as algorithms for performing inference in these models. The resulting system automatically and accurately reconstructs protolanguages from modern languages. We apply this system to 637 Austronesian languages, providing an accurate, large-scale automatic reconstruction of a set of protolanguages. Over 85% of the system’s reconstructions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages. Being able to automatically reconstruct large numbers of languages provides a useful way to quantitatively explore hypotheses about the factors determining which sounds in a language are likely to change over time. We demonstrate this by showing that the reconstructed Austronesian protolanguages provide compelling support for a hypothesis about the relationship between the function of a sound and its probability of changing that was first proposed in 1955.

Link

9 comments:

tew said...

Oh, god, here we go again. Another day, another "linguistics" paper published in a non-linguistics journal by a group of non-linguists with not a single linguist in the team and claiming to outsmart linguists and their 200-y.o tried-and-true linguistic methodology.

To be sure - the subject is fascinating, but the trend described above is really disturbing.

Ned said...

I am very suspicious of working out language relationships by statistical means but this is very different. As far as I can see this is replicating the process of comparative linguistics with a computer. The results are slightly different but are they better? Only an austronesian linguist could tell you and Côté would have to release the sound change laws his method produced.

Any comparative linguist method, with or without a computer has problems that this does not overcome for example (a) inter dialect loans (consider if English were unwritten how the words 'great' and 'one' would fit in - both interdialect loans in their spoken form) and (b) loss from all daughter languages (if romance languages were unwritten how would the Latin 'h' be reconstructed).

Anonymous said...

I am not a linguist and I don't really consider linguistics to be a study worth a university education. The field is so out of touch with any form of reality. Cultural Anthropology is another of those self-indulgent "studies".

I gather Astronesian languages are comparatively young, that is, spread out much later than other language families and had less time to corrupt, drift and absorb alien language forms. If the language group is that young, and judging by the late arrival in some of its far flung spoken regions, then it should be child's play to work out the proto language base without computers or statistics. Romance languages are very recent in origins and cannot be compared with the Austronesian language family as far as working out its proto form.

MOCKBA said...

Although the paper doesn't come directly from Gray-Atkinson collab, the authors use the same Gray's database of Austronesian languages which has been exploited by Atkinson, and which is considered faulty or rigged by some linguists (although I have no clue if these suspicions have merits).

In addition, the authors admit that their reconstructions rely on simple tree phylogenies, and won't work with most over language families which typically experienced stronger admixture / borrowing than the relatively isolated Austronesian languages. But of course the genetic toolkits are also only gradually starting to account for admixing. So depending on how you think of it, the present method isn't directly applicable to most languages of other families, but the glass may be half empty or half full.

limetom said...

@Ponto

It's kind of ironic that you say linguistics is "completely out of touch with reality," and then proceed to show how ignorant you are (which is fine, not everybody can know everything, and I'm guessing you're not an Austronesianist, either) about what we know about how languages change, and in particular how the Austronesian languages have changed.

Austronesian languages are not "particularly young." Most linguists agree that the time-depth involved is on par with Indo-European languages, likely dispersing from Taiwan around 6,000 BP. Perhaps you were thinking of Polynesian languages? Greenhill and Gray (2005) give a good overview of the major hypotheses of Austronesian origins.

Having "less time to corrupt, drift and absorb alien language forms" has nothing to do with how different modern languages will look from their common ancestor. Some will look very close, having had very little change. Others will look very different. It was more or less known for decades now in linguistics that there is no "glottoclock"--languages do not change at a fixed rate (see, among others Blust 2000).

Additionally, you cannot judge by it's "late arrivial in some of its far flung spoken regions" whether or not the family as a whole is young. It's clear that only a subgroup (Polynesian) of a subgroup (Malayo-Polynesian) is very young. It would be like saying because the common myna first appeared in the Hawaiian Islands in the mid 1800s, the genus Acridotheres is particularly young.

terryt said...

"If the language group is that young, and judging by the late arrival in some of its far flung spoken regions, then it should be child's play to work out the proto language base without computers or statistics".

Which is the case. Presumably that is why the authors used that group to test their program. If the program is succesful it may help in constructing the deeper origins of the Austronesian languages. Many see a link to Tai-Kradai (or whatever is the current designation). Perhaps these two language groups have even deeper connections to other East asian languages which the program may be able to discern.

"the authors admit that their reconstructions rely on simple tree phylogenies, and won't work with most over language families which typically experienced stronger admixture / borrowing than the relatively isolated Austronesian languages".

Yes. That makes Austronesian languages an ideal point from which to start. Most languages developed progressively during the Austronesian expansion into the Pacific although the SE Asian ones were obviously subject to admixture with and borrowing from neighbouring languages.

"the glass may be half empty or half full".

I think you are realistic there.

Simon J Greenhill said...

MOCKBA - I'm the author of the Austronesian Basic Vocabulary data. I've spent years building it. I would be *very* interested to hear about these claims that it's "faulty" or "rigged"?!

It is a collection of wordlists from published data sources or linguists, with cognates from published data sources or Austronesian language experts. To say that it's faulty or rigged is bullshit.

--Simon Greenhill.

Jim said...

"I am not a linguist and I don't really consider linguistics to be a study worth a university education. The field is so out of touch with any form of reality."

Oh the irony. "Out of touch with any form of reality", Ponto, as you express that thought by means of langauge?

The reality that linguistics studies is you and your behavior, ponto. You are the object of study.

MOCKBA said...

Sorry for bringing in the dirty laundry of the anti-Atkinsonianism, Simon! But that's what Quentin's detractors say. Citing verbatim from a facebook post which made me gasp, sort of: "this database was created with a certain conclusion in mind, wink". As I said, I wouldn't know if there are any merits in these claims. But that's how some linguists feel.