April 30, 2010

The shape and tempo of language evolution (Greenhill et al. 2010)

This is an extremely interesting paper which addresses the claim that typological features of languages (e.g., whether they use Subject-Verb-Object) are more conservative than the lexicon. If that is the case, then typological features could be used to infer evolutionary relationships between languages that are older than ten thousand years or so (an upper limit on what can be inferred using vocabulary).

In general, the authors reject the idea of typological conservation, although they note that typological features differ in this respect, and some of them may appear to be conservative within some language family but evolve rapidly in another. Their tree reconstruction is able to infer well-known language families (e.g., Indo-European), or suspected ones (e.g., Nostratic), but the corresponding clusters are not robust (e.g., Hindi is broken away from the IE cluster, and unrelated non-Eurasian languages fall into the Nostratic one).

There is a freely available link to a preprint of the paper (pdf).

Proc Biol Sci. 2010 Apr 7. [Epub ahead of print]

The shape and tempo of language evolution.

Greenhill SJ, Atkinson QD, Meade A, Gray RD.


There are approximately 7000 languages spoken in the world today. This diversity reflects the legacy of thousands of years of cultural evolution. How far back we can trace this history depends largely on the rate at which the different components of language evolve. Rates of lexical evolution are widely thought to impose an upper limit of 6000-10 000 years on reliably identifying language relationships. In contrast, it has been argued that certain structural elements of language are much more stable. Just as biologists use highly conserved genes to uncover the deepest branches in the tree of life, highly stable linguistic features hold the promise of identifying deep relationships between the world's languages. Here, we present the first global network of languages based on this typological information. We evaluate the relative evolutionary rates of both typological and lexical features in the Austronesian and Indo-European language families. The first indications are that typological features evolve at similar rates to basic vocabulary but their evolution is substantially less tree-like. Our results suggest that, while rates of vocabulary change are correlated between the two language families, the rates of evolution of typological features and structural subtypes show no consistent relationship across families.



Unknown said...

Re the colored, circular chart classifying the world's languages according to their typological features:

Group 2's features should be thought of as the neolithic innovation pattern, emerging with agriculture. The un-numbered group to the left represents the original, pre-agriculture pattern. Group 1 is, if not an actual hybrid of the two, at least a pattern that is mid-way between the others, sharing aspects of each.

pconroy said...


Basque appears closest to Hunzib
Which is from the Northern Caucasus - right where one would expect, based also on Blood Type O-Negative, which is also high in this area.

pconroy said...

Also, check out Mandarin, closest to Yoruba???!!!

pconroy said...

Pity they didn't extend it to languages like Irish Gaelic, which has some grammar features similar to Berber and other Afro-Asiastic languages...

Gioiello said...

It seems that also from this reasearch the Nostratic family has many possibilities to be real. And it was formed where we have always thought hg. R was born: among Caucasus/East and West Urals. Given the links with Sino-Tibetan I think having demonstrated in my studies I spoke on elsewhere in this forum, probably it isn't absurd to think to a very ancient time when hg.NO and P where in South Siberia. My thinking, as you know, is that some R (R1b1* etc.) came to Western Europe very early, between the LGM and the Younger Dryas.

Andrew Oh-Willeke said...

While interesting, the study has some real methodological flaws and relies on a very incomplete data set (the WALS data).

The analysis is not very explicit at all on the issue of weighting. Not all language features in WALS are created equal, and this study doesn't appear to reflect this fact. Likewise, giving equal weight to similiar languages with recent shared histories, and ones that do not is problematic. If there is ever a case for being a Baysean, it is in a context like this one where we have a lot of context to guide our statistical inferences.

The big issue that the study expressly says it is trying to deal with is the degree to which non-lexical features are stable within IE languages and Austronesian languages. The key question is the degree to which particular language features: (1) derive from parent languages, (2) derive from substrate languages, (3) derive from areal relationships (i.e. borrowing based on geographical proximity), and (4) are random. But, the study design is ill suited to capturing this issue. A particularly glaring omission the the absence of proximity and lexical language commonality as variables to be considered in the study.

If adjacent languages known to be part of different language families based on history and lexicon tend to share a feature, this is a good argument that the feature is areal or due to a substrate. Conversely, if adjacent languages known to be part of different language families preserve distinctions in other language features, this is a good argument that the feature is due to a parent language and strongly conserved over time.

Similarly, when features are shared by languages spoken by groups with the same population genetics, but not by adjacent languages spoken by people with different population genetics, this argues for either substrate or parent language influences; while features shared across these lines are likely to be areal.

The most interesting data based on this kind of analysis would be one showing the most and least areal features in the WALS data supplemented by proximity and FsT population genetic similarity data.

Eyeballing the WALS data in maps (which reveals proximity issues) and with a knowledge of population genetics, for example, make the exteme distinctiveness of the various Caucasian languages in both certain grammar features (like ergativity) and phonetics leap off the page. The case from proximity for egativity not being a trait that is spread areally is very strong.

Another data set that is missing and important for the kind of analysis the study is purporting to do is to look at actual evolutioon of features over time when there is a continuous data set in literate languages. What features actually varied from Old English to modern English, from Sanskrit to Hindi, from early Akkadian to late Akkadian? Scratch those from the list of conservative features.

We also have some very good evolution of language data in cases of known substrates, like the replacement of Sumerian by Akkadian where there was a known period of bilingualism. This also makes a case for a lack of substrative to superstrate carryover of ergativity.

In short, dumb statistics don't cut it, and one needs the right data set and the right use of that data set to answer the questions asked in this paper.

Dienekes said...

Andrew, the authors cull languages that have many missing features, and they also study different features separately, so your points really don't stand.

Also, if you take the idea that certain languages form a family, then of course you will discover features that are stable within them. But, that's hardly interesting. What the authors are interested in is discovering new language relationships, especially older ones for which typological features have been claimed to be stable.

All in all, I think that the paper is excellent at what it does, which is cast doubt on the alleged stability of typological features or their superiority to the lexicon.

ashraf said...

This paper confirms the Euro-Asiatic (but not the lislakh hypothesis of Carleton Hodge) hypothesis of Greenberg, (although the presence of languages such as Basque, Burushaski and Ingush in the first group is somehow odd as those languages are said to be Sino-Caucasian)and also the Nostratic hypothesis of Pedersen.

But in the same time undermine the Afrasan hypothesis as it showed that South Afrasan (Kushitic, Omotic etc...) is very distant from North Afrasan (Egyptian, Berber, Haussa, Semitic).

Anonymous said...

Quechua? How could that possibly be related to to languages like Turkish or Hindi?

Naren Palepu said...

May be good effort. If we go by the words of Dienekes, it goes from extremely interesting paper to extremely amateurish paper.

So many Indologists previously said some of the typological features of Indian Indo Europian languages are from Dravidian sub stream.
The consideration for Dravidian languages is very minimum. which also provided script for lot of ASEAN countries like Thai,Burmese, Kambodian etc. that aspect makes this paper ridiculous.

eurologist said...

Quechua? How could that possibly be related to to languages like Turkish or Hindi?

Random noise / chance.

I like that the typology groups the "Balkan" languages across 3 to 4 IE subfamilies. Probably both substrate and diffusion at work. I am puzzled though how far away Greek is listed in both methodologies - anecdotally, I see at least (if not more) shared words with Germanic as with Slavic languages, especially if you look at similar or related meaning, and not exact matches.

Irish is also interesting. I have argued before that insular Celtic is extremely different from what we know of "Celtic" languages close to the western Alps - which to me appear much closer to Latin and Germanic. And I still don't believe that the language spoken ~2,500 years ago east of the Rhine and south of the Danube was anything resembling Celtic at all - likely much closer to proto-Germanic, given the ~1,000km shared "border" that in fact has no significant geographic obstacle whatsoever.

Maju said...

"In general, the authors reject the idea of typological conservation"...

Thanks goodness! I was already establishing the 'nasal theory of language families': what matters is the shape of the nose, that's what makes Evo Morales and Oteiza speak "similar": their big squarish noses. :D

But pity anyhow because I'd love linguists to begin ranting about 'Indo-Georgian', 'Altai-Andean', 'Burusho-Caucasic', 'Afro-Pacific', 'Nippo-Burman', 'Afrasian-Uto-Caddoan', etc. :D

Now seriously, one may come to suspect that beyond phylogeny and mere areal features, part of the typological structure might reflect some sort of similar substrate, some particular way of thinking of language, any language (for instance the existence of certain phonemes or not, or how tonality affects speach if at all). In fact this would not be too different than areal features or even phylogeny itself. But seems difficult to disentangle the mesh.

Belenos said...

Andrew, excellent post, and a good illustration of why statistical comparison of large numbers languages is really not a very useful tool.

The processes through which languages change are really far more complicated than anything we find in genetics. Even if we are looking at time depths of just 3000 years, we have to take in to consideration the effects of sprachbunds, processes of creolisation, adult second language acquisition and diglossia, not just with known language families, but with languages which have left no historical record.

I'd also like to add a couple of problems with the analysis.

1. Typological features have not been proven to be stable over the massive depths of time neede for this analysis to have value. In fact they are clearly subject to language contact effects and language change. The S-V-O example given in the post illustrates this perfectly. English has been SVO for 500 years, for 500 years prior to that it was becoming SVO. Prior to that it was an analytical language, where various sentence orders were possible.

Imagine how many more of these effects have occured over 10 or 20k years? It more or less destroys any hope of accurately finding links.

2. Typological similarities are not always significant. One could easily declare Greek and Basque were both analytical languages, and so similar. But the way they are analytical is completely different. In individual cases, one can see the mistake easily, but with a statistical analysis it will be missed.

Maju said...

PConroy said: "Basque appears closest to Hunzib"...

Actually to the whole NE Caucasian family: they coverge very clearly well below (rather 'above' in the graph) where they converge with Basque. One may also argue that they also converge with Burusho and NW Caucasian (Abkhaz) well before converging with Basque.

This could be some support for the Vasco-Caucasian hypothesis and specially for my favorite version: Basque-NE Caucasian-Hurro-Urartean-Sumerian with a Gravettian origin for all.

But who knows!

Belenos said...

The most useful sentence of the article

"our analysis of rates of evolution failed to
identify any typological features that evolve at consistently slower rates than the basic
lexicon. If the signal in the lexicon does stretch back as far as 10,000 years then our results suggest that typological data
is constrained by a similar time horizon."

One to remember when looking at the chart at the top of the article...

Maju said...

On a related thought, the graph does suggest a further grouping of this putative Vasco-Caucasian-Burusho with 'Dravido-Georgian', 'Altai-Andean' and maybe also Barbacoan (Awa Pit) and Uralic at a very deep coomon root, distinct from Indoeuropean, which might be rooted at the colonization of West Eurasia and Central Asia (extensions as far as Mongolia). Just a hunch that makes some potential sense to me.

Gioiello said...

Maju, remember that the great finding of Alfredo Trombetti (La lingua basca, 1925) was that Basque was linked with the Caucasian Languages. As we know now that there is a macro-group (Basque-Caucasian-SinoTibetan-Na-dené) to which I think belong Sumerian, the link of Basque with Sumerian is possible. Of course other languages have had some mingle that Basque and Sumerian haven't had, not being in contact from very ancient time. The relatedness could go back to tens of thousands of years.

eurologist said...

1. Typological features have not been proven to be stable over the massive depths of time neede for this analysis to have value.

2. Typological similarities are not always significant.

I would suggest that people read the actual article before making such disparaging comments...

Maju said...

Gioello: sure, the Basque-Caucasian tentative connection has been there for a while but the very structure of the three Caucasian language families itself is a matter of controversy and whatever connection they have with Basque is very blurry.

There is not one Caucasian family but three (NW Caucasian, NE Caucasian and Kartvelian) and they have never been conclusively connected to each other. NW Caucasian has been related to Hattic and NE Caucasian to Hurro-Urartean, while Kartvelian sometimes shows up into the Nostratic hypothesis.

However, the most convincing stuff I have read was about a NE Caucasian-Basque connection. Also I have toyed a bit with shrunk-down versions of mass lexical comparison using numbers 1-5, in fact looking for potential cousins for Sumerian. What I 'found' was that Sumerian seemed closest in this aspect to NE Caucasian-Hurro-Urartean and that Basque also showed up at a most distant position in that same grouping.

Considering that archaeological evidence places the origins of Sumerians at the Zagros Neolithic and that its precursor, the Zagros Epipaleolithic (Zarzian culture) is very possibly derived from Eastern European Epigravettian, via the Caucasus, the elements converge at Gravettian, so it's only logical that the distance is so huge and so hard to spot and confirm.

Another possibility might be Neolithic but we should see much more clear affinities in that case.

ashraf said...

Basque numbers are similar to afrasan and indo-european numbers rather than to caucasian ones.
1 bat, aa ad
2 bi clearly ie
3 hiru also clearly ie/aa
4 lau
5 bost clearaly ie (panca,fist,pandj)
6 sei clearly aa
7 zazpi clearly aa
8 (z)o(r)tzi clearly ie
9 bederatzi
10 hamar

Maju said...

LOL, Ashraf. "Clearly ie/aa" - whatever that means.

So according to you:

bat = waahid
bi = ithnaan
hiru = thalaatha
lau = arbaa
bost = xamsha

I just hope it's a joke. ;)

Onur Dincer said...
This comment has been removed by the author.
Onur Dincer said...
This comment has been removed by the author.
Onur Dincer said...
This comment has been removed by the author.
Gioiello said...

Before the beginning of modern linguistics, during the 19th century, that every language derives from Hebrew was a common belief. This is in line with the belief that every man derives from Adam and that the world has the age of the Jewish calendar and that man has been made at image of God.

Of course in linguistics and genetics things are very different.

Anyway Muslims do think that Arab was the language of Allah, for this they don't translate it.

Onur Dincer said...
This comment has been removed by the author.
Onur Dincer said...
This comment has been removed by the author.
ashraf said...

mr Maju, I did not understand how you extrapolate such conclusions, have you read my comment
ie=indo-european so basque bi(2) is similar to ie bi as in (bi)national,(bi)directional.
The numbers you gave are modern Araic ones not Semitic nor aa (afro-asiatic).
I think you now that hurrian(6)and ie+karvelian+altaic+uralic (7)[which is very similar to basque ones] are considered loans from ps to pie,p uralic,p kartvelian,hurrian (and perhaps basque) by mainstream linguists.
Here I rewrite my comment (ie=indo-european,aa=afro-asiatic,and please note that I did no comment on the numbers 4,9,10)

Onur Dincer said...
This comment has been removed by the author.
Maju said...


Oops, I did not not understand what you meant. My apologies.

"so basque bi(2) is similar to ie bi as in (bi)national,(bi)directional".-

Problem is that bi is not indoeuropean. The IE root for two is dwos and the Latin word is similar: duo.

'Bi-' is a particle taken by Latin from some other pre-IE language (Ligurian? Iberian?) and extended via Latin through many languages of the West, including Basque itself. Making 'bi' be IE is forcing things a lot. It's a clear case of Vascoid substratum in West Europe.

"The numbers you gave are modern Araic ones not Semitic nor aa (afro-asiatic)".

Do you prefer I compare with Akkadian?, ancient Egyptian?, Kabyle? It's the same: they do not correspond (though I do suspect some odd Vascoid influence in Berber from either Solutreo-Gravettian substratum or, more likely, Megalithic influence).



4-lau (*laur)-erbe-yaf'daw-rebea
5-bost (*bortz)-hhamish-'di:yaw-xemsa

Clear now? Otherwise please point to the exact connections with living or even proto-AA reconstructions (hard to make looking at such high diversity).

"I think you now that hurrian(6)and ie+karvelian+altaic+uralic (7)[which is very similar to basque ones] are considered loans from ps to pie,p uralic,p kartvelian,hurrian (and perhaps basque) by mainstream linguists".

It is possible that Basque sei (6) and maybe even sazpi (7) may well be loans from some IE, like Vulgar Latin. But my exercise only dealt with numbers from 1 to 5 (because in Sumerian six, etc. are said "five-one", etc. and I was looking for Sumerian relations when I undertook it, not Basque ones - also because these low range numbers seem more linguistically stable, conservative).

However I'm not really persuaded by the theories that consider that proto-Semitic influenced PIE. I'd rather think that both have similar influences by other "Neolithic" languages of the area, but hard to tell with the limited information we have. IMO PSem and PIE never really interacted directly.

Will continue to review your original post I misunderstood.

ashraf said...

Thank you for your comment, I recommand you to read Blazek book on numerals, for example pie (2) is not connected with p north aafrasan (2) but rather with proto north afrasan for twin and similar case is for pie 5 with ps fist.

All mainstream indo-europeanist (including even all the most "ie'centrists" ones) recognize pie 7 as a loan from ps, the ps of 7 has a semitic etymology , semitic morphology and afrasan parallels.

The more liberal mainstream ie'ists also find tentative that pie 7 and 3 are ps loans or common ps-pie roots.

Maju said...

Ashraf wrote before:

1 bat, aa ad
2 bi clearly ie
3 hiru also clearly ie/aa
4 lau
5 bost clearaly ie (panca,fist,pandj)
6 sei clearly aa
7 zazpi clearly aa
8 (z)o(r)tzi clearly ie
9 bederatzi
10 hamar


I have already discussed above that "bi" just cannot be IE. Even if one might argue for a b@ <> d@ (where @ is any random vowel, so we can think of dwos and bi as hypothetically related - quite forced but anyhow), the relation should be most remote, highly archaic: Paleolithic. Even English 'two' is a zillion times closer.

Let's see the other numbers:

1-bat. Potentially there might be a connection with some AA words: at/att in some Amharic languages, with close relatives in Hebrew (ahat, axat) and South Arabian t'ad. A few non-Semitic AA languages also have similar forms of 1 (adda, ta, da), so it's possible (though rather hard to explain).

3-hiru (pronounced iru, the 'h' is a modernism of Occitan influence). Any relation with IE *treyes can only be in letter R, hence not closer than for 2 (see above). I could not find any AA word that would be closer than PIE/modern IE.

4-lau (*laur). Rien de rien. You acknowledge this.

5- bost (*bortz). Proto-Berber *fuss with possible connections in Omotic only. Very tentative, specially as modern Berber is rather different.

6- sei. In principle cognate of Vulgar Latin/Iberian Romance 'seis'. IE.

7- sazpi. Might be from IE *sweks via Latin septem. But I think it was Krutwig who noticed that sazpi and sortzi (8) could be read as S+azpi/ortz, meaning 'azpi' below and 'ortz' the sky (above). He had some speculations on this and its hypothetical relation to the Scottish Mason pillars Jakin (in Basque meaning 'to know') and Boaz (in archaic Basque 'boz' means heart ('bihotz' modernly) and happiness ('poz' modernly)). But, well, he belived Picts spoke Basque... IDK.

8- sortzi. See above. I don't think it can have any relation with PIE *okto: except in the extremely remote sense as with 2 and 3.

9- bederatzi. Such a long word should be an artificial creation. I have recently speculated on a Megalithic age possibility here. Most likely with Basque etymology, IMO.

10- hamar (read: amar). Might be related to prto-Berber *meraw (and many derivates in modern Berber).

What would I make of all this?

A. There may well be a very remote Basque-IE connection (Gravettian?). I have already detected a few other basic words, like Basque 'izan': to be (however mainstream linguists consider 'izan' a modern word, what is contradicted by the Veleia inscriptions - but these have been 'inquisitioned' by the linguists popes' camarilla). However I don't see any reason to think of a modern (post-Paleolithic) connection: such thing would be much more obvious.

B. There may well have been Vascoid influence in Berber, both at the pre-AA Oranian substratum and in the Megalithic Age. Otherwise, I'd consider the occasional connection with mostly Ethiopian languages product of mere chance. Anything else would need much stronger demonstration.

C. I don't think that because an element or series of elements are (arguably) present in two distinct language families, that means any sort of phylogenetic influence. Areal influences (sprachbund) and shared influence from a third language or group of languages can perfectly and often explain better such borrowings.


ashraf said...

Thank you very much, here another amateurish interpretation (in French) that shows that all norafrasan and iranohittite (or indohittite) numbers are interconnected (ie the norafrasan and ih are genetically connected and arose in the middle east after the demographic explosion that occured after discovery of agriculture=>that's why we can see these numbers in such distant language families as Na-Dene, Altaic, Uralosiberic, Hurrourartean...)

1/Nombres protosemitiques et leur etymologies(Selon la reconstitution de Dolgopolsky et en assumant que "n" est une "consonne de liaison" et avec une reanalyzation pour convergence sous la famille lislakh)

1 ad (solitaire) [slavique ed=1]

2 ti (taw=jumeau, arabe tawam, aramean toma par connection lislakh/emprunt anglais twin=jumeau) [iranohittite tu=2]

3 tal (jumeau triplet) [iranohittite ter=3]

4 arb (saison,4 saisons)

5 ham

6 sis [iranohittite ses]

7 sab (indexe=7 eme doigt) [iranohittie sap]

8 tam [slavique sam/tam]

9 tis

10 as [iranohittite (t)as]

2/Nombres protolibyques=protoberberes(Selon Blazek, avec reanalyzation=>enlever prefixes, suffixes et consonnes de liaison et simplification par convergence sous perspective norafrasienne)

1 wan [iranohittite oin comme dans l'Anglais one]

2 ti [norafarsan, ih, lislakh]

3 kar

4 ok [iranohittite okt=8(4*2)]

5 sam

6 sas [ih,norafrasan,lislakh]

7 sa [ih,norafrasan,lislakh]

8 tam [ih,norafarsan,lislakh]

9 tis

10 mar [semitique miy=100]

3/Nombres protoegyptiens(meme principe)

1 oye [ih, berber, lislakh]

2 snaw [ih,norafrasan,lislakh]

3 kham

4 fdaw/fur(fur dans des langues afrasiennes africaines=>archaisme conserve?)[ih, norafarsan, lislakh]

5 daw

6 sis [lislakh, norafrasan, ih]

7 sap [idem]

8 ham [idem]

9 pisd [siginifiant neuf en Egyptien=>parallele semantique avec l'ih nav(9)=neuf, par exemple en Français neuf veut dire et 9 et nouveau, new=>possile taboo mythologique moyenoriental lislakh commun pour eviter de dire le nombre des mois de la grosesse de peur du mauvais oeil, du a la mortalite foetale tres importante dans ces temps recules]

10 maw (semitique miy=100) [norafrsan]

Donc reste seuleuement le nombre 5 qui n'est pas commun a ih et norafrasan (cad il ya plus de nombres communs ih/norafrsans que de nombres communs a l'interieur meme de la famille ih, voire les cas des nombres de la branche armenienne et la branche anatolienne) or le nombre ih 5 pank/pak peut etre connecte par metathese au norfarasan pak/fas/kap (=poignet, semitique kap/kappa ,berbere afus, anglais fist)

En tenant en compte que meme a l'interieur de la famille iranohittite les reflexes protoiranohittites (en les branches armeniennes, tochariennes ou anatoliennes entre autres) manquent, on peut considerer les reflexes lislakhs presents dans une branche norafrasienne mais pas une autre comme norfarasiens.

Autres similarites autre nombres est

norafrasan na/ih na=non

Arabe NAkar, NAfa, NAha=nier

norafarsan ist/ih hes=etre

Arabe ist/Anglais is

ISTasaad al thawr=le taureau s'EST "lionnise"

Arabe laysat(la+ist)/Anglais is not

lastu marid=je NE SUIS (pas) malade

Onur Dincer said...

Ashraf, are these French writings yours or are they quotes from someone else?

ashraf said...

No, they're mine ie amateurish but based on an interpretation of datas and some readings.

Onur Dincer said...

they're mine

Clear from the choice of words and some distinctive spellings.:)

Marnie said...

nice ashraf,

i can read you better in french than english.