August 28, 2009

Refinement of ancestry informative markers in Europeans (Tian et al. 2009)

From the paper:
In general, Fst values corresponded to geographical relationships with smaller values between population groups with origins in neighboring countries/regions (e.g. Tuscan/Greek, Fst = 0.001) compared with those from very different regions in Europe (e.g. Russian/Palestinian, Fst = 0.020) similar to previous studies [10].

...

The current study extends the analysis of European population genetic structure to include additional southern European groups and Arab populations. Even within Italy, the relative position of northern Italians compared with subjects from Tuscany is consistent with the general geographic correspondence of PCA results. Interestingly, the majority of Italian Americans (NYCP 4 grandparent defined) appear to derive from southern Italy and overlap with subjects of Greek heritage. Both of these observations are consistent with previous historical information [30,31].
The paired Fst table confirms that the closest population to Greeks are Italians (negative Fst=-0.0001) and Tuscans (Fst=0.0005). Much further apart are Spaniards (Fst=0.0035) and Germans (Fst=0.0039), who are still much closer than the most distant Russians (Fst=0.0108) and Orcadians (Fst=0.103).

The low genetic distance between Greeks and Italians (the lowest in the table), suggests, once again, that southern Italians are little more than Latin-speaking Greeks as their history suggests, without discounting the possibility that they have experienced some non-Greek admixture.

Also of interest is the proximity of Ashkenazi Jews to Greeks and Italians which are about twice closer to them than Bedouins, Palestinians, or Druze from the Near East. As I have argued before, a major component in the ancestry of Jews was picked up in Hellenistic-Roman times; most published models of Ashkenazi Jewish origins have only considered admixture between a Near Eastern component with a northern European (German-Slavic) component. Indeed, Ashkenazi Jews are closer to several European populations than they are to Middle Eastern ones

However, as the PCA analysis shows, Ashkenazi Jews are distinct from both Europeans and non-Jewish Middle Eastern populations and cannot be viewed as a simple mix of the two; their distinctiveness must be -in part- due to the specific features of the small founder population of that community after it became effectively reproductively semi-isolated from gentiles after Roman times. It would be interesting to see different Jewish communities studied in the context of a broad variety of European and Middle Eastern populations, to determine whether Ashkenazi distinctiveness is specifically Ashkenazi or more generally Jewish distinctiveness; I would bet on a combination of the two.

Also of interest is the analysis of European populations in comparison to South Asian Burusho and Balochi, which shows on the one hand, substantial homogeneity of West Eurasians compared to South Asians, but also, to some extent, the transitional nature of some populations such as Bedouins or Adygei.

Related: A previous article by Tian et al.

UPDATE (Aug 29)

The PCA analysis is also quite interesting:

Some observations:
  • In Α we see a west-east differentiation in northern Europe, with Irish and Russians in the two ends of PC1.
  • In Β we see differentiation of non-Jewish southern European populations from Ashkenazi Jews along PC1 and from Druze, Palestinians, and Bedouins, along PC2. Greeks are concentrated near the center at the lower left quadrant.
  • In C we see all the populations using only ancestry-informative markers and in D with all 270k markers. The two plots are similar, although use of the full set results in clearer results. We observe a cline of populations from the Near East to Northern Europe at the bottom. A little discontinuity between Greeks and Arabs would probably disappear if geographically intermediate populations had been included. Ashkenazi Jews are differentiated from the entire sample, suggesting that due to genetic drift, selection, or cryptic other ancestry (?) they cannot be reckoned as a simple European-Near Eastern mix genetically.
UPDATE (Aug 30):

Here is a dendrogram I created based on the paired Fst table from the paper. It is of course better to refer to the original table, but the plot, nonetheless shows in a different form "southern" (divided into European and Arab clusters) and "northern" (divided into "western" and "eastern" clusters).
Also a dendrogram after removing the island populations of Orkney and Sardinia, and the non-IE Basques.

Mol Med.
2009 Aug 24. [Epub ahead of print]

European Population Genetic Substructure: Further Definition of Ancestry Informative Markers for Distinguishing Among Diverse European Ethnic Groups.

Tian C, Kosoy R, Nassir R, Lee A, Villoslada P, Klareskog L, Hammarström L, Garchon HJ, Pulver AE, Ransom M, Gregersen PK, Seldin MF.

The definition of European population genetic substructure and its application to understanding complex phenotypes is becoming increasingly important. In the current study using over 4000 subjects genotyped for 300 thousand SNPs we provide further insight into relationships among European population groups and identify sets of SNP ancestry informative markers (AIMs) for application in genetic studies. In general, the graphical description of these principal components analyses (PCA) of diverse European subjects showed a strong correspondence to the geographical relationships of specific countries or regions of origin. Clearer separation of different ethnic and regional populations was observed when northern and southern European groups were considered separately and the PCA results were influenced by the inclusion or exclusion of different self-identified population groups including Ashkenazi Jewish, Sardinian and Orcadian ethnic groups. SNP AIM sets were identified that could distinguish the regional and ethnic population groups. Moreover, the studies demonstrated that most allele frequency differences between different European groups could be effectively controlled in analyses using these AIM sets. The European substructure AIMs should be widely applicable to ongoing studies to confirm and delineate specific disease susceptibility candidate regions without the necessity to perform additional genome-wide SNP studies in additional subject sets.

Link

86 comments:

Polak said...

Hey Dienekes, have you seen the supplemental material for this anywhere? I can find the link, but it looks like it should be available.

Also, you should redesign your Euro ancestry calculator based on these markers.

Gioiello said...

Dienekes underlines: “Interestingly, the majority of Italian Americans (NYCP 4 grandparent defined) appear to derive from southern Italy and overlap with subjects of Greek heritage” and I can quote: “The low genetic distance between Greeks and Italians (the lowest in the table), suggests, once again, that southern Italians are little more than Latin-speaking Greeks as their history suggests, without discounting the possibility that they have experienced some non-Greek admixture”. But also Tuscans, who hadn’t a Greek colonization, are the closest to Greeks, then I think to a common origin (also the Pelasgian background about which I have always spoken).
What said on Ashkenazim tallies completely with what I have always said.

Dienekes said...

Look in the journal website for the author's manuscript which if I remember correctly contains all the supplementary material.

Maju said...

Thanks "God" for the Burusho because they demonstrate that Ashkenazim are not from outer space after all (fig.4). ;)

Seriously:

...the inclusion or exclusion of particular ethnic groups (i.e. Ashkenazi Jewish, and Sardinian for southern European, and Orcadian for Northern European) shifted the relationships in PCA when southern or northern Europeans were examined separately. Similarly, the inclusion of South Asian populations (Figure 4CFigure 4) changes the relationships of the population groups with the Ashkenazi Jewish population appearing in the center of a presumed southern European cline. These findings are consistent with our previous observations [12], and show that PCA results are highly dependent on which population groups are included in the analysis. Thus, there should be some caution in interpreting these results and other results from similar analytic methods with respect to ascribing origins of particular ethnic groups.

That sums it up IMO in regard to what we can expect from PC analysis.

I am personally not that interested in Ashkenazim (though again I do miss a Turkish comparison) but I do find curious that Basques show up as rather distant to all neighbours (table 1):

1. Closest are Spaniards (0.0060), similar distance to that between Orcadians and Spaniards or Palestinians and Greeks.

2. Next are Germans, Italians/Tuscans, Irish and US Europeans. At a distance (c. 0.0080-90) similar to that of Italians with Orcadians or Russians.

Most distant are Eastern populations (Bedouins, Palestinians, Druze and also Russians, Adygei and Sardinians).

DagoRed said...

They are all "pelasgians" but many don't want understand this.

Joshua said...

Only because they don't have England there -- England ought to be closer to German than Italian to Greek.

Gioiello said...

Dienekes writes: “Ashkenazi Jews are differentiated from the entire sample, suggesting that due to genetic drift, selection, or cryptic other ancestry (?) they cannot be reckoned as a simple European-Near Eastern mix genetically”.
I think it is important that they deviate to North on the graphic, so far from Middle Easterners and between Europeans and Caucasians (ADY). If genetic drift has carried Sardinians (who are Italians but isolated from many thousands years) in the opposite position respect Jews, I think that Ashkenazim, with their three components (Middle Easterners, Europeans, Khazarians) and a genetic drift well known, can stay in that position. At this point the Middle Easterner component seems the least important.

Ebizur said...

Gioiello said...

"I think it is important that they deviate to North on the graphic, so far from Middle Easterners and between Europeans and Caucasians (ADY). ... At this point the Middle Easterner component seems the least important."

To which graphic have you referred? According to Graph A in Figure 1, the tested Adyghes (Circassians/Northwest Caucasians) are most similar to the tested Spaniards, Italians, and Greeks, along with a few Swedish outliers. The only individuals who have been positioned among or beyond the Ashkenazim are three Eastern Europeans, one Swede, and one German. One Hungarian and several other Swedish individuals are intermediate between the Northern European/Germano-Slavo-Celtic cluster and the Ashkenazi cluster.

Furthermore, it is only PC2 that distances the Ashkenazim so greatly from the Arab samples. The Druze, Palestinian, and Ashkenazi samples exhibit equally high values for the first principal component. However, the Ashkenazim and the Arabs are polar opposites in regard to their values for PC2, with the Ashkenazim having values that surpass even the Swedes and the Russians and the Arabs having values approximately equal to those of the Sardinians.

argiedude said...

A quick glance through the pdf shows they tested like 4000 people!! Niiice...

But there are some problems.

The Greek distance to Germany was 0,0040, and to Orkney 0,0100, implying the distance between Germany and Orkney is around 0,0060, while in the Heath study the distance between UK and southeast Germany was 0,0006, a difference of an order of magnitude.

Also the Greek distance to Russia and Germany is 0,0110 and 0,0040, implying a difference between Germans and Russians of perhaps 0,0070, but in the Heath study Germans and Russians had a distance of 0,0016.

Obviously they're using HGDP samples for some of the populations: definitely Orkney, probably Palestinians, probably Tuscans, probably Russians. The study may have huge sample sizes for Germans and Greeks, but for Orkney and Russia and Palestinians they probably have just 20 or 30, as per the typical sample size of the HGDP populations. I've already established reliably that the sample sizes in these FST estimates need to be in the hundreds for accurate results, if the populations studied are extremely closely related, as should be the case in any study of Europe. And I've already established that the typical error rate introduced by the low sample sizes of the HGDP samples will increase FST by a fixed amount of around 0,0050, though it varies around that average hugely. With this established, everything fits into place. The distance between Greeks and Italians and Germans is extremely accurate because the study used proper sample sizes numbering in the hundreds. But the distances in which one of the pairs is an HGDP population, such as Orkney, Russia, Tuscans, or Basques, is very wrong, and as a rule of thumb, to arrive at the probable result had the study used big sample sizes for these HGDP populations, the FST score should be reduced by about 0,0050 FST. Thus, the Greek-Orkney distance of 0,0100 becomes 0,0050, which then compared to the Greek-German distance of 0,0040, implies a German-Orkney distance of 0,0010, which is absolutely perfect (UK-southeast German distance was 0,0006 in the Heath study, which used hundreds of samples per population).

The Italian samples are from Americans, so it's likely that close to half of them are ultimately of Sicilian origin. The Italian-Greek distance is 0,0000 FST. The North African-Europe distance is 0,0300 FST. Greeks have virtually no non-European ancestry (y-dna and mtdna). If Sicilians had even as little as 5% North African ancestry, their distance to the Greeks should have shot up to 0,0010 or 0,0015 FST.

argiedude said...

I can't help but note how absolutely perfect my guesstimate for Greeks and southern Italians was, back when we didn't have any autosomal studies of Greeks and southern Italians and it was anyone's guess where they would fall.

http://i88.photobucket.com/albums/k178/argiedude/FSTfromBritain.gif

Polak said...

>>The Greek distance to Germany was 0,0040, and to Orkney 0,0100, implying the distance between Germany and Orkney is around 0,0060.<<

It does't work like that at all.

Also, the Orkneys don't represent the UK very well because they're a breeding isolate of sorts, and this affects Fst distance.

Btw, 23andme tests 3516 of the 3519 SNPs, so it's possible to devise intra-European ancestry tests based on this paper for all those who have the data.

Ponto said...

All this study shows is that for Western Eurasian nationalities and ethnic groups that live close to each other are closer genetically. No big deal.

The study shows clearly that isolating Europe from the Middle East genetically, I mean the Paleolithic/Neolithic paradigm, is foolhardy. The study shows for one group, the Adygei ethnic group in the Caucasus region, has a genetic connection to South Asian populations. Gee whiz, what a brainer that one was! Geographic closeness means genetic closeness.

The American Jews, that is what they really are now, the Ashkenazim, are as I have said before, just the descendants of the Peninsula inhabitants of Italy with some minor Middle Eastern or "Turkish" embellishments but their genetic origins are skewed by restrictive marriage practices in order to preserve their religion and eating practices. The Finns are outliers in Europe because of the small group of people that formed the original population and who had unusual alleles. Founder effects.

Taking Europe as a whole, it is obvious that the Southern parts have been mostly seeded by immigrants from the Middle East. So Italians are close to Greeks who are close to Middle Eastern ethnic groups genetically. Other Europeans are a mix of Caucasoids from the Middle East and further north in Southwest Asia.

The Paleolithic/Neolithic paradigm, the farmer versus the hunters is rather simplistic. Europe's peopling was more complex than just two groups separated by 25 ky and coming together as a result of the invention of farming.

The Bedouin may be highly inbred but when they out breed it was with other Middle Easterners maintaining a Middle Eastern array of SNPs. Same can be said for Europeans like Italians, Spanish, Greeks who were also fond of close relative unions until recently in terms of history. The Middle Eastern component of Jewry however out breed with, mostly, non Middle Easterners, the results of their later restrictive unions would be quite different from Bedouins or other Arabs, and the inbreeding effects practiced by Europeans.

Dienekes said...

There were only 7 Greek subjects.

antoine1706 said...

Argietude Write "The North African-Europe distance is 0,0300 FST. Greeks have virtually no non-European ancestry (y-dna and mtdna). If Sicilians had even as little as 5% North African ancestry, their distance to the Greeks should have shot up to 0,0010 or 0,0015 FST"

Where did you get these numbers regarding North-Africa ? Currently the only group tested are Mozabites who are Saharan and not representative at all with more than 20% of sub-saharan ancestry. So it is almost certain that if the Mozabite-Europe distance is 0,0300 FST, distances between Europe and mediterranean north-african groups should be very low, even less that between Europe-Palestinians... By the way it is visible : Beduin, Druzes and Palestinians are much darker than costal north-african people

Gioiello said...

Ponto writes: “The American Jews, that is what they really are now, the Ashkenazim, are as I have said before, just the descendants of the Peninsula inhabitants of Italy with some minor Middle Eastern or "Turkish" embellishments but their genetic origins are skewed by restrictive marriage practices in order to preserve their religion and eating practices”.

It isn’t casual we both have gained many banishments.

Ebizur said...

Gioiello, the Ashkenazi Jews have been placed "so far from Middle Easterners and between Europeans and Caucasians (ADY)" only in regard to their value for PC1 in Graph A of Figure 3
(Principal component analyses of Southern European populations). Actually, a couple Greek subjects also have been assigned similar values for PC1, so it would be more accurate for you to say that Ashkenazim, Adyghes, and some Greeks have been assigned values for PC1 that are intermediate between those of other Southern Europeans and those of Mashreqi Arabs in a principal component analysis performed with the exclusion of all Northern European populations.

Furthermore, Graph C in Figure 3 shows that the reason for the placement of the Adyghe near the Ashkenazim in Graph A is due to the Adyghes' opposition to the Sardinians. When the Sardinians have been excluded from the analysis, the Adyghes cluster squarely among the Greeks and Italians.

Gioiello said...

Ebizur, I thank you for your explications. Unfortunately I haven't so much competence in this matter, and I confide in you. But why exclude the Sardinians, who are Italians genetically, with only a drift due to an isolations of many thousand years. And Mashreqi Arabs who are? Which is their origin? Which their DNA and mtDNA? Argiedude, who is more competent than me in this matter, says that autosomal with a few persons is unreliable. I hope Dienekes does a new calculator. I have deCODEme and 23andMe which can be examined.

argiedude said...

PUAAAAHHHJJJJJJ! I wish I had read this thing before making a preliminary post yesterday. ABSOLUTELY WORTHLESS! They actually outdid HGDP. The average sample size, excluding Sweden and Ireland (591 and 84 samples, respectively) was 17. That's even worse than HGDP, which is pretty bad. The only acceptable pair-wise estimate in the FST table is the Irish-Sweden comparison! And it's only acceptable, not super accurate. The results for the other 15 populations are worthless. Their margin of error eats up (several fold) the real result. In some cases the margin of error is probably ten times greater than the result itself. Completely worthless.

Here are some examples.

Greece and Italy have a distance of 0,0000 FST, supposedly indicating they're virtually the same people (Australians, British, and CEU had an FST of 0,0002 with respect to each other). But... comparing Greek and Italian results to other populations shows very big differences: Greek-Spanish GD is 0,0035 FST, while Italian-Spanish GD is 0,0010. That difference is equal to the GD between Spain and England, or between Russia and England. In the very excellent study of north Europe done by McEvoy earlier this year, which included the Aussies, Brits, and CEU, the FST never changed when comparing these 3 English-speakers with other Europeans. For example, the distance to Finland was 0,0064, 0,0064, and 0,0064. Or to Denmark: 0,0005, 0,0005, 0,0005. Or to Ireland: 0,0004, 0,0009, 0,0006 (partly Irish Aussies scored the 0,0004). Consistency in results, thanks to gargantuan sample sizes (I think 500+ in the McEvoy study; it HAS to be in the hundreds when testing very closely related people). Compared to McEcoy's results, this new study is a total joke.

Basques and Spain have a distance of 0,0060 FST, equal to the distance between Spain and Russia in the infinitely much better study of Europe done by Heath in which the average sample size was 300.

After closely looking at the results that can be compared with Heath's infinitely much better study, I'm surprised that the margin of error seems to be just 0,0020 FST. With sample sizes of just 17, I would have expected a margin of error of more like 0,0050 FST or worse. But even 0,0020 FST is completely unacceptable. The distance between Spain and England is 0,0024 FST. How can you draw conclusions about the relationships amongst Europeans when your margin of error is as big as the distance from one of extreme of Europe to the other ?!!? [In hindsight, the negative FST between Greece-Italy (-0,0001 FST) was a warning flag.]

To summarize: worthless! TOTALLY WORTHLESS!

Dienekes said...

Their margin of error eats up (several fold) the real result. In some cases the margin of error is probably ten times greater than the result itself. Completely worthless.

The standard deviations are given in Table S2.

GRK-ITN: -0.0001 +/- 0.0011

So, yes, even accounting for the confidence intervals, Greeks and Italians are practically the same people.

To put it in perspective, the average Fst between populations is 0.0095; Greeks and Italians, even if their real Fst is +1SD (=0.0011) are nine times closer to each other than two random European populations are.

argiedude said...

If you think that a distance of -0,0001 + 0,0011 results in "practically the same people", then you must also think that French and Spaniards are practically the same people (0,0008), or the French and British (0,0006), or Germans and Poles (0,0012), or Poles and Russians (0,0002).

"The small differences in these independent samplings (mean SD = 0.0009; median SD =0.0008) indicate that this approach resulted in good estimations of paired Fst values."

The average deviation from equivalent results in Heath's study was 0,0020 FST. Heath's study is the Bible compared to this garbage.

argiedude said...

"To put it in perspective, the average Fst between populations is 0.0095;"

You calculated that average including all the non-Europeans, which make up a 1/3 of the samples, and the Sardinians, which are relatively extremely far away from Europe. And these autosomal studies are showing that Europeans form a distinct genetic bloc relatively set apart from North Africans and Middle Easterners. Plus, the results are all pumped up artificially by 0,0020 FST as an artifact of the extremely small sample sizes. In Heath's study of Europe the average distance between 2 European populations was 0,0019 FST.

"Greeks and Italians, even if their real Fst is +1SD (=0.0011) are nine times closer to each other than two random European populations are."

That's south Italians, by the way. Spaniards and Russians seem to be very homogenous, Italians seem to be very different genetically (relatively speaking, of course, within a European perspective). Anyhow, I do think that south Italians and Greeks probably are close, I'll bet their real result will be determined, in an authentic study and not in this garbage, to be around 0,00010 FST. But that's typical of 2 populations so close geographically. Ireland and Britain have around 0,0005 FST, and Britain and Denmark have around 0,0006 FST, and they're all separated by water, too, like south Italians and Greeks.

Dienekes said...

Only 1/10 of the pairwise Fst distances are within the Greek-Italian distance+3 standard deviations, so yes, they are practically the same people in the European context.

Incidentally, there is really no problem with the small sample sizes. As previous research has shown, even samples of 1-2 individuals are generally placed correctly in a European map, suggesting that most individuals are representative of their populations. When you have clusters with a lot of "scatter" around the mean, then undersampling leads to error, but if clusters are very tight, then even a sample of 1 is sufficient.

argiedude said...

"I'll bet their real result will be determined, in an authentic study and not in this garbage, to be around 0,00010 FST"

I meant to say 0,0010 FST.

argiedude said...

"Incidentally, there is really no problem with the small sample sizes."

The average GD between the European HGDP samples is 0,0075 FST, but in the Heath study comparable sets of samples averaged 0,0025 FST. The HGDP populations have an average sample size of 25, while Heath's populations averaged 300. I've seen it over and over. I've thought before about why this could be, and I don't have a real answer, but I think it might have to do with the small sample sizes producing too choppy results in each of the thousands of equations that make up a single FST estimate. Of course, the point of using thousands of SNPs is to smooth out those results.

argiedude said...

post by antoine1706:

Argietude Write "The North African-Europe distance is 0,0300 FST. Greeks have virtually no non-European ancestry (y-dna and mtdna). If Sicilians had even as little as 5% North African ancestry, their distance to the Greeks should have shot up to 0,0010 or 0,0015 FST"

Where did you get these numbers regarding North-Africa ? Currently the only group tested are Mozabites who are Saharan and not representative at all with more than 20% of sub-saharan ancestry. So it is almost certain that if the Mozabite-Europe distance is 0,0300 FST, distances between Europe and mediterranean north-african groups should be very low, even less that between Europe-Palestinians... By the way it is visible : Beduin, Druzes and Palestinians are much darker than costal north-african people

end post by antoine1706

Yes, I used the Mozabites as representative of all North Africans, and yes, it's not ideal. But you're wrong about them having 20% sub-Saharan ancestry. The Rosenberg study from which you're almost certainly taking this info found them to have ~12% sub-Saharan ancestry. That's very typical of North Africans, given that they usually have around 25% mtdna L and 5% y-dna E1b1a. The Turchi (2009) study of Morocco and Tunisia included a reference on Mozabite mtdna. Mozabites had 13% mtdna L, out of 85 total samples. And your observation about Palestinians being darker than North Africans because they have more black ancestry: in the Rosenberg study Middle Easterners had about 2% sub-Saharan ancestry; like I said, North Africans (not Mozabites) have 15% black ancestry according to y-dna/mtdna, so it follows very logically that North Africans would be expected to have the greatest genetic distance to all other Caucasians. For now we only know about the Mozabites. North Africans are probably similar to them, and only somewhat closer to other Caucasians, not "very close".

...............................

post by Polak:

>>The Greek distance to Germany was 0,0040, and to Orkney 0,0100, implying the distance between Germany and Orkney is around 0,0060.<<

It does't work like that at all.

end post by Polak

You're right only in a strict sense, such as, if we take 3 random samples and we know 2 of their distances we can estimate the 3rd. It doesn't work like that, as you pointed out. But this isn't a set of 3 random samples. This is a straight line from Orkney to Germany to Greece. The only way it wouldn't work is if one of the 3 was heavily inbred.

argiedude said...

Anyhow, whatever I said about North African ancestry in Sicilians can now be forgotten, because this study is probably the worst autosomal study of Europe ever. Seldin's study from 3 years ago was better (it even used more SNPs).

argiedude said...

Here's a final example regarding small sample sizes and inaccuracy in FST estimates.

In early 2009 I did an FST estimate on all the HGDP samples, and I used it to build a map of global genetic distances. I just finished comparing my results with the exact same population pairs used in this study and they were identical, so I know what I'm doing (I even had the same margin of error versus Tian's results, 0,0008 FST, as he had between his 3 separate runs of random SNPs).

When I was making the FST estimates earlier this year, I bunched up the very small East Asian samples (n=10) into 2 sets of northern and southern Chinese samples. The pairwise difference between these 2 sets of 50 samples was 0,0110 FST. But individually, the average difference between the 5 southern and 5 northern samples (25 total pairwise FST results) was 0,0208, a giant difference. And yet they were the exact same samples. If the sample size didn't matter, smaller subsets should've obtained roughly the same result as all together.

That was the most extreme example, because East Asia had the smallest sample sizes. But in every case, when I bunched up the HGDP samples, increasing their sample size, their FST distances to other populations dropped, generally by a roughly fixed amount, around 0,0050 +/-0,0030. It didn't matter if the original distance was 0,1500 or 0,0100, increasing the sample size resulted in a decrease of roughly the same size. The distance between sub-Saharan and Europeans dropped 0,0050 when bunching up all sub-Saharan and Europeans into 2 mega-samples, the same between Europeans and East Asians, between Europeans and Middle Easterners, etc. They never increased, the FST value always went down.

Major Tom said...

They writes:
“Interestingly, the majority of Italian Americans (NYCP 4 grandparent defined) appear to drifts from southern Italy and overlap with subjects of Greek heritage”
But what they intend for Italian from Southern Italy?
Many immigrants from the South Italy came from places that Greeks never colonized. Many of them originate from the inside mountainous regions, of the Irpinia, of the Abruzzo, Molise, Lucania and others that never had relationships with Greece. But they are often genetically more close to the Greeks that the inhabitants of the places that saw the Greek settlements.
In this site there are published documents that show it. These "scholars" are dealt with things that don't know well.

Maju said...

Good point, Mayor. It highlights the fact that Aegean influence in Southern Italy pre-dates not just classical Greece but Greece altogether.

It began with Cardium Pottery (Mediterranean Neolithic) but this influx is more from Albania and the Adriatic Balcans, though may have ultimate origins in Greece (Otzaki). Continued in the Bronze Age (Aegean chronology, Italy remained Chalcolithic for quite a while yet), including the Late Bronze Age that is already a Greek influence. An finally we have the Magna Grecia historical epysode, with connections that surely lasted also into the Roman era.

So you have like 6000 years of more or less continuous flow of people and ideas from the Aegean and nearby areas (not necesarily Greece all the time).

Major Tom said...

Maju, we have to use the correct terms. There were not Greeks 6000 years ago. The concept of a Greek nation is very more recent, the same term of Hellenis is more recent. It has been first a term "ad excludendum" (and many people was often called not Helleni despite they spocke a greek dialect, only in a second moment it was used to defining an affiliation ethnic.
Then we have to say that the Mediterranean peoples have a common substratum, that cannot be defined Greek or with the name of another historical people.

Dienekes said...

The trouble is, that "cardium pottery" or the "Neolithic" or any number of cultures that originated in the Aegean didn't affect just southern Italy, but a great deal of the Balkans and the Mediterranean.

But it is not a great deal of the Balkans and the Mediterranean that are indistinguishable from Greeks, but specifically southern Italy. Therefore, these similarities are not the result of "cardium pottery" or the Neolithic, but of the fact that the Greeks colonized southern Italy, and southern Italy was essentially a full part of the Greek-speaking world for about 15-20 centuries.

Maju said...

Maju, we have to use the correct terms. There were not Greeks 6000 years ago.

Where did I say that? I talked all the time of Aegean and Balcanic.

...

Then we have to say that the Mediterranean peoples have a common substratum, that cannot be defined Greek or with the name of another historical people.

Nope because it is not a substratum (but one or several superstrata) and it does not affect equally all Mediterranean peoples (these influences further west in Europe, or further North in Italy itself, were much weaker and intermitent and elsewere in the Mediterranean basin the processes were different).

The trouble is, that "cardium pottery" or the "Neolithic" or any number of cultures that originated in the Aegean didn't affect just southern Italy, but a great deal of the Balkans and the Mediterranean.

This is true for CP but not for the rest, at least not in a similar extent. What to Italy arrived with clear strength as the "new fashion", to SW Europe only arrived as some sort of mysterious echo. Eventually, in the Bronze Age, the Hesperides became some sort of a magnet for easterners, but Italy was always much closer and getting a more direct impact.

Greek colonies in Italy too, only existed in some coastal areas, notably Calabria and Eastern Sicily (their coasts) but the classical Greek influence in the interior was always indirect. There's no way that you can claim that South Italians were replaced in the Classical Greece period (or even including Mycenanean) alone. We know with certainty that the interior was never colonized by Greeks and many coastal areas weren't either.

It can only be explained by a much longer and gradual process, of which historical Greeks were just the penultimate layer. This process is, IMO, perfectly in agreement with the archaeology of the region, specially since the Chalcolithic (Early Bronze in the Aegean).

But it is not a great deal of the Balkans and the Mediterranean that are indistinguishable from Greeks, but specifically southern Italy.

They are not so extremely identical: they show clear influence but it's not like they are "Greeks" 100% - not at all. And the Aegean peoples could in fact propagate themselves much easier by sea, notably since they acquired Bronze technology. The same that historical Greeks never paid too much attention to the inland Balcans (didn't the Amazons live there in legends?) probably their predecessors did not either.

Dienekes said...

Greek colonies in Italy too, only existed in some coastal areas, notably Calabria and Eastern Sicily (their coasts)

Your historical knowledge is laughable.

Major Tom said...

Dianekes you confuse the diffusion of the language and the Greek culture with the genetic impact of the Greek colonization.
The influence of the Greek culture was enormous and the language spread largely in Italy. The whole Roman upper class was bilingual since the first times of the Republic, but the real colonization remained on the coasts of the Peninsula and deeper in Sicily.
The studies done however show that often the populations of the center Italy are genetically more close to the Greeks than that live today on the coasts , as for Sicily the population that lives in the west part is more close to the Greek of that alive in the east.
Then this would show that the Elymians were more Greek than the Syracuse people?

Maju said...

Your historical knowledge is laughable.

That's not an argument (and it's the third or fourth time you "tackle" a discussion without arguments like that in few days, what doesn't say much for you having any reason). And it is obvious that classical Greeks did never colonize the interior - not sure if there was some inland colony but, if so, it is the exception, not the rule.

Vincent said...

But it is not a great deal of the Balkans and the Mediterranean that are indistinguishable from Greeks, but specifically southern Italy.

You are basing this on the huge number of non-Italian, non-Greek samples from the Balkans and Mediterranean in this paper?

Seriously, this paper revealed the same sort of diversity clines in Europe that we've seen in countless recent papers. Greece and souther Italy appear similar in this kind of analysis because they have had recent contact but also because they share many aspects of similar origins. Had the study included Albanians, Croatians, Hungarians, and so forth you would see that the Greek/Italy connection is not dramatically different from any connection you'd find between geographically proximate locations.

Also, I doubt anyone is convinced by Dienekes attempt to demonstrate "clusters" with his dendrogram but if you want a look at what an honest unrooted plot of the fst matrix from the paper reveals I've put one up.

http://vizachero.com/images/TianNJ.pdf

VV

Dienekes said...

You are basing this on the huge number of non-Italian, non-Greek samples from the Balkans and Mediterranean in this paper?

Well, Yugoslavs and Central Italians are distinguishable from Greeks.

So, yeah, the fact that southern Italians are not, while central Italians are is significant.

Had the study included Albanians, Croatians, Hungarians, and so forth you would see that the Greek/Italy connection is not dramatically different from any connection you'd find between geographically proximate locations.

No need to speculate, as such studies have already occurred (vide supra) and show that Greeks can be distinguished from Central Italians/Yugoslavs. Not sure if the same will be true for Albanians at this level of genetic resolution, but Albania, like southern Italy had plenty of Greeks in its territory until recently, and numerous Greek colonies in antiquity, so I wouldn't be surprised if they were somewhat close.

Also, I doubt anyone is convinced by Dienekes attempt to demonstrate "clusters" with his dendrogram but if you want a look at what an honest unrooted plot of the fst matrix from the paper reveals I've put one up.

Pray tell, what makes my dendrogram "dishonest" and your unrooted plot "honest"?

Dienekes said...

That's not an argument (and it's the third or fourth time you "tackle" a discussion without arguments like that in few days, what doesn't say much for you having any reason).

Well, you only look at a map of ancient Greek colonies in Italy and Sicily to be convinced that they were not limited to "Calabria" and "East Sicily". Hell, the very first colony of the Greeks in Italy was in the bay of Naples, do you propose that Naples is in "Calabria"?

Vincent said...

Pray tell, what makes my dendrogram "dishonest" and your unrooted plot "honest"?
The UPGMA tree you built involves a dubious assumption (constant molecular clock) and assumes a known root. UPGMA is very unlikely to produce an accurate topology except in the most blessed of circumstances - which is why most bioinformatics references discourage its use. A pairwise matrix of fst is not one of those blessed circumstances, for sure.

VV

Dienekes said...

Did you see it in your dream that this is an UPGMA tree?

antoine1706 said...

Agietude writes : "Yes, I used the Mozabites as representative of all North Africans, and yes, it's not ideal. But you're wrong about them having 20% sub-Saharan ancestry. The Rosenberg study from which you're almost certainly taking this info found them to have ~12% sub-Saharan ancestry. That's very typical of North Africans, given that they usually have around 25% mtdna L and 5% y-dna E1b1a. The Turchi (2009) study of Morocco and Tunisia included a reference on Mozabite mtdna. Mozabites had 13% mtdna L, out of 85 total samples. And your observation about Palestinians being darker than North Africans because they have more black ancestry: in the Rosenberg study Middle Easterners had about 2% sub-Saharan ancestry; like I said, North Africans (not Mozabites) have 15% black ancestry according to y-dna/mtdna, so it follows very logically that North Africans would be expected to have the greatest genetic distance to all other Caucasians. For now we only know about the Mozabites. North Africans are probably similar to them, and only somewhat closer to other Caucasians, not very close".

Thats not correct. In this last study Mozabites who are considered as "Mixed" populations (like Tuareg) by costal North-Africans have on average 20% sub-saharan ancestry as indicated by this last study.

http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000519

Regarding mtdna, it is true that it is found between 3% and 45% but some costal ethnic groups show about less than 10% which is less than what is found in south-Central Portugal.

Saying that other North-Africans are similar to Mozabites is (almost) as ridiculous as saying that South-portuguese are similar to Northern-Finns...

By the way, if the sample Mozabite shows 12% sub-saharan at 23andme AncestryPainting (using > 500.000 SNP) costal Norh-Africans show less than 0.5% and even 0% Afican ancestry in some cases. So your estimation is wrong.

Dienekes said...

There are no typical Northern Africans. Northern Africa has anything in it from full Caucasoids to substantially Negroid populations. Even if 20% (or X%) is considered typical, there is still the issue that the remaining 80% is not necessarily the same all over North Africa. How similar is the Caucasoid component in Spain vs. Greece/Anatolia in the northern Mediterranean? Why would it be homogeneous in the south Mediterranean?

antoine1706 said...

Dienekes writes "There are no typical Northern Africans. Northern Africa has anything in it from full Caucasoids to substantially Negroid populations."

Correct.

Here are some African % at 23andme AncestryPainting (> 500.000 SNPs).

North-Africans :
Mozabite individual (from HGDP) : 12%
Northern Moroccan 0.17%
Egyptian 0.10%
...

Near-eastern :
Palestinian1 0.77%
Yemen 0.48%
Palestinian2 0.39%
...

Almost all 100% Iberians and Sicilians (from Europe not North-America) have of course usually 0% but a very few of them do have African %

Spanish 1.5%
Sicilian 0.48%
Portuguese 0.18%

3 Greek individuals show 0% African

Dienekes said...

Here is what the UPGMA tree actually looks like:

http://i25.tinypic.com/x0txg8.jpg

VV's contention that his neighbor-joining tree is "honest" is of course complete BS.

The NJ algorithm is nothing more than a computationally cheap method for inferring a phylogeny. There are much better methods to do this when dealing with only 17 populations.

But, the real trouble with the use of NJ is that these populations are not taxa of a phylogeny as they didn't evolve tree-like from a common ancestor, but have exchanged genes every which way. So, the notion that creating a phylogeny using a cheap algorithm is somehow "honest" is complete nonsense.

Note also that VV erroneously assumed that my dendrogram was an UPGMA tree and that it pretended to present a phylogeny (both unfounded assumptions of his own imagination). He could've easily asked for the technical details of the dendrograms, but why bother when he can pretend to be knowledgeable by copying a little bit of Wikipedia to the effect that NJ is "superior" to UPGMA?

In fact, UPGMA is a particular kind of linkage method for hierarchical clustering. "Dendrogram" does not equal UPGMA.

The dendrograms presented in my blog post are certainly not "phylogenies", and they are not UPGMA dendrograms. They are the result of hierarchical clustering using the complete linkage method, which results in compact, similar clusters, and they are a way of visualizing similar populations, and not, of course, of deriving their "phylogeny".

Vincent said...

Did you see it in your dream that this is an UPGMA tree?

Huh? I don't need to dream it: I can see it right there on your blog. It is distance-based, it is rooted, and it is ultrametric: it is a UPGMA tree.

You can easily see some of the distortions in your presentation. In your second tree, the Russians are shown being just as distant from the Spanish as from the Bedouins, even though according to the original data matrix they are much more distant from the Bedouins (fst=0.0211) than from the Spanish (fst=0.0079).

The unrooted, non-ultrametric NJ tree, while perhaps not perfect, is a far more accurate representation of the original data.

VV

Dienekes said...

Huh? I don't need to dream it: I can see it right there on your blog. It is distance-based, it is rooted, and it is ultrametric: it is a UPGMA tree.

Ok, your ignorant belief that dendrogram = UPGMA tree is noted.

Vincent said...

Ok, your ignorant belief that dendrogram = UPGMA tree is noted.
I don't mind acknowledging that UPGMA and CL are subtly different variations of the same flawed approach to tree building. Happy?

The bottom line is that it is the forced imposition of ultrametric distances on a set of data that are not ultrametric which is the fatal error.

In this case the "computationally cheap" NJ algorithm produces a superior rendering of the data, and one that is at odds with your distorted presentation. No wonder you simply presented "a dendrogram I created" instead of providing any real details.

VV

Vincent said...

The dendrograms presented in my blog post are certainly not "phylogenies", and they are not UPGMA dendrograms. They are the result of hierarchical clustering using the complete linkage method, which results in compact, similar clusters, and they are a way of visualizing similar populations, and not, of course, of deriving their "phylogeny".

I think you are only fooling yourself. Do you really expect anyone else to believe the suggestion that representing populations as more or less similar is something other than an effort to represent them as more or less related?

I hope not.

The goal of hierarchical clustering is to deduce relationships, and when you represent those deduced relatioships in a tree form you are creating a phylogeny whether you call it that or not.

VV

Dienekes said...

What is distorted is your flawed imposition of a phylogeny on a set of populations that did not evolve tree-like, your flawed assumption that clustering is only used to infer a phylogeny, your flawed assumption that dendrogram = UPGMA, and your ridiculous attempt to hide your ignorance, by concluding (an impressive 18 minutes after learning the difference) that my approach is "flawed".

1. These populations are not part of a phylogeny; NJ builds a phylogeny; therefore your "superior rendering of the data" is a truckload of BS.

2. Clustering is not useful only to infer phylogenies (which are nonsensical in this case), but also as a way of visualizing distance, and grouping similar entities together.

What happened here is that your half-educated mental apparatus assumed that clustering is only useful for inferring phylogenies. It combined this assumption with the assumption that UPGMA is equivalent with dendrogram. It combined these two assumptions with your knowledge that NJ is better than UPGMA in inferring phylogenies, and it concluded that "your" tree was much better than "my" tree, under the false belief that (a) I was as ignorant as you in wanting to fit a phylogeny to populations that didn't belong in one, and (b) I wasn't capable of fitting a phylogeny correctly if I wanted to.

It's sad when one's attempt to show off fails so miserably. Better luck next time.

Major Tom said...

About the history of the "Magna Grecia", it needs to say that it rose, it developed and it decayed, as all the human activities.
In the full one of her flowering the Greeks occupied the coasts of the south Italy and they built some town to the inside, to defensive reason and for commerce,
But from the V century B.C the pressure of the italic tribes was more strong and Greeks lost the control of almost all the territory. Particularly the Campanis occupied the coasts of the region that brings their name, the Japigis defeated the Tarantinis in the 473 B.C (one of the worse defeats ever suffered by a Greek people, say Erodotus) and the Lucanis occupied everything in the south Calabria too. A lot of Greek cities were abandoned or occupied by the italics. Greek influence was reduced notably.
The campaign of the Alexander the Molossus was a failure ended with his death.
Later the Romans entered on the stage, and all know what happened.

Vincent said...

Give it a rest: NJ and complete linkage are both clustering methods based on a distance matrix.

They differ in some ways (ways that can be important) but at their heart they are both clustering algorithms. If complete linkage is merely "a way of visualizing distance" then NJ can be described in the same terms. There is no ground to be gained in claiming one is "clustering" while the other "builds a phylogeny".

The fact is, the NJ tree is not more or less a phylogeny than your CL tree. Both are trees, but neither is actually depicting a true phylogenetic relationship. In either case, the goal is to visualize which populations are most closely related. Right? So the only criteria about which tree is better or worse SHOULD be how well it corresponds to the original distance matrix.

Complete linkage does a worse job of this than NJ - not always, but with this kind of data - because complete linkage is an ultrametric method (like UPGMA is). NJ doesn't suffer this constraint, and so can better represent the actual topology or at least the actual distances. This is clear in the Russian/Spanish/Bedouin example I gave earlier, and in any number of other examples I could highlight.

The final take-away is that you should concentrate less on what you think I am assuming, and more on what I am actually saying. And if you are going to present a tree that purports to show the relationship between populations, then take the time to do it right.

VV

Dienekes said...

Later the Romans entered on the stage, and all know what happened.

Don't forget that there was Greek settlement of southern Italy and Sicily in medieval times as well, until they were lost to the Byzantine Empire.

Vincent said...

By the way, for anyone interested in yet another alternative presentation of the fst data:

http://www.vizachero.com/images/TianNN.pdf

This is NeigborNet network, similar to a NJ representation but allowing the display of alternate paths between taxa. Areas in which the relationship is most ambiguous will have the most reticulation (aka multiple paths).

VV

Dienekes said...

There is no ground to be gained in claiming one is "clustering" while the other "builds a phylogeny".

I will not distort terminology to help you save face. Hierarchical clustering is used to group similar entities according to a distance measure. It is sometimes used to infer phylogenies, and sometimes it is not. The Neighbor-joining algorithm is used to infer phylogenies period, as the title and content of the paper where it was introduced makes clear ""The neighbor-joining method: a new method for reconstructing phylogenetic trees"

you should concentrate less on what you think I am assuming, and more on what I am actually saying.

What you said is I built an UPGMA tree to represent a phylogeny. Both parts of the previous sentence are false.

This is clear in the Russian/Spanish/Bedouin example I gave earlier, and in any number of other examples I could highlight.

The example you gave earlier is based on the false assumption that branch lengths are additive, which anyone even vaguely familiar with dendrograms (not you, since you don't know the difference between an UPGMA tree and a dendrogram) is aware of.

Vincent said...

Neighbor-joining and complete linkage are simply two different hierarchical clustering algorithms. You don't have to take my word for it, any competent treatment of the subject will confirm it.

Travis Wheeler (2009): Neighbor-joining is a hierarchical clustering algorithm. It takes as input a distance matrix D, where dij is the distance between clusters i and j, and initially each sequence forms its own cluster.

Stackebrandt (2006): The most universally applied clustering methods are pairwise clustering alogrithms that use a distance or resemblance matrix as input. The unweighted pair group method using arithmetic averages (UPGMA), complete linkage (furthest neighbor), single linkage (nearest neighbor), Ward's method, and neighbor joining are examples of such methods.

And so on. Any of these methods can be used to infer a phylogeny, but that doesn't change the basic nature of what they are: algorithms for building clusters from a matrix of distances.

The different algorithms have strengths and weaknesses, and one weakness of complete linkage is that it often distorts the relationships (e.g. doesn't accurately cluster the taxa based on the actual distances) just as it did in this case.

So when you present a dendrogram that purports to visualize the distance between the populations in this study, I don't mind taking the time to point out that your visualization is not an accurate representation of the actual fst table.

VV

Dienekes said...

I don't mind taking the time to point out that your visualization is not an accurate representation of the actual fst table.

Neither your NJ tree is an "accurate representation of the Fst table." Your NJ tree is a nonsensical attempt to fit a phylogeny to the table, since NJ is used to infer phylogenies period, while hierarchical clustering -which is a group of methods, NJ being one of them- does not necessarily infer phylogenies.

The presented dendrogram successfully shows groups of populations clustered according to their similarity. Your "objection" amounts to the following proposition:

A, B belong in cluster X
C belongs in cluster Y

But A is closer to C than it is to B, so the clusters are "distorted".

But, any halfwit knows that this is not the case.

For example, the following distribution shows the clear presence of two clusters

http://i27.tinypic.com/9sqfb8.jpg

yet it is also clear that some points of the bottom left cluster are close to some points of the top right one than to other points of the bottom left cluster.

In conclusion:

1. The presented dendrogram is a useful visualization of the Fst table which shows the presence of North, and South main clusters, each of which can be divided into East-West and European-Arab subclusters respectively. This is a useful visualization that cannot be immediately perceived when reading a table of 17x17 numbers.
2. Knowledgeable people are aware that hierarchical clustering branch lengths are not additive, so if they are interested in the Fst between particular population pairs, they can look it up in the original table, exactly as I advised them to do in my blog post.
3. Ignorami will build a NJ tree over the same data which neither preserves the Fst distances, nor does it represent any sort of phylogeny (as there is none to be preserved).

Vincent said...

Neither your NJ tree is an "accurate representation of the Fst table." Your NJ tree is a nonsensical attempt to fit a phylogeny to the table, since NJ is used to infer phylogenies period, while hierarchical clustering -which is a group of methods, NJ being one of them- does not necessarily infer phylogenies.

The presented dendrogram successfully shows groups of populations clustered according to their similarity.


You really are immune to the truth, aren't you? Can you read what you just wrote?

". . . hierarchical clustering . . . does not necessarily infer phylogenies."

". . . [Hierarchical clustering] is a group of methods, NJ being one of them. . ."

". . . NJ is used to infer phylogenies, period . . ."

Are you so desperate to save face that you can't even maintain positional consistency over the course of a single sentence?

Both my NJ tree and your CL tree are visually representing the same fst table. The NJ tree is doing it more faithfully. There really isn't any more to it than that.

Vincent said...

2. Knowledgeable people are aware that hierarchical clustering branch lengths are not additive, so if they are interested in the Fst between particular population pairs, they can look it up in the original table, exactly as I advised them to do in my blog post.

The whole point of hierarchical cluster analysis is to cluster the most closely related (or less distant, if you prefer) taxa with each other.

The NJ method has the advantage of doing that and ALSO giving you additive branch lengths. Complete linkage, at least in this case, does neither.

The preservation of proportionate distance in the NJ representation is a huge asset. For example, at quick glance the NJ diagram shows you that the Tuscan sample is more closely related to the southern Italian sample than it is to the Greek sample. This observations can not be made using your dendrogram.

VV

Dienekes said...

You really are immune to the truth, aren't you? Can you read what you just wrote?

". . . hierarchical clustering . . . does not necessarily infer phylogenies."

". . . [Hierarchical clustering] is a group of methods, NJ being one of them. . ."

". . . NJ is used to infer phylogenies, period . . ."


Your are really immune to logic, aren't you?

". . . knives . . . are not necessarily used in surgery"

". . . Knives are a group of tools, scalpels being one of them. . ."

". . . scalpels are used to perform surgery, period . . ."

Dienekes said...

The whole point of hierarchical cluster analysis is to cluster the most closely related (or less distant, if you prefer) taxa with each other.

Dude, an hour ago you couldn't tell apart a dendrogram from an UPGMA tree, and now you became an expert in "hierarchical cluster analysis"?

For example, at quick glance the NJ diagram shows you that the Tuscan sample is more closely related to the southern Italian sample than it is to the Greek sample.

Er, a quick look in your NJ diagram shows that ITN-TUSC and ITN-GRK are about equi-distant, something which is false.

Let's summarize:

NJ is a method for inferring phylogenies. It does not preserve Fst; sum of branch lengths is not equal to the Fst between a pair of populations.

One uses NJ if one wants to infer a phylogeny. This is not the case here, so we're left with a piss-poor approximation of Fst distances.

So much for your superior "honest" tree.

Vincent said...

. . . scalpels are used to perform surgery, period . . .

You can't win for losing.

http://en.wikipedia.org/wiki/Scalpel

"A scalpel is a small but extremely sharp bladed instrument used for surgery, anatomical dissection, and various arts and crafts."

Vincent said...

NJ is a method for inferring phylogenies. It does not preserve Fst; sum of branch lengths is not equal to the Fst between a pair of populations.

No. As you yourself observed above, NJ is a method of clustering. NJ can be used to infer phylogenies, but that is not what it IS.

I agree that you'd have to be very lucky if you could get a NJ diagram that preserved each individual fst distance with perfect fidelity. The real world is messy, after all. And you can have negative fst . . .

That doesn't change the fact that the NJ approach does a better job than your linkage method at visualizing the data from this paper. The added degree of freedom you have with NJ (i.e. branch lengths actually meaning something) is not a trivial advantage.

VV

Dienekes said...

As for your notion that your NJ tree is "superior" to the CL tree, let's summarize your argument:

1. You see -in a completely non-formal way- that some branch sums in the NJ tree are kinda similar to some of the Fst's.

2. Ignorant of the fact that branch lengths CANNOT be added in the CL tree, you nonetheless add them and discover that "the Russians are shown being just as distant from the Spanish as from the Bedouins, even though according to the original data matrix they are much more distant from the Bedouins (fst=0.0211) than from the Spanish (fst=0.0079)."

It's like a schoolkid reading on a tourist guide that the distance between Berlin and Paris is X, that the distance between Paris and Rome is Y, and then saying that the tourist guide is inaccurate because the distance between Berlin and Rome is not X+Y. It ain't the representation that is flawed, but its improper use.

3. From this exercise, you conclude that the CL tree does not preserve Fst's rather than what we ought to conclude, i.e., that the whole point of building the CL tree isn't to "preserve Fst's" which is impossible for any hierarchical clustering method, but rather to group populations into clusters of similarity, a task in which it succeeds admirably.

Vincent said...

You see -in a completely non-formal way- that some branch sums in the NJ tree are kinda similar to some of the Fst's.

You have not yet comprehended the real difference between CL and NJ, have you? Clearly not, and if not by now then probably never.

Ignorant of the fact that branch lengths CANNOT be added in the CL tree, you nonetheless add them and discover that "the Russians are shown being just as distant from the Spanish as from the Bedouins, even though according to the original data matrix they are much more distant from the Bedouins (fst=0.0211) than from the Spanish (fst=0.0079)."

Read closer, and you'll see I am right. From the beginning I've been trying to bring you to this essential point. You lead yourself to it, but still can't admit it. A dendrogram (or any tree representation) is creating a hierarchical presentation: things which are most closely related are closest in the tree. Regardless of whether branch lengths are proportionate to distance or are additive, this relative positioning is an undeniable trait.

So if you show A and B separated by one node while A+B are separated from C by an additional node, you are inescapably representing A as more closely related to B than to C.

I pointed out that you have A being more closely related to B, when in truth A is more closely related to C. At least according to fst. And I made not only this empirical observation, but gave you the theory to explain why your approach led you to the mistake. And I gave you an alternate method that largely avoids the mistake.

The least you could do is say "thank you".

VV

Dienekes said...

The added degree of freedom you have with NJ (i.e. branch lengths actually meaning something) is not a trivial advantage.

I am compelled to continue the free lesson, since your continued spread of misinformation may actually harm some of my readers.

Branch lengths "actually mean something" in a dendrogram like the one in my blog post; they are the distances between either a pair of populations (if they are joined directly), or between populations and clusters, or between clusters.

The sum of branch lengths means nothing in either method; one is better off looking at the original distance table rather than trying to calculate it off the tree by "adding up branches" as you misguidedly did.

Vincent said...

The sum of branch lengths means nothing in either method; one is better off looking at the original distance table rather than trying to calculate it off the tree by "adding up branches" as you misguidedly did.

The sum of branch lengths most certainly means something in an additive tree. Even when the fit is less than 100% between the tree distances and the matrix distances, the ability to estimate distances based on branch lengths is a real benefit.

The real problem is trying to fit a non-ultrametric dataset (like Tian's) into an ultrametric method like CL or UPGMA. When you do that (and you did), you end up with a topology that is suboptimal AND branch lengths that are impossible to interpret.

VV

Dienekes said...

"A scalpel is a small but extremely sharp bladed instrument used for surgery, anatomical dissection, and various arts and crafts."

The point being that the species (scalpel) has a more restricted field of applicability than the genus (knife), just as NJ has a more restricted field of applicability (phylogeny inference) than the genus (hierarchical clustering, which is not only used for phylogeny inference).

Dienekes said...

the ability to estimate distances based on branch lengths is a real benefit.

Lol, ok, take your ruler and add up branch lengths to get a piss-poor approximation of a number it takes me a second to look up in the original table.

The point of using NJ is to infer a phylogeny, and NJ is completely inapplicable in this case.

The point of CL hierarchical clustering on the other hand is to create similar clusters, and it succeeds admirably in achieving its purpose.

So if you show A and B separated by one node while A+B are separated from C by an additional node, you are inescapably representing A as more closely related to B than to C.

I even drew you a picture, but you still don't seem to get it.

http://i27.tinypic.com/9sqfb8.jpg

If I say that the points in the bottom left blob belong to cluster A, and the points in the top right blob belong to cluster B, I am certainly NOT claiming that all points in A are closer to other points in A than they are to some points in B.

Vincent said...

Lol, ok, take your ruler and add up branch lengths to get a piss-poor approximation of a number it takes me a second to look up in the original table.

The point is not to reverse-engineer the matrix (though you could get an approximation if you had your hear set on it), but to give the user a visual clue that you don't get with your dendrogram.

Don't focus on that "visual clue" as the only (or even main) benefit. If you do, you'll continue to miss the greater point, which is that using any ultrametric method on a non-ultrametric dataset is likely to produce a suboptimal clustering. If you think I am making that up, then you really need to get yourself an education on the topic.

VV

Maju said...

I was trying to find out where did the Southern Italian sample was taken but can't find it. I am under the impression from the Materials and Methods section that they are not a direct sample from Europe but a proxy taken from several US databases (selecting those declaring 4 grandparents). Is it possible that all them (they cluster quite tightly) are from areas genuinely Greek, i.e. from some coastal areas heaviliy colonized?

Dienekes said...

Don't focus on that "visual clue" as the only (or even main) benefit. If you do, you'll continue to miss the greater point, which is that using any ultrametric method on a non-ultrametric dataset is likely to produce a suboptimal clustering. If you think I am making that up, then you really need to get yourself an education on the topic.

You are clearly not educated on the topic (your confusion about the difference between a dendrogram and an UPGMA tree, has clearly demonstrated this), so keep your advise to yourself.

As for "suboptimal clustering", this is yet another example of your basic ignorance of what clustering is:

"Suboptimal" implies that different clustering methods can be arranged in order of the goodness of the trees they produce. What objective criterion -your muddle-headed grammar school attempts to add up branch lengths don't count- do you propose for saying that one clustering method is better than another? IF we were dealing with a known phylogeny, then one could compare inferred trees against it and see which one is closest to it. However, in this case, there is no phylogeny (either known or unknown).

Thus, your contention that NJ is inherently better than complete linkage hierarchical clustering is an empty slogan.

First, your claim was based on the supposed ability of NJ to generate an accurate topology of the tree. Once it was pointed out to you that there is no accurate tree topology since these populations did not evolve tree-like, you switched to vague claims about "preserving Fst" better by adding up branch lengths. Once it was pointed out to you that "preserving Fst" is not the point of clustering, since you can look up Fst in the table, you switched to generalities about "suboptimal clustering" without giving a criterion of optimality.

Dienekes said...

BTW I wonder how come your "superior tree" places Russians and Swedes in a separate branch (Fst=0.0036), even though the Swedes are closest to Germans and vice versa (Fst=0.0.0007), while my "distorted" tree, correctly puts them in the same branch.

According to your "logic":

A dendrogram (or any tree representation) is creating a hierarchical presentation: things which are most closely related are closest in the tree. Regardless of whether branch lengths are proportionate to distance or are additive, this relative positioning is an undeniable trait.

So if you show A and B separated by one node while A+B are separated from C by an additional node, you are inescapably representing A as more closely related to B than to C.


Swedes are thus "inescepably shown" by your "superior tree" to be closest to Russians, Irish, Orkney, Eastern Europe, and Germans in that order, while they are in fact closest to Germans, Irish, Eastern Europe, Russians, Orkney, which is precisely how they are "inescapably shown" in my distorted tree.

Vincent said...

BTW I wonder how come your "superior tree" places Russians and Swedes in a separate branch (Fst=0.0036), even though the Swedes are closest to Germans and vice versa (Fst=0.0.0007), while my "distorted" tree, correctly puts them in the same branch.

NJ correctly places the Swedes closer to the Germans than to the Russians.

Not only that, the distance on the tree between the Swedes and Russians is . . . wait for it . . . precisely 0.0036
.

Dienekes said...

Not only that, the distance on the tree between the Swedes and Russians is . . . wait for it . . . precisely 0.0036

And the distance between Swedes and Germans is precisely 0.0007 I am guessing (not).

Thankfully people have eyes and they can see that your "superior tree" in no way reflects that Germans and Swedes are closest to each other as they are really are, and in no way reflects that they are five times closer to each other than Swedes are to Russians.

Dienekes said...

Or will you deny e.g., that your "superior tree" shows the Swedes to be closer to the Irish than to Germans, even though in reality they are about three times more distant, while my "distorted" tree correctly joins Swedes with Germans before joining them with the Irish.

My eyes are just too bad for the superior "visual cues" provided by your NJ plot...

Vincent said...

My eyes are just too bad for the superior "visual cues" provided by your NJ plot...
I'm pretty sure the problem is not in your eyes, but rather on the other end of the optic nerve.

Dienekes said...

I'm pretty sure the problem is not in your eyes, but rather on the other end of the optic nerve.

It seems you have ran out of excuses, and no longer pretend to address the substantive issues of my reply.

Well, people can easily see that your superior plot shows Swedes closer to the Irish than to Germans and does not show them anywhere near five time closer to the Germans than to Russians.

http://vizachero.com/images/TianNJ.pdf

Your participation in this thread wouldn't have been as embarrassing for yourself if you had gracefully accepted your errors several posts ago, instead of piling on new ones.

I can't say I mind too much though, as this series has been doubly-educational: on both the interpretation of clustering and the perils of being an uneducated know-it-all.

Joshua said...

I can't say I mind too much though, as this series has been doubly-educational: on both the interpretation of clustering and the perils of being an uneducated know-it-all.

What, exactly, is wrong with being an uneducated know-it-all? Did I miss the part where you talked about your academic tenure or professional position? I have been a faithful reader for many years, and always assumed that you were simply an enthusiastic amateur -- and never thought less of you for it. Was I wrong?

Dienekes said...

"Uneducated" in terms of exhibited knowledge. I don't really care about people's credentials only about their exhibited grasp of concepts.

For example, if someone sees a dendrogram and assumes it was built by UPGMA, they are revealing that they do not know that UPGMA is not the only hierarchical method of clustering that can be represented with a dendrogram. Or, if someone thinks that every tree representation of genetic distance represents a phylogeny. Or, if they don't know that the Neighbor-Joining algorithm always produces a phylogeny, and that it is not appropriate in the presence of reticulation.

Patrick said...

"Only because they don't have England there -- England ought to be closer to German than Italian to Greek."

Why? Languages do not equal genetics. "Italians" are a heterogeneous population (possibly even more so than "English" and "Germans") because of the distinctive history and geography of Italy and the fact that southern European populations contain more genetic diversity than northern ones.

If you know anything about Italian history, you would know that large numbers of people migrated from Greece to South Italy throughout Archaic and Classical Greek times (Magna Graecia).
Later, during Byzantine times, after the Slavic invasion of the Greek peninsula and the Persian/Arab invasions of Asia Minor, there were yet more Greek migrations to south Italy.

From what I can tell, most people in southern Calabria and eastern Sicily are basically Latinized Greeks. About 20% of surnames in the provinces of Reggio and Messina (surnames were only adopted after medieval times in southern Italy) are of Greek origin, and there are still a few Calabrian towns where people speak a Greek dialect (Griko).

Joshua said...

"Only because they don't have England there -- England ought to be closer to German than Italian to Greek."

Why? Languages do not equal genetics. "Italians" are a heterogeneous population (possibly even more so than "English" and "Germans") because of the distinctive history and geography of Italy and the fact that southern European populations contain more genetic diversity than northern ones.

Hmm, it is probably that my knowledge of the history is bad. I understood that England was populated and subjugated by germanic invaders long after Homer. However, now that I think about it, that could very well be wrong.

Maju said...

It was subjugated but probably not so much populated, Joshua. Even in paternal ancestry (Y-DNA), the most similar of all English to North Germans and Danes are only like 40% that, and some of that could be older, one could argue, from the times of Doggerland. The average English may be, by exclusively paternal ancestry, only like 20% Anglo-Saxon plus Viking.

One thing is to conquest and another to populate. England conquered India but did not populate it, right?

Joshua said...

Thanks Maju,

I don't think the India example is very comparable, though I get your larger point. However, you have a stronger point with the comments about Y-chromosome. I was thinking specifically of the various studies that have been done on Y-chromosome, which show that the Anglo-Saxons (through the use of complex sociopolitical codes) replaced the genes of the indigenous populations in record time. I think there was one that showed that the Y-chromosomes of Friesland are indistinguishable from England today. And of course, these studies always show that the similarities do not continue into Wales or South Ireland, suggesting that this was an issue of subjugation coupled with genetic replacement.

I found it interesting that my own DNA marker had the most 37-marker Y-chromosome matches in Netherlands versus any other place, although my patrilineage is documented to American revolution, and central England back to 1600s.

To your point, though, I believe that the studies also showed that mtdna is much more diverse across these populations, and incorporates plenty from the subjugated populations (IOW, there is probably plenty of Pict mtdna floating around, but zero Y-chromosome).

Maju said...

My bad. My document of reference on this issue is Capelli 2005: A Y chromosome census of the British islands... but I was talking from memory and got the data somewhat wrong.

In fact, the two areas more affected by the Nordic invasions (York and East Anglia) show a Y-DNA intrusion of c. 60%, but most of England is in the 40% range, with some areas well below. Anyhow, this refers only to Y-DNA and does not exclude other previous flows in earlier times.

MtDNA is in fact more akin to mainland NW Europe but this is usually interpreted as meaning that the original population arrived largely from there (via Doggerland in the Epipaleolithic, as it's attested archaeologically for NE England and Scotland) and that the difference in Y-DNA signals a greater alien male input in the mainland than in Britain.

In any case it was not a settlement, like Australia but something more of a conquest with some settlement of males mostly. Even the more genetically "Germanized" areas are still 40% aboriginal (at least).

Ponto said...

Totally uninterested in the Brits and their haplogroups.

I am interested in SNP differences between ethnic groups.

I have tested with 23andMe, and have transferred my data to deCODEme.

This is what I have observed. Jews, Ashkenazim, have minor Asian admixture, as do some Iberians, and Italians. However this admixture does not effect their PCA diagram placement. At 23andMe, most Jews are clustered in the European groups, mostly with Tuscan and Bergamo Italians but often with the French. Few end up in the Middle Eastern PCA diagram, but some Southern Europeans do end up in the Middle Eastern groups. At deCODEme. Most of the Jews I am friends with, share some data, are clustered with the Italians but these Jews tend to be more Sephardic Jews. Other Eastern people like Samaritan, Anatolian Turks, Armenians, Assyrians, Israeli Jews tend to cluster on the Europe PCA further up from the Italian group heading towards the Adygei.

As the study of Tian et al, stated that they separated Southern Europeans from Ashkenazim Jews why cannot 23andMe or deCODEme do the same?

Andrew Oh-Willeke said...

I'm surprised at the emphasis of the Southern Italian origins of the Italian-American population in the U.S. which was common knowledge (a good part of that immigration was pre-Italian unification and was from the Kingdom of Sicily).

The study doesn't resolve one of the more pertinent questions about Italian origins, which is whether they are closer to the Western Anatolian population (who spoke a Hittite derived Indo-European language) or the post-Mycenean Greek derived Indo-European language population. The PCA chart in a study I've seen (sorry, no cite immediately at hand) including all three puts the Italians closer to the Anatolians than the Greeks, a finding which the linguistic evidence (the Italic-Celtic languages, most notably Latin, are closer to Hittite than Greek) supports. There was an isolated colony or two in Italy with paleo-Balkan linguistic origins, but many with Italic languages in South Italy.