http://www.sendspace.com/file/is4ynf Dienekes, I ...

2009-09-17T03:03:38.546+03:00

http://www.sendspace.com/file/is4ynf

Dienekes, I made a file that calculates FST for any combination of samples from the HGDP dataset. It's very easy to understand. I made this to demonstrate that the sample size in an FST estimate really does affect the result. In the recent Tian study of Europeans, where sample sizes were exceptionally small, some of the the results were predictably odd and contradictory, such as Greeks having a distance of 0,0000 to south Italians, while the distances of these two to Spain were 0,0010 and 0,0035, a huge difference within Europe, equivalent to the genetic distance between Spain and Czech Republic, or between Ireland and Russia. These results paint a blurry genetic map, which gives the false impression there's a gradual genetic change between Europe and the Middle East.

You need Excel 2007 to run the file. I tried converting it to Excel 2003, but every time it has to do a calculation it takes ten minutes. The same calculations in Excel 2007 take just 2 seconds.

Pick any population and put half of their samples in Pop 1 and the other half in Pop 2. Theoretically, their FST should be 0,0000. This hardly happens.

After loading the samples and obtaining the first FST results, interchange any 2 samples between the 2 columns and recalculate FST. Sometimes it will change a little (such as 0,0002), other times it will change by as much as 0,0050 or even more. Always just by interchanging 2 samples in the columns.

When I pointed out that the Greek-south Italian distance (in the Tian study) was completely unreliable because of the sample size, you observed that the standard deviation was only 0,0010, so that my argument that the real result could be off by as much as 0,0050 was incorrect, and that the real result was close to the study's estimate, after all. But if you look at the standard deviations in the file I uploaded, you'll see that this isn't reliable, either. For example, taking the 24 Yoruba samples, 12 in each column, the mean FST of 4 random subsets of 4500 SNPs is 0,0035 (a little on the high side when comparing samples of the same population but it happens). The standard deviation is 0,0004. After interchanging 2 samples I recalculated FST and now the mean of the 4 subsets is 0,0042 and the SD is 0,0003. Another interchange of samples and the mean drops to 0,0004, with a SD of 0,0005. The last 4 results don't come even close to overlapping with the previous 2 sets of 4 results each. The SD was very tight in all 3 runs, yet the mean differed by an order of magnitude of this standard deviation. This is just a typical example that can be easily reproduced in other samples.

One last note, I recommend disabling automatic recalculation in Excel [go to Excel Options > Formulas > then select Manual]. Afterwards, to manually instruct Excel to recalculate all cells, you have to press F9.

2009-09-17T03:01:14.923+03:00

This comment has been removed by the author.

Comments on Dienekes’ Anthropology Blog: Y chromosome and mtDNA of goats in North Africa

http://www.sendspace.com/file/is4ynf Dienekes, I ...