October 06, 2010

Eurasian ADMIXTURE (a precursor to Eurasian-DNA-Calc?)

(Last Update: Oct 8; K=7 added)

I took the 540,814 markers from the HGDP dataset that are also included in the 23andMe personal genomics test, and that have less than 1% no-call rate.

I ran ADMIXTURE on all the West- (and some mainly Caucasoid Central-) Eurasian populations, including Yoruba and Han Chinese to account for non-Caucasoid admixture in parts of Eurasia.

The populations are (left-to-right): Tuscan, North Italian, Sardinian, French, French Basque, Orcadian, Russian, Adygei, Palestinian, Bedouin, Druze, Mozabite, Pathan, Sindhi, Balochi, Brahui, Burusho, Yoruba, Han Chinese.

Here are the admixture proportions corresponding to this experiment:

This seems like a good starting point for the new EURASIAN-DNA-CALC I have in the works.

Relative to the existing EURO-DNA-CALC, doubling the number of ancestral populations (from 3 to 6), and increasing the number of SNPs (by 3 orders of magnitude) introduces some obvious computational problems. I have some ideas on how to resolve them, so stay tuned.

APPENDIX

For the sake of completeness here are the ADMIXTURE runs for K=3 to K=5.

At K=3, the three major races (Caucasoid: green, Mongoloid: red, Negroid: blue) emerge.
At K=4, the Caucasoids are split into West Eurasians (red) and Central Asians (purple)
At K=5, the West Eurasians are split into Europeans (yellow) and West Asians (blue)
PS: I will probably do some ADMIXTURE runs for K=7 and higher in the next few days; the results will be posted in this blog post as an update.

UPDATE (K=7)
The Druze get their own cluster (pink) with an average membership of 65.4% of Druze individuals

29 comments:

GrIQ said...

The central asian in French seems very high. Overall this admixture levels are similar to the admixture analysis of Davidski (Polako). And Spain in this experiment would have around ~64-65 european, ~3-4 north african, ~17 west asian, ~13 CAsian and the rest between SSA and E.Asian

Dienekes said...

The central asian in French seems very high

The labels correspond to the geographical origin of populations where each component is highest (bolded part in table), so we don't have to imagine Central Asian tribes moving to France.

princenuadha said...

Wow, the middle eastern component doesn't really distinguish the Fr. Basque from the French, only the central Asian one.

ashraf said...

Thank you for the interesting topic

It seems that the population clines are rather geographical not ethnolinguistic (due to the fact that even until very recently the very majority of the populations were illetrate countryside inhabitants who will ,with time, adopt the language of the ruler class who afford to built a solid state[except of course very remote areas and areas inhabited by the same linguistic group of the ruling class newcomer]) which goes back at least to Neolithic times and bronze age migration[and to letter extent historical migration waves]waves could only diffuse some newer component with a gradient decreasing as far as we go away from the "bronze age newcomers homeland" but still, all this being said, the Central Asian green component disparity between indo-european speaking Frenches and Vasconic speaking Francian Basques is rather suggestingful.

My question is:
*Could we determine the spatial as well the temporal "homeland" of each component? and could we consider the homogeinity of Yorubas (100% Sub-Saharan African) and Hans (99% East Asian) as a "proof" that the Africanid (aka "negroid") and the Mongoloid biocultural "homeland" would be , respectively, Africa and Eastern Asia.
Thanks

onur said...

It is obvious (once again) from the uniform distribution of a very small and very similar amount of the "Mongoloid component" in most of the West Eurasian populations at K=3 that it isn't actually a Mongoloid component in them but a misleading result of using relatively isolated and drifted populations like Sardinians and Basques. In Mozabites, OTOH, the same allegedly "Mongoloid component" is "pretty diminished" due to the significant Negroid admixture.

pconroy said...

Dienekes,

This confirms what Polako found, that North Russians are very similar to Orkeynar - thus, as I mentioned earlier - are a poor proxy for North Western Europeans.

Dienekes said...

At K=7, the Druze split off from the W Asian cluster. I think it will be feasible to do at least one K a night for a while, so let's see how many of these populations will be re-discovered from the data.

Dienekes said...

This confirms what Polako found, that North Russians are very similar to Orkeynar - thus, as I mentioned earlier - are a poor proxy for North Western Europeans.

I don't see how this follows from being similar to N Russians. Actually they are not that similar, as they lack the Yellow component (K=6).

pconroy said...

What I mean is that at K=6, the other West Europeans, have a significant West Asian component, lacking in the Orkneyar and Russians - e.g. French (13%), French Basque (12%).

What I've gathered from Polako's results, are that:
1. The Irish/British would be similar to the French, but the West Asian component is under 10%
2. The North Russians have less than 2% East Asian component

So therefore the Orkenyar are more similar to North Russians than any North West European population - ergo not a good proxy sample for NW Europeans.

Vincent said...

I think it'd be interesting to skip directly to k=19 and see how well the 19 sample populations get picked up.

pconroy said...

Vincent,

I agree - like the French cattle paper

Marnie said...

The Orkney Islanders are a proxy for themselves.

Wish we could see the Orkney sample run against the Norse, Shetlanders, Welsh, Irish, Sweden, Finland, Estonia, Icelanders, Eastern Scots and Sami peoples. Something like that.

In other words, there needs to be more samples from Northern Europe if there is going to be any understanding of the appearance of various Asian components in Western Europe.

Why would you exclude the Orkney sample because it is indicating an Asian component?

Which other "West European" populations have this? Why?

Fanty said...

"What I've gathered from Polako's results, are that:
1. The Irish/British would be similar to the French, but the West Asian component is under 10%
2. The North Russians have less than 2% East Asian component"

Where have you gathered that?

In Polakos results the Orkadians are similiar to Norse, Swedes, Germans, French and British.

They are no were close to Northern Russians.

There are the Germans, Swedes, Poles and Belorussians between the Orkadians and the North Russians.

Average Joe said...

How about including some British, Irish, Dutch and German samples?

Dienekes said...

2. The North Russians have less than 2% East Asian component

I don't know what he means by "East Asian", but it's impossible to arrive at such a low admixture estimate for this population.

My guess is that he is using a high-K estimate. A good analogy are the Mozabites who have substantial Sub-Saharan influence at K=3 to 5 which is diminished at higher K when their own cluster emerges. Their own cluster is actually a partially inbred composite of majority Caucasoids and minority Negroids.

This is evidenced from the fact that at K=5 they are 25.2% Negroid, which diminishes to 7.3% at K=6. Where did the rest go? It got incorporated into the Mozabite/N African cluster which emerged at K=6.

Let's go to Russians and examine their East Asian component at successive K in %:

3: 12.4
4: 9.5
5: 7.5
6: 6.4
7: 7.0

Dienekes said...

I think it'd be interesting to skip directly to k=19 and see how well the 19 sample populations get picked up.

Too slow, even at K=8 with all 540K markers.

might be said...

Fantastic blog!
It is unrealistic to separate individual human populations of the same region (say Europe) by ADMIXTURE (or any other similar structure-like approach)just as was done in the cattle paper. It is likely that only popultions with a lot of drift - Kalash in Pakistan for example - will get their own "colour". The populations just are not different enough. The new components at higher Ks will be distributed across populations (with clines). Moreover, at these very high K values the algorithm will probably not converge (meaning that parallel runs will not arrive at the same result). To save calculation time you can thin the dataset by geting rid of some LD using PLINK for example. See Admixture manual.

Fanty said...

"My guess is that he is using a high-K estimate."

This is from the calculation he now uses with FTDNA Family Finder members:

K5 Intra-West-East-North-Asian: (rounded up or down, by the rounding laws)

North-Russians:
Amerindian: 1%
Anatolia/Caucasus: 4%
North-East European: 85%
North Eurasian (Yakut etc): 9%
East Asian: 0.3%

Orcadians:
Amerindian: 0.7%
Anatolia/Caucasus: 18%
North-East European: 80%
North Eurasian (Yakut etc): 0.4%
East Asian: 0.4%

My own results (German blend from all over the northern half of the German Empire from the Dutch border to the Russian border.:

Amerindian: 1.5%
Anatolia/Caucasus: 20%
North-East European: 77%
North Eurasian (Yakut etc): 1%
East Asian: 0%

Now
K6 Intra West Eurosian (Again the calculation he used on the FTDNA members):

North-Russians
Mediteranean: 6%
SoutheastEuropean/Anatolian/Caucasian: 1%
Middle Eastern: 3%
East/North-East European: 67%
Westeuropean: 22%
North Aftican: 1%

Ocadians
Mediteranean: 10%
SoutheastEuropean/Anatolian/Caucasian: 5%
Middle Eastern: 0.08%
East/North-East European: 38%
Westeuropean: 46%
North Aftican: 0.5%

My own results:
Mediteranean: 23%
SoutheastEuropean/Anatolian/Caucasian: 4%
Middle Eastern: 0.6%
East/North-East European: 44%
Westeuropean: 28%
North African: 0%

Dienekes said...

North-Russians:
Amerindian: 1%
Anatolia/Caucasus: 4%
North-East European: 85%
North Eurasian (Yakut etc): 9%
East Asian: 0.3%


I don't see where this particular breakup is coming from. How are e.g., Pathans supposed to fit into this scheme, which seems to be lacking the Central Asian component?

Is the methodology explained anywhere?

Fanty said...

"Is the methodology explained anywhere?"

I can only name you what apear to be the anchors he uses (or where the clusters are the greatest in):

K6 Intra European:
Western European (Frensh Basque)
East-Northeastern European (Chuvash)
Southeast European/Anatolian/Caucasus (Georgian)
Mediteranean (Sardinian)
North Africa (Mozabites)
Middle Eastern (Saudi Arabia)

K5 Euro-Asia-Amerind ...

Amerind (Karitiana)
Anatolian/Georgian (seems pretty even between Georgian and Armenian)
Northeasteuropean (Lithuania)
North Eurasian (Yakut)
East Asian (Dai)

.......
I also know there is quiet some amazingly large difference between the "North Russians" from the reference population and his Russian 23andme project members. Who somehow apear almost as distant from those "North Russians" than they apear to Germans. But apear very close to Belorus and Polish 23andme profiles.

"Pconroy" would say: "Northern Russians apear to be not a good proxy for Russians!" ;-)

You recall this map I did from his 23andme thingy?
http://img4.imageshack.us/img4/519/500ksnp.jpg

The 23andme Russians are all in the region where Russia overlaps with Belorussia. All the other parts close to the 23andme Fins and the Chuvash reference are the "Northern Russians".

"How are e.g., Pathans supposed to fit into this scheme, which seems to be lacking the Central Asian component?"

I think Pathans are more like southasia.
Its totaly lacking ;)
But already visibe in the name: "K5 Intra-West-East-North-Asian:"

And the anchor list too. There is no Anchor in that region to mark Iranian tribes like the Pathans.

Marnie said...

might be:

From the ADMIXTURE manual:

"As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correction
for continentally separated populations (for example, African, Asian, and European populations
FST > :05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0:01)."

Dienekes, I see you're running 540,814 markers. A third of them are European. So don't you need more "markers" to pick out populations within Europe, for instance? In any case, it's very interesting that a distinct population for the Druze was resolved.

As K increases, could you let us know the termination criteria for this run? Are you converging or exceeding "N" iterations?

Thinning the marker set for linkage disequilibrium would seem to be laborious. ADMIXTURE is begging to be set up for parallelization on multiple platforms. Anybody know if this is in the works?

Another team seems to have done this. Anybody heard of parLEA? The paper is open source:

http://bioinformatics.oxfordjournals.org/content/25/11/1440.full

Marnie said...

Sorry Dienekes. Your marker size is OK and just above the "rule-of-thumb" of > 100,000 markers for the Euro population.

Again, this speaks to a runtime/marker size requirement limitation for ADMIXTURE. At K=7, sounds like you're taking at least a day. So for a minimal data set for Eurasia~=500,000, at K=7, you're taking about a day.

Don't know what platform you're running on.

Again, I'd be curious to know what your runs (convergence, runtime) look like for k > 7.

Anybody tried changing the convergence condition?

ie. admixture -a qn2 ...

(manual, section 2.8)

Ponto said...

Those admix runs all depend on which ethnic groups are used, and from what geographic region.

Dienekes is using one African reference, the Yoruba, and one East Asian reference, the Han Chinese. West Eurasians, Central South Asians and North Africans probably have admix from Africa, and other parts of Asia other than from the Yoruba, a West African ethnic group, or the Han Chinese, one ethnic group from East Asia. That is the problem 23andMe have encountered with American blacks and mixed race people; using one reference from Africa, and one reference from Asia does not work well in finding out admixture or racial composition. With Europeans, no one has been able to tease out Mesolithic, Neolithic, Indo-European, North African, Islamic Middle Eastern, Central Asian or East Asian contributions to the makeup of Europeans. The best that is done is ambiguous labels like Mediterranean, high in Sardinian Islanders and so on.

My interest in admixture is knowing what is what, and when, and to whom can the admix be attributed. I would like to know how much contribution came from the Neolithics, and separated from later Middle Eastern contributions. After all the immigration events are separated by thousands of years.

Dienekes said...

Dienekes, I see you're running 540,814 markers. A third of them are European. So don't you need more "markers" to pick out populations within Europe, for instance? In any case, it's very interesting that a distinct population for the Druze was resolved.

The trouble is that Europeans have relatively low distances from each other, so the subdivisions occur among other populations first.

It's possible to speed it up of course by reducing either the likelihood convergence criteria or pruning the markers. That will no doubt allow most populations to be resolved, but the admixture estimates will suffer (inferred "mixedness" increases as the number of markers decreases).

might be said...

hi again.
Marnie: thinning the dataset is very easy and quick. see indep-pairwise in PLINK. after that you just use --keep for the plink.prune.in subset. for a 1000 ind 600000snp set it takes a few hours.

Fanty said...

"My interest in admixture is knowing what is what, and when, and to whom can the admix be attributed. I would like to know how much contribution came from the Neolithics, and separated from later Middle Eastern contributions. After all the immigration events are separated by thousands of years."

Well, we all want this.

But I doubt that this is possible without knowing the DNA of exactly those mentioned people.

If I imagine mixing tea, coffee, coke, juice etc...

There is no way to soort out at what time the coke was put in.

Its also questionable if its possible to sort out if the "Sugar" component ist from the coke or if it was sugared coffee involved.

And sometimes its even worse.
Its still disputed if something like "Indoeuropeans" ever existed as a people or if this is only a language and culture that spread. As long as this problem is not sorted out, the question, what the DNA of the indoeuropeans actually is, is kind of off.

I want to know some other things too: Why have all Europeans the same mothers (mtDNA) but different fathers (Y-DNA)? ;)
Yeah, most of us tend to think in Y-DNA because it seems to make sense and ignore mtDNA because it does seem to not make any sense at all.

Who spread the blue eyes mutation all over the northern half of Europe? A external visible indicator that autosomal DNA must have a North/South split, at least in one of the many layers. And all that in the last 6k years (blue eyes mutation is meant to have happend 6k-10k ago in the Ukraine and ALL modern day blue eyed humans are suposed to have it inherited from the same person....kind of unbelievable)

Thats a connection between Northern Russians and Orcadians btw. too ;-)


I have a lot of hope in A-DNA to solve some problems. Like debunking Y-DNA drift effects (wich would be the case if populations would have totaly different Y-DNA/mtDNA setup but close A-DNA)

But some things it wont solve. Unless, we have ancient A-DNA to compare. And a lot of it from all overthe world and from each milenia. Wich we will never have.

Marnie said...

Dienekes:

"The trouble is that Europeans have relatively low distances from each other, so the subdivisions occur among other populations first."

Your K=7 run likely indicates that the populations of Europe still have real admixture in them. The Orcadians have both Euro and Central Asian components in them, and the Central Asian component is higher for them than the French. As someone with a partly Scottish background, I think that's real.

The French have a higher subcomponent of Western Asian in them. Perhaps that's a result of either a refugium population infusing into France from Corsica and Sardinia or Greek (Marseilles) influence. Probably both.

Maybe that's too simply, but there's a hint of both in your k=7 run. (Actually, you can see West and Central Asia in Euro populations at k=6.)

I don't mean to get on your case, Dienekes, but it would be good to know runtime, platform type and convergence (yes, no) for these runs.

might be:

Thanks for the recommendation about pruning. What are your observations about loss of accuracy. How much redundancy is there in a data set like the one Dienekes just ran?

Also, give the massive about of raw computer power out there, again, this problem should be parallelized. Forget about messing around with datasets! ADMIXTURE, where is your option for parallel runs??!!

Thanks! Very informative.

might be said...

Marnie:
I have not seen any loss of (in fact change in) results after pruninig (at least for ADMIXTURE). In fact, LD pruning is recommended both by clustering algorithm writers (see STRUCTURE or ADMIXTURE manuals)and even more so for smartpca of EIGENSTRAT.
About parallel runs of ADMIXTURE: I'm not sure what you mean - distributing a single run on different cores? If yes then I don't really see a need for that. Since in a normal setting you would anyhow have to run say 100 runs at each K to see how the Loglikelihoods behave then that's where you do it in parallel. You run say 1000 runs (10 diffrent Ks) at the same time on 1000 cores.

Marnie said...

might be:

thanks.

I think you mean throwing say, 10 different K's and 10 different cflag termination criteria values in parallel?

With the parallel cflag runs, you'd be hoping to catch the minimal cflag condition for convergence? I guess that would change with K.

That sounds like a good brute force way of evading mucky convergence problems. :)