December 01, 2010

Human genetic variation: the first 50 dimensions

Here is a huge data dump for anyone interested in human variation. Part of the reason I started the Dodecad Project was to be able to analyze data on my own, rather than having to squint to make sense of a plot, to speculate about what might show up at higher dimensions, or with more clusters, to wonder how the inclusion of additional populations would affect the results, and so on.

The following dataset represents the culmination (so far), of my efforts.

Number of SNP markers: ~177,000 as in here
Populations: 139
Individuals: 2,230

In the RAR file (~11MB) you will find 49 scatterplots (5000x5000 pixels each) representing the first 50 dimensions of a multi-dimensional scaling analysis of this dataset, together with information about the samples and their sources. There is a plot of the 1st and 2nd dimensions, 2nd and 3rd, 3rd and 4th, and so on, until the 49th and 50th.

I don't believe Picasa allows such huge pics, so I've made a few smaller (still 1600x1600 pixels each) ones to give you an idea of what to expect. Note that the legend in these small ones is partly visible.

In all plots, population labels have been placed on the population averages; this usually correspond to blobs of datapoints belonging to that population, but occasionally they are shifted due to the presence of outliers.

Before I proceed, it might be worth to give a visual representation of the three poles of human variation in its broadest context; these are Basques/Sardinians, Mbuti/Biaka Pygmies, and She. Well, these are marginally more toward the three poles than many others, but they will do:



Mbuti image by Mikael Strandberg; She image from Portraits of Chinese ethnic groups and links therein.

1 vs 2


3 vs 4

5 vs 6

7 vs 8

Inspection of these plots gives you an idea of why Clusters Galore works so well. It can detect "clusteredness" of individuals along multiple dimensions. It does not look at a series of 2D plots, but it considers proximity of individuals to each other along multiple dimensions, and adapts to the shape, size, and orientation of the clusters.

19 comments:

clusteredmaps said...

Pontikos, why do the Ethiopians cluster so closely with the Koryaks ?

Dienekes said...

You must be more specific than that.

clusteredmaps said...

http://3.bp.blogspot.com/_Ish7688voT0/TPZ9g-sDtQI/AAAAAAAAC-M/X1fWtBzKGRU/s1600/7_8.png

Dienekes said...

http://3.bp.blogspot.com/_Ish7688voT0/TPZ9g-sDtQI/AAAAAAAAC-M/X1fWtBzKGRU/s1600/7_8.png

Nothing strange about that, as dimension #7 captures Northern Europeans vs. non-Northern Europeans and #8 African hunters from everyone else.

wijjy said...

Nice work.

Are any of these components picking up overall variation or platform differences?

What does a plot of heterozygosity against the components look like?

Dienekes said...

Are any of these components picking up overall variation or platform differences?

I don't see any platform differences, almost all of them are typed on the same Illumina chips though.

Not sure what you mean by "overall variation".

What does a plot of heterozygosity against the components look like?

That's a good idea. I'll keep it in mind.

Ward said...

Just wanted to leave a note of how impressed I am with your blog. The genetics details you are going into is very interesting. Looking forward to your coming posts

Perahu said...

Why are Australoids (Papuans etc) closer to Africa than East Asians and Amerindians are in the first two dimensions? Don't they possess one of the greatest Fst distances to Africa, even more so than some East Asians?

Dienekes said...

Why are Australoids (Papuans etc) closer to Africa than East Asians and Amerindians are in the first two dimensions? Don't they possess one of the greatest Fst distances to Africa, even more so than some East Asians?

I don't know how far they are from Africa off the top of my head, but you must remember that you need ALL dimensions to recreate the distance matrix.

If A is closer to B than to C in one 2D projection, you CANNOT conclude that A is closer to B than to C using the full distance matrix.

Annie Mouse said...

"Why are Australoids (Papuans etc) closer to Africa than East Asians and Amerindians are in the first two dimensions?"

This is my interpretation. The first dimension (y, height) is a measure of African vs European. By this measure Australoids are more closely related to Europeans (or more likely the common ancestors in the Middle East/India) than Africans.

The second dimension is East Asia (x). So movement to the right is a measure of Asian-ness.

But these are modern population and things have changed over the last 200,000 years. Europeans have become increasingly European and Asians have become increasingly Asian (and Africans increasingly African).

At the time the East Asian and Papuans spit off from the common population it would have sat at about where the Ethiopians are now. This is why they lie at the same height. These populations seem to be the earliest to split away.

Other populations may have spit off later (greater height) as the Europeans continued to deviate. This is why there are a series of lines fanning out from the Europeans roughly towards East Asia.

To complicate matters however there are a number of overlaid ad-mixed populations. This is best illustrated by the African Americans who lie distributed along the line that connects the modern Africans to the modern Europeans. Other admixed populations also form the spokes that fan out roughly between Europe and Asia. It appears that there is more than one spoke either because there are several East Asian populations in connection with the Europeans, or that these represent connections at different times in history. So the Australoid "spoke represents genetic tendency towards Asian-ness perhaps earlier in history (lower down so when the Europeans were less deviant). The Amerindian "spoke" is higher representing flow when the European population was more deviant (more recent, higher).

So basically the Australoids are actually slightly less African than the Chinese and just appear closer because they are less Asian than the Chinese/Pima/Maya. Also Maya and Pima are less African than the Chinese and Papuans but more African than the Europeans, by this measure.

There are no spokes connecting Asia and Africa because there was little historical connection.

Perahu said...

Annie & Dienekes,

Thank you both for the interpretation.

C Bard said...

I'm curious to know the precise geographical origin of the Egyptian sample. Is it from all over the country or just one part (north, south, Cairo, etc.)? I can't seem to find the answer in the Behar article or the GEO database.

princenuadha said...

Love the plots, beautiful structures. Just wish I could read the labels!

eurologist said...

Just wish I could read the labels!

Right-click, open in new window/tab, expand the browser window and/or use the plus-magnification cursor symbol (in firefox).

yungsiyebu said...

Dear Dienekes, Could you give a larger picture, focus on east eurasian groups? thank!

Dienekes said...

Dear Dienekes, Could you give a larger picture, focus on east eurasian groups? thank!

Did you download the rar? Bigger than 5,000x5,000 pixels?

princenuadha said...

@eurologist

Thanks but I already know how to open it up. My problem is that everything becomes blurry. I have had no problem viewing all the other plots on this site. Its only this one I have a problem with. Dienekes could you help me out.

yungsiyebu said...

Did you download the rar? Bigger than 5,000x5,000 pixels?
---

Sorry, I didn't notice where's the rar. file to download yet, if email is available, my email address: Yungsiyebu@gmail.com, thanks!

princenuadha said...

Ok, the ones I was looking at must be the 1600x1600 you were talking about. The problem is I can't download the rar on my phone or download the supporting browsers, such as Firefox. Is there anyway to just relabel the populations for the small lumber of pixels or something else?